Go Time – Episode #179

Event-driven systems

with Daniel Selans & Steve High

All Episodes

In this episode we talk with Daniel and Steve about their experience with event-driven systems and shed some light on what they are and who they might be for. We explore topics like the complexity of setting up an event-driven system, the need to embrace eventual consistency, useful tools for building event-driven systems, and more.

Featuring

Sponsors

Teleport – Quickly access any resource anywhere using a Unified Access Plane that consolidates access controls and auditing across all environments - infrastructure, applications, and data. Try Teleport today in the cloud, self-hosted, or open source at goteleport.com

Retool – Retool makes it super simple to build back-office apps in hours, not days. The tool is is built by engineers, explicitly for engineers. Learn more and try it for free at retool.com/changelog

LinodeGet $100 in free credit to get started on Linode – Linode is our cloud of choice and the home of Changelog.com. Head to linode.com/changelog OR text CHANGELOG to 474747 to get instant access to that $100 in free credit.

Notes & Links

📝 Edit Notes

  • Batch - Daniel’s company, which is a platform for working with message busses and event-driven systems.
  • RabbitMQ - An event/message bus tool.
  • MQTT - Another event/message bus option that is very simple.
  • etcd - A fast key/value store. Daniel talks about using it as a cache in the episode.
  • Plumber CLI - A tool written by Batch to help work with any message bus
  • Event Sourcing - Martin Fowler’s article on event sourcing.
  • CUE - An encoding tool mentioned in the episode.
  • Code blocks example from Steve - In the show Steve mentions code blocks helping readability. This is an example of this.

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Hello everybody, and welcome to Go Time. Today, Kris and I are joined by Daniel Selans and Steve High to talk about event-driven systems. Hey, Daniel. How are you?

Doing alright, what’s going on?

Not too much, just excited to learn about event-driven systems from you guys. Steve, how are you doing?

I’m doing great, thanks.

Awesome. And Kris, how are you?

Doing pretty well.

Okay. So Daniel, why don’t we just start with a little bit of background information about yourself? What experience do you have with event-driven systems, and why are you sort of the person to talk about it?

Well, I don’t know if I’m the exact person to talk about it…

Or a person to talk about it…

Yeah, I’ll take that seat, the king of the castle seat… So I’ve been in the industry for about – I don’t know, I don’t wanna age myself too much, but it’s been like 15-20 years. I’ve worked in all kinds of various places. I have a pretty serious background in data science. I did a lot of – well, back in the day it was known as systems integrations, and now it’s really just like automation stuff. After that, I spent a lot of time in fintech and the design space, and kind of all over the place, including APM space as well.

Most recently I was at a social startup called community, where it had the best event-driven system I had ever seen. They managed to pull off something that is fairly rare, which is a tiny startup implementing an amazing foundation, so that as they continued growing, they didn’t have to patch a bunch of holes.

[04:14] I’ve been building event-driven systems for a while, but we started playing around with Kafka between me and my now co-founder, just somebody I was working with in the community as well. We basically came up with this prototype of like an idea as to how we could basically simplify the event-driven systems in the first place, and ended up submitting that to Y Combinator, just kind of for fun… And then thought that it was a fluke when the interview came up, and then we were like “Oh, well I guess this is real.” So we ended up getting accepted into Y Combinator. Then it started us on our track to basically build this stuff.

I’ve been exposed to event-driven systems for a long time, and now that I’m actually working on it full-time, it is more apparent than ever it is still an area that is kind of unknown; people are generally afraid of it, that sort of a thing. So I’m here to really try to clear the space, clear the air, that sort of thing.

Cool. And your startup is at batch.sh. Do you wanna give a quick elevator pitch as to why somebody might wanna check it out?

Sure, yeah. Basically, we are a data pipeline company, essentially, that specializes in extracting data from message buses. We work with basically anything, any message bus text. It could be Kafka, Rabbit, NATS, AWS, SQS, and the list goes on and on. We’re basically message bus agnostic. You don’t necessarily have to do event-driven; as long as you’re doing something with message buses and the data that’s on them is important to you, and you need to look at it and be able to inspect it and that sort of thing, then you should definitely check it out. Or just shoot me a message and we could just chat about it.

Awesome. Steve, what about yourself? What is your experience with event-driven systems?

Currently, I’m with a company called NTWRK, which is built as the QVC for gen Z, which basically means that we drop really high-demand products at a given time, and [unintelligible 00:06:12.29] try to buy these things all at the same time. The event-driven stuff that we do right there - there’s a lot of transactional management that we have to worry about, a lot of state management that we have to worry about.

Previously, I was with a few other companies that had simple (or not quite so simple) message buses using Kafka, and MQTT, and other types of technologies like that. But back even further, my main background is actually embedded systems design… Message buses in that sphere look a little different than they do in the current flavor of technology, but at the end of the day it’s kind of the same concepts, so it translated pretty nicely for me to move all that knowledge to where we’re currently at, with an event-driven microservices architecture.

Awesome. You both mentioned some different technologies… Would one of you want to take a stab at explaining at a high level what an event-driven system actually is?

I’m gonna try to do it in a non-scientific way. Really, at the core of it, it’s a systems architecture that essentially requires you to – or it uses (usually) asynchronous messaging to communicate state. That means instead of a service A talking to service B directly via REST or gRPC or whatever, instead you are emitting a message saying that some sort of a state change occurred, and you do not know in advance the audience that actually it’s intended for, but somebody’s going to consume that message and do something with it.

Honestly, it’s a fairly simple concept. What you end up with is gaining a lot of reliability for introduced complexity, essentially. That’s at the core of it. What do you think, Steve? Are you gonna hype that up?

[07:56] Yeah. Basically, what Daniel said – the main crux of it is you are communicating the state of something in your business logic. It could be literally anything - a shopping cart, or a customer’s status, that kind of thing. The key is the coordination - I guess we’ll get into that in a little bit - of those events and translating of those events, and tolerance that you have for lost data, that sort of thing. That’s really where the crux of the design of the event-driven architecture really – or most of the time you should spend on such a thing should take place. That’s normally – in my experience at least, that’s where most of the time is spent, is dealing with edge-casy failure stuff.

It’s asynchronous events, sometimes synchronous… The receiver doesn’t really know where it’s coming from, they just know that they got a message. So it’s kind of that simple.

Okay. As a more concrete example, and let’s say I have a system where a user is signing up and paying for a plan, and then that sort of unlocks their account. What would that look like in an event-driven system? Would you still be talking with services, or would just events be used for certain parts of that?

You would definitely not talk to any services. All the services in an event-driven system usually utilize some sort of a message broker, like an event bus. SO the idea would be that whatever your frontend app is communicating to a main BFF (backend for frontend), the backend for frontend receives the request to charge somebody or something like that, or put it in an order, and what it’s going to do is it’s going to just emit a message saying “Hey, a new order has come in” or something like that.

Another service, let’s say a building service, is going to pick up that order, because its’ listening to those messages for let’s say that routing key; it picks up the message, it does something with it, maybe it emits another message for another service to do something with that message as well, but let’s say it charges the person, and it goes to Stripe and it does everything. Then it would emit another message onto the same event bus that the frontend app or the BFF is basically listening on as well, and it says “Oh, it’s complete.” That’s essentially it. Obviously, the more business logic you have and the more decoupling you’re doing, you could have five services actually be working in that. You basically build it out as complex as you need, depending on what kind of scale you really need as well. It’s all growing out of necessity, not because it’s the proper way to do it from the get-go.

So obviously, anybody who’s just starting out with building an application, or just sort of getting a feel for it, it’s gonna seem intuitively easier to just talk to the services you need, and especially because then you’ll know that things have happened. So if somebody was considering this event-driven architecture, what are some of the benefits of doing that that might entice them to try it out?

One of the things is that you’ll probably notice immediately - depending on how you’ve written your service to begin with - is a performance bump. You will be able to dispatch these kind of dog and pony show things in the background, and then your UI can do things to make the experience seamless for the user. But really, what’s happening is you’re getting a performance bump from the fact that you are now asynchronously just batching hundreds or thousands (or even hundreds of thousands) of messages immediately, and then the rest of your architecture is delegating those tasks around, just to handle the disposition of those events.

Whereas with a synchronous architecture, RESTful request database calls, that sort of thing, you are tethered to your IO at that point. You really can’t escape the physical reality that you have this IO boundary around your service. In my opinion, that’s the biggest advantage. There are some other advantages… In general the distribution of work, but it also makes it a lot easier as an organization (let’s say) to create a common lexicon of types that you can then work from. It kind of forces you - within reason - to curate those things properly, instead of just kind of willy-nilly throw events over the fence, because you have to maintain a discipline to then unpack those events, any observability that you have to worry about, any of that stuff… It really forces you to think about how you’re communicating across your stack.

[12:06] I would say also that a massive benefit from us is you’re building a really solid foundation for the future. Instead of having to untether a massive mess at some point in time, once your company/org has grown, and having to try to decouple a huge monolith and so on, you already have gone down the right rails. You already are decoupled in the first place, and so on.

With that said, just to put it in a slight con - even though I love event-driven systems - it’s the fact that they are fairly complex. So it doesn’t necessarily mean that every startup should be doing event-driven. Actually, in most cases I would say they probably should not do event-driven, for a tiny startup. If you’re just doing a web shop, or something like that - maybe it’s not necessary. However, if you are planning to build something that has to be high throughput and high scale, it totally makes sense to get those foundations right. Like Steve said, I think that is one of my number one things that you gain out of event-driven systems, is speed and performance. Just the simple fact that there is nothing blocking any longer as you’re inputting some sort of data. I mean, you still have Kafka or something that’s basically buffering all the stuff behind the scenes, but really, now you’re essentially limited by how fast you’re able to write to Kafka… And that is incredible, because it’s a really simple service; all of a sudden, instead of just being able to do 1,000 requests a second, it can do 100,000. That’s a really big deal. But again, it depends purely on your use case. So what are the goals of the thing that you’re building? Are the goals super-high throughput and scalability, reliability, that sort of thing? Or is it “Men, just a little tiny side project”?

A question on that… It kind of sounds like doing this event-driven type of system would require either low coordination, or induce some kind of higher latency for coordination; if you have to use Kafka to talk to another service, and you need something back from it, you incur the cost of sending things over Kafka. So I guess for either of you, would you suggest that if your use case kind of skews toward that, you avoid event-driven, or is there some kind of hybrid pattern that you might be able to use, where event-driven can help you out?

I’m gonna use the cop-out answer for most of these nebulous – not even nebulous questions, but… Every single one of these questions can be answered with “It depends”, and this is definitely one of those times.

So if I was instrumenting my services, I would probably measure the amount of time it takes to run that loop, versus a direct connection… And I’d weigh that against the performance load that it puts on using the event bus vs. not using it, what that load looks like. And basically, those are engineering decisions based on what you’re willing to tolerate in terms of customer experience, and that sort of thing.

If it was a transaction, let’s say, where you’re making a purchase, and your money is involved, obviously, the tolerance for failure there is really low, if not non-existent. Whereas if you’re just refreshing a page, in the user experience side of things, that tolerance can go up a lot higher. So you basically have to make those value judgments.

There’s probably some mathematical formula that somebody came up with at some point. I normally just base it on starting with using experience and working my way back. What can a user tolerate before they either close the web page or put your app down?

I’d also mention – first off, “It depends” is a pretty sweet answer. It’s totally true. [laughs] For me also, when I’m looking at this sort of stuff, it depends on the number of events that you expect in there for a particular transaction to take place. If you are expecting there’s going to be 100 events that need to be exchanged, besides the fact that maybe you haven’t quite architected it correctly, at that point maybe it doesn’t quite make sense to do that.

[16:05] At the same time, if we go down on a slightly lower level, on a technical level, the difference between creating a connection to some HTTP service versus a pre-established connection to your event bus, it is going to be faster on the event bus 100%, always. So you creating six connections one after another to various different services, plus you have circuit breakers, Hystrix style circuit breakers or something like that in place, it’s guaranteed going to be slower than you emitting events. So I think generally speaking, if you’re doing under 50 events - I would say even under 20 events - to perform some sort of transaction, I think it’s fine and it’s generally negligible, in comparison to some sort of a RESTful call that you’re sending to someone over a service.

Can we take a step back one second? Daniel, I know you mentioned that batch works with pretty much any event bus… But it dawned on me that everybody listening might not quite know what an event bus is. So at a high level, could we just make sure that they’re on the same page?

Totally. Yeah. An event bus is really just a fancy word for a message broker, which is really a fancy word for a queue. So a queue with different kinds of capabilities. Basically, it’s a centralized system which accepts messages, queues them up and sends them to consumers. That’s essentially it.

At the core of event-driven tech or event-driven architectures there’s always a message bus… And a message bus vs. an event bus - it’s synonymous, really. As we’ve built Batch, we’ve seen people using everything. That’s the reason we continued building out Batch so it’s supportive of all these different systems; not because we’re a bunch of geniuses and we’re like “Oh yeah, we need to add support for this and this.” No, because people were asking for it, and we’re like “Alright, if you want WebSphere queues - sure. I’ve never heard of it, but it sounds good. We’ll just add it anyway.”

People use MQTT, and NATS, and Kafka, and RabbitMQ, and GCP Pub/Sub… There’s a ton of them. But really, anything that is usable for transmitting a message and it being sent to somewhere else, that basically works as an event bus or a message bus.

Awesome. Thanks. And then going back to the discussion you guys were having about performance - I know that you said several times that it is definitely faster to use, the event-driven system, because you can write these events really quickly. But I assume that one of the downsides to that means that you write an event, but you don’t actually know if people have consumed it or done anything with it yet. So it’s faster in the sense that you write things, but I assume you would have this eventual consistency problem where you can’t count on things just happening in a certain order. Is that true?

Absolutely. That’s a term that I just absolutely love. I think that eventual consistency in this case - you’ll have to be okay and accept eventual consistency into your heart, and say that this is okay, and I’m okay with it. At the core of it really is that you have really no guarantees. You can add guarantees into it if you want to, and you can have special acts and so on, but in general, if you just simply accept the fact that you know that it’s eventually going to become consistent, it’s okay, and it’s good enough. And I think that’s a building block for the entire architecture. Basically, you are okay with things just working, eventually. It’s totally possible that things are going to go down and they will not be immediate, but that’s okay, because at some point in time later in the future, let’s say when a service goes down, it doesn’t consume an event, it’ll come back up and then it’ll consume it, and the system will become in a correct state again. Have you got anything to add, Steve?

I’m surprised you didn’t say the i-word, Dan. Would you like to say the i-word?

Idempotency?

There you go. That’s the crux of it, right? You’re firing off a bunch of asynchronous events and they’re not guaranteed to be delivered at the same time, so you have to make sure that when you read these events, that a change that occurs in event A does not undo the change that occurred in event B, or it’s basically just a no-op at that point. So again, it’s a design decision you have to make.

[20:10] Once you get used to doing that, the fear, uncertainty and doubt around an event-driven system goes down, because now you have things like [unintelligible 00:20:18.11] you have things like event retries, that sort of thing, to kind of help mitigate these failures that do happen, and they happen all the time, even in managed architectures like an AWS cloud environment. These things always happen to everybody. So you do have to do things to mitigate that stuff.

A part of it is also that – I mean, just the concept of event-driven, or even event sourcing in this case, is that you’re able to replay the events; that’s part of the reason why we’re doing this in the first place, is that when there is an outage or something bad happens, you can basically take those events and shoot them right back, and basically ensure that the system is going to go back into the correct state.

It’s what Steve said - I’d love to say the word “idempotency”, even though I don’t know if that’s the correct way to pronounce it, but anyway… So idempotency - it’s just a thing that you should build your services with from the get-go, and things are going to be okay after that.

So you brought up two things. One, the idempotency - and you were talking about replaying, so I assume that means that it’s uncommon to rely on messages being delivered exactly once, or events being delivered exactly once. Is that pretty true?

Ah, yes… The snake-oil that is exactly-once delivery. I believe that once we’ve come up with a perpetual motion machine, then yeah, that exactly-once delivery will also happen. It’ll be great. But until that happens, I do not believe in it, I have never seen it properly be implemented, ever. It is doable, I guess, in some really close circuits, and really controlled environments… But technically, it’s very hard to guarantee that, if it relies on even electricity.

So yeah, for event-driven systems you should not rely on exactly-once delivery, even if it’ sounds plausible. You should just not. Because at some point in time, someone is going to re-emit a message accidentally, twice, and then basically you’re in a world of hurt.

Instead, just build your systems with idempotency in mind, that there’s a possibility that an event is going to get duplicated. If you wanna do deduplication on your side, go for it. But the easiest one is like “Have I handled this before?” “Yes, I have.” Ignore, and just move on. It’s so much easier, the logic that’s involved in that.

So when you talk about handling that, what are some techniques that work well for making things idempotent? Is it just looking at an event ID of some sort? Like, “Oh, I’ve handled this one.” Or is it something else, like other techniques that work well.

Timestamps are pretty sweet. I like timestamps a lot. In the words of Steve, it depends. It depends on how important the dataset is. You can absolutely look at timestamps and basically say “The event that I have already handled has a timestamp in the future, and another event that comes in afterwards has got an older timestamp. Don’t worry about the older event; just dismiss it.”

Similarly, every service here can have its own data store as well. So that means you can have your own caches, and all kinds of stuff. So you can either put it in memory, and then basically keep a track of all the messages that you’ve already handled, and so on - keep that stuff either in a cold cache, in Etcd, or in mem, or wherever, in regards to keeping track of IDs, and that sort of a thing; you could do that, but it’s generally speaking not necessary, I don’t think.

So it sounds like, generally speaking, with an event-driven system you – do you just naturally get an event log or an audit log of some sort that shows you everything that has happened in your system, or is that something that you have to build into it?

That is the awesome part. Well, you’ll still need to build some sort of an archiver of sorts, but in general, as an event – because the event is the source of truth, you’re basically getting audit logging for free. That’s essentially what it comes down to. Every single thing that has happened within your system… If folks are familiar with concepts of like change data capture - that’s essentially it; you’re basically plugging into… Actually, let’s roll back. Change data capture. Change data capture is basically plugging into a database’s replication log, watching every single thing that changes. If you need some sort of compliance - let’s say like [unintelligible 00:25:23.10] or something like that, you probably will need that sort of stuff. And you’re gonna need to set up some sort of pipeline to actually read all those changes and record them somewhere.

With event-driven it’s the same exact thing, except there is no database. You have just a message bus. So you would basically record every single one of those events; those events are essentially your change data capture. You’re watching everything that has happened within your system. So you do not need to build yet another system to say like “Oh, now we’re gonna have audit logging”, that sort of thing. So it absolutely is an audit log. But that said, you will need to build some things on your own. That’s part of the reason why we built Batch as well, is because we didn’t want folks to have to build all this stuff on their own, on the side. It’s a decent-sized endeavor, but you totally get a lot of free functionality out of it as well.

I would imagine at least in my mind that some of the added functionality also comes down to just – like, making things like seeding a developer database or things like that sound like they’d be a lot easier if really that process is just having a set of events and then sort of running them and you don’t have to worry too much about how your system has changed over time, or anything. Those events should still theoretically replay and get you to the right place.

So is this something that ends up improving the developer experience when we’re not in production and we’re working locally, or how does that affect the developer experience building something?

To me, the ability to replay events - even though events are supposed to be delivered asynchronously, the same batch (no pun intended) of events being delivered in let’s say a test is really helpful, particularly around testing the boundaries of connected services that are connected via event propagation. I’m saying this as a consumer of batch.sh. But I’ve been able to replay a batch of events that I’ve emitted to the service in (let’s say) an integration test, or even a unit test, for that matter. It just reduces the amount of work I have to do. For me, the biggest positive of replay is – other than auditing, it would definitely be just testing with known inputs.

There’s a really big piece of event-driven – so I’ve mentioned the word “event sourcing”, and it’s just basically a sub-architecture of event-driven. Let me just quickly spell it out. Basically, in even sourcing, the idea is that you’re able to utilize the events to essentially build another data source. Because we’re considering the events the source of truth, it basically enables a developer to essentially spin up their own data store utilizing the events. But the data store only is pertinent to the service that they’re building. So that is incredible.

[28:07] So where in the past you would basically – let’s talk about a regular architecture - you would probably try to create a dump of some main monolithic database, import it into your own db and then build a feature off of that, and now you have a second problem, which is like “Well, now my service is either connected to this other db, or I have to keep them somehow in sync.” That stuff is gone, because you are essentially plugging into all the events, the ones that you’re interested in… Say you filter out and say “You know what - I know that there’s five billion events in there, but I only care about the billing events, because I’m the billing service.” Well, you basically siphon off only the billing events, build up your own data store with just the events and what they actually represent, and build the db as to how your service would be best to utilize it; how your service would best utilize it.

I’m glad you answered that one, because when I was writing the tweets, I didn’t quite realize the difference between the two… So I wrote one that had event sourcing, and Kris actually pointed that out to me; he’s like “Are we talking about event sourcing today, or just event-driven?” So does that answer your question, Kris, or do you have more–

No, I think that answers my question. I think the other thing that’s important there too is that there’s a distinction between the two a little bit, of event sourcing and event-driven. You don’t have to do event sourcing to do event-driven… And you don’t necessarily have to do event-driven if you do event sourcing. It can be useful on its own. But it’s important to not conflate the two, or you’ll wind up in a really awkward world most of the time.

Totally. We probably should have started off with that. They are extremely similar, and they work with each other as well. There’s quite a few of these sub-architectures. There’s CQRS as well, there’s saga patterns that you can utilize, and they all kind of work together. I just look at it as really event-driven is the big one; that is an umbrella term that basically encompasses all this other stuff. And there’s different ways how you can utilize it, basically, and one of them is event sourcing.

So you both mentioned replaying – you can’t technically replay events, especially if you just wanna seed a database, or do some testing… In my mind, this seems like something that would make a lot of testing simpler, because rather than having to spin six different services up and make sure they interact correctly, you can sort of just – it’s almost like a unit test, where it’s like “Here’s the input. Does it emit the event I want it to emit?” Is that generally what testing ends up looking like?

Totally. Here’s some anecdote time really quick - it was awesome. It was the first time I ever experienced this in my entire career. Basically, we were building a company and we were building out the architecture for it. We didn’t actually have a database behind it for the first six months, I believe. Everything was so event-driven. You can say that all we did it’s basically just write features, emitted events, had another service consume the event, and then we moved on. And because you’re utilizing some sort of caching, because you still need to bootstrap your services, right? You need to keep some sort of state within the service. So we would use Etcd behind the scenes, and basically as we consume an event, we wanna say “Oh yeah, this event has been consumed” or “We’ve done something with this customer. Save it to Etcd.” The service starts back up, it loads it from Etcd, and we’re back into the same thing.

So in that particular case, a database, and these concepts of having to see the db and so on - they’re all basically gone. You don’t really need to do that sort of stuff. So in regards to testing things, or whether it’s actually part of your CI and so on - yes, you’re essentially just emitting events. You rarely will have to do some sort of db mocks and so on to actually perform your unit tests or integration tests.

I feel like there’s – like, this is obviously all great, but one of the things that’s kind of always in my mind when it comes to event-driven stuff is “How do you manage a) designing the events well, and b) updating and changing those events when you need to add something?” What would be your suggestions to people that are getting started off with this? Because there’s a number of different paths. You can add things to events, you can make entirely new events… What are your thoughts on that topic?

[32:13] It’s almost my favorite topic. I don’t think it’s an unpopular opinion - I am 100% in the camp of you should use protobuf. You should not use anything else, you should just use protobuf and just call it a day, and follow the best guidelines for how to write protobuf schemas. Protobuf is not just for gRPC; it’s great for gRPC, but really, you can use it as a message envelope as well… And all the same patterns that apply to writing good types, how to deprecate them and so on for when you’re writing a gRPC service - all the same stuff applies still. Same with however your CI pipeline is set up to actually compile your protos - same exact thing applies to these schemas as well. You use proper tagging, you vendor them in your code, or whatever… Basically, all the same good practices that you would usually do for gRPC, you do the same thing for protobuf.

I would just say avoid JSON. If you do not have to have conflicts, that is how you avoid them. Having something that’s strict - and I think Steve touched on this briefly in the beginning, that basically having something that has a strict schema is the friend here, so that nobody can just go out of their way and just willy-nilly add some field, or change a type of something, and so on. If it is stamped down, such as with protobuf, then it’ll make everyone’s lives significantly easier.

Yeah. The protobuf compiler is – just to illustrate what Daniel has just said. gRPC is just one of so many uses for protobuf, especially the compiler. Granted, some of the Go code that the protobuf compiler generates is pretty ugly… But it’s also highly performant, and it’s also just really easy to work with, although it’s a bit verbose. But you can write plugins for a protobuf compiler. You can do basically anything based off a common dialect of your abstract syntax tree, where you can kind of go in there and assign properties, create functions, that sort of thing. It’s kind of just a toolkit that you can use to do anything with.

The reason I like using protobuf is because it works across several stacks, and languages, and that sort of thing. So on the backend we’re writing Go code, on the frontend we’re writing TypeScript. We can share the same definitions, and we don’t even necessarily have to share the same entirety of the definition, but just a sub-slice. So you don’t wanna expose data to the frontend? Alright, write that into the compiler rules. You can expose just enough that the frontend experience does what it needs to do, but you’re still working off the same data definitions, the same schema. So when you update the schema and then do all the code gen, then as long as you’re using proper semver and all that stuff, you’re gonna get that to propagate across all your services, including clients, and that sort of thing… And that’s just really – I don’t wanna say it’s magical, because it certainly isn’t… But it’s kind of magical.

I wanna add a quick shout-out, by the way, to IntelliJ up in here… Because the IntelliJ protobuf plugin is so great. It works so well. And that is a massive part of the reason why I absolutely love protobuf for this… Because it does so much for you, in regards to even includes of other protos. The schemas that we utilize at my company are fairly complex. There’s tons and tons of different types in there included back and forth, between different places… And as long as your editor is able to support that stuff, it is super-awesome. I would say that’s the one thing that you’re totally gonna miss with JSON, almost guaranteed. You’re not gonna be able to construct really complex schemas that are representative of what you actually need. You’re probably going to cut some corners. And then you’re going to get bit when you forgot to add a comma somewhere…

[36:03] I feel like this is also a good use case where you could bring in one of the other topics we’ve had a couple weeks ago of using Q, and kind of bringing Q and protobufs together to help give you some of that constraints that you were talking about, Steve, around “What is valid in our protocol buffer?” Because I’ve definitely found that protobufs is a great universal format, but it’s missing some of that constraint, or required fields, or anything like that… And there’s good reasons why those things are in there, but at some level you need to express these sorts of things.

I’m a really big fan of Q. You’re referring to Qlang, right? I’m a huge fan of Qlang. Unfortunately, I’ve just really never had the use case to switch over to it. Every time I look at the docs, I’m like “This is just a really nice way to build out rules without having to actually code them.” It’s pretty awesome.

So if we’re using something like protobuf or Qlang or anything like that really, as far as I can tell, it’s mostly language-agnostic. Are there any benefits to using – does Go bring any benefits to using it in an event-driven system over some other language?

I’ll start with this, because I’m sure we’ve both got opinions here… The primary benefit for me are all the concurrency primitives that are in Go. In event-driven stuff you tend to have to accept an event, read something, fire off some one-off jobs to do something specific in a service and so on, and being able to spin up a goroutine that is cheap and not having to think about threading, or how it affects your instance or whatever, is pretty awesome, from this simple concept of you’re not gonna be able to launch a thousand threads, or it’s gonna be very difficult to launch a thousand threads in Java and expect that everything’s gonna work great. Whereas in Go, you really don’t even think about the concept that “Oh man, I’m gonna have to spin up 500 of these.” So that is super-awesome.

And from the perspective of protobuf - protobuf, the syntax itself is so similar to Go, and the type system works still the same way as well. Even the dreaded [unintelligible 00:38:07.27] there’s very little surprises in any of it. It works very similarly between how you would write Go and how you would actually write the schemas themselves. And the toolset itself just works extremely well with Golang. I think it’s a fantastic package for it.

Go is so incredibly simple to get bootstrapped into, to start developing in this language… It is fantastic for building tiny services quickly, regardless of the architecture, whether it’s event-driven or whatever else. It doesn’t take very much.

One last part is also the quality of the libraries. That’s a really big deal. The quality of the libraries in Go for event bus or for message buses is really great. The quality is quite high. As a result of working on this tool called Plumber, we basically have had to interface with pretty much every event bus under the sun. And it is written in Go, and we use basically the library that has most of the stars, and they always work really well, even in production as well.

Quick side note then - for Kafka there is this Segment library, and it works fantastic. It is absolutely an excellent library. And the same way for all the other main message buses. The ones which are fairly popular are the ones that work really well. That’s basically it.

Just to add on to that - what makes Go a really good language for an event-driven system, to echo what Daniel just said, is the concurrency primitives in the language itself. For me – others may feel this way, others may not, but for me, it’s really easy to transition my thinking from the event bus to the language, because I can think in goroutines, if that even makes any sense. I can reason about the events being accepted into the system a lot easier than if I were to have to create a thread and then manage its lifecycle… Whereas in a goroutine – you have to know how your goroutines end, obviously, but it’s a lot easier. It’s just a lot easier, and a lot less cognitive overhead to worry about. So for me at least, that’s one of the bigger – because I only have so much space available up here, and the less I have to think about managing threads, the better.

So a follow-up question then, I guess… A lot of our listeners tend to people who are learning Go, or who are sometimes new to programming, or maybe they’re even expert developers, but they just haven’t touched something like this… So what would you recommend – do you have any recommendations you give to people if they want to build a starter project using event-driven? Because I’m guessing they probably don’t wanna go build a production system at first shot; maybe, but it seems scary…

Alright, I’m going to give it a shot. I’m gonna put some opinions here; this is definitely opinion time. This is not the definitive way to do this, but… Throughout my career, I have basically come to a preferred setup for this sort of stuff. I will talk about the production side of things first, just to keep that out there. Basically, I think a production setup would consist of Kafka, and that would be utilized for high throughput messaging. And I would utilize something else, like let’s say RabbitMQ for facilitating actual interservice communication. And the reason being is that Kafka is incredibly fast, but it’s kind of a beast to try to set it up, and to even write a consumer and a producer in Go for it might be complicated. Also, on top of it, it doesn’t have a whole lot of routing capabilities. You cannot say that “I want this type of message, but I don’t want this type of message”, or based on headers, or based on the routing key that is being used for writing this stuff… I would choose rather for that sort of stuff.

For somebody who’s starting right out, basically just Rabbit. Rabbit and protobuf. That’s essentially all you would need to be able to start building this sort of a system out. I would take care though to learn about all the different capabilities that Rabbit actually has in regards to – complex routing mechanisms you can come up with, because it will 100% influence your software architecture itself. There is a lot of stuff that I’ve even learned several years after running Rabbit in production and realizing that “Oh my god, I could have done it this way, and that would have saved on so much complexity.” Maybe some sort of a delivery pattern where it only reaches one specific service under some specific conditions, whatever.

[43:41] Another piece I mentioned a little bit earlier is Etcd. I am a massive fan of it. Obviously, for a cold cache, you could use something like Redis, but the idea is that you want to utilize components that are scalable, that hopefully are distributed, and hopefully can horizontally scale. Etcd is one of them. It can totally scale horizontally, and it’s extremely resilient to latency issues as well… Especially now that it also has gRPC transports as well inside of it, it makes a fantastic use case for it just using it as a caching layer as well. Maybe somebody who’s starting out doesn’t need a caching layer, but if you’re doing something production-level, then I would say it makes sense.

I would say… And this is probably - even though we’re not on that segment yet - not a very popular opinion… I would advise people learning event propagation specifically to not rely so much on all the stuff that a library gives you, let’s say, but just focus on the wire protocol. Think about how things look going over the wire, which is why Etcd is actually really – Etcd is super-lightweight. Etcd is fast, it’s lightweight, it’s very simple… MQTT is the same way, they’re very easy to understand. Because I think you need to understand exactly how messages are communicated across the wire. You don’t necessarily need to know that in order to write the system, but you should understand how they’re propagated, what the actual protocol looks like… And then you can kind of step back into bigger and bigger realms of functionality. But I really truly believe that starting out, you should try to just stay as close to the metal as possible, even if it’s a toy implementation. That’s how I learned, so I think others may be the same way. I know some people just like to use the tools…

I think that’s a great point, because MQTT is incredibly simple. There is nothing to it. I would agree with that, totally. Start out with just actually understanding this simple transaction of “I emitted a message, and somebody else consumed the message.” And then from there on you can go on further, or whatever.

My point about Rabbit really is just the fact that I’ve been in that boat before, where I have written something and I utilized Rabbit for something, and I had to architect around the problem and come up with a crazy design, only to later find out that “Oh, it supports this fantastic feature, and I could have routed messages in this manner, and I wouldn’t have needed all this crazy complexity around it.”

So that’s just a general word of caution, that sort of stuff needs to be figured out. There’s not a whole lot you need to learn about MQTT. You put a message in, you take it out. That’s all you need to learn about it, really.

Yeah, that’s it.

But if you use more complex message buses like Rabbit, then you should probably look into it, because there’s all kinds of different paradigms.

[46:40] to [46:56]

I’ll just go first… It’s probably unpopular. I think so at least. I am super, super against continuous deployment. I cannot stand the concept of it, and I’ve seen it break so many different things, at the worst possible times as well. So I am a huge proponent instead of owning your deployment, and owning your deploys, and considering marking a ticket that’s done truly when it is actually done means it is deployed into production. It shouldn’t be thrown over the wall, it should be your responsibility to actually deploy this thing. I’ve gone to great lengths to institute that at various different organizations, and then be generally hated, but that’s okay…

To make sure I’m understanding this correctly, you’re okay with continuous integration for testing and that sort of stuff, you just don’t like the deployment aspect.

[47:52] Yeah, totally. That’s exactly it. I think that absolutely there needs to be CI, and your CI pipeline should also build the artifacts and so on as well for additional testing and whatever, but ultimately, when you click that Merge button in GitHub, it can totally kick off some sort of CI that’s also now going to build the artifact and push it to Docker Hub or some GitHub registry, but the deployment part itself should be actually manual, to some degree. And I’m not talking about 15 steps [unintelligible 00:48:25.22] But that should ultimately just be essentially a Kubernetes deployment; you’re just like kubectl deploy yaml, basically, which is pointing to the latest Docker image, or something like that.

I don’t really have a strong opinion one way or the other here, so I can’t really…

[laughs] The thing that I’ve seen in the past, throughout my career, is that basically the deployment part has been basically treated like the SRE/DevOps part, or like a QA type of thing, where it’s like “Well, we created this functionality”, threw it over the wall to QA, and QA is going to figure it out and see what’s happening, throw it back over the wall, and then the devs are going to say “That’s not totally right”, throw it back over the wall to QA. Same exact thing here - you own the thing, you built it, you know how it works, you know how it should be interacting and how it should respond and so on, and you are the best person to see it to its conclusion, essentially.

I don’t know if this is an unpopular opinion as well, but basically, if it’s in master, or in main, or whatever, then that is what should be deployed. That is what should be actually running in dev and in production.

I forget what project it was, but I’ve actually seen – like, taking what you’ve just said there about master being what should be running in production, I’ve seen a couple of open source codebases that take the weirder approach of master doesn’t necessarily always compile… And basically, if it’s not a versioned release, it’s not really expected that – and that one always threw me off… Because I’m kind of in the same mindset as you, is master should be something that anybody could grab, and it should work, and we should be good to go.

Yes, one hundred percent. It should work. Nobody should be expected to go into tags and start looking for “Oh, let me find the – I know how your project works… It probably uses the stable tag.” God forbid they used the incorrect tag, which is actually minor numbers means unstable. No, I don’t need to figure any of that stuff out. I just wanna go grab master and make it work. Anyway…

I feel like part of that is also built into the Go community, since we’re so long… Master is what go get would get, so if master doesn’t work, then no one’s gonna use your library, because it’s broken all the time. And maybe that wasn’t such a bad thing; maybe that was actually a really good thing for us as a community, and it taught us some good skills.

I would definitely say that git branching and how people decide to do that is something that – I don’t know if I’ve ever quite been in companies that all do it exactly the same. Granted, I haven’t been to a ton of different companies and I’ve worked for myself a lot more, but I’ve definitely seen different companies all do it differently so it’s sometimes interesting to see the reasoning behind it.

But yeah, as far as continuous deployment goes, I have no strong opinions one way or the other. I feel like in most of my projects, a huge chunk of my career has been on very small teams, where you’re pretty much responsible for everything regardless… So I completely relate with that. And I haven’t really had a lot of experience on the other side, where you have the opportunity to throw it over the wall. Maybe if I had the chance, I’d love to try it.

Oh, totally. That’s exactly what happens. It gets thrown over the wall, and – I mean, usually, there is going to be a dev that still sees that “Oh, it shipped” or whatever. But something is going to break at the most inopportune time, at 2 AM on a Thursday, and it’s going to be some SRE dealing with this thing and not realizing that this particular dev is responsible for it, because they didn’t test that particular edge case, or that sort of a thing. So yeah, it’s an ownership thing.

[51:58] And I think that also, we’ve gotten very used to the concept now of automating everything… Because automating something even six or seven years ago, you needed to have a decent skillset to be able to automate stuff in the first place. You needed to be a programmer of some sort to be able to do that sort of stuff. And now, I think every SRE is expected to be able to write code, essentially. So we have this ability now to say “Well, everything can be automated, everywhere”, and I guess what I’m saying is “Not everything should be automated.”

This reminds me of – I swear I read an article about GitHub, how they use some sort of bot or something, where basically developers would pretty much deploy a branch to production, verify things worked, and then that’s when it would get merged into master, I think… But I remember reading some article about their deploy process, because it was – on one hand, it looked rather chaotic, because I’m like “They have a lot of engineers, and if they’re all deploying things and verifying they work, that sounds slightly scary…” But at the same time, the ownership aspect of it, I definitely agree with - having somebody actually verify their stuff when they deploy it.

I think ownership - that’s a fantastic word for it. It is proper production level service ownership. That’s really what it’s about. You are owning the service from the beginning to the end; you’re owning its dependencies, you’re owning everything for it, including even its CI process, and if it has a CD process, you should own that, too. You’re owning everything.

Alright. Steve, do you have an unpopular opinion you would like to share?

Oh, man… Yeah, I shared this with Dan yesterday, and he didn’t like it at all, so I’m pretty sure this is unpopular. I think the overuse of err as an error variable - I think it makes code harder to read. Now, there’s a lot of guard rails around that statement… Obviously, you shouldn’t be writing 200 and 300-line functions. I don’t know, I think errors should in some way describe what the error actually is, even if you put an N, or a g in front of it.

I don’t know, I see the reuse of err too much, and to me it just makes code a little harder to read. As a corollary to that, I think there is another part of the language that people don’t use that often, and that is naked braces. You just have two mustache braces, and then – to me, I look at the code and I can just totally read it a lot cleaner, even though it does some things with scope as well… It just makes things a lot easier to read. An old guy like me, with failing eyes, it’s really hard for me to figure out where that err began. I just can’t. So just give it a better name, that’s my opinion.

I will definitely say that the error variable is one of those things where I feel like as I’ve had more experience with it, it’d be weird not seeing that as the variable name. I’m not saying necessarily it’d be worse, but it would just throw me off at first, because I’m just so used to seeing that. And I definitely get that throughout the lifecycle of a program, it’s kind of hard to – basically, that’s the one that gets reused by far the most throughout the program, so I get that aspect of it… I don’t know, I’d almost have to see an alternative approach before I could even give any sort of feedback, I think. And you mentioning the naked braces - to me, I’ve never found a use for them that I liked. So I’d love to see how you use them sometime. Examples.

I mostly use them in tests, actually.

If I wanna create this reference variable at the top, or – sorry, I can do a lot of copy-pastes in a test, and not have to worry about redeclaring, or the compiler yelling at me for redeclaring a variable. So it’s more out of laziness. But it also makes it easier to read. It’s almost like a stanza in a poem - you can very clearly understand “Okay, this is a very specific block of functionality”, versus just several lines of [unintelligible 00:55:56.24]

It doesn’t get used very often.

[56:00] I forgot that that functionality even exists. I’ve been doing Go for like seven years… Literally, I’m getting pings of like [unintelligible 00:56:05.29] And now… Now I know. There is somebody. They added this feature for you, Steve.

Just for me. Just for me.

I’ve heard people talk about it in the past, and every time, I’ve never really seen a good concrete example of it… So maybe I have to bug you later and ask for an example, Steve, that we can share with our audience…

Yeah, absolutely.

Because it’s definitely something I’d love to see more examples of. Because I’ve seen enough people mention it that I’m like “I’m curious how it’s helping them”, but for whatever reason, my brain just hasn’t quite made that mental leap to figure out where I might use it.

I think on the error thing as well - I like reusing err; I don’t like having to come up with new names for errors… But I also feel like errors have become a sort of half-fulfilled promise in Go, because Rob Pike was really big on like “Errors are values.” They are not something different, they are not something separate. That is why we don’t have exceptions, or this separate class of way of handling errors - they are values, and you should treat them as values. And we as a community just never really followed all the way through with that… We were like, “Okay, yeah, they’re values…”, they’re in there, but we don’t treat them like all of the other values. We still have this like “It’s a special value”, and I think that allows us to have this laziness around naming it, because it’s just like “Oh, this is a special thing. It’s fine if you just always call it err. You don’t have to call it something else.” But we wouldn’t do that with other types of values in Go, because it would wind up making our code harder to read.

So I think I half-agree with you, even though I’m like, I don’t see the problem with errors. I can see how that is a sort of annoying thing, and like an inconsistency in the code that we wind up writing.

And I admittedly – I mean, if I’m in a codebase where that’s being done, I just go with the flow… Well, most of the time I just go with the flow; because I don’t wanna be the one to push my preferences on other people, even though it does take me another hot second to figure out where that error started.

Another (interestingly enough) context is the opposite. To me, ctx - everybody uses ctx for context… To me, you should never use anything else, because just by the design of context, it’s just one wrapped context after another. So if you’re creating a logger context or trace context just keep it… Because then it’ll just propagate through your code and you don’t have to worry about “Oh, is this a child context that doesn’t belong to this parent over here?” kind of thing. So the difference to me is kind of night and day, versus errors.

Context is something where I feel like there’s only one, and you generally get it from one place. It’s not like you’re getting context from two different sources, and dealing with – I at least have never seen code where you get two contexts and somehow have to manage both of them… Whereas the error does technically come from multiple different places. And Kris, when you were talking about naming them as variables, the one thing that popped to my head is I wonder if you would even be able to get code accepted, that doesn’t have err somewhere in it. Even if it was “thing err” or something, sort of describing what the error was… People would probably be fine with that. But if you just named it something else, I don’t know if people would let that fly… Whereas if you have a map of people, it’s not like “people map.” You just call it “people”, or something. It is a special case where people don’t want to not have the word err in it, even if they are willing to just not use err.

Yeah. Some of these idioms are just totally organic… That’s one of them. It’s what everybody does, and we’re stuck with it.

It’s a weird dichotomy too, because as a group, I’ve noticed that most software engineers don’t actually like dealing with conflicts and error handling. People will just rather ignore it… So they’re like “Oh, I’m just gonna relegate it over here, but I’m gonna be mad if you built it into the actual flow of everything.” It’s like, if errors weren’t there, people would be like “Where is all of the problems that go wrong?” and it’s like, “Well, there’s a lot of different classes of ways that things can go wrong, and you can define different ways of expressing that. It doesn’t have to be an error.” So despite people not wanting to deal with errors most of the time, they’re very set on having the errors be very visible, so they can very visibly ignore them. If you just kind of built them in other ways, they get very mad. [laughter]

[01:00:16.19] I’m even thinking about - like, if you know that your function only returns one specific type of error that’s more concrete than the built-in error, I still don’t know if I’ve ever really seen functions that return that, instead of just returning the built-in error as the type… When in reality, returning a more specific error type would actually be way more useful, but we don’t ever do that.

Yeah. And then you get in the business of doing a type switch on the error itself, and that’s a code smell to me, too. Sometimes you kind of have to do that, but it kind of defeats the purpose…

That’s a great idea. That’s exactly what I like to do… Type switches on errors!

Yeah. In general to me, unwrapping an interface into a concrete type is a code smell, pretty much across the board. That may be an unpopular, too.

You’ve got me curious now though, if like I went through a codebase and actually refactored it to be like “I’m gonna return the most specific error type I can in every case”, if I have an interface that defines a more specific error, or something… Just to see how that would end up.

I do imagine one of the issues would be that you couldn’t reuse that error variable all the time, because you’d have different types being returned, and that’s probably one of the reasons why people dislike it, is because they wanna keep reusing that error variable.

Listen, friends… We’re in a friends circle here. Okay. “Not found” as an error is a typical error that adheres to everything. It’s basically the equivalent of 404. There is no reason why there couldn’t be an error that is called “not found err”, and you could just return that and call it a day, and that’s it, and it’s always there. And thus, you do not have to do “Oh, string contains this and this, and it’s not found, and it’s a not found case…” Because sometimes you do that, to see what kind of an error it is. So I’m just saying… You don’t necessarily have to do switches maybe, but you could probably be able to do an if on something, just to emit a [unintelligible 01:02:15.22] or something.

Oh, yeah. Errors.is is my best friend. I use it all the time. The errors.is and errors.as - they are a fundamental part of my workflow now, so to take those away from me would be terrible.

Yeah, it definitely seems like this is an area where it’d be nice if we could advance in some way away from – it feels like a lot of our error handling and errors end up being simplistic, like stuff people think about after the fact… Because it’s like, yeah, your errors should be rich, they should give you lots of information about “What went wrong?” so you can handle the cases, or retry, or do whatever you need to do… And most of the time it is just like “Here is an opaque string that you can go parse” and “Maybe I’ve implemented something internally, but I’ve just exposed that out as a bunch of opaque string. If I changed those strings, then you’re kind of screwed.” But once again, it takes a lot of energy to think about your error flows and your error cases.

I get a lot of flack for this, but I love bitmasks. I love using bitmasks for that purpose, because in a 64 bit unsigned integer you can stuff 64 error cases into it. And it could be any of these, or all of them, and you could just check it using a bitwise operator at the end, and it’s very fast, it’s very efficient, and it’s readable. But for whatever reason, bitmasks aren’t as popular as they should be.

I believe I’ve worked with some sort of a higher-level error library - pretty much at every place, there’s somebody that comes in as an afterthought and creates a higher-level error lib. That’s what Kris was talking about. You would want to have all kinds of stuff indicating where did it come from, some sort of statuses, how you should react, is it fatal, is it not fatal? All that sort of stuff. And it almost always sucks. Something is busted with it. Either it doesn’t log at the right time when you want it to, or it logs too much, or it does double logs, or it doesn’t send a span or a trace somewhere how you need it to… I’m down to complain about it; I don’t have a solution. I just wanna complain about it and say that it’s not great, and I wish it was better, but I don’t know how.

[01:04:21.19] For whatever reason, it just seems like an area that it’s hard as a developer to justify spending too much time there. You’re getting paid to make things work, and it sometimes doesn’t feel like you’re getting paid to make handling the errors easier, and for whatever reason, the way error.is… I mean, yes, you have to handle the errors, but for whatever reason, upper management cares about the working thing; they don’t really put as much thought into the “Oh, did you also handle all these error cases?” …until something goes wrong. Then they care. But prior to then, they don’t care so much.

I think that’s like an ethos thing that we probably need to fix at some point. I always relate it back to my history as a writer, and I’m like “Nobody likes a story where there’s no conflict and nothing goes wrong.” So the fact that we build our software and people want us to build our software in this realm of like “Don’t really think about the errors or the conflicts or the problems. Things will mostly work all the time. It’ll be okay”, it’s weird to me, because the important stuff, and a movie you go watch, or a book you read - it is all of that conflict and the errors and how you resolve them… And I think properly handling that is what makes the difference between really great software and mediocre software. Right now, people are okay with building a lot of mediocre software.

I think users are getting increasingly annoyed at that, because a lot of those “turn it off, turn it on again” bugs are because someone didn’t handle some error case somewhere, or didn’t understand the semantics properly, and now everything’s busted, and no one knows where the problem is, so we just restart the whole world.

I was just gonna say that I will not approve a PR which doesn’t check the error for a JSON Marshal. You should check all errors; it doesn’t matter. I understand that the only error case for a JSON Marshal is if it’s like an infinite math number, or something like that; I get it. But I do not know what’s gonna happen in the next version of Go, when Rob decides that actually “You know what - if the hosting contains squirrels, I’m going to actually err.” So check all the errors, check everything. That was a convoluted error case, but you get what I’m saying.

Yeah. Another thing in codebases that I’ve seen that is generally inconsistent - depending on who wrote the code - is how errors are actually propagated. In your typical RESTful service you have this entry point, you dig down into the service layer, into your data layer, and you may hit an error. Some people like to log that error right where it happens, some people have to propagate it the whole way back up the stack… It’s one of those things where you have to pick one and stick with it, because if you don’t, then your observability is gonna be terrible, and everything is gonna suck. You’ve gotta be consistent.

And your price is going to skyrocket for your logging platform as well.

Thinking about the errors stuff and just how hard it is to handle them well, Kris, you were saying that how you resolve conflict is a big part of what differentiates great and mediocre software… And one of the first things that comes to mind is when you’re submitting a form, and some forms will come back and literally tell you every little thing that’s wrong, and it’s really easy to figure out why your form didn’t go through; other ones, you get a generic “This didn’t work” message, and you’re like “Well, that sucks.” But weirdly enough, we’ve set things up so that it’s much easier to do the first than the second. It’s much easier to have the generic “Something’s wrong” than to actually show somebody “Here are the things that went wrong.”

I even went and did that at one point, I sat down and I was like “Alright, I want this form to literally highlight every field that goes wrong”, and the errors that are coming back from a Go server, and trying to figure out the right way to do that was not the easiest thing in the world.

But I feel like sometimes people just try to sweep that under the security rug. “It’s a security vulnerability if you tell people what went wrong.” Have you thought this through though? Have you really thought this through? Because I don’t think you have.

It’s like the same time that people will say, when you go to the password reset and it can’t tell you if that email address actually exists in the accounts… And they’re like “Oh, it’s a security thing”, and I’m like “I can go sign up with that email address and it’ll tell me if it’s there or not. So I can figure this out already; you’re not helping anybody. So just tell me.” And those ones just seem like the same type of thing where they’re worried about a security thing that is completely vulnerable in some other way, so it just does not matter.

People like to do selective security thinking. It’s like, “If it’s convenient for me, then I’ll say that it’s a security problem. But if it’s something I’ll have to go think about an hour, or go fix - I don’t know… We don’t really need to care about that security case.”

We can do that later..

The password reset forms just frustrate me, when they’re like “If you have an account, we’ve sent you an email”, and I’m like “That is not helping me at all right now.” You’ve basically told me nothing.

I literally just went through that flow that you were just describing just a few days ago, of trying to reset the password somewhere, and I didn’t know actually if I had an account there or not… And I went through that whole thing. Because I don’t know if it did it. Was there a mail server problem somewhere? So I went and signed up with an account, and I was like “Nope. Definitely no account here”, so I just created a brand new account. So yeah, I agree 100%.

Alright, I think that’s about a wrap for this episode. Daniel, Steve, thank you for joining us.

Thank you for having us.

Yeah, thank you.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00