Go Time – Episode #304

Foundations of Go performance

with Miriah Peterson & Bryan Boreham

All Episodes

In the first of a multi-part series, Ian & Johnny are joined by Miriah Peterson & Bryan Boreham to peel back the first layer of the things that matter when it comes to the performance of your Go programs.

Featuring

Sponsors

Fly.ioThe home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.

Notes & Links

📝 Edit Notes

Chapters

1 00:00 It's Go Time (Ship It!)
2 01:24 Meet the guests
3 03:18 Setting the stage
4 04:44 Is Go the problem?
5 09:30 Why is THIS the bottleneck?
6 10:37 The expectations
7 13:58 Mechanical sympathy
8 20:36 pprof
9 24:48 Knowing if the function is optimized
10 30:46 One change at a time
11 32:31 Memory definitions
12 37:51 Garbage collection
13 42:49 When to use pointers
14 49:18 Maps
15 52:52 Unpopular opinions!
16 53:16 Mariah's unpop
17 55:03 Bryan's unpop
18 57:17 Johnny's unpop
19 58:51 Outro (Ship It!)

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Well, hello, hello, hello. Welcome to this episode of Go Time. I have a very special show for you today. And before I dive into what we’re going to be talking about - so that’s the good stuff - I wanted to introduce you to my co-host, Ian Lopshire. Ian, why don’t you tell the people hi?

[laughs] That’s good enough.

Just following directions.

Just following directions. Ah, it’s about to be good. Alright, so I brought on a couple of guests today with me to talk about performance. Oh, by the way, I’m Johnny. [laughs] I always forget to introduce myself. Oh, you know my voice, you know exactly who I am. Anyways, I brought on a couple of guests on the show today to talk about performance; more specifically, foundations of Go performance. So before I introduce those folks, or let them introduce themselves, the idea for the show, which I’m hoping is one of multiple in a series, around Go and performance matters, is to take you from zero to hero. To basically provide some guidance for you, perhaps the beginner, intermediate or even in some cases advanced Go programmer who is curious about what is available to you, from a tooling standpoint, from a sort of idiomatic standpoint, what is available to you when it comes to Go and performance of your programs. And also what should you be thinking about and looking at when it comes to writing efficient Go programs. So to help me with that discussion, I have my first guest, Miriah Peterson. Miriah, why don’t you to introduce yourself to the people?

Well, hello. I am Miriah Peterson. If I could say two words about myself, I would say “Don’t trust my title as a data ops engineer, because data people don’t use Go, and I do.” So…

We will get into that later, for sure. [laugh] Also joining us today is Bryan Boreham. Hopefully I’ve pronounced your name correctly. Introduce yourself to the people, Bryan.

Yeah, hi there. I’m Bryan Boreham. I do a lot of performance optimization in Go. I’ve been working in Go for nearly 10 years now, and right now I work at Grafana Labs. I’m a Prometheus maintainer, as well.

Nice, nice. See, I told you this was going to be an interesting show. I brought on people who know what they’re doing, and know what they’re talking about. So let us sort of dive in here. Before I do, let’s level-set a little bit. So let me set the stage to help sort of drive the discussion. Imagine you’re a developer on a team, and you maintain one of several components, several services, several executables, however they’re deployed, whether they run on the CLI, maybe you’re building developer tooling, or whether they run out on a cluster somewhere… You are in charge of some services. And your team lead comes to you and says “Hey, this particular component, depending on how much data we feed into it, behaves more slowly, more unpredictably than some of our other services, and we think we might have either a performance issue, a bottleneck, maybe it’s CPU, maybe it’s memory… We don’t know. So I’m tasking you to identify whatever the problem might be, and fix it.” So I’m gonna play that role. I’m gonna be asking questions. So I assume that I don’t know a whole lot about Go and Go performance optimization or anything like that. I’m gonna be putting myself in those shoes, and I’m gonna start asking - not dumb questions, but naive questions, perhaps; I’m gonna play the role of somebody who doesn’t know, and wants to know. Alright? How does that sound to everybody?

Okay…

That sounds good. I’m seeing thumbs up, I’m seeing head nods. Alright. So tell me, from the get go - I know Go’s design principles is all about simplicity and efficiency… I know it’s a garbage-collected language. First of all, I might need some comparison as to what that even means for me; garbage collection versus what? Can we set the ground floor here as to how I should be thinking about Go when it comes to design principles with regards to performance? Can you provide some starting point for me to reason about Go’s philosophy with regards to performance? Why don’t we start with you, Bryan? Can you get us started?

I was gonna say, before you get into the Go details, if the first thing we know is this component is slow, the next thing we want to know is what is it doing? Is it slow because it’s chewing up a lot of CPU, or is it slow because it’s waiting for something else? So usually, the something else is something on the network, or disk, or something like that. So that’s kind of step one, before you actually get into the Go code, or the details of the code, is what is it doing? I sit here shouting that at the screen most days. But because it’s Go Time, and it’s not Network Time, or Disk Time, we could assume that we went through that step, and we decided that the thing is slow because it’s sitting in your Go code, it’s chewing up a lot of CPU… Now what do we do? And I’m gonna say the step after, when you get to there, profiling is the good step.

[06:14] So once I know that, okay, I’ve ruled out – I know my program is… Let’s assume it’s some sort of a service that listens on a board, and gets some traffic and whatnot… How should I even be thinking about Go’s design and sort of philosophy? How should I even – how should I approach this problem when it comes to Go performance? What is the first rule of thumb I should be thinking about?

For sure. I think we should underscore a little bit about what Bryan said though, something I go back to quite often… So I’ve been doing software for only six years, and when I started, I’ve only ever worked on cloud services. So there’s a lot of things, I think background skills, and understanding profiling that comes from “Oh, I’ve done things on Linux”, or “Oh, I’ve had experience on different kernels, in different constrained environments.” That is honestly a foundational skill that is how we begin a lot of these understanding the problems. Because I think we obfuscate it, and go to “Oh, cloud resources are cheap, so we can just do the cheap stuff.”

So there is a lot of background things we need to do before we dive into the Go. And then yeah, once we say – because Go is faster than a lot of programming languages. Like, I’ve worked with Python, at my day job we do a lot of Ruby, and Go is just always going to be faster than those two. And so “Is Go the problem?” should always be the first question, which is what Bryan was saying. And then the next question is “Okay, it is something with the code that I’ve written. What tools are out there?” And luckily, we did choose Go, and a lot of the tools are shipped with the standard library. So that’s where we start.

I do like that initial approach, and I too did a lot of, in my case Ruby prior to sort of switching to Go, and I’ve found that as I made that transition - and Ian, I’m not sure if you had a similar situation, but as I made my transition, I’ve found that even my naive Go programs were faster than my most optimized Ruby programs. And this is no knock on Ruby. There’s just a different sort of performance profile for compiled static languages like Go than you will get from a language like Ruby or Python, and these kinds of things… In most situations. I don’t want to make sort of a blanket statement. But yeah, in my case, the kinds of problems I was solving, I could get a whole lot more bang for the buck… Which is why as a programmer who’s sort of switching to Go and you’re like “Okay, I just wrote this program, and it is miles ahead of whatever I was doing before”, you can go far and for a long time without ever having to worry about performance optimization, or anything like that… Unless you find yourself in a situation - especially depending on the scale of problems you’re working with - where there is the need, like our sort of hypothetical situation here, where your boss comes to you and says “Hey, we noticed that through some analytics that we’re tracking at a cluster level that this particular service is a bottleneck.” So with that in mind, Ian, I’d be interested to know sort of what the next step you’d take, knowing what Mariah and Bryan just put forth?

Yeah, I think I’d start to ask questions about “Why is this the bottleneck? Is it dropping requests? Is it responding slowly? Is it crawling to a halt, and just stopping every once in a while?” I think a lot of people have this idea that performance means going fast. But really, it’s “Am I falling within constraints that I need to fall within?” So I think my first step is figuring out what those constraints are, and then we can start doing the optimization to get there.

[unintelligible 00:10:05.17] Yeah, you might want to characterize, is it every request that’s slow? Or is it a particular kind of request? Or is it those ones coming in from a particular kind of user? Maybe you can characterize this, maybe you can’t… But it certainly helps, and it helps even more if it’s repeatable. The worst kind of problems are the ones that happen once in a while, and you don’t know why, and you can’t trigger it yourself. So being able to figure out what causes it, and how to repeat it can be really important.

I definitely wanna touch on sort of – and I’m gonna paraphrase what you just said, Ian, in terms of sort of expectations. What is the expectation from a performance standpoint of this particular service? Because I think those expectations are sometimes and often in production environments from a resource standpoint translated into a certain CPU allocation, or a certain memory allocation. I had sort of a nasty surprise when I first started working with orchestration tooling, for the Dockers and these kinds of things, and realizing that “Oh, my program–”, the way I experience the output of my program is that it’d be working and doing things, and it would reach a certain threshold, and all of a sudden, it’s almost like it would just stop dead in the middle of trying to perform some sort of action. It would just stop. And I’d be trying to debug, why is it that this thing was running, and all of a sudden it just stopped? No stack trace, no error message… It just got killed off. And I realized that “Oh, crap, the orchestrator–”, whatever, whether you’re using ECS, or Kubernetes, or whatever it is, that environment allocated a certain CPU and memory resources for that particular service, and whenever I would go above that threshold for a certain amount of time, my process would get killed. I didn’t realize my services were just being killed off by that environment. And the moment I realized that, I’m like “Oh, snap.” To Ian’s point, you kind of have to know ahead of time, and work out with - whether it’s either working with the ops team, or maybe you are the ops team if you’re doing SRE stuff, and trying to understand “What is it that I need to accomplish ahead of time?” Which I think feeds into sort of knowing what your application or services is supposed to be doing with whatever data it’s supposed to be doing it… Which kind of leads to, if you’re going to have to deal with a certain volume of data, and you think you have a problem with how you’re handling it, what’s the next step? How do you go about trying to figure out where your problem is?

Well, I already said profiling. I’m gonna come back to that.

Alright…

Is there another answer besides profiling? That’s what I’m trying to find out here. I guess Johnny made an interesting point - something I never do, that I have known people to do, is sit down and do a calculation, “I expect to have a throughput of this many bytes, take up this much space.” I kind of have always been the brute-force type, run as much sanity testing as possible, and see when it breaks. But that leads to the other problem, of – so the conceptual knowledge isn’t always there. And I think you don’t want CPU and memory issues. Well, they tend to be you don’t need to worry about them until you are locked out of your machine and you have to worry about them. Like, that’s how they ended up being. And that’s when you’re like “Okay, great. Now what do I do?” Well, I guess I should set up a profiler, but you’ve already hit the wall. You’ve already ran into that threshold; you’ve already crashed. Now it’s a little bit late to be thinking about the “Oh, crap. The stuff that I never have to worry about now is the only thing I care about.” So you’re Bryan, I agree…

Yeah… It is really difficult to kind of formulate in advance what the resource usage should be. But I think it’s something that comes with experience, and also… There’s a concept called mechanical sympathy. Have you heard of that?

[14:13] That was on my list for bringing up on this podcast.

Let’s get into it. Yeah.

I mean, I think it came from a Formula 1 racing driver, talking about like if you understand how the car works inside, then you can drive it better. And computers are a little bit like that. So there are certain things, like your CPU can do a billion things in a second, where each one of those things is something like adding two numbers, or something like that. And without needing to know any great details – and even that number of a billion is off by a bit. I’m just saying this – like, this gross, gross simplification. If you’re sitting there waiting for the computer to come back to you, then that means it took like half a second, or something like that. So the computer, the CPU at the real core of your computer could do half a billion things. So what the heck did you write in your code that made it to half a billion things? That’s one of my starting points; what is it doing? How on earth did it take that long for what I asked it to do?

If you have to process a billion things - which, if you do, I’m sorry for you… That’s kind of a hard problem… [laughter]

Welcome to my life.

[laughs] If you have to - and actually, Miriah, as a data person I know the kinds of volume and the ways you’re dealing with them, of data that you deal with, is going to be perhaps different from somebody who’s writing sort of networked applications. Not that you’re not working with some network stuff as well, but I think if you’re doing things in – if you have unpredictable workloads, I should maybe put it that way… If you have unpredictable workloads, it’s going to be different than if you have a predictable set of data you know you have to deal with. You know you have to deal with five gigabytes’ worth of text processing… Perhaps your approach to writing your code is going to be a little different than if you have to write a network service that is supposed to deal with streaming data. It’s going to be different; at any one time you’re working with a subset of a larger sort of pool of data.

So I think this is a very interesting topic to bring back to Go… So yeah, I was doing research for a course I’ve been workshopping, trying to do more – anyway, the topic is Go and Data Engineering. And going into it kind of with this idea - I’ve been doing a lot of stuff with pprof, which is Go’s profiling tool, which we should circle back to… I’ve been doing a lot of stuff with pprof, trying to understand goroutines… Basically, at the orchestration layer, is it better to use a goroutine and have it in one program, or to scale out your programs? Should you do horizontal autoscaling in Kubernetes, or should you try doing workers internally? Those kinds of things. And part of the problem with that comes to essentially basic API calls, is how I’ll put it. So it doesn’t matter what program you’re building. Go uses the IO reader and IO writer for most of its – like, that is the interface behind all its orchestration, whether you’re connecting to a database, whether you’re connecting to a streaming service, whether you’re connecting to an API… It all goes back to that layer. So it doesn’t matter – like, if you’re writing a service that’s going to be handling hundreds of thousands of API calls, it’s the same system as if you’re handling hundreds of thousands of database writes. Or if you’re processing hundreds of thousands of data points, the latency in Go doesn’t tend to – unless you have built your program incorrectly, the latency and the slowness doesn’t tend to come from the actual data munging, or the manipulation of the data point. It comes from those connections to and from file systems, or to or from API points.

[18:12] So when you’re designing your service and you’re trying to optimize things in Go, that is the point where you tend to have memory issues. The memory issues often come from those connection points or those API points, the point where you’re sending data from one function to another, or one system to another; it doesn’t come from the actual horizontal scaling of the service. It is from those design choices at the IO read/write level; like, the “Oh crap, I forgot to close my writer. Oh crap, I have 15 connections open when I really only need one.” Those are the different things… And I feel like that’s the same no matter what program you’re building, because all of us are manipulating bytes at the end of the day… And those are the kinds of things that I find people tend to turn to Kubernetes logs, or other kinds of things, and not use some of the stuff that Go has built in to help us track that kind of stuff.

I would argue there are some silly mistakes you can make in the code as well, which again, tying in the reading, the IO… For example, when I’m teaching Go, one of the first things I tell folks is that if you have to work with files on disk, even if it’s a predictable size file, know that if you use, say, io read all, you’re putting every single byte of that file into your program’s memory. So that’s an easy mistake to make, and thinking that “Oh, I’m just gonna read all the lines in the memory, and maybe I’m iterating and doing some sort of transformation, or counting, whatever it is, on every line. I’m reading the entire file in memory.” Whereby the moment – I explained that “Well, you want to take more of a streaming approach, not ‘get everything in memory’ approach”, then they’re like “Oh, I can do that? I can read one line at a time, I can process things one line at a time?” It’s the difference between “Oh, let me just get everything in memory and work on it”, versus “Let me work on pieces or chunks at a time” kind of thing. And the moment folks realize “Oh, you can do that?”, that’s the mind=blown kind of thing that goes on. But if I don’t know about these libraries, or if I don’t know about these easy way of shooting myself in the foot, these things tend to happen very, very often.

So when you are faced with these situations, this is where I think we start to introduce more of a tooling aspect of things. Pprof has been dropped a couple times here… Let’s talk about pprof. What is it? Why is it?

So the basic idea with a tool like pprof is you’re going to run your program, you’re going to have it doing its thing, and the profiler is going to interrupt like 100 times a second. And every time it interrupts, it’s going to take a note of what was executing. And then statistically, over a few seconds, or however long you leave it running for, you could – that’s why we call it a profile; you build up the numbers, you say “Well, it ran for 10 seconds, I interrupted 100 times a second, so I have 1000 counts in total. Of my 1000 counts, half of them were in this one function. And 10% was in this other function, 10% was in this other function.” So that’s the profile. The mechanical step of interrupting 100 times a second and making a note, and then adding up all those counts, and then you draw it on the screen. And I like a particular visualization called a flame graph, where - it’s not very good for a podcast. I’m waving my hands around, but it’s not helping. Honestly, go find videos where people show these things off if you’ve never seen it before. But basically, you’ve got rectangular bars, and the bigger the bar, the more time it spent in that function. So you just - you bring this up, and you look for the big bars. That’s it.

[22:01] So that’s the first place you look… Certainly, they’re obvious markers; that doesn’t necessarily mean that it’s where you’re gonna get the most bang for the buck. Maybe you are doing a function that is already highly optimized, and perhaps it’s not the function itself that you need to optimize. Perhaps it’s somewhat how much data you’re sending it that you need to streamline somewhere else. So it gives you an obvious place to start and look, right?

Yeah, I mean, where might you go after you’ve – I mean, the basic process is to come up with some idea about how it could go faster. What are you going to do? You’re going to spot that you’re calculating, or doing some operation you don’t really need to do; just skip it. Or you’re going to find a cleverer way to do that thing. Or you’re going to realize you do the same thing multiple times, and save it for later, like caching… You’re going to come up with one of those techniques. So you have to sort of plot your way… That’s kind of what you’re looking for. And what you said, if it’s already highly optimized, then that’s – if someone’s been in there and applied all the techniques… You may still be able to do more things in parallel. Go is a great language for that. If you have the CPUs available, you could split things up, run them on different goroutines concurrently… I never remember what – parallel versus concurrency, there’s…

I play it safe. I say concurrent. I’ll play it safe, man… [laughs]

That’s a different podcast, I believe… [laughter]

Okay, okay. So the pprof tool gives you different knobs. And boy, are there many. But the ones I usually find interesting are sort of CPU profiling, which is a different from memory profiling… There’s also a trace that you can apply, that more readily shows you what’s happening across goroutines and things of that nature… I think, Bryan, to your point, you have a starting point, your function; now you’re trying to figure out “What are my options? What can I do?” Identifying the embarrassingly parallel problems, the opportunities for concurrency that perhaps you’re not taking advantage of. I mean, it could simply be that, right? If there’s no dependency between running the function one time, and then the next time you run it again, if there’s no dependency between our data, perhaps that is a great opportunity for concurrency, right? Just launch some goroutines… If you know how many you need ahead of time, maybe use a weight group; if you don’t, maybe use some channels for some communication… Then you start – basically, you’re peeling back the layers to figure out “Where to now? Where to from there?” But is that – I sort of wanna go back to the whole notion of the function perhaps being already optimized… How do we know it’s been optimized? What other tool could we use to bring into play here to know that “Okay, this thing is going to perform consistently based on the data that it provided”?

You’re being really, really pointed towards the benchmarking tool, I think… [laughter] But before we go to the benchmarking tool, I do want to say… I run the local – well, I don’t run, I used to run a local meetup here, but I prepared a talk on pprof for it, and I made some videos… And me personally, I could never remember what the pprof tools do. I cannot, for the life of me, remember what all of the things mean, what all the graphs mean… Go makes their own flame graph in pprof. So they have the traditional flame graph, and then they have what’s called the new flame graph. One of the things that tells you is “Is Go’s compiler optimizing for you? Did it inline functions for you?” So you can tell right off the bat, “Did the compiler optimize for me?” by looking at this flame graph. So step one, put pprof in your program. Step two, look at the flame graph and see “Are anything already inlined for me? Is the compiler already optimizing things for me?” And then I would say “Great, now I notice that I have this function…” You’re looking at the web graph, which is their other thing - again, if you want to know what it means, I have a video on YouTube; go look for it. I go through it in detail. It’s visual, that’s why.

[26:10] And you can see “Oh, this is an expensive function. It’s taking a lot of CPU, it’s taking a lot of memory… Great. I want to see if I can try and optimize this. Let me write a set of benchmarking tests, turn on that memory flag, and see if you can start dialing knobs to go fix that high problem function.” So I don’t think it’s one or the other. I do think the tools have to go in tandem.

Again, I always find benchmarking to be that last step, that last “Great, I’ve got a program, I’ve got it working… I’m trying to get it optimized… Maybe it’s not the most optimized it can be, but I’m trying to get it 80% of the way there. I’ve identified some heavy CPU functions, or some heavy memory functions”, which you can identify with pprof very easily. “Now let me pick these functions, turn on the right flags, and start writing benchmark tests and see if I can do that.” And the way benchmarking works is instead of just running it once, it by default runs it as many times as possible within that window, and it’s giving you that basic average performance over that so many times of runs. And so you can see “Great. On average, I’m allocating 3000 bytes to this one function.” That would be exceedingly a huge problem, and you should fix that. As opposed to “Can I get that to be lower? Can I get it to not be maybe this–” And then “Oh, it’s taking 300 milliseconds per run, but the function next to it is executing in 20 nanoseconds per run?” Maybe I can trade off things so that this function is not as much of a bottleneck for that whole system.

So when I teach on benchmarking, those are the two things I always say: look at how long it takes to run your function, and then look at how many bytes you’re allocating per system, and those are the first knobs to start tweaking, in my opinion. I don’t do it that much professionally. I very rarely get to the point where I am needing to prove optimization at the benchmark. I think I’ve never worked at a company big enough where they start losing money based on the speed of my things… I always am at the company that’s like “We need to move our infrastructure from here to here”, and so I’ve always got the free budget to build new things. Anyway. But that’s still where I point people.

Aren’t you lucky?

I know. And then when they tell me to start fixing things, I’m like –

“I’m out of here.” [laughs]

“New job. Let’s go find a new greenfield project.” But again, that’s why I was so interested – maybe I didn’t use it professionally, but now I need to learn it well enough to teach it, because I do think it is that important. So there’s your benchmarking plug, Johnny.

Nice. Nice. Nice. Ian, anything to add to the whole benchmark discussion?

Sorry, I’ve kind of lost my train of thought… Not really –

You need to benchmark that.

Definitely… I like the idea that they go hand in hand. So you’ve found your problem, you used pprof, you’ve found this as allocating a lot, this is using a lot of CPU cycles… And the next step is writing that benchmark, so you can tell if you’ve actually made a difference. I like the idea of those going hand in hand. They have to, right?

Yeah, it makes it repeatable. We kind of started our fictitious example of measuring something in production, measuring something that really happened… But you may not be able to recreate that so easily, and you don’t really want to mess around in production too much… So recreating that particular thing as a little standalone program - that’s a benchmark. And then being able to run the thing, as Miriah said, again and again and again, so we can get a kind of average timing out of it… And the Go testing framework does that for you.

[30:00] I’ve done a little bit of teaching this stuff as well, and I think it’s kind of half and half. Some people have seen benchmarks, love them, do them all the time… And half the world has basically never touched them in Go. Maybe they’ve scrolled past them in a file once or twice, but they’ve never really looked at it.

So I would certainly encourage – it’s a really simple pattern. You just basically write a loop that will run the thing you’re interested in over and over again… And the more complicated part is, is setting up the test conditions. But that’s the same for any unit test. It’s just a unit test where you can run the same thing over and over again. And now you can really iterate, you can start playing; you can try something out, run the benchmark… Did it go faster, or did it not? Try something else.

Change one thing at a time. That’s another big tip. When you’re excited, you have all the ideas. “I’m going to code them all up. It’s gonna go way faster.” But do one at a time. Change one thing, measure again. Change one thing, measure again. That’s the way to actually figure out what’s going on.

And sometimes changing that one thing may mean taking it all the way into production to now try and get a hopefully different outcome, right?

Yeah, it could be. I mean, it depends how good your benchmark is. In some cases it’s really, really hard to emulate the true production conditions. There’s also a bunch of things to watch out for. You know how I was saying your processor can do a billion things in a second? As long as you don’t use more than like a few tens of kilobytes of RAM of memory. The minute you go up past your L1 cache, the whole thing is going to slow down by a factor of 10. When you go past your L2 cache, it’s going to slow down another factor of 10. So you need to be careful when you try and recreate the problem in your benchmark that you don’t make it too small. You make it so small that it’s unnaturally fitting in the really, really tight cash of the processor…

This is one of these mechanical sympathy things. It’s a huge amount of knowledge to kind of learn about processor architectures, and different layers of caching, and so on. I don’t think everyone has to learn that. But a little bit of – certainly just the fact that you don’t want your… You want your benchmark to be at a realistic size. You don’t want it to be so big it takes a day to run, and you don’t want it to be so small that it’s unrealistically fast.

So speaking of caching and memory, the whole notion of optimizing for memory usage, that whole thing has its own sort of lingo. When I was first learning about heaps, and stack, and allocation, and these things, I was like “Okay, how much of this do I have to worry about it if I’m just going about running my programs? Do I have to worry about declaring variables, keeping them around? Go is garbage collected… Isn’t it gonna just do it’s magical thing?” What are some of the primitives for having this discussion around memory optimization? Can we have some definitions?

No. [laughter] In Go’s docs it literally says “You don’t need to know what’s written to the stack versus the heap.” That’s straight off of Go’s website, go.dev. And I’ve literally pulled that out and used it in slides, and I’m like “It doesn’t matter technically, but I do think it does matter conceptually”, because it helps you choose things like “Oh, maybe I should use a pointer here. Maybe I shouldn’t use a pointer here.” It’s good to know what types Go is using a pointer behind the scenes for. Strings, for example, always are using pointers behind the scenes… So it’s a lot easier to share strings around than it might be to share a slice of bytes, or some weird things. But most of the time, it won’t matter if the garbage collector is going to clean it up. But when I do think it does matter, again, is when you do something stupid to stop the garbage collector… Which people do all the time. And then the other time it does matter is when I think you see people starting to bring in patterns from other languages.

[34:28] We joke about the Java Go developers all the time, and there are things that they bring in that may make the code work better on the JVM, but it doesn’t work with Go’s compiler, or Go’s typing system. Bill Kennedy’s – I bought his notebook, The Ultimate Go Notebook, and it’s like my favorite Go book, because it has all the weird tips I don’t want, but one of the things he has in bold is “Don’t use getters and setters.” And every time I say that, everybody who’s ever worked in Java is like “Why?! We need this!” I’m like, “There are times when you want to use them.” If you have something that is a private – you know, you have a method to access a private type. Yeah, that’s a good use case for getter and setter, but if it’s public, the Go compiler can inline any call you make to that, and optimize it for you, as opposed to if you had made a function that is then adding more bytes to your stack, that is doing this, that and the other… Like, every function is going to take more memory onto your stack, and it’s going to require another call through the interface to do all of these things… And the compiler is supposed to be fast; it’s only inlining so much before it’s like “This is breaking the threshold of speed in compilation.”

So we’re supposed to be building Go in a certain way according to these idioms that help the compiler make it faster. And so those are the kinds of things that I think “Yeah, it doesn’t matter”, but if we know those things, then we start to understand why the idioms are the way they are, and why this is good code versus what we say is bad code, or Java code. Not that Java is bad, it’s just that writing Java-like code in Go might make or does make a less efficient system, because it’s a different compiler, it’s a different system, different typing signature… So again, it doesn’t matter, but it can help us make better code.

I feel like we should call that “compiler sympathy”, or something like that…

Yeah… think it’s worth trying to understand those two things, the stack and the heap, are fundamentally about lifetime. If you enter a piece of code, a function, and you have some data, the data only lives as long as that function. Then the Go system as a whole - the compiler and runtime work together for memory management. So if the lifetime of your data is within a function, the compiler can clean it up really quickly. And that’s the idea of a stack. Every time we call a function, data kind of piles on top of whatever we were using before, piles up in a stack. And when we leave that function, we can clear it all out. It’s just basically subtracting a number.

The heap, on the other hand, is where anything goes where we don’t know the lifetime. So what happens then is the things that you’re still using, and the things that you don’t need, are all on the heap. And the things you don’t need anymore, the things you actually no longer have a reference to - that’s the garbage. But the way the system works, it just lets it all pile up until a certain point, when it does garbage collection. And that is the performance thing, really.

[37:50] So what is garbage collection? When the Go runtime starts garbage collection, it starts from the places in your program that can access data. So that’s all your global variables, all the pointers on your local variables, things like that. It makes a list of those and it says “Okay, what does this pointer visit? Okay, that thing’s needed. I can still access that. Does it have any pointers? Okay, I’m going to visit every one of them. That data is needed. And when I got there, does it have any pointers?” It’s an enormously – in a big program, or any size of program really, but it’s a lot of work… It’s a lot of work to follow all of those pointers. And that’s the thing that’s going to slow you down. That’s why we talk about memory being important in Go for performance.

What drives the cost of garbage collection is two things. First of all, how many pointers have you got? And that’s basically a function of how big is the memory you actually need. So if your whole program runs in 16k, then that’s not very many pointers. My programs tend to run in a couple of gigabytes. So there’s hundreds of thousands, millions of pointers, and they take an appreciable amount of time to follow all the pointers.

So the number of pointers, which is basically a function of how big your heap is, is one factor. And then how fast are you leaving stuff lying around? How fast are you generating new garbage? Those two things multiplied together give you the cost of garbage collection. And both of those things are driven by how much memory are you using. The first one is how much memory do you need and you’re actually using, and the second one is how much garbage do you create and throw away?

And every time that cleanup runs, your program effectively stopped.

Garbage collection is a stop the world operation, yes. But I’ve never had it be noticeable. It doesn’t stop the runtime, right?

Not since Go 1.5. There are two phases to garbage collection. The mark operation – it’s called mark and sweep. Mark is the one I was going on about, where we follow all the pointers. That can carry on concurrently; let’s use that word again. In parallel with. I don’t know which one is which. Hopefully, we’ll get people tweeting at us about this.

Probably concurrently, because it would get mad if it hits a lock. So I agree. Concurrently. [laughter]

So the mark phase can and will proceed at the same time as all the rest of your program. When we’re done marking, when we know which memory we need and which memory we don’t need, is the sweep phase, where we basically take all the garbage and turn it into free memory. That’s a stop the world operation. But it’s really, really short. That’s microseconds. Whereas for a gigabyte heap, the mark operation can run into seconds.

I’m gonna steal that explanation now. I’ve always explained garbage collection as only that sweep phase… I always say, it’s the marking for garbage collection which happens in parallel, and then garbage collection. I’m now going to change the terms, whether that’s more or less confusing, I don’t know, but it is more correct. And that’s what matters. So thank you, Bryan, for teaching me today.

Yeah, my pleasure. So before – right about 1.5 the whole thing was [unintelligible 00:41:14.15] and people were quite upset about that… But it now runs in parallel. And you can actually see it on your profile. On your CPU profile you will see the garbage collector running. It’s got kind of funky names. It doesn’t just sort of say garbage collector in big letters, but there are functions like mallocgc. Usually they have gc in the name somewhere that you can look for in your CPU profile. But at the risk of getting really complicated, it’s not just the time to do the marking and sweeping that you need to worry about… Because this whole process of [unintelligible 00:41:55.23] through all of memory is going to kick things out of your CPU cache.

[42:03] I was saying the little bit right in the middle of your CPU is the only bit that goes at top speed… The process of marking, of scanning all of your data to figure out which bits you’re needing and which bits you don’t - that kicks things out of the cache, because it kind of goes and visits everything. So garbage collection is slowing down all of the rest of your program, beyond what shows up as garbage collection in the profile.

I mean, I’ll put it another way… If you look at your profile and you decide garbage collection is like 20% of your whole CPU, and then you halve the amount of garbage collection, I would expect your program to go 40% faster. Because it’s like a multiple of what it – what you can see in the measurement, take that and double it. That kind of factor.

So as a programmer – so I could take Miriah’s approach and basically say “You know what? It doesn’t matter if I’m using–” Well, I’m not sure if what – I’m gonna paraphrase; correct me if I’m wrong, Miriah. You can call me out if I get it wrong. But you’re saying “Don’t worry so much about whether you’re using a pointer here, or value a there, or something… Let the GC worry about those things. Or perhaps write your program, get it working, and then worry about whether you’ve got some escaping memory from calling functions, or anything like that.” Like, how much should you be paying attention to the lifetime, as Bryan puts it, the lifetime of your allocations? What is the impact of that? How should you be going about it. So I still get people who say “Well, when should I use a pointer? When should I pass by value. What do I do, when? Does it matter from a performance standpoint? Is it about the semantics? Do I wanna return nil instead of returning a zero value for something? How should I go about this?”

I always say, I guess, what is like best practices. I’m always like “Default to not using a pointer.” And these are the cases where it’s just the exception to that rule. One of them is “Oh, you’re using an interface, so you have to have a type that implements that interface.” Those are the kinds of things… Like, I don’t know, I start to get really picky about writing good software, versus writing good Go. If it starts to matter so much, if you get to the point where the types you use matter to the point that it’s affecting your garbage collection speed, maybe Go isn’t the right choice of language for you. Go use Rust, that doesn’t have a garbage collector and makes you think about that.

I don’t want to run into that problem. I use Go. And I say garbage collector helped me. I’m going to not use pointers till I hit that point where the pointer makes sense. I’m going to use slices until the point where the array makes sense. I feel like Go was built to abstract a lot of that low-level stuff away on purpose, and when we hit the point where we need that knowledge, I don’t know if Go is – I’m probably being a little bit contrarian here, but I just use Go for what it’s good at. And Go is good at being a very simple language, that does a lot for you. You should still know how it works. Still use pprof, still do your benchmarking, still know how things work and write good Go; good, performant Go. But once you get past that point, maybe you should look at Rust, I don’t know. Again, I’ve never hit that point; I still am living in Go land. So that’s just my thought.

But we went through that whole era – I’m sure you all remember the era we went through in the Go community where we were sort of… We were all anti-allocation. I can’t tell you how many benchmarks I saw around HTTP routers alone. About “Oh, this one is zero allocation this. This one has no –” I mean, we went through that phase…

But then why do we have a garbage collector? I don’t know… Just write good Go, and good Go uses the tools provided in the Go –

I’ve got an example…

Go ahead, Bryan. You’ve done more than me.

[46:11] …which is hopefully neutral on this. It’s a pattern a lot of people will –

I’m not trying to draw lines in the sand, I’m just saying that’s just my experience. Go ahead, Bryan.

So imagine you’ve decided in your program you’re going to make a slice, and you’re going to put some things in it. And it’s very nice in Go you can say append. So there’s basically two ways to do this. One is to start with a completely empty slice, and just append to it. Let’s say you’re going to put 100 things in the slice. So you go append, append, append, append, append… And under the covers, what’s going to happen is it starts off with no memory allocated. There’s nothing in the slice, there’s nothing allocated. And you put something in, so it allocates one spot. And then you put another one in and it says “Okay, not enough room. This time I’m going to make some more.” And I don’t want to get into the specifics, but let’s say next time it makes room for three. And then you fill those up, and it’s going to make room for eight. Don’t worry about the numbers. The point is, all of those smaller slices that we don’t want anymore are garbage. And if we knew we were going to put 100 in the slice, we can make one call at the beginning… Say make the slice with room for 100. And there’s gonna be zero garbage just in terms of this append operation.

So I think that’s hopefully easy enough to understand on an audio-only medium… But there are some really simple patterns, like pre-allocate your data structures to a decent size… So if you if there’s gonna be 100 things in it, make it 100. If you know it’s roughly 100, make it 120, something like that. Because the cost of even one throw-away memory allocation is going to be more than 10% or 20% of extra surplus space.

If you don’t know whether it’s 100 or a million, then sure, you have to go through some amount of wastage… But try and get close to the right size, and err on the high side. And that’s gonna save you some performance. There’s a bunch of things like that.

I agree. That’s an example of writing good Go. I think that that’s exactly how you should handle… If you can predict, if you’re gonna say “Well, I know this. I know how this is supposed to behave”, you should always be saying “Yeah, I want this to be this many allocated spaces”, because… Yeah, I agree. Slices are fun under the hood.

They come at a small cost, generally small, but there is a… Yeah, it’s a great piece of advice. And along those lines, as we start to wrap up, is there any other sort of obvious things that as a Go programmer, who’s not necessarily trying to do some premature optimization, but sort of your day to day, the pre-allocation, if you know the size of an array, of a slice you’re gonna need ahead of time, make it in that allocation of that size. That seems like an obvious thing that – you know, not necessarily in the spirit of optimization, but of writing good Go, as you say, Miriah. Are there other obvious things like that we can recommend as a best practice?

Well, maps work the same way. Slices and maps you can make them with a size. Maps are a lot more complicated, but same basic principle; if you know there’s going to be 1,000 things in there, tell the runtime when you make it to leave room for 1,000, and it’s all gonna work a lot better.

A lot of the tricks for saving memory allocation are obscure, unfortunately. Once you get beyond ones like that, like calling make at the top of your loop, designing interfaces – when you call into something that’s going to give you back a slice, it can be nice if you can pass the destination in, because you might have a slice the right size to pass in… And there’s a few APIs like that in the Go standard library. But it does complicate things…

[50:08] I was gonna say, make it elegant if you can before you start making it fast. And really only break that rule of make it elegant when you really, really need to. Usually, elegant code does go fast anyway. But I really don’t want people to kind of start bending their programs out of shape just because they think it’s gonna save a few bytes, or save a few nanoseconds. Look for the 80/20 rule. Usually, most of the time is in one place, or a few places. And yeah, you may need to get a little tricksy in those places.

I would say most of my tips are pretty self-explanatory… Follow the idioms, be careful with your pointers… I’d say don’t use them until you need them. The places you need them are obvious – I think Bryan gave a great point… If you’re going to be populating something, and basically you need to manipulate the data in something and change it, it’s much better to pass in the pointer and manipulate the data inside that object than it is to pass something in and return something else. So those are the kinds of things that I think are good Go. But I would say, especially if you’re new, I would get a really good linter, and all good linters have things like “Did you close the SQL connection? Did you close your HTTP connection? Did you close your file connection?” Because those are the things that I forget, that lead to memory problems… And they’re pretty simple. They don’t always get caught in code reviews either.

So linters, especially if you get really strict on them, can help you make that nice code, good Go code, that will prevent, I think, a lot of stupid memory stuff, or stupid CPU stuff… And then when it matters, it’s a real problem, and not just somebody forgetting to close their file. You know, those are the kinds of things that I think everybody should start with, and then move on from there. Anyway, it saved my bottom a couple times.

Nice. Anything to add, Ian, before we go to unpop?

The way I think about optimization, at least early on, is just do less work. So the idea there is if you’re gonna use a regular expression, compile that once, reuse it. If you’re going to use a template, compile that once, reuse it. I see it all the time, where a regular expression is defined in a handler, and it gets compiled every single time. So even before you profile, or benchmark, if you see things where you’re doing a lot of work you don’t need, that’s the very bottom of the fruit. Do less work.

Alright, let me hear it. Lay it on me. Who’s got an unpop?

I’ll go first.

I have two. The first one is “Chocolate is gross, because I don’t like chocolate.”

[laughs] Okay.

The second one - I may have hinted at this earlier. I think Python is a bad language for data engineering. I think it’s a great language for data analysts, like data analysis or data science… But for data engineering itself, Python is a slow, bloated language, that just wraps other languages. So why are we using Python and not the languages underneath? And people have already fought me on this, so I know it’s unpopular.

No way… [laughs] Wow, spicy.

I, like, want that to be true…

[54:00] Want? It is true. This was in unpopular facts, sorry.

I work in Go all day, every day… And then for the data stuff, I have to pop into Python, and remember how all that works… So I would love to not do that. But like the infrastructure, all of that - it’s just so much work to do it anywhere else.

And then the other thing where you see Python a lot is Python - sometimes the Python SDK isn’t actually doing code, it’s just setting up a different service. And I’m like, we’ve already solved this config thing… Isn’t that why we do everything in YAML, in your DevOps world, is because that’s much better for configs than Python? So anyway, Python just doesn’t make sense in data engineering, in my opinion.

Wow. We shall see how much flack you get for that one when we pull out the survey…

We’ll see how many data engineers listen to this podcast. That’s what we’ll see.

Yeah, we might actually get a whole new audience from this unpopular opinion… [laughs] How about you, Bryan? You’ve got something for me?

Yeah… It’s a lot more niche than chocolate. So my unpopular opinion is in within the Prometheus query language, PromQL - so I don’t know whether any of you are familiar with that…

Oh, I am.

…there’s a couple of similar functions - rate an irate. So my opinion is never use irate.

Tell us why.

So the difference is – so you give it a window, right? How long are you looking over. And basically as you zoom out, you’re going to look over a bigger window. So you’re zoomed in, you’re looking at a rate over one minute, say, and then you zoom out, it’s five minutes, and you zoom out a lot, it’s an hour maybe… Irate only considers the last two points in the window. So that’s why you should never use it, because as you zoom out, it’s discarding more and more of your data. It gives the strong possibility you’re going to get artifacts. I suppose you’re looking at a five-minute window; you’re only looking at the last two points in every five minutes. Suppose you’ve got a big spike that happens every five minutes - it’s gonna look like the thing is just solid.

So it has its uses, but they are so few and far between you have to be so much of an expert in exactly what’s going on that I just say “Never use irate.”

It actually kind of makes sense to me, so I’m not sure how unpopular that will be… But I’m sure there are a couple out there who think it will be unpopular.

Yeah, it gets used a lot. I’ve come across it in all kinds of other people’s dashboards, and so on.

What does the i stand for?

I think it’s instantaneous.

That sounds like it’s a very specific use case kind of tool.

Yeah, I think people like it because it makes your charts – because the rate kind of smooth things out as you zoom out… And irate leaves spikes in the data. It’s more kind of energetic, and there’s more going on if you use irate.

So I think this might be purely a result of how you consume and visualize the data…

Interesting. Cool. I have one to take us home… So not sure if y’all know, but recently, Apple came out with an interesting piece of open source, which - that’s not something I see very often; Apple open source. That’s not something you hear very often. But they recently came out with a piece of open source software that I actually am finding to be interesting. So they came up with this configuration programming language thing called pickle pkl… And dare I say, pkl is better than JSON and YAML put together.

How’s it compare to CUE?

That was gonna be my question.

That’s the first comparison I made in my mind. I’m like “Hmm… CUE lang.” So I’m going to compare those two and report back. As a matter of fact, I’m actually trying to put an episode together with the core contributors from CUE. Maybe I’ll also ask them when they come on the show, and say “Hey, now you’ve got seemingly some competition.” I think they’re trying to solve – maybe there’s some more nuance to CUE lang, but they could potentially be solving – they could have an overlap in the kind of problems they’re solving. So yeah, I need to dive a little deeper… But I watched a video on it, I read some of the documentation… I’m like “You know what? This actually makes sense. This actually makes sense.” In the same way when I was looking at CUE I’m like “You know what? This actually makes sense.” So we shall see where we land on that. But yeah, that’s my unpop. We’ll see how well that goes.

Awesome… So, Ian, do you have something?

No… [laughter] He’s just [unintelligible 00:58:57.16] today. Fine, fine, fine. Alright, well, let me take us home.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00