In Memory Of Travails

Video size:

Abstract

Ride along on the rollercoaster of improving the performance of the Auction.com graph service, our coherent interface between all our backend services and our clients. Watch as an onslaught of heap profile diffs, flame graphs, and async_hooks lead to a better, more performant graph.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, my name is Gabriel Schulhoff, and I'll be talking about how we improved the performance of resolvers at auction. com's subscription service. The problem we faced was that, in the course of operations, the Node. js process would, reach its heap limit. so the heap is a data structure that the JavaScript engine maintains. and it imposes a limit on it, which can be configured, but of course you cannot use an infinite amount of heap. So we have it configured to two gigs or so, and it was reaching that limit. And, that was a problem. So this is the story of how we tried different things and we managed to knock it down a bit. So the background, at auction. com, we have a GraphQL service that serves pretty much. Like 85 percent of the client's needs. There are a few places where, clients issue requests to services other than graph, but 85 percent of the time it goes to graph and we have request response type. operations with Graphen. We also have WebSockets, which serve subscriptions. The subscriptions themselves, are mostly not all of them. some subscriptions, have different content, but most of them are basically, rebroadcast Kafka messages coming from the background. And for those messages, we perform some transformations and then we send them to Redis and if Redis decides that we have subscribers, then it sends it back to us. And the reason for that is obviously because we have multiple pods in, Redis. in production and the Kafka sends the message to only one of those pods. And so if that pod happens not to have any subscribers for that particular message, but some other pod does, then we need Redis to regulate this. to reproduce this out of memory condition or rather to investigate Ways of, reducing memory consumption and, just improving performance in general. what I did was to set everything up locally on my Mac. Namely, I ran a broker locally, I ran graph locally, I connected 4, 000 WebSockets locally, and then I used KCAD to basically flood, the graph with Kafka messages. There was only one message, but. just sent it repeatedly and see what, see where it gets hung up and what uses all that memory. All right. So this is what I measured at first. When I first tried this, I broke down the measurement into three different phases. The first one is just like run graph. And that's it, just node and graph, no sockets, no activity, nothing. Then, in the second phase, I connected 4, 000 sockets, and you can see that the memory consumption increased. And then leveled off to accommodate 4, 000 idle sockets. so this is the overhead of maintaining 4, 000 sockets. And then two minutes later, I started sending messages for the next four minutes and measured how many messages, were, received or it was able to receive and how many messages it was able to send out. And these are the results. It's not very good. 16 Kafka messages in four minutes. That's not very good. That's 15 seconds processing a single message and sending it out to 4, 000 clients. We need to do better. And so what did we do to make things better, both memory wise and, performance wise? So memory wise, the first thing that we noticed was that, this being GraphQL, right? Answering, the subscription with a single, response means executing GraphQL, GraphQL, right? Performing a GraphQL execution. Now, every GraphQL execution has an associated context. The context is there to store the information that it, that pertains to that particular execution. And in the case of subscriptions, repeated executions, but either way that it's particular to the one connection, right? And one of the things that, GraphQL sometimes does is, it starts out with a certain object received by some means, in this case, Kafka, and if the subscription specifies that, it needs certain fields that cannot be found on that object, but for which there is a resolver, meaning that GraphQL is aware of a solution for providing those fields, even without, even though they are not on the original backend object. Then, it will reach out to backends and it, which are pre configured to provide that information. Then it massages that information and then returns the result and the client is happy. we have 32 different backends at auction. Each backend has, five different, methods being HTTP and all. and so we have these wrappers on the context for reaching out to the backends. basically the, it looks like this. so we have a backends object, which encapsulates all our backends. And then we, for each backend, we have it by name and then by method. And We construct this path whenever we construct the context. So every context has 160 objects on it. And that's a problem because we have 4, 000 contexts coexisting right in our test. And each one has 160 objects. That's a lot of objects. And chances are that, any one, response isn't going to cause. the executor to access every single method on every single backend. Quite the contrary. It's usually five, six accesses. if it's any more than that, then you need to worry about, Oh, is it N plus one? am I causing too much traffic? So either way, you're not going to need 160 objects. okay. what to do? let's make it lazy. Why not? JavaScript has this wonderful thing called a proxy object. Which basically is an object that pretends to have any number of, of, fields set. But it doesn't really have them set because what it means for a field, what it means to retrieve a field is up to a certain callback, right? Instead of just being like a property access, it's actually a function invocation where it's given the name of the property and the function can decide what to do about the fact that somebody wants that property. And so you don't necessarily need to store all possible values. And that's exactly what we did. when somebody asks for, UAA. And they ask for post and we create the UAA backend on the fly as they are asking for it. And then we create the wrapper for the post method as they are asking for it. And it's not as if we're delaying the response by this because we would have created these things anyway, when the execution happened. So the response is not going to be slower. In fact, it's going to be faster because we're not creating 160 of these. We're only creating maybe a total of four or five. So we're saving time. We're saying we're saving memory. Now let's see what this got us. turns out that the improvements here are in terms of throughput, aren't that great. So there's probably another bottleneck somewhere, creating 160 objects was not that expensive in terms of execution time. But it was pretty expensive in terms of memory. Like you can see that we lowered the plateau here. because, the, these contexts, which were causing this rise are now causing a lower rise because, we're not creating any, backends. And in fact, for this particular subscription, It doesn't take any backends. You'll see later. All it does is basically take the object that comes in over Kafka and transforms the keys and that's it. And then it sends it out. It doesn't need any backends. so then, okay, this is maybe a little optimistic because it's like maximally reduced, but it doesn't matter because it's going to be reduced in pretty much all cases. All right. So what else can we do? We've lowered memory consumption. Sure. But can we do more, lower it in different ways? So one of the things that, we explored was, If we make the execution more efficient, then maybe the memory usage is going to go down because we're using, more modern, style code that, that doesn't require so much support from the engine might even be leaner. it might be optimized, whereas the old code was not optimized. And so it doesn't need to create so many temporary objects to address, the execution. and so in this particular case, we have this one function called convert obj to snake keys. I told you earlier that, that we have objects that. That look like this when they come in from, Kafka and we need to convert them basically because our convention is that GraphQL fields, output fields are snake case. So we have this thing that converts a deep object to snake case. and that function. What's using low dash a lot because it's an old function that nobody has really touched because it works just fine And back in those days there was no like map there was no object entries that kind of thing so okay, so let's see if we can speed things up, right? let's replace all these calls to load dash and let's use some of the stuff that the engine has locally and natively, to accomplish the same thing. And so we made, we transformed the code like this. And the result was that performance was back into the positives relative to that 16 message baseline that I showed you on one of the first slides. we retained a little bit of a lowered plateau here because we're still not. So we're not doing any of those contacts, but because we're doing more work, memory consumption in the busy case has actually resumed where it was. It's a little tiny bit lower, but not as much as we'd like. All right. Let's see what else we can do. So I told you that we're using Redis because we have a situation where we have multiple graph pods like this, and they're declared as a single consumer group by Kafka, meaning Kafka will partition messages that are addressed to the consumer group among the members of the group. And so let's say, if, If G2 retrieves, receives the green message, then G2 can satisfy, C4 and C5 because they request the green message. But how can C2 be satisfied? C2, G1 never got the green message and yet somehow it still has to send it to C2, right? So the way we reconcile that is we send all the messages to Redis. And then Redis is does G1 have a subscriber for, this green message? Oh yeah, they do. Okay. Then I better forward it. So Redis doesn't just blindly forward everything to everyone because then we could just get rid of Redis entirely and just declare every graph instance to be its own consumer group, but it does filter. And so it does reduce traffic versus just like the Christmas tree case. while at the same time, allowing this. this, GRAF service to appear to be like a unified service rather than, being able to tell, Oh, I'm con I must be connected to pod number two because I can't see this and that. So it, it ensures correctness. It reduces traffic. and so that's why we use it. And I figured, okay. GRAFQL subs GraphQL Redis subscriptions. It does a lot of, serializing, deserializing a lot of streaming. And so So, can we improve it just by upgrading it? maybe, folks upstream, doing wonderful open source work have realized that there are better ways of doing what they're already doing without actually breaking compatibility. So we upgraded it and no, not quite. it's okay. it's worth a shot. It was worth a shot. the plateau remains. This memory consumption, is pretty much the same. Performance improved a little bit. but not spectacularly. okay. We tried it. At least now we have the latest and greatest GraphQL Redis subscriptions, right? Or we did at the time and that can never hurt. so then let's see, can we do more to improve this wonderful, convert object to snake keys function? can we make it leaner? Because to be honest, I keep coming back to this function because, I was running node clinic. on the whole workload, just to see what it is that's holding up the show. why is it performing so poorly with the sort of quote unquote on ulterior motive to actually try to reduce memory consumption. node clinic is not a memory consumption tool. It's a, it's an execution time measuring tool. It's a flame graph for execution time. Where is your. app spending most of its time, that's the question that it answers. And time and again, when I ran it, it came back with convert apps to snake key. So I'm like, okay, fine. let's get that off the way. And so I went back and, was thinking to myself, we have objects like these, right? And we're converting them, to Snakey. So we're not touching the values at all, unless the value is an object or an array. And so it's a deep thing and we recurse, but otherwise we're not touching the values at all. we're just touching the keys all the time, right? we are a company, we do stuff with data, right? Like every other company that does stuff with data. And so the data in this case is key value pairs, right? And, the keys can refer to all kinds of like business related things, You can see right here, for example, fallout history and, is auction status change. it's business related stuff, right? But these words, these are keywords. These are like variables that programmers use, to get the data at that particular location. And so, these names, they're not generated automatically. we agree on them and then we use them, So in terms of, processing them, into snake keys, once, once I've read days on market and I've turned it into snake keys, and this is what it looks like. The next time I encountered this days on market thing, and I need to turn it into snake case, why do I need to run that computation again? I already know how to do it. I've done it once, right? So why don't I just record the result and then not do it the second time? This is called memorization, right? and that's exactly where this is going, right? just memoize everything. And then, no matter where you find a key named external identifiers, you, you've already done it and that's it. Then you just look up the result and reuse it. So that's what we did here. We memoized it. It's, it was a super simple change, like literally just like practically two lines, like one, one addition and one replacement, and that's it. The rest is just like gravy. And, Incredibly performance went from up 20 percent versus the original baseline to up. 331%, 0. 3. Okay. So that was a huge improvement. and you can see here the curve that, you can see that it's not struggling so much to, to deliver these messages because it's the new curve is like the old curve, but. And many people are facing that issue of doing transcripts as well. And so what we've seen, the problem with the, and during the process of delivering those transcripts, we actually see some errors, a lot of typo errors, a lot of typos, some sheets are missing or something, A plateau, because it's more flat. And if the height of the plateau goes down, okay, then we've improved something across the board in terms of memory consumption. Whereas here it's very difficult to judge, because we don't have a plateau that we could systematically lower and observe that it's being lowered. All right. So let's see, what did we do next? okay, we can't, we step back a little bit and we're like, okay, so these pods are dying. Okay. That's the fundamental problem. that's what got us started on this whole process. So let's see if we can spread out the load. we see that the memory is rising precipitously. Because, of some activity that is causing that, that is completely valid business activity. And, it's not able to keep up. okay, let's quickly add more pods and then maybe now then it's going to be able to keep up. And so we did, we keyed the upscaling on memory consumption and equally the downscaling. And, it was okay. Um, it handled it a little bit better. Restarts went down a little bit, nothing spectacular. so then, okay. what else can we do about, by the way, aside from this precipitous increase and so forth, we also had a slow memory leak that would accumulate over like days on end. And so we also had to address that. And, the way we did that is drum roll, please. just restarted every night. Cause like we couldn't figure out where the leak was coming from, it doesn't matter you, this is a perfectly valid way of handling it because nobody's watching their web sockets at like 2 AM. So why not restart the service? And besides you, there are no auctions going on at 2 AM. Everybody's practically asleep all over the U S. And our clients, the, not the people, but the apps that they use and, the website that they use, those clients are designed to reconnect very quickly. So within a matter of seconds, they, they would reconnect their web sockets and continue to get data. So yeah, we quote unquote addressed this particular memory leak anyway. so back to the quest of lowering memory consumption, one of the things that I didn't. I'm going to add in my background slide is that, the way we deploy graph is we do a bunch of like transformations on the source code, right? Like we, we have TypeScript in there. So of course we compile that to, or transpile that to JavaScript. But we also have this legacy sort of bundling system that is basically designed for clients, but that we also use on the server, which just puts everything through Babel and generates this one big file that is the entire server. And. that, that the configuration of that process was such that there, it made no distinction between whether it creates a client bundle or a server bundle. And with this optimization mindset, I went in there and I was like, okay, so let's introduce such a distinction because, the client bundle has to address all kinds of browsers with all kinds of capabilities and legacy browsers and all that. And so it needs to be bulletproof and be able to function in no matter what level of ECMA script is running on that browser, but that's not true for the server, right? Node. js, we know exactly what version of Node. js we're going to have, 20 in this case, and so the translation algorithm might as well target that version and leave a lot of the native stuff in place instead of replacing it with polyfills, right? Unnecessarily in this case. So we did that and you can see that the curve has gotten a lot flatter. if you look at this one, the peaks and the valleys are a lot smaller, we've improved things a little bit and the performance went up by another 0. 3 X, right? We're at 3. 6 X now versus the original. okay. it's good. Memory consumption is a little more predictable. Performance is slightly up. Can't complain. Except of course that we haven't really lowered this plateau. It's still, hovering around just above one gig. okay. Let's see. What else? What else can we do? Okay. So back to optimizing this guy. And, If you remember the code, we had a bunch of like object entries and object from entries and map in there and that generates a lot of temporary objects, right? because, object entries converts an object to a map, to, to an array, and then from entries converts that array back to a map. An object, and that creates a new object. Then you have the old object in the new object. You have this array, which then gets destroyed. Same thing with map. It creates a new array and destroys the old array. why don't, why do we do that? why don't we just do things in place whenever we can. So for objects, it's difficult to do things in place because with objects. You would have to replace the keys. and that's really difficult to do. Cause you have to insert a new key, which is snake case, delete the old key, which is, camel case and so forth. It's much easier to just create an empty object, fill it in with the converted keys, And then return that. And then that is the object. So okay, yes, we're creating a new object. But in the case of an array, we can completely save ourselves that because we're not touching the keys. We're only touching the values, right? So just replace the values with the snake case ified version of themselves. And that's it. And you can still use that same array. And so when we deployed this, lo and behold, performance went from 3. 6x to 6. 3x. So we tripled performance again. memory consumption is very nice and flat now, but it's still hovering at around a gig. Okay. all So then break out the big guns, so to speak. this is Chrome DevTools, the memory tab, as you can see, because we're looking at memory consumption and it has this wonderful feature called a heap snapshot that you can use with Node. js. Now a heap snapshot will basically, as you might think, as you might imagine, it'll take a snapshot of the heap, i. e. it'll tell you all the objects that are on the heap at that time, by their type. And then, one of the best features here, of this tool is not only that it can take these snapshots, but if you take two of them of the same process, earlier in the process, later in the process, then, you can actually get, the difference. Like what have I created since my last snapshot? What have I destroyed since my last snapshot? And so of course you, you'd want either zero or lots of negatives, but unfortunately, that's not what we were seeing. We took a snapshot at the point where the 4, 000 sockets were there, but idle. And then we took another snapshot. I'm sorry. We took a, the first snapshot is where everything is idle. Node. js is just sitting there doing absolutely nothing. The second snapshot is after, the sockets were connected, but before there were any messages coming through. And the difference is, for example, among other things that some more recognizable things is that there are 261, 968 location objects. Now, what is a location object and why do we have so many of them? it turns out the location object is an artifact that is created whenever, an incoming request, such as a subscription is processed by GraphQL. It creates a syntax tree for it, an AST, and unless you ask otherwise, it will attach to each AST node, it will attach the location of that node in the original source codes, so row five, column 29, that's where something starts, some kind of token. And that token is stored in an object in the AST, and associated to that token is a location object that will allow you to go to that location and do something interesting. we're worried about, answering these queries. We're worried about giving responses to these queries. We don't actually care where in the query string, something is located. So we don't need these locations at all. And we configured the parser. We kind of monkey patched it at first, but then we found better ways. We configured it so that, we actually had to reach into subscriptions, transport WS at first and basically just tell it, okay, please don't. And I can do the same thing with a std. So I can then say, oh, where can I convert it? Or I can give it a transcription, a transaction. I can give it a transcription, a transaction. And that's basically the kind of thing that we're going to be doing. And what I mean by that is I'm not going to be converting that logic into, say, a translation disk, but I'm going to be converting it into a transcription of some sort of, a state definition. it didn't take very long to get through the one gig barrier. Whereas now it barely makes it to one gig. we got some performance in terms of memory consumption reduction. Great. And the execution performance hasn't really decreased. it's pretty good. In fact, If I think about it, it's actually gone up from 5. So, not so bad. but is there anything else we can do? let's see how low we can go, all So the next thing we did was we went back to the context and we're like, okay, can we lower this plateau a little more like in the idle case? Because if it's low in the idle. It's going to be proportionally lower in the case of execution, or at least most of the time. we realized that in addition to backends, the other thing on our context is something called data loaders and primers. So a data loader is basically just a kind of a backend where, you ask it for a bunch of things and you ask it one at a time. And it pretends to answer right away. But really, one of its most important features is that it batches those requests and then it issues a single backend call saying, give me all of these things, if the backend is capable. And then when the backend responds, it doles out the various objects that it was requested to, to provide as they come in. And so it's a great tool for avoiding N plus one situations. And we have lots of them. Like we, we load, we use data loaders for lots of different things. and we have 50 of them and primers, which prime the data loaders when the execution knows that it's going to subsequently need that data. of those we have even more. and we store them all like this. the name of the loader and then the function that creates the loader, then the name of the loader and the function that creates the loader. and the result of these function calls is always an object. And so again, we were in the same situation as with backends, where we were creating like hundreds of these objects, if not thousands, for the data loaders, and for the primers for every single context. But unlike the backends, this required a little more subtle solution to, to transform into a lazy object because, it's a much bigger sort of object and it's constructed statically. So the backends, they're actually constructed in a loop, but this object is literally just a big giant object literal declaration. And so we didn't want to turn this into like complicated code that looks like this, right? while at the same time, wanting that, that laziness about it so that we might save memory. And so what we did since we're, since, as I explained earlier, we run through these build steps to convert things and run Babel on them and so forth until it's just one big file. Babel has these things called plugins. So why don't we make a that takes one of these objects and turns it into. In the end, the output of the package starts from the statement, which is something like this, which is the name of the parameter, which is the name of the function, which is the name of the function. This is the number, which is the value of this specific function. So it's basically the name of the function. Every function that you have in the database, and the name of the function, which are actually one of the properties in this object is actually primers, which itself is a, lazy object. So it gets complicated quickly. And so better, better to use that Babel plugin. And so the Babel plugin works like this. It looks for instances of proxy. lazy in the code. which, proxy is a perfectly valid, JavaScript or ECMA script, specified, global, but that global doesn't have a static function called lazy. That's fictitious that we invented that function, right? But with Babel, that can be turned into quote unquote, a real function by virtue of the fact that, we create code to satisfy this and to turn this into an object literal, that is really a proxy. so that's basically our excuse is to find this word lazy here. And so that's our hook actually. And okay. So then that turns the code into something like this, where if you're asking for key one, then you call function one, if you're asking for key two, you call function two, but you also memoize, right? So originally. You have all the keys because the object still has to look like it has all the keys, right? But each key has only like this the symbol assigned to it, which is one symbol that is getting reused for all keys Instead of having one symbol per key that's a big savings while not affecting the look and feel of the object And then when, you want one of these data loaders, then you just ask for the key. It checks, Oh, geez, the key is a, is one of these symbols, then I better create it. So it creates it and then it replaces the symbol with the result of the function call, i. e. memorization. and that's it. And let's see what ha what happened when we did that. we did another thing first too. we introduced a query cache. Because, we didn't want to have 4, 000 copies of an AST if we could avoid it, why not just have one copy? And we also added, JITing, which basically creates a function for every, document, every AST that comes in. It creates a function, which when called will have exactly the same effect as if you had called execute with that particular AST only this being a single JavaScript function, created from like a string that happens to be valid JavaScript, that is very carefully constructed to, to execute. the JavaScript engine can optimize that it can actually turn it into native code. So it really has a very good chance of being running as native code rather than, lots of interpretations and then jumping around on the AST and all that stuff. So what we did was, okay, we introduced a cache when the query comes in and we parse it, we create this compilation artifact for it And then when it's time to execute the query, we grab the compilation artifact from the AST upon which we stored it during parse. And, if for some reason during parse, the creation of the compilation artifact failed, no big deal. We'll just go back and run the query the old fashioned way. Otherwise we'll run a JIT and it'll be faster. So with these two things put together, this is what we achieved. So we went from this guy to this guy. Again, perfect memory savings, right? The, this plateau genuinely got lower. and look at the difference between the original plateau and this plateau. Like it, it's substantially less. Like we went from 600 megs to 300 megs. we have the idle consumption. performance unaffected essentially and much lower memory. we got some ways towards our goal and that's, the scope of this presentation. things have happened since then. And, one of the future directions here was to upgrade packages and believe it or not, that actually solved the problem for us because it turns out that we did have a package in the mix in, in, in our supply chain that was basically, wasteful with memory. And then after switching to yarn four, whereupon we upgraded all our dependencies, that package either got, shoved aside because some of the package got hoisted instead, or some of the version of the package got hoisted instead, either way, Yarn 4 got rid of a lot of our, duplicate dependencies in terms of same package, different versions, and somehow a newer version ended up and. it solved the problem. But that doesn't mean that these future directions for any sort of, memory consumption reduction aren't a good idea, right? So using heap diffs, see what else we can shave off. We could still, just because a problem doesn't happen anymore, we don't have restarts. That doesn't mean that, you can't improve memory consumption. It's never a bad idea. Promise objects. Okay. I didn't really cover this in this talk, but basically we found that. if you treat promises and primitives, equally, then you're going to end up with a lot of extra code that, that gets executed, and a lot of memory because all these promise objects need to live somewhere that, that gets used that doesn't need to. So for example, if you have something like. Await function call, and that function call returns something, anything other than a promise. If it returns a number five, then you don't want to await that function call. You only want to call a wait on a function or dot then or whatever. If you can be absolutely a hundred percent sure that function is going to return a promise. If not, then just call it directly. Don't use await because that creates like three promises every time you use await. so that, that will save you memory and it'll save you execution time. Upgrading packages, like I said, That did it for us. And, this is still good. creating a realistic subscriptions load test, not just one message, have multiple messages. So you have some heterogeneity, in your workload. yeah, we can still do these things, but of course the impetus now is a lot less than it used to be. so these are some of the insights that we gained. choose very carefully what you attach to your context. That's never a bad idea. keep your context lean. and by the way, some of these insights, they apply to request response graph as well, not just to subscriptions. Like the fact that we made all these things lazy, reduce memory consumption on the request response side as well, not just the WebSocket side. So it's generally not a bad idea to, to have a lean context for graph execution. then, Execution efficiency and memory usage efficiency are not entirely orthogonal. Like we saw, we improved execution efficiency and memory consumption actually went up instead of stay, staying where it was. so yeah, they're not, yeah, they're not independent. they are related, which can be related in a positive way or a negative way. So improving execution efficiency is never a bad idea, but if it kills your memory consumption, then, you might want to think twice about it or find out why, of course. Yeah. node clinic I mentioned earlier, it's great for improving execution efficiency. And as a result of the previous statement, maybe memory consumption, efficiency, but definitely execution efficiency. it, it does a great job showing you, okay, there's a huge plateau here. You keep calling this function. Why? Stop calling it or, make it more efficient. It really points squarely at the culprit. and then heap snapshots came in very handy. especially the differential ones, because, they pointed squarely at the thing that you could remove, we don't need this. Goodbye. it's a great tool for that. All right. That concludes my presentation. If you have any questions, I think we have a means of answering them, but, either way, thank you so much for your attention.

Slides

Download slides (PDF)

See all 36 talks at this event!

Conf42 JavaScript 2024 - Online

October 31 2024 - premiere 5PM GMT

In Memory Of Travails

Video size:

Abstract

Summary

Transcript

Slides

Gabriel Schulhof

Software Engineer @ Intel

Join the community!

Featured event

2025

2024

Info

Conf42 JavaScript 2024 - Online

October 31 2024 - premiere 5PM GMT

In Memory Of Travails

Video size:

Abstract

Summary

Transcript

Slides

Gabriel Schulhof

Software Engineer @ Intel

Join the community!