Mission: Performance Impossible – Real time chat webapp’s performance with 100k active users

Video size:

Abstract

Scaling Mattermost to support 100k active users meant tackling complex performance challenges. This talk provides a practical look at how we measured, validated, and improved Mattermost’s performance for a better user experience.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone Welcome to Conf 42 JavaScript track. I'm really excited to be speaking about this topic. It's called mission performance, impossible real time chat web app performance. My name is Zubair. I'm a software engineer at Mattermost. So a little bit about Mattermost. It's an open source collaboration platform for communities, enterprises, teams. We have been, very focused on Mattermost being an entire suit of platform tools for entire software development cycle. Now, before we begin with the, with the topic, let's first understand the architecture. of the web app matter was web app uses react as it's, front end library to make the web application. So as everybody's already aware, react is more efficient when we have a lot of updates and indeed learning the components which have changed only, change only have the difference. in visualization of what components have actually are supposed to re render instead of re rendering the complete. And the state manager which we are using is Redux, Redux with Thung's library separately. It's, and it serves, as I've said, it serves as a state manager of the application. And, I think everybody is aware of Redux. It's a very predictable state container library for React. It also answers, handles asynchronous logic for interaction with networks and all the other core features. The next is technology which actually makes the web application of MatterMost real time is WebSocket API. This WebSocket API enables us to have real time communication between the client and the server. And this obviously is very crucial for the chat application to be able to serve in real time. And, yes, all of that, the application is a single bit application, which enhances the user experiences by reducing, frequent page loads and all that. It's it's entirely modular design, and if it's open source, feel free to just check out the repository and the code behind that, everything is modularized into components, actions, selectors and the UI component library, all that. And as I've already mentioned, real time updates with AppSocket, that maintains a persistent connection to the server, allowing for the instant data push and updates without requiring, much user interaction. With this, let's try to understand what the, what are the challenges of real time application. Now, when I say challenges real time data, I'm purely talking about in respective of the client performance. The real time application on the server side is again a completely different story, which needs its own talk, but this talk is concerned about the client performance, by client the web application running on the browser. Now when things are real time, there is a high volume of incoming network requests with WebSocket events. And most of the times, this incoming network events also must be followed up by more by the client by more network request. For example, if you have a post coming in, we usually don't send the user's complete information who's sending the post. Now it's up to the client if it's, if it has that user ID particularly stored in the Redux, it's well and good. If it's not, then again, it has to go back to the server to fetch the user details for you. Now, in high elevated environment, this could lead to a lot of, network traffic when your client is constantly responding to the new incoming, WebSocket events with more network events. We get to this one of the problems, which this is one of the problems we have faced. we'll get to that problem in a bit. Now there's always a constant battle between what to fetch and what not to fetch or what to fetch later on. So with the real time data, there is always, you have to may have this balance of overfetching or underfetching. Keeping all of this There, your UI has to be responsive, has to be interactive, by the user, because user is concerned about using the, application. It has, it just, your app has to support that. Now, there were a lot of performance, optimizations which we had, which had to wet in to, elevate the load bearing capacity of the client. We'll get to some of the PRs, not all of them, in a while. But before that, let's try to understand how can we simulate this, right? Because you, if you are testing on a local environment. You won't have this many users signing up, these are sending the, these many users sending the post, people joining the channel, people reacting to the post, people commenting on the post and all that. Matomos has an inbuilt tool called Matomos load testing engine. which it uses to test, test, test the server environment. A little bit about the architecture of this particular tool. It's an in house tool developed by Metamast. We have, I think you're already seeing the architecture over here. Just a very big, very high level view. There are two main components that we should be, we should look out for. That is, the first one is a user controller. Now, this component is actually in charge of driving or controlling a user. By user, a simulated user, right? It is, its implementation is what defines, what exactly the behavior of the user is. Its action could include, what the user might do when it opens the Matterport. It might go to type in post, switch channels, switch teams, open a new thread, a reactor post. So all of this is implemented in the user controller module which you can see over there. And then another thing which we should also be aware of is the coordinator. Now this coordinator is in charge of managing the cluster of load test agents by load test agents and then it's trying to figure out what is the maximum number of concurrent users the particular server or active server can have. So again, this is, if you can see there is a looping, there's a loop going on from coordinator directly to Prometheus. It's a loop back, a simple feedback loop. So when the load test starts, the coordinator We'll begin increasing the number of users and at the same time, we'll also start monitoring, or acquiring the Prometheus server by collecting its server metrics and seeing are we reaching our target limit or not? Then based on that, it will start decreasing the active number of users slowly. And once the equilibrium point is reached, it'll stop. By equilibrium, when no more active users can be added. And that would indicate that these are the maximum number of supported users this particular Matomo server can have. With this, these are all happening on the server side. What we would like to do is have an extension of this for the client. So into the one of this app instances, we are going to connect our client and see how the client performs when all of these user controllers are working in tandem, doing all sorts of things, all expected user behavior, because that is how we would, that is how the application would be used eventually, right? So for this, what we did is we have, automated test running on the client, which will open the web application when the load test is being run on the background. And again, the front end, load testing tool, which start interacting with the application, the real time application. Now you must be confusing how you said the user controller is doing the interaction, but yes, user controller is doing. On the server side, but we need a client side metrics, right? So that's the reason we need to connect a client and then run an automation, loading tool to get the actual browser metrics and what the actual user is seeing. And for that, we need to connect it to a browser. So as you can see, once we start doing that, it runs whole suits, it runs for some time. It runs whole suits of tests. And with this, we are, getting, we are piping all the data to the Prometheus, which is already there in our load testing instance. And, with that Prometheus, we are scraping, Grafana is scraping for us to, get us the useful analytics, which we would like to see. Now, I told about this, like analysis, like what the metrics are. Let's try to understand what these metrics are. Now, before going to the metrics. How can you compare two versions of a metadata application, right? So what we do is, within the, for the version number A, we run the load test, wait for the equilibrium to set in, get the app node, run the client automation test. Run it for some time, see the graph dashboard, take the metrics from the browser, and then we run it again with the equivalent or exactly the same environment variables, which we have done for the past test. Now with the new, with the different client version. And then that is how, after again, running the same steps, then we compare, have we improved or not? And that is how we analyze, okay, are we improving the. performance of web app or are actually degrading that. Now, with this, let's, as I've mentioned that the metrics which we are talking about, let's try to understand what are these, what are a few of the metrics that we track in the client side. Now, this is important. Now, even before going to the actual metrics, we need to, the reason there are a few things that we first need to understand. This client side metrics are very much dependent on the hardware. And because the hardware is constantly changing with any number of clients, right? Because each one of them have a different browser or have a different hardware. All the users, even them, Although they all of them are in the same Matamoros instance, so we need to be, that's the reason I've said that comparison is the key over here, not the absolute values this metrics. So with that, the first thing which we do is we, use the Chrome dev tools to see if there are any memory leaks or not. So we do the memory profiling. And with this, what we are basically trying to see is, are there any memory, leakages or are there any inefficient memory users? With this, what we can see is, will this memory excess, excessive memory leads to crash or slow performance. once we do find out, okay, there are a few things that we need to do, then we have to follow up with efficient data structures, if you would like to see the structures or efficient calling of functions so that our memory allocation optimization, if we Optimized memory allocation. We also monitor CPU usages. So this identifies us if there are any processes bottlenecks because high CPU usage can slow down the complete client affecting response times or application performance. Not only batteries application performance, it also degrades other applications which are open. under the user's computer. So we profile, the CPU, and see what cause, if there are any spikes, then we can go back to it and see what caused this additional computational overheads. Another metric which we also see is what is the frames per second when the user is interacting with the application. Now that is on the top left, as you can see, Again, available in the Chrome DevTools, it's, it's a measure of how smooth the animations or interactions are being happening, are taking place in your application. Obviously, low FPS leads to choppy visuals, delayed responses, very bad user experience. So we identify which process is causing lower frame rate per second and concentrate on that to see how can we improve that, how can the re renders be improved. And all that. Another useful tool, which we also take a snapshot of is the network, HTTP archives or apps, hr. What is, this is what this will, allow us to see is are there any, long pending network requests which are being carried on by the client, which we might not be aware of. That client is making that. and see how other web assets are loading. Are there any network bottlenecks? Because what happens is, although the browser does not have any limitations on number of requests being sent, but all, but this network request, when the response from this network request comes in, the client has to get this data, do, understand where to, where in the place of redirect that has to place this data, do a lot of pre rendering and all that. So although network request doesn't cost us the memory usage, but the following processes after we get the response is, are the heavy ones. optimized, network loading is, very much appreciated. And, the next thing is we have the Web Vitals, library from Google as well, which helps us to understand various, stable, industry recognized metrics. For example, FCPs, LCPs like FCP is the first content and contentful paints, cumulative layout shifts and all that. there's a amazing library by made by the Google, which you can incorporate to, and which would track all these numbers for you. So measurement of these vitals and also comparison is one of the vital, performance tool which we have in our arsenal. Running Lighthouse is another, process which we have. it gives us an holistic view, said by, Google Lighthouse tool. It has, again, not only the metrics, which we talked about earlier, but also other, other metrics, which are very useful. For example, the accessibilities and, it also gives a very helpful recommendation on how to solve this performance, bottlenecks. So we have. the FCP, which is the first contentful paint. It's the measurement of time when the user first navigate, have navigated the page. And, and if the part is rendered on the screen, that's the first contentful paint. And the largest, LCP is the largest contentful paint is basically the time between the, one of the largest element on the page, which appears. And then we also have interaction to NextPaint, again, which is a very core, web vital for us. It basically measures how much of the latency we have with the user interaction when the user first lands on the page. And then we also have other things like total blocking time, speed index, all of that here. Then we also have a specific set of custom Matomost metrics, for example, channel switch time, like how much time does it take for, on an average, for the user to change from one channel to another, or from one team to another, all this, these are custom metrics which we have defined, in Matomost. And yes, we are, we keep on defining or developing this custom metrics as a part of performance analysis. Now with that, I think now we have, understood like what all metrics which we can use. look out for and how do we do the comparison and what is the process that we follow to the load test on the client. Let's look, quickly look at a few example PRs just to get an understanding of how the actual process of work is being done. So this is one, one of the things which we have seen, when we ran the load test on the client is there was a very high memory usage of one of the, one of the reducers. And we have seen that it was like, what, 34 percent of the memory usage, in that particular interaction. We had to reduce that, but why is that even happening? It turned out that we were not correct. to say we were not actually, getting, we were getting objects. We were not, which were not being recycled by Chrome and they were having some dangling, objects, which were not being garbage collected. And if they were dangling objects, We had to, usually what happens is when you run a garbage collector, it stops the whole application and then runs garbage collector. So GC run by the browser is intensive operation. I need to avoid that as much as possible, not completely avoided, but yeah. as much as possible. So this was a one of the thing which we, which we had and we were using few of the arrays, patterns, which solved the issue and we were able to get down a significant usage by this particularity. So there was also another one which was causing from the issues was that we were using a library for the profiles, profile popover and all that. So which were causing a significant memory leak, which we were able to find out. So one of the methods was to either, either have it custom built by ourselves or replace the complete library, an existing one. So we had the base library changed, to a more, new and supported library without these memory leaks. And yes, we did had a lot of improvement when we actually changed the popover library and performance that. And what another, another, investigation, which we also did was the, we had a lot of, network calls been sent to get statuses and user IDs, when the new message was posted. Now, agreed, there is a little bit of, what you can say, we don't have a web socket event when someone changes the statuses because there are some performance, bottlenecks we have from the server. And that is one of the, limitations which we have. That's the reason we don't have a, we don't send out a WebSocket event when the user status is changed. So what we had to resort for is manual fetching after some amount of time, fetching again the user statuses. Now that was repeatedly hammering the Redux store, updating that, the whole application was very much freezing. We had to do something about that. what we ended up doing was we created a simple, data manager for statuses and also varieties, which, which would, keep a queue of the statuses and then execute after a certain amount of time. that solved the issue. And yeah, that, after how much interval that this particular queue has to get cleared, that is again, configurable better. application. And there was also another, I'm sorry about the network request. So we also saw like an unwanted network request, right before. So the, again, there was one of the thing which we saw is, we, it was unnecessarily fetching complete thread when we are not seeing that particular channel. So as you can see, We, if you have noticed all of these PRs and not only this PRs, but any of the performance PRs have, a very good backing of reasoning of why this particular performance improvement is important by having a sound comparison between before and after effects of the PR so that we can see visually. Okay, this has improved by this particular percentage or by this particular metric. and, not only, it helps for the, but also it helps in the future as well, when we go back to this PR and see, okay, why did we do that? At that particular point of time, we saw, okay, this was the improvement which we have seen. Now, let's get back to now, by, we have seen a few of the examples, work, which was done on Matamoros, the client side. Now, obviously there are a few caveats with this, no matter how much performance improvement you do, network latency will eat away all your hard work, Because that's just a hard reality. So even if you have improved like 30, 30 milliseconds of your improvement, if you're a server, if you have a large latency, all of that performance goes to waste. And then. client, as I've also mentioned earlier, client side performance is, has a lot of dependent factors. Now, for example, what kind of hardware are they using? What kind of browser they have? What kind of other application they have running alongside the Matterport application? So yeah. So that's the reason I've said comparison is the key in load testing client. And it's, it so happens that, the performance, for example, it so happens that when the user loads the application and the first time it is slow, the user perception is that application is bad. So user perception regarding performance is very much different. That's the reason that you have to do your own load testing on the client, or not rely on server's feedback, or you know what, channel switching is slow, channel team is slow. We cannot rely on that information solely. And then we have to just replicate exactly what exactly is happening. with our load testing tool to understand where and when the problem is arising. And not all your performances improvements will be linear. It might be some might give you a lot of gain. Some might give a very less gain. So that's always there. Now, with this, I'd like to conclude my talk. If you have any questions, please, reach out to me and I'm available at my, website over there. Thank you so much for listening. And once again, it was lovely. It was lovely talking, presenting all of you. Thank you.

Slides

Download slides (PDF)

See all 36 talks at this event!

Conf42 JavaScript 2024 - Online

October 31 2024 - premiere 5PM GMT

Mission: Performance Impossible – Real time chat webapp’s performance with 100k active users

Video size:

Abstract

Summary

Transcript

Slides

Mohammed Zubair Ahmed

Software Design Engineer @ Mattermost

Join the community!

Featured event

2025

2024

Info

Conf42 JavaScript 2024 - Online

October 31 2024 - premiere 5PM GMT

Mission: Performance Impossible – Real time chat webapp’s performance with 100k active users

Video size:

Abstract

Summary

Transcript

Slides

Mohammed Zubair Ahmed

Software Design Engineer @ Mattermost

Join the community!