Conf42 Cloud Native 2021 - Online

Hundreds of Microservices without breaking your APIs

Video size:

Abstract

Managing dozens or hundreds of microservices in scale can be very challenging. As developers, we often find ourselves blind to how our application is actually behaving in production, what dependencies we should be aware of and what should we check before deploying a new version.

In this talk, we will introduce a new method to collect and analyze microservices communication from production and how to leverage this data during development and testing phases to improve our code.

By relying on production behavior we can automatically generate tests that are more efficient, catch dependencies that are about to break in real-life and to have our developers more product & product oriented.

Summary

  • Michael Haberman is the co founder and CTO of Aspecto. He will talk about microservices in particular a lot of microservices, and even more particular, not breaking your API. Specifically, he'll talk about distributed tracing and how we can use it to overcome microservice issues.
  • I've been doing microservices for about five years. Started as an independent consultant. In the last two years I decided that I'm going to do my own thing. Started Aspecto basically product focus on helping developers with microservices.
  • Most companies are migrating to microservices rather than starting a brand new company or product with microservices. Understanding the bigger picture, understanding who is consuming who and how they're consuming are the main concerns. As your architectural picture grows, the risk of having production issues increases.
  • People are having a hard time to understand their architecture, they are doing an architectural document. And I bet that each and one of you has in your organization some kind of an architectural chart. But it's never up to date, it's always a leg behind.
  • Open API really helps you to kind of document your HTTP endpoints. And the question how services communicate between one another. Maybe another documentation, maybe to find some service map solution. You could try to find a way, but also it's a challenge.
  • If your service is down, you may lose data. We need to move from synchronous to unsynchronous communication. There are endless options out there to choose the right one for you. You just need to use it in the way that fits your need.
  • Yeager: The more micro you go, the more complexity you are going to face. There is something out there called distributed breaking, getting more and more popular. Open telemetry allows you to do distributed tracing. Being able to see it all together kind of give you the story of what happened to a particular API call.
  • Yeager shows diagram of dependencies between services. It doesn't answer important question of when in which endpoint they are going to start to communicate with one another. This is kind of the resolution that the developer is looking for.
  • Let's assume that you started to collect distributed tracing data. You need to start to introduce some tools that allows you to do that if you want to generate mocks. CTo: Get familiar with open telemetry, get to know distributed tracing, understand how to implement it.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everybody. My name is Michael Haberman. I am the co founder and CTO of Aspecto. Today. I would like CTO talk with you about one of the most, most exciting topics out there. That would be microservices in particular a lot of microservices, and even more particular, not breaking your API. So if you run microservices today or in your, I don't know, prior job or something like that, you know that microservices could be complex. And in this talk I'm going to try and help with that and try to kind of predict what are the challenges that you are going to face and also how we can overcome those challenges, how we are going to do that. We are going to do that by going through a journey, how your microservices, your typical microservices project start in a company and how it starts small and simple, and then it start to get more and more complex as you go. Probably by definition of microservices going to be complex. And then we need to ask ourselves what tools do we have in order CTO overcome these challenges, this complexity. Specifically, I'm going to talk about distributed tracing, and also we are going to take a look at what is distributed breaking, how we can use it to overcome microservice issues and also how we can use it in a cool new way that is going to, I think, really help your development process. So let's get started. But before that, let me tell you why. I feel that I know a bit about microservices. So I've been doing microservices for about five years. Started as an independent consultant, helped companies to break their monolith to microservices or their microservices scaled and they needed some help with that. At my last position, I worked for a company and I was the chief architect, managing about 120 microservices with quite some scale. And in the last two years I decided that I'm going to do my own thing. Started Aspecto basically product focus on helping developers with microservices. So yeah, let's dive right into it, looking at your typical microservices journey. So most companies are migrating to microservices rather than starting a brand new company or product with microservices. You usually have some problems. You have some problems with your monolith. It's hard to deploy, it takes a lot of time to test, you have a lot of regression bugs. This is the area of complaints that you hear starting the migration process. And then when you start your process and you start to create microservices. Usually it's something outside of the monolith. Usually it's something, a new feature, a green field feature, something that doesn't evolve a lot of monolith logic. And then you start to develop, you develop those three services that you can see on the screen. And your architecture is really simple. Like you have service a communicating with b, communicate with database, service b also communicates with the database. Then service b communicates with service c and it uses the database. And maybe even service a communicates with c. Something quite simple, quite straightforward, very easy to maintain, very easy to understand, and one of the characteristics that you can, oh, important CTO say we are talking about HTTP at this point. So it's a synchronous communication between services. So what identify that you are at the beginning of your microservices journey. It's very simple to run it locally. I can just spin up three processes, maybe with Docker, maybe without, maybe Docker compose, maybe, I don't know. It's very simple to spin it up in your local environment. And if I were to ask you to go CTO a whiteboard and describe your architecture, you would do it quite easily. And maybe the best way to know that you're in the beginning of your journey. How easy it for a developer to onboard? If it's easy, that means that you're in the beginning. And when looking at product in companies, successful projects continue. Now the most likely that your first day in microservices are successful. You have small amount of services, you don't have high complexity, you manage to release this new feature fast. So the product is happy, sales are happy because they can sell it, business are happy because things are going fast, everybody is happy. And when everybody is happy, you start CTO, get more requirements, more feature, you started to take responsibility from your monoliths and it starts to grow. When it is starting to grow your architecture a few months later, starting to look more like that, and there is a lot of components on the screen. But as you start to draw the relation between them, this is where it start to get a bit scary because there is a lot of relations between them. And I was quite easy here. You could have tons of relations between different services depending on one another. And it's starting to get big, it's starting to get more and more complex. So if I were to ask you a few questions, I would ask you, okay, can you please draw the architecture diagram of your project? And that's starting to get more and more complex as you go and if I would ask you, let's take this service, for instance, who is using it, which other services or clients are calling this service? And this service, does it call other services? If you know this answer, it's going to really help you. But as the picture gets bigger and the diagram gets bigger, it's harder to remember. And also when you are looking at the communication, there is tons of communication. We chose at this point to be still with HTTP, but even with HTTP, you need to remember the route that you are calling, the verb and also the structure itself. What is the contract that this service expects to get? When I'm communicating with another service, I need to know what is the contract between us? And sometimes even that. Isn't that obvious? So I've raised your three main concerns that I think that at this point in your microservice journey, those would be the main concerns. Understanding the bigger picture, understanding who is consuming who and how they're consuming. So those are the three questions. But we are developers, we know how to solve problems. This is what we do. So let's get and get it fixed. But one thing to remember, and this is really important, talking about microservices or distributed application as a whole, as the picture. And when I'm saying picture, I mean your architectural diagram, when that's big and it increases over time, it also increases the risk of having production issues. And this is something that we need to remember because it's going to kind of guide us through this talk. So if we agree that microservices, as they grow, as we have more and more services, we are going to have production issues. And the reason that is basically when you have more to remember, more dependencies, more things that you need to take into account when you code, it's just getting bigger and just getting harder. And microservices, quite by nature, by their definition, they are keep on growing. If I were speaking in front of you, and I could really ask you the question, how many services have you added in the past year? Most of you would say some number. But if I were to ask you, did you remove a service? So that doesn't happen a lot. It does, but not at the same rate. So microservices kind of dictates that we need to have a lot of them and we need CTo have separation of concern, and we have all kind of reason why we are separating. But microservices are usually growing, and when they are growing, the picture gets bigger. And when the picture gets bigger, the risk of production issue increases. So this is kind of an equation that we need to take into account when thinking about microservices. What I want you to remember at this point is when microservices are growing, my risk to have production issues increases as well. Okay, so we mentioned three issues. We mentioned the picture is big. It's hard to understood, to understand who is consuming who, and it's hard to understand how. And this is problems. Let's try to solve them. Again. From being with a lot of customers and having with them projects to migrate to microservices, I'm kind of going to the main points that were repetitive between those projects. And the first thing that I thought people are doing when they're having a hard time to understand their architecture, they are doing an architectural document. So they go to, I don't know, any kind of solution that you can draw your diagram. And I bet that each and one of you has in your organization some kind of an architectural chart. And if I were to ask you how confident you're absolutely sure that this is accurate, I think all of you would find the honesty to say, yeah, that's probably not 100% accurate. It's somewhere in that direction, but it's not 100%. And this is something, I think it makes sense, it's very natural. Just go and create your architectural document and that's fine, go ahead, do that. Do remember, it's never up to date, it's always a leg behind. So we have an architectural document that we did manually. Okay, that helps. That doesn't solve it. But let's go to the second issue, which is a bit more complex. I want to know for a specific service on which service it depends on, and which services it serve, which services consume it. And this is again, something that it could be somewhat difficult. Maybe another documentation, maybe to find some service map solution, maybe CTO serve to find some service catalog solution. You could try to find a way, but also it's oriented a challenge. I think that at the early days I would go with docs probably, but maybe down the road I would take some vendor to help with that. And the question how services communicate between one another. Well, we can just go with Swagger. Swagger is a great tool, or openapi, it's just the protocol name that Swagger gave to their standard. And open API really helps you to kind of document your HTTP endpoints. And yeah, again, you can do a documentation, usually not one hundreds percent accurate, but really helpful at the beginning. Okay, so we did all of that and everything is good, everybody is happy, but time went by. And now we have your first downtime. So when having something down, you're starting to reallife that it's quite problematic because looking at HTTP, if service a, as you can see in the diagram below, sends an API call to service b and service b is down, is not available for some reason. Basically that means that you lost data and we don't want to lose data. The fact that we chose to do a distributed application doesn't mean that we need to lose data, right? So HTTP, it's great. I love working with HTTP. But do remember, if your service is down, you may lose data. Losing data is something that we should be afraid of. And your boss might get upset with you because you lost data. And then you sit outside and you just say, I hate HTTP, I just hate it. So we can fix it. We know how to fix it. HTTP doesn't work for us. We need to move from synchronous to unsynchronous communication. And there are just an endless amount of options out there to choose the right one for you. It could be a pub sub solution, it could be a queue solution. Or being more specific, Kafka Redis pub sub RabbitmQ and AWS sQs. There is can endless amount. You just need to pick the one that fits your need or the one that your company already works with. And I'm going to refer it from now on as Kafka, just because we use a lot of Kafka and it's very trendy these days. So when I'm introducing Kafka, how is that going to help me? So you can see here the diagram that I show you. Really, really at the beginning of this talk, we have three services a, B and C. They communicate with one another. But right now, if I'll take this example how service a communicates with C, they are not communicating directly, they are communicating through Kafka. So service a is going to send the data to Kafka and Kafka is going to receive it and is going to persist it. And then service C is going to ask Kafka, hey, can you give me more messages? I'm ready to kind of process more messages. This thing that Kafka is doing, that it persists the data until service c completes. To work on that basically ensures that you don't have downtime. And that is great. That is exactly what we wanted. So that's all of it. Data is persisted, no data loss. If service is down, we'll just spin a new one and everybody is happy. But we are experienced enough with distributed application microservices. And we know that usually there is downtime to architectural decisions. So think for a second, what's the problem that we're going to introduce by introducing Kafka? The first thing is that our picture, our whole picture diagram just got way more complicated. There is tons of stuff that happened just because we made this tiny change and we need to take a look in that. So service a is calling service b, that's fine. Service a is calling Kafka. But then, I don't know, as a service a, who am I communicating with? I just lost my ability to understand from service a perspective who is going to consume this data. I can only understand that if I'll go to the code base of service b and then figure out if it's using that Kafka topic. And it works really easy when you have 310, but when you have 100, 120 microservices, it's trying to get really problematic. Like I don't know who produces this data, I don't know who is consuming this data. So by introducing Kafka to a solution, we didn't only solve the problems that we had with async communication, but it also introduced some problems in perspective of understanding who is communicating with who. Also another thing that is part of the architecture, you just need to use it in the way that fits your need. It's a two way communication. You don't have a request and a response. It's a one way thing. You just send your payload and somebody's going to get it. You're not going to get a response. That's fine. Also doing debugging locally got more complicated. You can just spin service a, service b, and start sending data. You need to spin up service a, you need to spin up Kafka. You need to spin up service c. You need to send data to service a so it would populate Kafka, so B could consume it. Everything's starting CTO, just be more complex. And Swagger is not relevant because Swagger is an HTTP do communication rather than some payload at any form. So those are the downsides. And I kind of wanted to emphasize how it looks. So I took the second diagram that I showed you where we have twelve services. And I started to introduce Kafka in the middle. And I think that you blind of get the idea. Like I hope you look at this and you say, nah, that's too crowded. And if you can see, I only added Kafka in the left side of the screen. So it gets really complicated, it gets messy. It's very hard to work with. And again, as the picture gets bigger, the risk increases. Then if we are able to serve the developer with the big picture in a good compact way that is going to simplify the developer work, it's going to reduce the risk. If I were a developer and I was looking at the code base of service a, I knew it. Communication with service b, there was no doubt in my mind. Looking at Kafka solutions, I will have a doubt. So if I were the ability to answer the developer who is consuming this message, I would reduce the risk of having production issues. So kind of to summarize so far, what I was referring to when I'm saying that microservices is complex, your name is micro, you are going to have a lot of them. I know a lot of companies are saying we're not doing microservices, we have only like five or ten or something like that. We don't have the big ones like thousands of services. Any distributed application is going to have those issues. It just depends on how significant they are. The more micro you go, the more complexity you are going to face. But it's going to allow you to run faster, it's going to allow you to deploy really, really fast. So that's one thing that makes it more complex, that it's very legitimate. CTO create more and more services. Also microservices allow us to choose the right tool for the job. Whether you need a special database or a special programming language, then you just spin up another microservice and that's absolutely fine. But you're going to have more. And also it's not only HTTP, ASEAN communication is very popular. We see all kind of async communication. All of them has their own purpose. But you are going to face at some point or another non HTTP communication and it's going to get complex. So I hope I kind of scared you just a bit, just the right amount. So you know that distributed application are complicated and we have tool to overcome it. So every time I spoke about the complexity, I spoke about the big picture, understanding the big picture, who is communicating with who and how. And this is what I want to try and help you with. So there is something out there called distributed breaking, getting more and more popular. CNCF, the same foundation responsible in Kubernetes is responsible also in a product called open telemetry, a product that allows you to do distributed tracing. The concept and I'll show you it in a second is quite simple. If every microservice would report what's happening. When I'm saying what's happening. I mean as a service, I got an API call, I performed a DB query, I set some data in redis, I communication with my cloud provider, all of those are being reported into one central place and there is kind of a link between them. If I see that somebody, some service set some data in redis, I have a blind to the HTTP call that initiated this DB statement or redis statement. So the way that it looks, if you can see here we have two services, service a and service b. And within the application code we have open telemetry service e. What it does, it sends an API call to service b. Service B gets it and, I don't know, does something with it. And you can see that both of them are kind of writing to a central traces solution. When service a got the API call, basically it's the root of this trace, it's the starting point, the entry point of this trace. It gets this data and it's saying, okay, I'm trace number one, this is the first trace and I'm also the first span. So this is the first action that took place under this context and it will be more clear in a second. And then he sent the API call. Now when this API call is being sent, it actually ejects a unique header telling the next microservice in line, hey, you are not the first one. I was before you. I want you to link them together. So when service b got the API call, it took the reference that got from service a. And as you can see, when it's reporting to the central trace place, it's using the same trace but span number two. So now if I'll ask you what's in trace one, the answer is span one and spend two. Span one represent the API call of microservice A, and span two represent the API call of service b. Now, being able to see it all together kind of give you the story of what happened to a particular API call. Let's even see that in action. So here you can see Jaeger Yeager is a very well known tool that allows you to visualize traces, open telemetry traces, and other type. And this is some flow in our backend. And you can see here that we have aspecto API docs and we have aspecto account and we have versions API lambda. And you can see the process, you can see that we got an API call to OpenAPI packages and then we sent an API call to aspecto account to get the user probably to authenticate the user. And once it was authenticated, we invoked a lambda called version API lambda, which can a query on DynamoDb. And I can see here the entire flow. So I have three microservices involved, I have API docs account and the versions microservice. And I can see the interaction between them. And if I'll click on one of them, I can even see all the data, all the relevant data that I need in order to understand what this thing is doing. Basically, Opentelemetry gives you the ability to take one particular request and visualize it altogether. If I'll try to give you with a bit more details, how it looks. So looking at, wait a second, looking at how to implement open telemetry. So it's usually kind of simple. You have this SDK, this is our open telemetry distribution. But there are plain open source open telemetry, you can implement it. Basically it's an SDK within your code that is sending out to whatever destination you are going to send it to. A destination could be directly to something like Yeager. So Yeager would take it and persist it in some database and then you're able to visualize it. You can send it to some vendor that is going to visualize it for you, and you can send it to your own database and then query this data in which fashion that you want. For instance, aspecto would take this Jaeger UI and would present it in kind of a, I don't know, different way, I would say. And you can see here how it looks within a spectrum. So once you started to send, you implemented open telemetry and you're sending the data, then what you got out of it is two important things. The first one is the ability to debug whenever you have an issue. Now you have all the story, all the breadcrumbs altogether. CTO understand how you reach this situation, how you reach the situation where you have this bug and now you can understand it. And also it helps you to visualize. We took some actions and we put them together on a graph, and now it's more visualized for the developer, which is definitely better. I'm not sure if it's still answering the big picture question, not sure about it. One really reallife important thing to do in your log, you ship all kind of metadata with it, right? What you can do is basically if you look on your flow that you have a bug in production today, you would get some error, probably, maybe try to product it, maybe go to your log solution, try to find the exception that was causing it. And imagine that you found an exception, quite a generic one, but you also have the trace id. The trace id allows you to take the trace id and throw it back into Jaeger. So you got an exception, you took the trace id. As you can see here, that's the trace id. You throw it in and now you can see the whole process that caused this exception. And it's a really reallife cool trick, simple one that you should definitely do. And I would urge you to start with open telemetry in Jaeger. It's kind of easy to set up and just work and it's amazing. Now I started to talk about visualization and said that I'm not sure it's answering the question. Let's say that you have some complex system, that you have a lot of services communicating with one another and everything is being sent to Jaeger. In here you are able to see how Yeager will show you the diagram of dependencies between services. Now this is in the service level communication. It's just telling you, hey, Scraper service is communicating with user service, but it also communicates with Wikipedia service. It's just telling you service a communication with B and just telling you who is communicating with who. It doesn't answer important question of when in which endpoint they are going to start to communicate with one another. So let's give it an example. In the left hand side you can see the service a, service b, service key communication. That's cool. And now I'm going to show you two different example of service a service a as an endpoint, v one items. Then it calls service c, but it calls service b only in v one purchase. And this is kind of the resolution that the developer is looking for. I want to know when services, when I'm communicating with one another. The fact they are communicating with one another is important, but it doesn't tell me all the details just yet. Those are the kind of things that aspecto is really good at helping the developer because we are trying to look at it from the developer perspective and trying to answer what they are looking for. We understood the problem. Now we know open telemetry, at least briefly, but we know what it can help us with. Let's assume that you started to collect distributed tracing data. Let's talk about why it's extremely important. So we already said that breaking as a whole helps us, but not a lot. It doesn't help me still with the big picture to understand it. It doesn't exactly help me with dependencies and it doesn't help me to narrow the gap between production to dev. Let me try and emphasize. So let's say that you want to replay traffic, right? I don't know how you're going to solve it today, but it's hard CTO solve. You need to start work on that. You need to start to introduce some tools that allows you to do that if you want to generate mocks. So usually you do static mocks. If you want to do API documentation, you do it manually, but maybe you can auto generate that. Think of all of those things. All of those things are present in your tracing data. Your ability to create docs is if you have the raw data, the raw network, the raw communication between the services, you can just take this data and create documentation based on that. So I think that you should use opentelemetry data. And it's very simple. You have sdks deployed in all kinds of different microservices. All of them are reporting to some component called collect and opentelemetry. Basically something that knows how to receive all the span and then just send them to some database of your choosing, such as elasticsearch. And on top of that you have Jaeger. Right? Yeager can communicate with elasticsearch. I think it's their best practice. And then what you can do is whenever you have some question that you want to understand how things are operating production environment, go and ask your database. It's already there, just to give you an idea. So you need to generate a mock, a mock for your unit tests. For instance, how are you doing it today? Looking at the code probably, and saying okay, I assume that I need this data to look that way or another. And I would just create the basic thing that I need to have for my mock, make it static. And I won't ever change it, right. Unless something really significant happens. But what happens if we use the database to fetch some mocks? Right now I have real relevant traces with real different usages, and I can really easily reproduce my production environment better in my test. And this is very, very easy to do and could really improve your tests. So I think we found cool ways. And just through an idea there, there is tons of things that you can do with distributing tracing data. If you're interested in that, go check out aspecto and our blog. We're talking about it quite often. And yeah, we started from three microservices, very simple, easy to reproduce locally, easy to tell another developer about it. And things started to get more and more complicated. So we introduced async communication using Kafka. Then we kind of lost sight of what's happening. So we had to introduce tracing. And then we found what we can do with tracing, and we can do a whole bunch of that. My suggestion is CTo you to get familiar with open telemetry, get to know distributed tracing, understand how to implement it. Your microservices is going to be super helpful. Always log your trace id in any log in our system, we have the trace id. If something happens, we can always throw it in Yeager and then just visualize it. And once you have all of that, go once a week to your database, have a look there. I'm pretty sure you're going to find some interesting stuff. So thank you very much. I really enjoyed talking about it. And if you have any questions, feel free to shoot me an email, Twitter, whatever. Thank you. Hope to see you next time.
...

Michael Haberman

CTO @ Aspecto

Michael Haberman 's LinkedIn account Michael Haberman 's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)