Lessons Learned From Maintaining SDK in Python for Three Years

Video size:

Abstract

Let’s explore how to build an SDK that stands the test of time and garners adoption from other developers. We’ll delve into proven patterns, examine the long-term impact of early-stage mistakes on the software, and discover strategies to introduce changes without breaking users’ code.

Summary

Adam Furmanek is a software engineer at Matis. Matis is developing a software that extensively uses sdks. This talk will analyze a big case study of what happened in Matis over many, many years across many languages. We will see some lessons learned.
Matis is a software that provides you the ability to build observability for your databases. The idea is that developers often don't notice problems that later can cause troubles on the production end. We want to alert users right before they even commit the code or at latest during their CI CD pipeline.
We wanted to build an SDK per tech stack. The first approach was having SDK. The second approach was to reconfigure the database a bit. And finally, in the third approach, we wanted to utilize open telemetry much more.
The idea is to break from the database. We install an SDK to your application that stamps the query with the trace id. Another piece that reads the logs and checks for the execution plans based on particular trace ids and sends all of that to Mattis. The problems now the database must be reconfigured.
Open telemetry is a set of sdks and libraries that can be used to emit logs, traces, metrics and other pieces of information. Now the question is, how do you trigger this open telemetry? And the best is yet to come.
Can you represent your data structures the same way between languages? Do you have specific proprietary protocol support in all the languages? How do you write a documentation for different sdks in different languages? To keep it maintainable over time, you need to keep it simple.
Next thingy is version management. Do I bump their versions consistently or do I just bump the version independently? Consider sdks for all the languages as one single SDK. Be explicit about your dependencies. Run your lets constantly and have them uniform and as easy as possible.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. My name is Adam Fremanig and thank you for coming to this talk in which we are going to talk a little bit about maintaining SDK over many years. We are going to see some lessons learned, we are going to see some experience that we gained when working with sdks, and we are going to basically analyze a big case study of what happened in Matis over many, many years across many languages. I am Adam Furmanek. I work at Matis as a devreau. Feel free to take a look at our webpage and see what we do over there. And without further ado, let us jump straight to the point. So, I've been working as a software engineer for many years, and Matis is developing a software that extensively uses sdks. So what we need to do is we need to first understand what we tried to build over those years, how we structured our sdks, how we built them, how we evolved them over time. And then we are going to see what particularly interesting we learned and what we would like to share. So let's go. So the very first thing is, what do we do at Matis? So Matis is basically a software that provides you the ability to build observability for your databases. The idea is that you have your SQL database or NoSQL database or database of any kind, and you have your applications that talk to the database. Now, in order to build the observability the right way, we need to understand what happened in the database and in your application. So we would like to understand, for instance, what rest API was called on your application and then what SQL query was executed as a part of handling this particular rest call. And ultimately we would like to get the execution plan. Why do we want to do that? Well, the idea here is that developers, whenever they deal with databases, whenever they work with databases, they very often don't notice problems that later can cause troubles on the production end. And this applies to, no matter whether you're a small startup or big enterprise company, all those places, well, they have them to face the same issues. Why? Because many times whenever we test our applications, we only focus on the correctness of the data, not on the performance, how things work. So we miss problems like n plus one queries from our orm, or we miss cases when our queries do not use indexes. And when we test these things locally, or when we play with those things locally, well, we typically have a small database with what, five rows, ten rows, maybe 100 of rows. But we do not have production like database available locally. So we do not know what the actual size of the data is. So even if we have a slow query that for instance, scans the whole table, then we don't know that it is going to cause performance issues. And there are no tools to prevent you from deploying such a code to production. Yes, you can run load tests. The problem with load tests though is they happen very late in the pipeline. And then those load tests, when they show you issues, you basically need to go back to your coding and rewrite the solution, restructure it, and sometimes even start from scratch. So this is way too late and very expensive to be efficient. So what we want to do is we would like to capture issues with your databases as early as possible, ideally right when you are typing your code. And to do that, Matis wants to understand what the rest API was called, what are the SQL queries that are executed and what the execution plans are in order to tell you, hey, this query was fast locally because you have only 100 rows in your database. But hey, you scanned the table and if you deploy this to production and you don't use index, it's going to kill your performance. So this is a critical issue and you need to change that. And we want to alert the users right before they even commit the code or at latest during their CI CD pipeline. So this is what we do. So we have a couple of assumptions where we deal with and how we want to tackle that. So generally we need to extract those three things and we are breaking with web APIs. So generally applications that expose like rest APIs or whatever else, and they are basically dealing with the network traffic. Those applications can be running like locally or in the cloud or on prem or whatever else. We generally don't necessarily constrain ourselves what types of applications we support. They are generally modern, meaning that we do not focus on technologies from, I don't know, ten years ago or 15 years ago. We generally focus on things that are modern in the sense that we want to embrace the problems with microservices or the problems with unclear interdependencies between applications, or many applications talking to many databases at once, or a single application talking to many databases at once. Generally this is the world where we try to deal with. We are not focusing on like monolithical applications talking to a single database. No. Instead we want to support a case when we have hundreds of microservices with hundreds of databases of various kind and generally support all of that. And ultimately we want to support the users in their CI CD pipelines as much as possible. So not only work with them in their local environments, but also work in their CI CD environments, showing them hey, this is your CI CD and you can feel safe and you can rely on it so that when CI CD tells you everything is good, it's not going to break in production. So this is the idea and we have a couple of tenets how we wanted to build the sdks. First, they must be easy to use. Breaking that, we do not want to build a solution that is hard to understand how to set up and hard to use. Our users ideally should do next to nothing to use MAtis. Ideally it should be like one command and all is up and running, right. Another thingy that we want to focus on is just one time integration of the Matis solution. Meaning that it's not that when you have a team of five developers, then every single developer needs to do something to integrate MatIs. No, nothing like that. We would like this to be a single time action. So you integrate MatIs, you commit anything you needed to do to the repository and bank. All the team can benefit from the integration you just did. All the team, all the company, basically everyone working with the product. No matter whether this is like in house product or open source product or whatever else, you do it just once and everyone can use Matis. Next thingy is ideally no code changes. Ideally we want the integration to not touch your application at all. Not change your application at all, if that's possible, obviously. So you don't need to modify the application. But this also goes the second thingy, which is we do not want you to change the way how you implement your application. Yes, maybe you will need to add like one line of code triggering Mattis or enabling Matis. But generally we don't want you to change the way you run your test, change the way you deal with your orm, change the way you write your business logic. No, we don't want to touch that. Ideally your business code stays the same, your infrastructure code stands the same, the only thing you need to do is, well, enable Mattis. And finally, we want to bring as few dependencies as possible, ideally zero. We don't want to bring dependencies on you so that you need to install this library, that library or whatever else. No, we only want to bring Mattis and that's it. The fewer dependencies the better. So this is where we are. So we want to implement sdks. We wanted to implement sdks for the web applications that are quite modern, dealing with microservices, many databases. At the same time, we want to get things that can show you this is what happened in your application like API X has been called in turn. This is the SQL query that has been executed and this is the performance and how it was executed. So then we can later tell you this thingy is not going to work well in production. And all of that needs to happen automatically. Should be as straightforward for user as possible and ideally not change the user's code at all. So let's see what happened and let's see what we build over the years. So generally what we want to achieve is we wanted to use open telemetry to achieve all of that. When we were brainstorming and trying to figure out, okay, how do we want to tackle this problem? So how do we want to tackle this problem? What exactly happened, what query has been executed and what was the execution plan? We decided yes, we want to use open telemetry to capture the interactions. Why open telemetry and what is open telemetry? Opentelemetry is basically a set of sdks and open standards showing how to shape the data and how to send the data and process the data that captures signals, signals from your application, signals like metrics, logs or activity or explanation about particular activity that happened. So traces and spans. This is what is called in open telemetry world. So open telemetry can capture that, hey, this is the SQL query that was executed or that was the interaction you had with some other microservice. So opentelemetry is basically an open standard on explaining and defining on how to explain capture that those interactions happened. And Opentelemetry also provides libraries and sdks for capturing those signals. Just like in your application, you have some logger, right? You have console log or you have just logger or you have system out print line in Java or whatever else, you just print mistakes. And there are many libraries that can take those mistakes and save them to file, send them over, network, save them to the database, add things like, I don't know, a date and time, timestamp, Fred ID, other stuff, right? Those things, those are libraries that you just use the same way. Open telemetry is basically a concept and a library how to capture metrics from your application. So you don't need to reinvent the wheel, you don't need to figure that out from scratch. No, you just take the open telemetry and you use it and bang, all your metrics are captured. Open telemetry also provides additional things to later process the data, process these signals and visualize them. For instance. So we have the tools to capture the signals so that those tools know how to emit them, how to structure JSON data or whatever else, how to send them over, how to process them and finally how to visualize them. So this is what we wanted to do next. We want to get the details from the rest endpoint and the SQL, meaning that we basically want to capture something like your rest path. So this is the API that was called under the API X with parameters, blah, blah, okay, and the SQL. So we want to capture the SQL statement that your application executed. And once we capture those two things, we can correlate them together showing that hey, this is what happened. So this API has been called and this is the SQL query that was executed as part of handling the workflow in this API. Once we have that we can get the query and we can go to the database and ask the database for the execution plan. So you don't need to give us the execution plan, we can capture the query and we can get the execution plan by using the explain keyword. So we basically go to postgres or MySQL or wherever else and we send explain your query. And this gives us execution plan explaining how the query has been executed, whether it was using indexes, whether it was scanning tables or whatever else. And finally, once we get all of that, we want to send that to Matis. Matis is a software as a service. So we send those details to Matis and we can show you, hey, this is the API, this is what happened. This is the SQL query. This is how slow it is, this is why it's slow. And most importantly, this is how you fix it. That's the idea. This is how we wanted to tackle this problem. And when solving this problem, we actually went through three stages of three different sdks. So we maintained our sdks and we changed our approach and we learned a lot over this time. So the first approach was having SDK. So if your tech stack was I am using Python, with fast API, with SQL alchemy, that's like one instance of a tech stack. If you are using JavaScript with PG driver, with sqlize, that's another instance of the stack. If you are using Java, with JDBC, with spring, with hibernate, that's yet another instance of the stack. So generally we wanted to build an SDK per tech stack and we wanted to support many languages, JavaScript, Python, Go Java, Kotlin, C sharp, Ruby, et cetera, et cetera. Many languages, many libraries, many orms, many tech stacks to support. The second approach was we wanted to reconfigure the database a bit, to read things from the database logs instead of necessarily instrumenting everything. And finally, in the third approach, we wanted to utilize open telemetry much more. So let's see what we did, how we did it, and what we learned. So the first approach, SDK per tech stack. So the way we wanted it to work was we take your application or you take your application, and this application has some basically entry point, some web framework, so ORM library, et cetera, et cetera, many things, and we ask you to install Matis SDK as a dependency of your application. And then this SDK does the following magic. So whenever there is a request from the user, from the browser, from external service coming to your web framework, and as a part of this web framework, like handling the request, you basically call your ORM library. Then this OrM library is going to the database to extract the data and thus select star from table. It then returns the data and the data is ultimately returned to the user, but at the same time as some kind of a hook is sent or event is sent to our SDK. So Matis configures hookings on the ORM library and on the web framework and whatever else to capture the event that hey, such and such query has been executed on the database as part of this particular single flow. So Matis captures this thanks to hooking, and then goes to the database to explain this query, gets the data, gets all the traces, ids, identifiers, whatever else, and finally sends them to Mattis. So this is the idea. So now how does it work? In the essence, you take your application, you do pip install of Matty's dependency at the entry point of your application. You trigger one line of code, something like Mattis enable, and then the magic we see here on the screen happens. So let's see how it actually worked and what wasn't working. So generally this approach was quite good. First, it was very easy to install. You just do Pip, install NPM, install maven, install whatever else, and bang, you have all the magic, you have the libraries and that's it. Second, it integrates with the language of your choice, breaking that. If you're writing in Python, well, you get Python API. If you are writing in JavaScript, you get JavaScript API, right? We don't need to change anything but that. We don't need to change your database, we don't need to change your application because, well, we just need you to enable Matis. And then everything else happens thanks to hooks and whatever else. So we just figure out how to plug into your web framework, plug into your orm library and whatnot. And it generally works everywhere with automated tests, with the actual APIs. When you run the application, it captures the queries, it can be easily disabled for production because you can control it and just not enable Mattis. So generally very nice, very easy and should work well. The problems were okay. The biggest problem is we need to implement a new solution for every single new tech stack. Meaning that if you are using a different web framework like flask instead of fast API, bank new tech stock. If you are using different SQL Alchemy version, bank new tech stock. If you are using different driver behind SQL Alchemy, bank new tech stock. If you switch languages, you go with JavaScript, typeScript, Java Kotlin, whatever else, new tech stack, new tech stack over and over and over again. So we would need to maintaining many many SDK and we can't reuse the code at all. We can't reuse the implementation. Yes, we can reuse bits of it, for instance the part that sends the data to Matis. This can be reused across all the Python sdks, right? But if you try to reuse the integration with ORM, or for instance your JavaScript library, or for instance your web framework, no way, you can't reuse that. So with every single tech stack we had to reimplement more and more things. And generally maintenance of that was super hard. Not to mention that even new version of web framework or new version of the OrM library could also introduce breaking changes. So we would need to support older versions for the very long time. Another thingy that doesn't work well in this approach is, well, we have differences between dependencies that we use. For instance, if we want to send, I don't know, JSON data, there are many different libraries that we need to use. We use different library in Python, different library in Python two and Python three, and different library in JavaScript, right? So those libraries, first, integration with them is different, and second, they have their own quirks, how they deal. Yes, even sending JSOn and encoding stuff can be very tricky. But there were also other problems that, well, apart from the burden of maintenance and implementation on our end were also like hard in terms of how it worked. For instance, integrating with open telemetry is not that straightforward. Sometimes depending on your Orm library, we may not be able to get the parameter values, or extracting the parameter values may be harder because we will need to scrape the locks or we can't correlate the rest API with the SQL query because they are executed on completely different threads and they don't share any unique identifier. So generally there are many quirks, many issues, how to correlate the rest API with SQL and whatnot. Not to mention testing frameworks. Some testing frameworks when you want to spin up your application in like testing environment and then test it. Some of those testing frameworks, well, they won't capture and they won't initialize the rest structures properly. So you don't know what API is being called. So generally this approach worked well in the essence, but was very hard to maintain, very slow to develop, and also had some quirks that we had to overcome over and over again. So we decided, okay, this is something we just can't do. We won't be able to support every combination of the rest of the web framework and every combination of the ORM and the language. That's going to be too hard and too much of a burden for us. Let's figure something else. So the second approach we wanted to take was breaking from the database. So now the idea is we have the application and then what we do is we would like get the rest API again which calls the RM library. Now this Orm library conceptually doesn't go to the database per se, but goes to the SDK which stamps the query. So Matty's now takes for instance the identifier of the rest call and puts it on the SQL query inside a comment. And then this query is being sent to the database and then the data is returned and then Matis SDK at the same time sends this information to Matis platform. So what happens at this point? Imagine that you call API orders, get order by id, right? Open telemetry, initializes. This is a new request with identifier. I don't know, one to three. Some guid comes here, right? So we take this guid, we put it on the SQL query and at the same time we lets Mattis know, hey, there was a GUI, one to three. And that was the interaction over API orders get order by ip, by identifier or whatever else. Okay, so this is what we do. And later asynchronously we have another piece that is Matis collector. That is a docker container that runs on the side. Asynchronously it goes to your database and reads logs from the database and then looks for the execution plans for all the SQL queries. And it finds the SQL query reads the comment on the query which says this is the query for the identifier one to three extracts the execution plan and finally developers it to Mattis. So this is what we do. So basically we install an SDK to your application that stamps the SQL query with the trace id, with the identifier of the interaction. And then we have another piece that reads the logs from the database and checks for the execution plans based on particular trace ids and sends all of that to Mattis. Okay, so that was approach number two. So what we now have is again, this is quite easy to install, just one command and deploying docker, one command and that's it. And it works, it still integrates with your language. So you still have like API that is your language specific is idiomatic, is whatever else. And we again make nearly no changes to your application code, right? Because you just need to enable Mattis and that's it. We can capture everything. We can disable that for the application, for the production and whatnot. The problems now the database must be reconfigured, must be reconfigured because you need to enable the database to log execution plans for every single query. So that's quite a lot of work. So you need to go to the database, change how it logs the data, change what it logs, and so you get all those logs and they must be sufficient enough for us to understand what happened. And this is especially hard when we are dealing with ephemeral databases. So databases that you just create for the duration of, I don't know, unit test and then you take them down. So with test containers or whatever else, right. Why? Because those databases, when you just create them, they won't be configured appropriately for our need, right. So what we need to do is we need to step in and changes the configuration of the database. But then many times we need to restart the database, and it's very hard to restart the database in an ephemeral context when you are just spinning it with test containers or whatever else. So this is hard. So sometimes we had to consider like building a specific image of the database, for instance, a specific image of postgres instance that would have this configuration enabled. Again, those are problems that are not easy to solve and they are highly dependent on how you execute stuff. If you execute stuff in CI CD, that gets trickier. If you execute that in GitHub actions, that gets trickier. If you execute that locally gets even more trickier. So generally those things are hard to be done. Again, another issue around this reconfiguration of the database is the cost money, because if you log and you explode your logs, then you need to pay for storing these logs, processing them and handling them. And this is even harder because well those logs, they take memory, then take space and they cost, processing them costs a lot. So generally it's not easy to do that. And yet another issue we have is difficult query stamping because some orm libraries, they are tricky in the sense that first, you can't put any comment on the query. Second, putting a comment on the query may breaking some other integrations, for instance with your monitoring solutions. And third, some libraries when you send just one single query, hey, get me data from these two tables. This library will generate multiple SQL statements, but you can stamp only one of them, so you effectively miss some queries. Not to mention the same issues we had with previous approach, meaning that hard to reuse the code between languages and libraries. Why? Because even though we do not need to integrate with DRM to that extent as before, we still need to integrate with your web framework for instance, and we still need to support many versions and we still can't reuse the code between Java, JavaScript, Python and other places. So generally, still many issues with this approach and generally this approach worked pretty well. It was very promising but still wasn't quite that easy and was very hard to maintaining and post many challenges between many languages and technologies we wanted to use. So we wanted to use something else, yet another approach. So this year that yet another approach was moving the ownership. So you can see that in previous approaches we were building a solution that we wanted to build, we wanted to maintain and we owned it. Now we want to shift this ownership to some other place. We wanted to build something or users, something that we wouldn't need to maintain and implement and fix every single time a new library comes or a new version of a web server comes. So what we build now is we dropped this idea of SDK altogether. What we now do is hey, you have your application and we don't really care what is in this application. You can see this application goes to the database, returns data, no magic in here. But now what we want you to do is we want you to reconfigure this application slightly and use open telemetry to just send us the logs and traces from your application so we can capture them inside the thing we call Matis collector. And Matis collector goes to the database, extracts the execution plan and stands that to Matis. How does this work now? How is it? What happened here? So we want to changes this approach now thanks to open telemetry. Open telemetry is, as I mentioned, a set of sdks and libraries that can be used to emit logs, traces, metrics and other pieces of information. And Opentelemetry has this fantastic mechanism called auto instrumentation. Auto instrumentation is the mechanism that can enable instrumenting, the libraries automatically instrumenting, so enabling them to send metrics, traces, logs and automatically breaking that. The only thing you need to do is you need to kick open telemetry and say, hey, instrument everything I have and then it will do the magic. Okay, so now we want you to instrument your libraries. So they send data to Matis collector, which is a docker that runs locally on the same host. But now the question is, okay, how do you trigger this open telemetry? And the best is yet to come. Open telemetry is or can be enabled from outside of the process. You don't need to change the application code at all. All you need to do is put some environment variables and then run your application. And if you have open telemetry in your dependencies, then it's going to work, it's going to trigger itself automatically. So how do you do that? Well, previously we were asking you install Matis SDK, trigger matis SDK during your entry point and then Matis SDK will take care of hooking into your libraries, extracting queries, using open telemetry to send the data, et cetera, et cetera. Now we do it completely differently. Hey, the only thing you need to do is install open telemetry, which most likely you already have in your applications because our assumption is your applications are modern. And if they are modern then well, most likely you already have the telemetry and you then need to trigger this open telemetry and that's it. And triggering the open telemetry is as simple as you just put some environment variables and then it goes. So that's it. So now you just install dependencies once. So you do pip install or add things to your pomfire or whatever else, and you put a couple of environment variables in your start script, bootstrap script, and then all the traces are automatically sent to the docker that we provide to Matis collector. And you basically need to run this Matis collector somewhere locally. So you can spin it up with test containers or spin it up wherever you wish and that's it. And then Matis collector gets those traces. It can extract the SQL queries from those traces, go to the database, explain, get the execution plan, send it over to Matis. So this is what we can do. Now pros of this approach, no changes to application code, literally no changes. And there is an asterisk over here because that depends on the language you use. You get no changes in python, no changes in JavaScript, no changes in Java, no changes in. Net, no changes in many languages. But for some languages you need to do some changes. You need to trigger open telemetry manually, for instance in C Plus plus, right? So generally there is an asterisk, but most of the times you don't need to change your application code at all. You don't need to change your database at all, meaning that we don't need to reconfigure your database anymore, we can just send the xplane and that's it. You don't need to change the database, meaning that we can support ephemeral databases, read only databases, whatever you have, we don't need to touch that. It integrates with the language in this sense that the way you enable open telemetry depends on your language and it is well integrated with your language. So it can use things specific to your language, it can use dynamic code execution, it can use additional parameters, I don't know, to node runtime, it can use additional parameters to python, whatever else. So it basically works in the idiomatic way. It can be easily disabled for production because, well, you just don't let open telemetry send things to us. Not to mention you don't deploy lets collector and it works. And the best of all the worlds is we don't own it, meaning that if there is a new version of the Orm library or new version of the web framework, then it's on them to integrate with open telemetry because they want to integrate with open telemetry. So if they break something or introduce breaking changes, they need to integrate with open telemetry. So they are going to fix that. And from our perspective nothing changes because if it doesn't work it's going to be them who fixes that. The only thing we need to do is we need to maintain this collector that does the magic. However, there are some problems. Sometimes we need to change the code depending on the programming language, for instance c plus plus or go. Not all libraries support auto instrumentation. It's more and more of them and it's obviously for their benefit. So mature and popular libraries, they are integrated with open telemetry and they support auto instrumentation because it's for their greater good, right? But some of them, they are not integrated and in that case we basically need to do some magic. For instance, other three lines of code to extract the things with hooks just like we did before. Sometimes those libraries are integrated with open telemetry in a way that we can't use them easily because for instance, they do not emit parameter values for SQL queries. They just, whenever you do select star from table where column greater than ten, you don't get this value. Ten you only get placeholder like dollar one that there was a parameter in that place, but you don't get this parameter value. So we need to extract the logs for instance and parse those logs to reconstruct the actual query. Sometimes it's hard to correlate like rest with SQL because well, open telemetry can't correlate that so we can't show you. This was the SQL that was part of this rest API. Sometimes it doesn't work well with testing frameworks, but generally this is a pretty good approach. And the most important part is we don't own it. Meaning if something breaks the authors of the libraries, they need to fix that. So generally, based on those three approaches and based on the history of how we evolved those dks in many languages, this is what we learned. We learned that uniform functionality is crucial. We learned that version management is crucial, and we learned that diverse languages and idiomatic approaches and other stuff, this is hard to keep in track. So let's see what actually we learned first. Uniform functionality. Whenever you deal with those sdks, sdks for different text tags, sdks for different languages, sdks depending on the particular version of a particular library. We learned that hey, those languages are different. Some languages they have static typing like compile time type checks. Sometimes they have dynamic typing or they can change types of variables or whatever else. Sometimes you have generics, sometimes you don't. Sometimes you have macros, sometimes you don't. Sometimes you have dependency injection, sometimes you have aspects, but sometimes you don't. Sometimes you can generate the code on the fly or even execute the code from a string, for instance, evil in JavaScript. Sometimes you can't do that. So whenever you deal with many languages and you want to keep your SDK uniform across many languages, you need to understand, okay, do I want to embrace additional features of programming languages? Or maybe I don't want to do that. And what I want instead is I want to keep my SDK implementation as primitive as possible so that all the features are users can be basically implemented in every single language. So you don't use generics, so you don't use macros, so you don't use dynamic code execution or whatever else, right? Those are the things that you need to take into account. But sometimes you can't do that. Sometimes you really need to rely on the particular language because you need to integrate with the ecosystem of this language. If you need to integrate with ORM hacks, then, well, generally it's not easy because you need to rely on the particular OrM implementation, right? So there are many things that you need to consider when doing this. Sdks for many languages. For instance, can you even represent your data structures the same way between languages? If you don't have generics somewhere, then well, you won't be able to represent those data structures, right? If you have class based inheritance like Python or Java versus prototype based inheritance in JavaScript, then how do you implement your data structures the same way? Super hard. Another thing is, can you use the same protocols for communication? Right? So can your SDK communicate of a network using the same protocol? Do you have JSON support in your language? In all the languages? Most likely yes. Do you have GRPC support in all the languages? It may get trickier. Do you have, I don't know, so specific proprietary protocol support in all the languages? Definitely no. You need to think, are there any implementation differences between the languages that would affect how your SDK gets initialized, for instance, or how your SDK gets installed, or what things you can use? Can you use private data? Can you use public data, et cetera, et cetera. How do you even deal with evolving the schema that you use to communicate between your SDK and your software as a service platform? Right? How do you introduce optional fields? Can you add optional fields as a dictionary, or can you put them as a pass through? How is this going to work when you have various libraries that deal with that and answering all those questions? Those things breaking using idiomatic approach is generally much harder and much more time consuming. And using specific things like generics, et cetera, et cetera is not reusable between languages. So years you can implement every SDK differently, but those things are not easy to translate between languages, so it increases the burden of the maintenance. And finally the documentation. How do you write a documentation for different sdks in different languages? You would like to get the same documentation, right? You would like to get the same parameters, the same APIs in your documentation. And to do that, you need to have exactly the same functions across sdks. So generally, in order to make this functionality uniform, functionality easier to implement and maintaining between text tags, you need to drop support for many language specific things, language specific extensions or constructs or whatever else. So to keep it maintainable over time, you need to keep it as simple, as basic, as primitive as possible. Another thingy is what protocols to use. JSon may sound like a great solution for everything, right? You just put JSon, any language can speak JSon, any language can use it. What could go wrong? Well, yes, implementations of JSON libraries, they are different. If you need to deserialize data, sometimes you need to have metadata on your JSon, sometimes you need to write this deserialization manually over and over again between languages. Sometimes JSon is ill handled in different languages, depending on the escaping, depending on special characters, on character encoding, et cetera, et cetera. Not to mention there are even differences in HTTP handling between languages. So generally JSOn, while it sounds easy, is actually hard to maintain over time. GrPC, on the other hand, is way easier if you can use it in your language. So if you take the languages that you want to support and you have GrPC in all of them, it's going to be easier. Why? Because in GrPC you define your schema just once and then you don't care. And then GrPC library takes care of generating classes, generating structures, serializing the data, and even minimalizing or cutting the network usage. So this is generally when you choose, when you need to pick between like JSON standards or GrPC generally consider things that have bindings to all the languages, and there is one single entity that maintains those bindings. So it's way easier to deal with that. You don't need to go and look for different JSON library in each language, just go with GRPC or go with something that supports the languages. And similarly the same goes for protocols. Should you use an open protocol, an open standard, or a proprietary one? Do you want to implement your protocol or your data structures or whatever else, or do you want to take some open standard, like open telemetry for instance. So with proprietary protocol, the biggest advantage is you can send anything you need and only the things you need. You don't get noise, you don't get the noise, and you get the things you need in exactly the way you need them, right? But the problem with proprietary protocols is, well, you don't have libraries to deal with that. You need to maintain your data structures, you need to maintain your communication, you need to implement that. And if you want people to help you, like from the open source world, they won't be able. So generally go with open standards, because users will have libraries for that, users will know how to use it and you don't need to own it. But the problem with open standards is that sometimes you need to squeeze your structures into those open definitions so that can be delivered and handled between languages. So generally, whatever you do, just don't try reinventing the wheel, don't reinvent the stuff, do not build your stuff, keep it basic, use open standards and that's it. This way you basically minimize the amount of the stuff you need to maintain over time. Next thingy is version management. So we do have Sam versioning, right? So we have those like major version, minor version and patch version, and we can use them to indicate what has been changed, right? But now comes the problem. Okay, what if I have sdks in different languages? Do I bump their versions consistently or do I just bump the version independently? And how do I know whether SDK in Python in this version supports the same set of features as SDK in JavaScript in that version? How do I correlate all of that? How do I adopt new features from the language if I want to use them? Do I actually need to bump the version across all the sdks or just one SDK? And how do I actually keep track of my versions? How do I actually test the versions? How do I do all of that? So there are many things that you need to consider here. How do you add a new feature to all sdks at once? Do you keep those features consistent across sdks? Or maybe you just let them live independently? What's your release cadence? Do you release a new version of all languages at once, or can they go out independently? How do you deal with things like, I don't know, logging between technologies or how do you deal with language specific options? Yeah, so all those things, they are very hard to deal. But generally whatever we learned is that first, whatever you do, you need to keep your environments tested as much as possible. You need to test the stuff in a reproducible manner, so you can take the things and reproduce them locally or in the cloud or in CI CD. So generally docker test containers nix other tricks that maintain those versioning for you. You want to run the tests across all languages for every single change. And if you find a bug in one implementation, like in Python implementation for instance, then most likely the same bug is there in implementations for JavaScript, Java, whatever else. So always look for those implementations everywhere. You need to try reproducing these things everywhere. And you would like to ideally have a common test set for all the sdks. So you don't have like language specific tests, or you don't have different data like sample data for testing. No, you want to have those tests running uniformly in all the technologies, so it's super easy to maintain. So whenever you need to introduce changes in one place, you know how to apply them in other places as well. Other thing is consider using like Monorepo for having all the packages and libraries that you want to use and users tools for that. For instance, use learner in JavaScript to keep those things in place and under control. And be explicit about your dependencies. Never use transitive dependencies that you do not control or versions that can go be bumped without you knowing about that because then those things may breaking accidentally and you have no idea about that. Be explicit about dependencies. Have as few dependencies as possible and control the versions of your databases and of your dependencies. And generally use as few dependencies to not cause conflicts, conflicts on different lessons between your sdks or your SDK and the user code. So generally be very explicit about that. Run your lets constantly and have them uniform and as easy as possible. Basically consider sdks for all the languages as one single SDK. That's the easiest way to keep that in shape and maintain that over time. And finally, diversity of the languages. So generally languages are different. And it's not about languages per se, but it's much more about the ecosystem of the languages, like dependency management, like the quirks of the platform, like the way how you deploy stuff on the platform, like the way how things evolve, how net framework changes to net core, how things get dropped and support gets lost. So generally those things are hard to understand, how to grasp for one person. One person won't be able to do that. One person can't understand all the ecosystems. That just doesn't work. So what you'd like to do is you would like to have, what worked for us is we have a language champion. So for every single language that we wanted support, we had a designated person, a language champion, so the person that knew and had to stay up to date. And on top of all the changes in the language and in the platform that could be affecting our sdks, there was a change in, let's say, orm versions, or a change in SQL drivers, or a change in dependency management in given language, or a change in other things, build systems or whatever, all change in features that a language supported, right? The language champion was supposed to stay on top of that and they had to push this knowledge and those updates onto the rest of the team. So we had like regular updates, weekly meetings where we discussed, okay, what new happened that could affect our sdks and what also broke in those sdks. Okay, what happened in Python SDK recently that broke the Python SDK and that we think could break other sdks or may affect how we want to evolve our sdks over time. So having this language champion was the solution for us to actually deal with that stuff. So that was basically it when it comes to what we did. So in summary, generally what we do is we had those evolution of sdks and the lesson learned always is the best thingy is the thingy that you don't own. So minimize the set of things that you need to own and maintain. Minimize the number of features, minimize the diversity between languages and the sample data you use, generally keep that as small as possible and just have this under your control and test things constantly in a reproducible manner as much as possible. And have a language champion that can help you maintaining those sdks across the planet. So this is what we did, and thank you for listening. Drop me a line if you have any questions. Join our discord. Take a look at the webpage. My name is Adam Furmanek and thank you for watching this session. I hope you enjoyed it. Thank you.

Slides

Download slides (PDF)

See all 32 talks at this event!

Conf42 Python 2024 - Online

February 29 2024

Lessons Learned From Maintaining SDK in Python for Three Years

Video size:

Abstract

Summary

Transcript

Slides

Adam Furmanek

Head of Engineering @ Metis

Join the community!

Featured event

2025

2024

Info

Conf42 Python 2024 - Online

February 29 2024

Lessons Learned From Maintaining SDK in Python for Three Years

Video size:

Abstract

Summary

Transcript

Slides

Adam Furmanek

Head of Engineering @ Metis

Join the community!