Go Big with Apache Kafka

Video size:

Abstract

When your data needs outgrow the traditional setup, spend some time getting to know Apache Kafka, an open-source, distributed event streaming platform. With high performance and epic scalability, Apache Kafka can get the data flowing between your applications, components, and other systems with the minimum of fuss.

This session will show you around the basics of Kafka, explain the problems it is best suited to solve, and introduce some of the tools that make dealing with it so delightful.

We’ll also show how you can use Kafka from your Go applications, and showcase the integrations such as Kafka Connect that can really take your systems to the next level. This session is recommended for engineers, data specialists and tech leaders alike.

Summary

Lorna: I'm going to talk about how you could use Kafka from your go applications. We'll talk about data formats and schemas, and then I'll also introduce you to async API.
Kafka is an opensource distributed event streaming platform. It's designed for data streaming. I see it very widely used in real time or near real time data applications. The maximum payload size for a message will be one meg.
Go is a really good fit with Kafka. What we've really got is partitions within topics. One consumer per consumer group per topic partition. Replication works at the partition level. The more precious your data is, the higher the replication factor.
Let's talk about using Kafka with go with our go applications. One of the points of Kafka is that it's an ideal decoupling point in our distributed and highly scalable systems. We should be using different producers and consumers with whatever the right tech stack is for the application.
Kafka cat is a single command line tool that does a bunch of different things. Cafdrop is a web UI that is super simple to get started, give it some configuration. If you are using a cloud hosted solution, probably they have something.
A lot of applications will use Kafka with no schema at all. Sometimes you will need a schema. They allow us to describe and then enforce our data format. I see it as particularly useful where there are multiple people or teams collaborating on a project.
The best way to evolve schemas is to do it in a backwards compatible way. Every time you make a change, even a backwardscompatible change, it is a new schema. And we can use this schema then to create a struct using the Gogen Avro library.
I want to talk a little bit more about building on that idea of enforcing structure. We don't so often publish our streaming integrations publicly to third parties. But in a large enough organization, having even an intel team integrating with you is equivalent to a third party. So what can make that easier?
Async API is an open standard for describing event driven architectures. It works for all the streaming type platforms, like messagey things, qish things. Once you have that description, then you can go ahead and do all sorts of things. You can generate code, you can do automatic integrations and generate documentation.
If you have data flowing between components, especially if it's eventish or there's a lot of it, Kafka can be a really good addition to your setup. Most common in the banking and manufacturing industries. It's an opensource project. There's plenty of room for more contributions.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi, my name's Lorna, I'm a developer advocate at Ivan. We do interesting open source data platforms in the cloud. The plan for today's talk is I'm going to introduce Kafka to you. I'm going to talk about how you could use Kafka from your go applications. We'll talk about data formats and schemas, and then I'll also, just as, as a final teaser, introduce you to async API. So let's get started from the website Apache. Kafka is an opensource distributed event streaming platform. It's a massively scalable pub sub mechanism, and that's typically the type of applications that use Kafka. It's designed for data streaming. I see it very widely used in real time or near real time data applications, for the finance sector, for manufacturing, for Internet of things as well, things where the data needs to be on time. Kafka is super scalable, so it's designed to handle large volumes of data, and perhaps this should say a very large volume of small data sets. The one thing which I think can be surprising about Kafka is that typically the maximum payload size for a message will be one meg, which is actually quite a lot of data. If you're sending files, you're doing it wrong. But in terms of transmitting data from one place to another, then it's pretty good. And we'll talk about data formats as well. Kafka is widely used in event sourcing applications, in data processing applications, and you'll see it in your observability tools as well, where it's often used for moving metrics, data around, and for log shipping as well. So now I've said all that, right, about how modern it is, how well it scales. Like modern scalable fast. Remind you of anything? Right? So go is a really good fit with Kafka. I see them used together pretty often, especially in the more modern event driven systems. I think it's a really good fit. So a quick theory primer or refresher depending on your context for Kafka. What do we know about Kafka? We know it isn't a database and it isn't a queue, although it can look a bit like both of those concepts. I think it's a log. It's a distributed log for data. And we know about logs. We know that we append data to logs and that the data that we write there is then immutable. So we write it once, we never go back and update a line in a log file. The storage is a topic, and perhaps in a database this would be a table, or in a queue, it would be a channel. The topic is what you write to as a producer. The producer is what sends the data. It's the publisher, if you like. And then the consumer labeled here as c one is the subscriber. It's what receives that data. I've talked about topics, but I've oversimplified it a little bit, because what we've really got is partitions within topics. When you create a topic, you'll specify how many partitions it should have, and this defines how many shards your data can be shared across. So usually the partition is defined by the key you send, a key and a value for kafka. And usually we use the key. You don't have to. You can add some custom logic, but there are two reasons to do this. One is to spread the data out across your lovely big cluster, right? Because the partitions can be spread apart from one another. The other one is that we only allow one consumer per consumer group. Consumer groups in a minute per topic partition. So if there's a lot of consumer work to do, you need to spread your data out between the partitions to allow more consumers to get to the work. So that's one reason, and another reason is within the partitions, order is preserved. So if you have messages that need to be processed in order, the same item updating multiple times, then you need to get them into the same partition. Messages with identical keys will be routed to identical partitions, or you can have your own logic for that. These messages could be anything, really. They might be a login event or a click event from a web application. They might be a sensor reading, and we'll see some examples of that later. So one consumer per consumer group per topic partition. What's a consumer group? Right. So we have one consumer per partition, but we might have more than one application consuming the same data. So, for example, if we have a sensor beside a railway and it detects that a train has gone past, that data might be used by the train arrival times board. I'm standing on the platform wondering how late my train is now, and I get the update because we just found out where the train was. The same data might be used by the application that draws the map in the control room. So we know where the train got to. So different apps can read the same data, and each one of those applications will be in its own consumer group. So we have one consumer per consumer group per topic partition. And in the diagram here, you can see that there's different consumer groups reading from different partitions here. One more important thing to mention is the replication factors. When you create the topic, you'll also configure its replication factor, and that's how many copies of this data should exist. Replication works at the partition level. So you create a topic and you say, I need two copies of this data at all times, and each of its partitions will be stored on two different nodes. Usually we're working with one and the other one's just replicating away. We don't need it. But if something bad happens to the first node, then we've got that second copy. The more precious your data is, the higher the replication factor should be for that data. If it's something that you don't really need, you don't really need to replicate it. I mean, you let's replicate it anyway because it's really unusual to run Kafka on a single node. You might as well have two copies. But if you really can afford to lose it, then set that number to the number of nodes that you have, and let's give you the best chance of never losing that data. Let's talk about using Kafka with go with our go applications. So I've got a QR code here and it's just the hyperlink, but you can scan it quickly if you want. You can pause the video if you're watching the replay. I'm going to show you some code and I've sort of pulled some snippets into the slides. But if you want to play with this and see it working, it's all there in the GitHub repository. There's also some python code in the GitHub repository, and probably you all write more than one language anyway, but also, why not? One of the points of Kafka is that it's an ideal decoupling point, an ideal scalability point in our distributed and highly scalable systems. So it is tech agnostic, it should be tech agnostic, and we should be using different producers and consumers with whatever the right tech stack is for the application that's getting that data or sending it, receiving it, whichever. So this makes complete sense to me, but we're going to look at the go examples specifically today I'm using three libraries in particular, and I want to shout out both these three libraries, but also the fact that there's a few libraries around and they're actually all great, which doesn't help you choose. I can't cover them all. I'm not sure I even have favorites, but I have examples that use these libraries. I'll give a specific shout out to Sarama from shopify, which is also pretty great. Historically it was missing something. That means that I usually use the confluent library, but I have forgotten what that is and I'm not sure it's even still missing. So I'm using the confluent library. They maintain a bunch of sdks for all different tech stacks, and you'll see that in the first example. The later examples also use this SR client library for dealing with a schema registry client and also the Gogen Avro library, which is going to help me to get some connect structs to work with from go so that I can more easily go into and out of the data formats that are flowing around the example. And you'll see this all through the example repo is it's an imaginary Internet of things application for a factory for an imaginary company called Thingam Industries. They make Thingama jigs and Thingama bobs. Examples are hard, but you get the idea. The data looks like this JSON example here. So each machine has a bunch of sensors, and those sensors will identify which machine, which sensor the current reading, and also the units of that reading, because it makes a big difference. So now you understand what the imaginary story is and the format of the data that we're dealing with. Let's look at some code examples. And I'm going to start with the producer that puts the data into the topic on Kafka. So we create a new producer and set some config. Line two has the broker Uri. I just have this in an environment variable so that I can easily update it and not paste it into my slides. The rest is SSL config. If you are running with the Ivan example, you'll need the SSL certs as shown here. You can use. I mean, everything we do on Ivan is open source, right? So you can always do it there or somewhere else. That's why we do it the way we do. If you're running the so Apache Kafka also has quite a cool docker setup that's reasonably easy to get started with. So if you're running that, it's on localhost and you don't need the SSL config. So this config will look a little bit different depending handling some errors and then deferring the close on that producer because we're about to use it for something that's part one of the code. Here's part two. I'm creating, putting some values into a machine sensor struct. We'll talk about the Avro package later. But anyway, this is my struct. I'm putting some data in it and then I am turning it into JSON and sending it off to Kafka. I asked the producer object to produce and off it goes. The consumer looks honestly quite similar. Here is the consumer. I ripped out the SSL options to show you that. On line three we use the group id and that's the consumer group concept that we talked about before. I also have this auto offset reset setting and by default when you start consuming from a Kafka topic, you will get the data that's arriving now. But what you can do is consume it from the beginning. So if you need to just re audit or retot up a bunch of transactions, then you can give it this earliest setting and it'll read from the beginning. If you're doing vanishingly simple demos for conference talks, then reading from the beginning means you don't have to have lots of real time things actually working. You can just produce some messages in one place and then consume them in another place. The interesting stuff is happening on line nine, believe it or not. Doesn't look like the most interesting. We read the message and it just outputs what we have. So consuming is also reasonably approachable and you could imagine putting this into your application and wrapping it up in a way that would make sense. So we've had a look at the go code, but let's talk about some of the other tools in the Kafka space. I mean you can just do everything from go, but there are some other tools that I find useful as sort of diagnosis. Quickly produce or consume messages, so let's give them a mention. First up, if you download Kafka, it has useful scripts like list the topics now please run a consumer on the console now please. It has all of that built in, so that's quite useful and that's definitely a place to start. I have two other open source tools that are my favorites that I want to share. First up, I'm going to show you Kafka cat, which is a single command line tool that does a bunch of different things. Here you can see it in consumer mode. It's just reading in all the data that that producer that you just saw produced. So for a CLI tool, and you'll end up with like a little text file with a load of little scripts or some aliases or something, this is one of my favorites. If the command line is not so much your thing, and really, who could blame you, then have a look at Cafdrop, which is a web UI. This one comes again, it's got a docker container, so super simple to get started, give it some configuration and off you go. So you can poke around at what's going on. Whether you have localhost, Kafka, or somewhere in the cloud Kafka, you can poke at it and see what's in your topic and what's happening. If you are using a cloud hosted solution, probably they have something. This is a screenshot of the Ivan topic browser that comes just with the online dashboard. So yes, check out your friendly cloud hosting platform, which may have options. Let's talk next about schemas, and I'll open immediately by saying schemas are not required. So a lot of applications will use Kafka with no schema at all, and that's fine. Kafka fundamentally doesn't care what data you send to it, and it also doesn't care whether your data is consistent or right or anything. It's not going to do any validation for you, it's not going to do any transformation for you. It's rubbish in, rubbish out as far as Kafka is concerned. So schemas can really help us to get that right. And I see it as particularly useful where there are multiple people or teams collaborating on a project. So schemas are great. I am a fan. They allow us to describe and then enforce our data format. Sometimes you will need a schema. So for example, there are some cool compression formats such as Protobuff or Avro, and both of those require that there is a schema that describes the structure of the data for them to work, they solve the problem in different ways. Protobuff, you give the schema and it generates some code and you use that code. So the code generator supports whichever text stacks it supports, and that's it. Avro is a bit more open minded and you will encode the schema, send it to the schema registry, and then the consumer gets the schema back from the schema registry and turns it back into something you can use. Because go is a strongly typed language. Like I've always liked schemas, but I've worked with Kafka from a bunch of different text stacks. When you come to do it from go, the schema becomes oh, we should all do it this way. And it's really influenced the way that I do this with the other tech stacks that I also know and love. So our favorite strongly typed language just really finds it useful to be strict about exactly the data fields and exactly the data types that will be included. I mentioned Avro. Avro is today's example. It's something that I've used quite a bit, again because it's tech stack agnostic and I find the enforced structure really valuable. Just that guarantee of payload format is it's good for my sanity. What Avro does is instead of including the whole payload verbatim in every message, it removes the repeated parts like the field names, and also applies some compression to it. So your producer has some Avro capability. You supply a message and your Avro schema, and it first of all registers the schema with the schema registry and then works out which version of the schema we have. Then it creates a payload which has a bit of information about which version of the schema we're using, and the specialists format of the message puts those two things together, sends it into the topic. The consumer does all that in reverse, right? Gets the payload, has a look which version of the schema it needs, fetches that from the schema registry, and gives you back the message as it was. And that ability to compress really saves space with the one meg typical limit, and can be a really efficient way to transfer things around. So the schema registry has multiple versions of a schema as things change on a per topic basis. In my examples, the schema registry is carapace. It's can open source schema registry. It's an Ivan project. We use it on our cloud hosted version, and I also use it for local stuff as well. There are a few around API curio have one, confluent has one. There's a bunch I talked about the schema versions, so I just want to do a small tangent and talk about evolving those schemas. I mean, don't change your message format that way lies madness. Sometimes things happen. I live in the real world and I know that sometimes our requirements change. When it happens, then we need to approach it, and the best is to do it in a backwards compatible way. So if you need to rename a field instead, add another field with the same value but the new name, we need to keep the old one. You can add an optional field that's safe as well, because if it's not there in a previous version, then that's fine. Every time you make a change, even a backwards compatible change, it is a new schema, and we'll need to register a new version with the schema registry. Avro makes all of this easy because it does have a support for default values and it does also support aliases as well. Cool. So that's schemas. That's a little sanity check on how to evolve them if things do happen. Let's look at an example of the Avro schema. How do we describe a payload? Well, it's a record with a name and a bunch of fields. Avro supports the name of a field, the type of a field, and also this doc string. So you can look at this and know if the field name isn't quite self explanatory. And you know, we all have good intentions, but sometimes things happen. Then the doc string can add just a little bit of extra connect. Also remember it, because you're going to see it later. And we can use this schema then to create a struct using the Gogen Avro library. So Gogen Avro takes the Avro schema, I've asked it to make an Avro package here, and it gives me this struct, which is what you saw in the example code. I can just set values on this. It knows how to serialize and deserialize itself. Actually it generates a whole. The struct comes with a bunch of functionality, so it can be serialized and deserialized. Amazing magic occurs. It's quite a long file, but as from the user point of view from userland, I just go ahead with this machine sensor and set my values or try and read them back. It's pretty cool. I do want to show you the producer with the Avro format and the schema registry piece. So again, two slides didn't quite fit. On one we've got the producer already created. We now have to also connect to this schema registry client. And here you can see that Sr client library that I mentioned. We also get the latest schema for a topic. I'm just assuming that what I'm building here is the latest registered schema. You know, where the struct came from this time. So we're filling in the values in the struct, and then we're just getting that ready as a bunch of bytes that we can add to the payload. The other piece of that payload is the schema id. We know the schema id. We turn it into a bunch of bytes and assemble it with the schema id. And then the main body of the message bytes put it all together and send it off to Kafka. So it is a little bit more than you saw in the first example, but it's quite achievable and sort of fits into the same overall pattern. I've shown you how to describe the payloads for a machine, but I want to talk a little bit more about building on that idea of enforcing structure and turning it into something that's a bit more human friendly. We don't so often publish our streaming integrations publicly to third parties, as we do with more traditional HTTP APIs. But in a large enough organization, having even an intel team integrating with you is equivalent to a third party if they're a couple of hops away on the chart. So what can make that easier? I'd like to introduce you to async API. It's an open standard for describing event driven architectures. If you are already familiar with Open API, then it's a sister specification to that. If you're not, it's an open API is another open standard. That's for describing just the HTTP APIs. Async API works for all the streaming type platforms, like messagey things, qish things. If you've got websockets or MQTT or Kafka or then async API is going to help you with that. For Kafka, we can describe the brokers how to authenticate with those endpoints. We can describe also the name of the topics and whether we are publishing or subscribing to those. And we can also then outline the payloads, which is what we did before with the Avro. And this is where it kind of gets interesting, because async API is very much part of the industry. It's something that integrations and is intended to play nicely with the other open standards that you're already using. So if you're describing your payloads with Avro as you've just seen, or cloud events, if you're using that, then you can refer to those schemas within your Async API document. Once you have that description, then you can go ahead and do all sorts of things. You can generate code, you can do automatic integrations, you can generate documentation as well. Let's look at an example. Here's the interesting bit. Basically from an Async API document. So straight away you can tell, oh, there's a lot of yaml. But the magic here is in the last line. The dollar ref syntax is common across at least OpenAPI and Async API, and it means that you're referring to another section in the file, or another section in another file, or a whole other file, as I am here. Async API is just as happy to process an avro schema as it is to process a payload described in async API format. One thing you can also do once you have the payload in place is to add examples. The Async API document encourages examples and they say a picture is worth a thousand words, but a good example is worth at least that many. So that's a feature that I really appreciate and enjoy. You can generate documentation, you can see it here with the fields documented, with the examples, the enum fields, and then actual examples showing on the right hand side. You can generate this for no further investment than just creating the Async API description. And I think there's a lot here that can really help us to work together. So I've talked today about Kafka and what it means for us as go developers, how it can fit alongside our super scalable and performant tech stack. I think if you're working in the go space, Kafka is well worth your time. If you have data flowing between components, especially if it's eventish or there's a lot of it, Kafka can be a really good addition to your setup if you don't have it already. It's most common in the banking and manufacturing industries, but only because they are ahead of most of the rest of the industries in terms of how much data they need to collect, keep safe and transfer quickly. It's got applications in a bunch of other industries, and I'd be really interested to hear what your experiences are or how you get on if you go and try it. And I hope that I've given you an intro today that would let you understand what it is and get started when you have a need for it. I'll wrap up then by sharing with you some resources in case you want them, and also how to reach me. So there's the example repository again. The Thingam Industries has everything that you've seen today was copied and pasted out of that repo, so you can see it in its context. You can run the scripts yourself, that kind of thing. Compulsory shout out for Ivan, who is supporting me to be here. Go to Ivan IO. We have Kafka as a service. Whatever other databases you need. I mean, go ahead, we have a free trial. So if you are curious about Kafka, then that's quite an easy onboarding way to try it out. And of course if you have questions then I would love to talk to you about those. I mentioned the schema registry, so that is carapace there's the link to the project. It's an opensource project. It's one that we use ourselves at Ivan. You're very welcome to use it and also very welcome to contribute. Here's the link to Async API asyncabi.com it's an open standard, which means it's a community driven project. We work in the open. The community meetings are open, the contributions are welcome. Discussions are all held in the open. If you are working in this space or you're interested, it's a very welcoming community and there's plenty of room for more contributions. So if you have ideas of how this should work, then I would strongly advocate that as a great place to go and get more involved in this sort of thing. Finally, that's me. Lornajane. Net. You can find slide decks, video recordings, blog posts and my contact details. So if you need to get in touch with me or you're interested in keeping up with what I'm doing, then that is a good place to look. I am done. Thank you so much for your attention. Stay in touch.

Slides

Download slides (PDF)

See all 25 talks at this event!

Conf42 Golang 2021 - Online

June 24 2021

Go Big with Apache Kafka

Video size:

Abstract

Summary

Transcript

Slides

Lorna Mitchell

Developer Advocate @ Aiven.io

Join the community!

Featured event

2025

2024

Info

Conf42 Golang 2021 - Online

June 24 2021

Go Big with Apache Kafka

Video size:

Abstract

Summary

Transcript

Slides

Lorna Mitchell

Developer Advocate @ Aiven.io

Join the community!