Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, my name's Lorna, I'm a developer advocate at Ivan.
We do interesting open source data platforms
in the cloud. The plan for today's talk
is I'm going to introduce Kafka to you. I'm going to talk
about how you could use Kafka from your go applications.
We'll talk about data formats and schemas, and then I'll also,
just as, as a final teaser, introduce you
to async API.
So let's get started from the
website Apache. Kafka is an opensource distributed
event streaming platform.
It's a massively scalable pub sub
mechanism, and that's typically the type of applications that
use Kafka. It's designed for data streaming.
I see it very widely used in real
time or near real time data applications,
for the finance sector, for manufacturing,
for Internet of things as well, things where the
data needs to be on time. Kafka is super
scalable, so it's designed to handle large volumes of data,
and perhaps this should say a very large volume
of small data sets. The one
thing which I think can be surprising about Kafka is that typically
the maximum payload size for a message will be one meg,
which is actually quite a lot of data. If you're sending files, you're doing it
wrong. But in terms of transmitting data from one place
to another, then it's pretty
good. And we'll talk about data formats as well. Kafka is widely
used in event sourcing applications, in data processing applications,
and you'll see it in your observability tools as well,
where it's often used for moving
metrics, data around, and for log shipping as well.
So now I've said all that, right,
about how modern it is, how well it
scales. Like modern scalable fast.
Remind you of anything?
Right? So go is a really good fit with Kafka.
I see them used together pretty often, especially in
the more modern event driven systems. I think it's a really good fit.
So a quick theory primer or refresher
depending on your context for Kafka. What do we know about
Kafka? We know it isn't a database and
it isn't a queue, although it
can look a bit like both of those concepts. I think
it's a log. It's a distributed log for data.
And we know about logs. We know that we append
data to logs and that the data that we write there
is then immutable. So we
write it once, we never go back and update a line in a log file.
The storage is a topic, and perhaps
in a database this would be a table, or in a queue, it would be
a channel. The topic is what you write to as a producer.
The producer is what sends the data. It's the publisher,
if you like. And then the consumer labeled here as c one
is the subscriber. It's what receives that data.
I've talked about topics, but I've oversimplified it
a little bit, because what we've really got is partitions within topics.
When you create a topic, you'll specify how
many partitions it should have, and this defines
how many shards your data can be shared across.
So usually the partition is defined by the key you
send, a key and a value for kafka. And usually
we use the key. You don't have to. You can add some custom logic,
but there are two reasons to do this. One is
to spread the data out across your lovely big cluster, right?
Because the partitions can be spread apart from one another.
The other one is that we only allow one consumer
per consumer group. Consumer groups in a minute per topic
partition. So if there's a lot of consumer work to do,
you need to spread your data out between the
partitions to allow more consumers to get to the work.
So that's one reason, and another reason is within
the partitions, order is preserved.
So if you have messages that need to be processed in order,
the same item updating multiple times, then you
need to get them into the same partition. Messages with identical keys
will be routed to identical partitions, or you can have your own
logic for that. These messages could be anything,
really. They might be a login event or a click event from
a web application. They might be a sensor reading,
and we'll see some examples of that later.
So one consumer per consumer group
per topic partition. What's a consumer group?
Right. So we have one consumer per partition,
but we might have more than one application
consuming the same data. So, for example, if we have
a sensor beside a railway and it detects that
a train has gone past, that data might be used
by the train arrival times board.
I'm standing on the platform wondering how late my train is now,
and I get the update because we just found out where the train was.
The same data might be used by the application
that draws the map in the control room. So we know where the train got
to. So different apps can read the same data,
and each one of those applications will be in its own consumer group.
So we have one consumer per consumer group
per topic partition. And in the diagram here, you can see that there's
different consumer groups reading from different partitions here.
One more important thing to mention is the replication factors.
When you create the topic, you'll also configure its replication
factor, and that's how many copies of this data
should exist. Replication works at the partition level.
So you create a topic and you say, I need two copies of this
data at all times, and each of its partitions
will be stored on two different nodes. Usually we're working
with one and the other one's just replicating away. We don't need it. But if
something bad happens to the first node, then we've got that second
copy. The more precious your data is,
the higher the replication factor should be for
that data. If it's something that you don't really need,
you don't really need to replicate it. I mean, you let's replicate
it anyway because it's really unusual to run Kafka on a single node. You might
as well have two copies. But if you really can afford to lose it,
then set that number to the number of nodes that you have, and let's give
you the best chance of never losing that data.
Let's talk about using Kafka with
go with our go applications. So I've got a QR code here and
it's just the hyperlink, but you can scan it quickly if you want. You can
pause the video if you're watching the replay. I'm going to
show you some code and I've sort of
pulled some snippets into the slides. But if you want to play with this and
see it working, it's all there in the GitHub repository.
There's also some python code in the GitHub repository, and probably
you all write more than one language anyway, but also, why not?
One of the points of Kafka is that it's an ideal decoupling
point, an ideal scalability point in our distributed
and highly scalable systems. So it is
tech agnostic, it should be tech agnostic, and we
should be using different producers and consumers with whatever
the right tech stack is for the application that's getting that data
or sending it, receiving it, whichever. So this makes complete
sense to me, but we're going to look at the go examples specifically
today I'm using three libraries in particular,
and I want to shout out both these three libraries,
but also the fact that there's a few libraries
around and they're actually all great, which doesn't help you choose.
I can't cover them all. I'm not sure I even
have favorites, but I have examples that use these libraries. I'll give a specific
shout out to Sarama from shopify, which is also pretty great.
Historically it was missing something. That means that I usually
use the confluent library, but I have forgotten what that is and I'm not
sure it's even still missing. So I'm using the confluent library.
They maintain a bunch of sdks for all different tech stacks,
and you'll see that in the first example. The later examples also use
this SR client library for dealing with a
schema registry client and also the Gogen Avro
library, which is going to help me to get some connect structs
to work with from go so that I can more easily go into
and out of the data formats that are flowing around
the example. And you'll see this all through the example repo
is it's an imaginary Internet of things application
for a factory for an imaginary company called Thingam
Industries. They make Thingama jigs
and Thingama bobs.
Examples are hard, but you get the idea. The data
looks like this JSON example here. So each machine
has a bunch of sensors, and those sensors will identify
which machine, which sensor the current reading, and also
the units of that reading, because it makes a big difference.
So now you understand what the imaginary story is and the
format of the data that we're dealing with. Let's look at some code examples.
And I'm going to start with the producer that puts the data into the
topic on Kafka. So we create a new producer
and set some config. Line two has the broker Uri.
I just have this in an environment variable so that I can easily
update it and not paste it into my slides. The rest
is SSL config. If you are running with
the Ivan example, you'll need the SSL certs as
shown here. You can use. I mean, everything we do on
Ivan is open source, right? So you can always do it there or somewhere
else. That's why we do it the way we do. If you're running
the so Apache Kafka also has quite a cool docker setup that's reasonably
easy to get started with. So if you're running that, it's on localhost
and you don't need the SSL config. So this config will
look a little bit different depending handling
some errors and then deferring the close on that producer because we're
about to use it for something that's part one
of the code. Here's part two. I'm creating,
putting some values into a machine sensor struct. We'll talk
about the Avro package later. But anyway, this is my struct.
I'm putting some data in it and then I
am turning it into JSON and sending it
off to Kafka. I asked the producer object to produce
and off it goes. The consumer looks honestly
quite similar. Here is the consumer.
I ripped out the SSL options to show you that. On line three
we use the group id and that's the consumer group concept
that we talked about before. I also have this auto
offset reset setting and by default
when you start consuming from a Kafka
topic, you will get the data that's arriving now. But what
you can do is consume it from the beginning.
So if you need to just re audit or retot up
a bunch of transactions, then you can give it this earliest setting
and it'll read from the beginning. If you're doing vanishingly
simple demos for conference talks, then reading from the
beginning means you don't have to have lots of real time things actually working.
You can just produce some messages in one place and then consume them
in another place. The interesting stuff is happening on
line nine, believe it or not. Doesn't look like the most interesting.
We read the message and it just outputs what we have. So consuming
is also reasonably approachable and you could imagine putting
this into your application and wrapping it up in a
way that would make sense. So we've had a look at the
go code, but let's talk about some of
the other tools in the Kafka space.
I mean you can just do everything from go,
but there are some other tools that I find useful as sort of diagnosis.
Quickly produce or consume messages, so let's give them a mention.
First up, if you download Kafka, it has useful
scripts like list the topics now please run
a consumer on the console now please. It has all of that built
in, so that's quite useful and that's definitely a place to start.
I have two other open source tools that are my favorites that I
want to share. First up, I'm going to show you Kafka cat,
which is a single command line tool that does a
bunch of different things. Here you can see it in consumer
mode. It's just reading in all the data that that producer that you just
saw produced. So for a CLI tool,
and you'll end up with like a little text file with a load of little
scripts or some aliases or something, this is one of my favorites.
If the command line is not so much your thing,
and really, who could blame you, then have a look at Cafdrop,
which is a web UI. This one comes
again, it's got a docker container, so super simple to get started,
give it some configuration and off you go. So you can poke around
at what's going on. Whether you have localhost, Kafka, or somewhere
in the cloud Kafka, you can poke at it and see what's
in your topic and what's happening. If you are using a cloud hosted solution,
probably they have something. This is a screenshot of the Ivan topic
browser that comes just with the online dashboard. So yes,
check out your friendly cloud hosting platform, which may have
options. Let's talk next about schemas,
and I'll open immediately by saying schemas are
not required. So a lot of applications
will use Kafka with no schema at all, and that's fine.
Kafka fundamentally doesn't care what
data you send to it, and it also doesn't care whether
your data is consistent or right or anything.
It's not going to do any validation for you, it's not going to do any
transformation for you. It's rubbish in, rubbish out as far as
Kafka is concerned. So schemas can really help us to get
that right. And I see it as particularly
useful where there are multiple people or teams collaborating on
a project. So schemas are great. I am a fan.
They allow us to describe and then enforce
our data format. Sometimes you will need a
schema. So for example, there are some cool compression formats
such as Protobuff or Avro, and both of those require that there is
a schema that describes the structure of the data
for them to work, they solve the problem in different ways.
Protobuff, you give the schema and it generates some code
and you use that code. So the code generator supports whichever
text stacks it supports, and that's it. Avro is a bit more
open minded and you will encode the schema,
send it to the schema registry, and then the consumer gets the schema back from
the schema registry and turns it
back into something you can use. Because go is
a strongly typed language. Like I've always liked schemas, but I've
worked with Kafka from a bunch of different text stacks. When you come to do
it from go, the schema becomes oh, we should all do
it this way. And it's really influenced the way that I do this
with the other tech stacks that I also know and love. So our favorite strongly
typed language just really finds it useful to
be strict about exactly the data fields and
exactly the data types that will be included. I mentioned Avro.
Avro is today's example.
It's something that I've used quite a bit, again because it's tech stack agnostic
and I find the enforced structure really valuable.
Just that guarantee of payload format is it's
good for my sanity. What Avro does is instead
of including the whole payload verbatim
in every message, it removes the repeated parts
like the field names, and also applies some compression
to it. So your producer has some
Avro capability. You supply a message and your Avro
schema, and it first of all registers the
schema with the schema registry and then
works out which version of the schema we have. Then it creates a
payload which has a bit of information about which version
of the schema we're using, and the specialists
format of the message puts those two things together, sends it into
the topic. The consumer does all that in reverse,
right? Gets the payload, has a look which
version of the schema it needs, fetches that from the schema registry,
and gives you back the message as it was. And that ability
to compress really saves space with the
one meg typical limit, and can be a really efficient
way to transfer things around. So the schema registry
has multiple versions of a schema as things change
on a per topic basis. In my examples, the schema
registry is carapace. It's can open source schema
registry. It's an Ivan project. We use it on our cloud hosted version,
and I also use it for local stuff as well. There are
a few around API curio have one, confluent has one.
There's a bunch I talked about the schema versions, so I just want to
do a small tangent and talk about
evolving those schemas. I mean, don't change your message
format that way lies madness.
Sometimes things happen. I live in the real world
and I know that sometimes our requirements change. When it happens,
then we need to approach it, and the best is
to do it in a backwards compatible way. So if you need
to rename a field instead, add another field
with the same value but the new name, we need to keep the
old one. You can add an optional field that's safe as
well, because if it's not there in a previous version, then that's fine.
Every time you make a change, even a backwards compatible change,
it is a new schema, and we'll need to register a new version
with the schema registry. Avro makes all of this easy because it does
have a support for default values and it does also support aliases
as well. Cool. So that's schemas. That's a little
sanity check on how to evolve them if things do happen.
Let's look at an example of the Avro schema.
How do we describe a payload?
Well, it's a record with a name and
a bunch of fields. Avro supports the name of
a field, the type of a field, and also this doc
string. So you can look at this and know if the
field name isn't quite self explanatory. And you know,
we all have good intentions, but sometimes things happen.
Then the doc string can add just a little bit of
extra connect. Also remember it, because you're going to see it later.
And we can use this schema then to create
a struct using the Gogen Avro library.
So Gogen Avro takes the Avro schema, I've asked
it to make an Avro package here, and it gives me this struct,
which is what you saw in the example code. I can just set values on
this. It knows how to serialize and deserialize itself.
Actually it generates a whole.
The struct comes with a bunch of functionality, so it can be serialized
and deserialized. Amazing magic occurs. It's quite a long file,
but as from the user point of view from userland, I just
go ahead with this machine sensor and set
my values or try and read them back. It's pretty cool. I do want to
show you the producer with the Avro
format and the schema registry piece.
So again, two slides didn't quite
fit. On one we've got the producer already created.
We now have to also connect to this schema registry
client. And here you can see that Sr client library that I mentioned.
We also get the latest schema for a topic.
I'm just assuming that what I'm building here is the latest registered schema.
You know, where the struct came from this time. So we're filling in the
values in the struct, and then we're just getting that ready as a bunch of
bytes that we can add to the payload. The other piece of that payload
is the schema id. We know the schema id.
We turn it into a bunch of bytes and assemble it with the schema id.
And then the main body of the message bytes put
it all together and send it off to Kafka.
So it is a little bit more than you saw in the first example,
but it's quite achievable and sort of fits into the same overall pattern.
I've shown you how to describe the payloads for a machine,
but I want to talk a little bit more about building
on that idea of enforcing structure and
turning it into something that's a bit more
human friendly. We don't
so often publish our streaming integrations
publicly to third parties, as we do with more traditional
HTTP APIs. But in a large enough organization,
having even an intel team integrating with
you is equivalent to a third party if they're a couple of hops away
on the chart. So what can make that
easier? I'd like to introduce you to async
API. It's an open standard
for describing event driven architectures.
If you are already familiar with Open API,
then it's a sister specification to that. If you're not, it's an
open API is another open standard. That's for describing just the
HTTP APIs. Async API works
for all the streaming type platforms,
like messagey things, qish things. If you've got websockets or MQTT
or Kafka or then async API is going
to help you with that. For Kafka, we can describe
the brokers how to authenticate with those endpoints.
We can describe also the name of the topics and whether we
are publishing or subscribing to those. And we can also then
outline the payloads, which is what we did before with the Avro.
And this is where it kind of gets interesting,
because async API is very much
part of the industry. It's something that
integrations and is intended to play nicely with the
other open standards that you're already using. So if you're describing your
payloads with Avro as you've just seen, or cloud events,
if you're using that, then you can refer to those
schemas within your Async API document.
Once you have that description, then you can go ahead
and do all sorts of things. You can generate code,
you can do automatic integrations, you can generate documentation as
well. Let's look at an example. Here's the
interesting bit. Basically from an Async API
document. So straight away you can tell, oh, there's a lot of yaml.
But the magic here is in the last line. The dollar ref syntax
is common across at least OpenAPI
and Async API, and it means that you're referring
to another section in the file,
or another section in another file, or a whole other file,
as I am here. Async API is just as happy
to process an avro schema as it is
to process a payload described in async
API format. One thing you can also do once you
have the payload in place is to add examples.
The Async API document encourages examples and
they say a picture is worth a thousand words, but a good example
is worth at least that many. So that's a feature that
I really appreciate and enjoy. You can generate documentation,
you can see it here with the fields
documented, with the examples, the enum fields,
and then actual examples showing on the right hand side.
You can generate this for no further investment than just creating the
Async API description. And I think there's a lot here that can
really help us to work together. So I've talked today
about Kafka and what it means for us as go
developers, how it can fit alongside our super scalable
and performant tech stack. I think if you're
working in the go space, Kafka is well worth your time.
If you have data flowing between components, especially if it's
eventish or there's a lot of it, Kafka can be a really good addition
to your setup if you don't have it already. It's most common
in the banking and manufacturing industries, but only because they are ahead
of most of the rest of the industries in terms of how much
data they need to collect, keep safe and transfer quickly.
It's got applications in a bunch of other industries, and I'd
be really interested to hear what your experiences are or how you get
on if you go and try it. And I hope that I've given you an
intro today that would let you understand what it is
and get started when you have a need for it.
I'll wrap up then by sharing with you some
resources in case you want them, and also how to reach me.
So there's the example repository again. The Thingam Industries
has everything that you've seen today was copied and pasted out of that repo,
so you can see it in its context. You can run the scripts yourself,
that kind of thing. Compulsory shout out for Ivan,
who is supporting me to be here. Go to Ivan IO.
We have Kafka as a service. Whatever other databases you need.
I mean, go ahead, we have a free trial. So if you are curious
about Kafka, then that's quite an easy onboarding
way to try it out. And of course if you have questions then I would
love to talk to you about those. I mentioned the schema registry, so that is
carapace there's the link to the project. It's an opensource project.
It's one that we use ourselves at Ivan. You're very welcome to use it and
also very welcome to contribute. Here's the link to Async API asyncabi.com
it's an open standard, which means it's a community driven
project. We work in the open. The community
meetings are open, the contributions are welcome.
Discussions are all held in the open. If you are working in this space or
you're interested, it's a very welcoming community and
there's plenty of room for more contributions. So if you have ideas
of how this should work, then I would strongly advocate that as a great place
to go and get more involved in this sort of thing. Finally, that's me.
Lornajane. Net. You can find slide decks,
video recordings, blog posts and my
contact details. So if you need to get in touch with me or you're interested
in keeping up with what I'm doing, then that is a good place to
look. I am done. Thank you so much for your attention.
Stay in touch.