Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Hi, and welcome to this talk. I'm Francesco Tisiot, developer advocate at Ive.
            
            
            
              In this session we will check how you can build event driven applications using
            
            
            
              Apache, Kafka and Python. If you are here at conference 42,
            
            
            
              I believe you are somehow familiar with Python, but on the other
            
            
            
              side you may start wondering, what is Kafka? Why should
            
            
            
              I care about Kafka? Well, the reality is that
            
            
            
              if you are into Python, you are somehow either creating
            
            
            
              a new shiny app or you inherited an old rusty
            
            
            
              app that you have to support, maybe extend,
            
            
            
              maybe take to the new world, no matter if it's new
            
            
            
              or old, if it's shiny or rusty.
            
            
            
              I've never seen an application, a Python program working in complete
            
            
            
              isolation. You will have companies within
            
            
            
              Python that needs to talk with each other, or you
            
            
            
              are exposing this old application into the word. So you have this
            
            
            
              old application talking with another application. Let me tell you a secret.
            
            
            
              Kafka is a tool that makes this communication
            
            
            
              easy and reliable at scale.
            
            
            
              Why should we use Kafka? Well, we used to
            
            
            
              have kind of what I call the old way of building
            
            
            
              applications, which were an application that at a certain
            
            
            
              point had to write the data somewhere. Where did
            
            
            
              the application write to? Well, usually it was a database,
            
            
            
              but the application wasn't existing to the database.
            
            
            
              Every single record, it was taking records,
            
            
            
              packaging them up, few of them, and then pushing them to the database.
            
            
            
              Or at the same time when it was reading from the database, it was
            
            
            
              reading a set of a batch of records and
            
            
            
              then existing a little bit before rereading the following batch.
            
            
            
              This means that basically every time we were using such
            
            
            
              a way of communicating, we were adding a
            
            
            
              custom delay which was called batch time, between when
            
            
            
              the event was available in the application and when it was pushed on
            
            
            
              the database, or when the events was available in the database
            
            
            
              and when it was read from the application.
            
            
            
              Now we are living in a fast word and we
            
            
            
              cannot wait batch time. You can imagine in batch time going between
            
            
            
              like few seconds or milliseconds to minutes or
            
            
            
              hours, depending on the application and the use case. Now we live
            
            
            
              in a fast world and we don't want to wait the batch time.
            
            
            
              We want to build event driven application. What are those
            
            
            
              application that as soon as an event happens in the real
            
            
            
              life, they want to know about it, they want to start
            
            
            
              parsing it in order to strike the relevant information. And probably
            
            
            
              they want to push the output of their basing to
            
            
            
              another application, which will be more likely another event driven
            
            
            
              application that will create a changing of those application. And we want to do it
            
            
            
              immediately. But let's do a step back.
            
            
            
              Let's try to understand what is an event. We are
            
            
            
              all used to for example mobile phones and we are all used to
            
            
            
              notifications. Notification tell us that an
            
            
            
              event happened. We receive a message. We made a
            
            
            
              payment with our credit card and we received the notification.
            
            
            
              Someone else stole our credit card details
            
            
            
              and made a payment. We receive a word notification.
            
            
            
              As you might understand, we cannot wait batch
            
            
            
              time of 2 hours, ten minutes, five minutes.
            
            
            
              We want to know immediately about someone stolen our credit
            
            
            
              card and we persons, we react as
            
            
            
              an event driven application by immediately phoning
            
            
            
              up our bank to block the credit card. This is why events
            
            
            
              and event event event driven applications important. But even without going into
            
            
            
              the digital world, we are used to events happening in the real life
            
            
            
              since long time. Just imagine where your alarm beeps in
            
            
            
              the morning, you wake up and you act as an event. Event driven applications,
            
            
            
              you are not only receiving passively events, you are creating events. Just think,
            
            
            
              when you change the time of your alarm, that will change your future,
            
            
            
              your actions in the future. Well, going back to mobile
            
            
            
              phones, especially in this time of pandemic, we have all been used to
            
            
            
              for example order food from an app.
            
            
            
              Well from the time that you open the app, you select the restaurant,
            
            
            
              you select which pizzas you want, then you create can order. This will
            
            
            
              create a chain of events because the order will be taken from the app
            
            
            
              and sent to the restaurant which will act as an event driven application
            
            
            
              and create the pizzas for you. And once the pizza is ready, boom. Another event
            
            
            
              for probably the delivery people to come and pick it up
            
            
            
              and take it to your place. So why event
            
            
            
              event event driven applications important? Because as I said, we live
            
            
            
              in a fast word and the value of the information is strictly
            
            
            
              relate to the time that it takes to be delivered. If we go back to
            
            
            
              the credit card example, I cannot wait half an hour
            
            
            
              or 5 hours before knowing that my cart has been stolen.
            
            
            
              I want to know immediately. But even if we talk about food and you know
            
            
            
              I'm italian so I'm a passionate about food. For examples,
            
            
            
              if we are waiting for our pizzas at home and we want to know
            
            
            
              where the pizza is, where the delivery person is, the information
            
            
            
              about the position of the delivery person is useful only if from the
            
            
            
              time that is taken from the person mobile phone to the time that
            
            
            
              it lands on my map on my mobile phone, the delay is minimal as
            
            
            
              10 seconds. I couldn't care lets what the position was ten minutes
            
            
            
              ago. The information has value only if
            
            
            
              delivered on time. So we need to have a way
            
            
            
              to deliver this information, to create this sort of
            
            
            
              communication between components, real time, in real
            
            
            
              time and highly available and at scale. How can
            
            
            
              we do that? Well, we can use Apache kafka. What is Apache
            
            
            
              Kafka? Well, the idea of Apache Kafka is really,
            
            
            
              really simple. The basic idea is the idea of a log file.
            
            
            
              A log file where as soon as an event is
            
            
            
              created, we store them. We store it. So event number zero
            
            
            
              happens. We store it as a message in the log file. Event number one happens.
            
            
            
              We store it after event zero, two, three and four even more.
            
            
            
              Kafka has concept of a log file which is append only
            
            
            
              and immutable. This means that once we store event zero
            
            
            
              in the log, we cannot change it. It's not like a record in a database
            
            
            
              that we can go and update it. Once the event is there,
            
            
            
              it's there. If something changes the reality that is
            
            
            
              represented by the event zero, we will store it
            
            
            
              as a new event in our log. But of course we know that events
            
            
            
              take multiple shapes in different types. Just think about
            
            
            
              I could have events regarding pizza orders and I could have
            
            
            
              other events regarding delivery position. And Kafka allows me to
            
            
            
              store them in different locks logs which in kafka
            
            
            
              terms are called toppings. Even more,
            
            
            
              kafka is not meant to run
            
            
            
              in a huge single server. It's meant to be
            
            
            
              distributed. So this means that when you create a Kafka instance,
            
            
            
              most of the times you will create a cluster of nodes
            
            
            
              which are called, in kafka terms, brokers.
            
            
            
              And the log information will be stored
            
            
            
              across the brokers in your cluster not only one
            
            
            
              time, but multiple times. The number of times that each log
            
            
            
              will be stored in your cluster is defined by a parameter called replication
            
            
            
              factor. In our case we have three companies of the sharp edges
            
            
            
              log. So applications factor of three and two copies of
            
            
            
              the runs edges log. So replication factor of two.
            
            
            
              Why do we store multiple copies of each log?
            
            
            
              Well, because we know that computers are not entirely reliable
            
            
            
              so we could lose a node. But still, as you can see, we are
            
            
            
              not going to lose any data. So let's have a deep look at
            
            
            
              what happens with Kafka. So we have our three
            
            
            
              nodes and let's go to a very simple version of it where we
            
            
            
              have just one topic, our sharp edges topic with two copies.
            
            
            
              So replication factor of two. So now let's assume that
            
            
            
              we have a building node. What happens now?
            
            
            
              Well, Kafka will detect this node whats been failing
            
            
            
              and will check well, which are the logs available in my cluster. Well,
            
            
            
              there is the sharp edges log with only one copy, but a applications
            
            
            
              factor of two. So Kafka at that point will take care of
            
            
            
              creating a second copy in order to keep the number of copies equal to
            
            
            
              the applications factor. So once we have also the second copy,
            
            
            
              even if now we lose a second node, we still are not
            
            
            
              losing any information. So as of now we understood how Kafka
            
            
            
              works and what Kafka is. But Kafka is something
            
            
            
              to store events. What is an event for
            
            
            
              Kafka? Well, for all that matters to Kafka, an event
            
            
            
              is just a key value pair. A key value pair where
            
            
            
              you can put whatever you want in key and value.
            
            
            
              Kafka doesn't care. For Kafka it's just a series of bytes.
            
            
            
              So you could go from very simple use cases where you
            
            
            
              put key, the max temperature label and 35
            
            
            
              three as the value itself. Or you could go wild and you could
            
            
            
              add both in the key and the value JSON formats, explaining the
            
            
            
              restaurant receiving your pizza order and the phone line
            
            
            
              used to make the call, and in the value, the order id,
            
            
            
              the name of the person calling and the list of pizzas. Usually the
            
            
            
              payload. The message size is around one meg and you can use
            
            
            
              like in this case JSON formats, which is really cool because I
            
            
            
              can read through it, but it's on the other side a little bit heavy when
            
            
            
              you push this on wire, because for every field contains both the
            
            
            
              field name and the fill value. If you want to have more compacted
            
            
            
              representation of the same information, you could use formats like Avro
            
            
            
              or protopath, which detach the schema from the payload
            
            
            
              and use a schema registry in order to tell to Kafka how
            
            
            
              I compacted my message. So when I
            
            
            
              read the message I can ask the schema to Kafka
            
            
            
              and I can recreate the message properly.
            
            
            
              So these are just methods for all that matters to Kafka
            
            
            
              you are sending just a series of types. So how can
            
            
            
              you send that series of bytes to Kafka? Well,
            
            
            
              you are trying to write to Kafka and probably in this case we will
            
            
            
              have a Python application existing to Kafka which is called a producer,
            
            
            
              and just remembering the things that we said earlier,
            
            
            
              it writes to a topic or multiple topics. In order to write
            
            
            
              to Kafka, all the producer has to know is where to
            
            
            
              find Kafka, list of hostname and ports, how to authenticate,
            
            
            
              do I use Sl? Do I use SASL? Do you use other methods?
            
            
            
              And then since I have, for example, my information available as
            
            
            
              JSON objects, I need to encode that in order to be
            
            
            
              the row series of bytes that Kafka understands. So I need to know how
            
            
            
              to encode the information on the other side. Once I have my
            
            
            
              data in Kafka, I want to read, and if I want to read from
            
            
            
              a topic with my python application that is called
            
            
            
              a consumer, how the consumer works is that it will read
            
            
            
              the message number zero and then communicate back to Kafka.
            
            
            
              Hey, number zero done. Let's move the offset to number one. It will
            
            
            
              read number one and move the offset to number two. Read number two,
            
            
            
              move the offset to number three. Why moving the
            
            
            
              offset? Communicating back the offset is important. Well,
            
            
            
              because we know, again, computers are not entirely
            
            
            
              reliable, so the consumer could go down. So the next
            
            
            
              time that the consumer pops up, Kafka still knows until
            
            
            
              what point that particular consumer read in that
            
            
            
              particular log. So the next time the consumer will pop up, will probably send
            
            
            
              the item the message number three, because it was the
            
            
            
              first not being read by the consumer. In order to consume
            
            
            
              data from Kafka, the consumer has to know kind of the similar information as
            
            
            
              the producers where to find Kafka Osnam import how to authenticate
            
            
            
              before we were encoding. Now we need to understand how to decode and
            
            
            
              we also need to understand, we need to know which topic or which topics we
            
            
            
              want to read from Kafka. So now it was a
            
            
            
              lot of content. Let's look at the demo.
            
            
            
              What we will look at here is a series of notebooks that I built
            
            
            
              in order to make our life easier. The first notebook that
            
            
            
              I created is actually a notebook that allows me to create automatically
            
            
            
              all the resources that I need for the follow up within
            
            
            
              Ivan, you can run this and you will access to this series of notebooks
            
            
            
              later on. But as of now, let me show you that I pre created
            
            
            
              two instances, one of Kafka and one of postgres that we
            
            
            
              will use later on. With Ivan, you can create
            
            
            
              your open source data platform across many clouds.
            
            
            
              In this case, we created a Kafka. Well, instead of showing you something that
            
            
            
              is already created, let me create a new instance. As you can see,
            
            
            
              you can create not only Kafka, but a lot of other open source data platforms.
            
            
            
              And once you select which data platform you want to create, you can select which
            
            
            
              cloud producers and within the cloud provider, the cloud rigid,
            
            
            
              so you can customize this per units. At the bottom you can also select the
            
            
            
              plan driving the amount of resources and the associated cost,
            
            
            
              which is all inclusive. Finally, you can give a name to the instance and after
            
            
            
              a few minutes, the instance will be up and running for you to have
            
            
            
              a look. The goodies about Ivan is not only that you can create open
            
            
            
              demand, but if you have an instance like this,
            
            
            
              you can upgrade it if a new version
            
            
            
              of Kafka comes up or you can changing the plan to
            
            
            
              upgrade, upscale or downscale. Or you can migrate
            
            
            
              while the service is online, the whole platform to a different region
            
            
            
              within the same cloud, or to a completely new cloud provider. So now
            
            
            
              instead of talking about Ivan, let's talk about how to create
            
            
            
              and produce messages to Kafka. So let's start a producers.
            
            
            
              The first thing that we will do is to install Kafka Python,
            
            
            
              which is the default basic library that allow us to connect to Kafka.
            
            
            
              And then we will create a producer. We create a producer by
            
            
            
              saying where to find Kafka, list of host, name and port,
            
            
            
              how to connect using SSL and three SSL certificates,
            
            
            
              and how to encode the information, how to serialize them. So both
            
            
            
              for key and value we will take the JSON and move it
            
            
            
              to a series of byte encoded in Ashi.
            
            
            
              So let's create this. Okay,
            
            
            
              and now we are ready to send our first message. We will send
            
            
            
              a pizza order again, I'm italian so pizza is key.
            
            
            
              And from myself, an order for myself ordering a
            
            
            
              pizza margarita. Okay, now the order, the message
            
            
            
              is sent to Kafka. How can we be sure about that? Well,
            
            
            
              let's create a consumer now let's
            
            
            
              move the consumer on the right and let's close the list of
            
            
            
              notebooks. We create a consumer. All we
            
            
            
              have to say is apart from the group id that we will check later.
            
            
            
              I'm calling it client one, I can call it whatever I want
            
            
            
              the same properties in order to connect osname,
            
            
            
              import SSL with the three certificates.
            
            
            
              And how do I deserialize now the data from
            
            
            
              the row series of bytes to JSON with the two formula
            
            
            
              CIA. Okay, so let me create the consumer.
            
            
            
              Now I can check which topics are available in Kafka and
            
            
            
              I can check there are some internal topics together with a nice Francesco
            
            
            
              pizza topic that I just created for this purpose.
            
            
            
              I can subscribe to it and now I can start
            
            
            
              reading. We can immediately see two things when
            
            
            
              reading. The first one being whats the
            
            
            
              consumer thread never ends. This is because we
            
            
            
              want to be there, ready as soon as an order comes in
            
            
            
              Kafka, we want to be there, ready to read it. And there is no
            
            
            
              end time in streaming there is no end date.
            
            
            
              We will always be there, ready to consume the data.
            
            
            
              The second thing that we can notice is that even if we send the
            
            
            
              first pizza order from Francesco, we are not receiving it
            
            
            
              in here. Why is that? Well, because by
            
            
            
              default, when a consumer attaches to Kafka, it starts
            
            
            
              consuming. From the time that it attached to Kafka, it doesn't
            
            
            
              go back in history. This is the default behavior and we will see how to
            
            
            
              change this later on. But just bear in mind this is the default.
            
            
            
              In order to show you that the wall pipeline producers consumer
            
            
            
              works, I'm going to send another couple of events I'm
            
            
            
              using to send an order for Adele with pizza y the pineapple
            
            
            
              pizza and an order for mark with pizza with chocolate.
            
            
            
              So I'm italian and both choices, there are not
            
            
            
              what I would call right choices for pizza. However, I respect
            
            
            
              your right to order whatever you want. Just try not to do that in
            
            
            
              Italy. Okay, so let's produce the two orders
            
            
            
              and if everything works, we should see them appearing on the consumer
            
            
            
              side immediately. There we are. We see that both Adele
            
            
            
              and mark orders are working in the consumer side.
            
            
            
              Our pipeline is working. So now let's go back
            
            
            
              to a little bit more slides. Let's talk about the log size.
            
            
            
              We want to send messages to a log. We want to send
            
            
            
              huge messages, huge number of messages to a
            
            
            
              log. But I told you that the log is stored in a broker.
            
            
            
              Is this meaning that we cannot have more messages
            
            
            
              than the bigger disk on the bigger
            
            
            
              server in our cluster? Well, this is not going
            
            
            
              to work well, if we want to send massive amounts of events,
            
            
            
              we don't want to have the trade off between disk space and
            
            
            
              amount of data. We don't want to need to purchase
            
            
            
              huge disk in order to store the wall topic in one disk
            
            
            
              on the other side. We don't want to limit ourselves and the number of events
            
            
            
              that we want to send to a particular topic because of disk space. We are
            
            
            
              lucky because Kafka doesn't impose that trade off on us.
            
            
            
              With Kafka we have the concept of partitions.
            
            
            
              Partition is just a way of taking events of the same time belonging
            
            
            
              to the same topic and divide them into subtopics,
            
            
            
              sub logs. For example, if I have my pizza orders, I could
            
            
            
              partition them based on the restaurant receiving the order because I want
            
            
            
              all the records, for example, for Luigi restaurant being
            
            
            
              the blue being together. But I don't care really what happens between the orders
            
            
            
              of Luigi restaurant and the orders of Mario the yellow one or
            
            
            
              Francesco the red ones. Now, why partitions are
            
            
            
              good for this kind of disk space trade off because the partition is
            
            
            
              what is actually stored on a node. So this means that if we
            
            
            
              want to have a huge amount of events landing
            
            
            
              in a topic, we just need more partition to fit
            
            
            
              the wall topic into smaller disks. And again,
            
            
            
              since it's distributed, we will have them stored across our
            
            
            
              cluster in a number of copies.
            
            
            
              In this case, the number of copies is equal to every partition because it's a
            
            
            
              topic level. And even if we lose a code again,
            
            
            
              we will not lose any data from any of the partitions because it will be
            
            
            
              available in the other copies in the other brokers. We said
            
            
            
              initially that we push data to Kafka and then it
            
            
            
              will be stored in the log forever. Well, this is not entirely
            
            
            
              true, because we can set what are called topic
            
            
            
              retention policies, so we can say for how long we want
            
            
            
              to keep the data in Kafka. We could say that based on time. So we
            
            
            
              can say, well, I want to keep the data on Kafka for two
            
            
            
              weeks, six hour, 30 minutes, or forever.
            
            
            
              Or we can say that based on log size. Basically,
            
            
            
              I want to keep the events in kafka until the
            
            
            
              kafka log reaches 10gb and then delete the oldest
            
            
            
              chunk. I can also use both, and the first threshold that will
            
            
            
              be hit between time and size will dictate when I will delete the
            
            
            
              oldest set of records. So we understood that partitions
            
            
            
              are good. How do you select a partition? Well, usually it's done with the key
            
            
            
              component of the message. And what
            
            
            
              Kafka does by default is that it ashes the key
            
            
            
              and takes the result of the ash in order to select one partition,
            
            
            
              ensuring that messages having the same key always land
            
            
            
              in the same partition. Why this is useful?
            
            
            
              Well, let me show you what happens when you start using partition.
            
            
            
              It's useful for ordering. Let me show you this little example. I have
            
            
            
              my producer which produces data to a topic with two partitions,
            
            
            
              and then I have a consumer. Let's assume a very simple use case
            
            
            
              where I have only three events, blue one happening first,
            
            
            
              yellow one happening second, red one happening third. Now,
            
            
            
              when pushing this data into our topic, the blue
            
            
            
              event will land in partition zero, the yellow event will land in partition
            
            
            
              one, and the red event will be in partition zero again.
            
            
            
              Now, when reading data from the topic, it could happen,
            
            
            
              it will not always be the case, but it could happen that I will read
            
            
            
              events in this order. Blue one first, red 1
            
            
            
              second, yellow one third. So if you check, the global ordering
            
            
            
              is not correct. Why is that? Well, because when we start using partition,
            
            
            
              we have to give up on global ordering. Kafka ensures the correct ordering
            
            
            
              only per partition. So this means whats we have to start
            
            
            
              thinking about for which events,
            
            
            
              for which subset of events the related altering
            
            
            
              is necessary and for which not. If we go back to our pizza order
            
            
            
              example, it makes sense to keep all the orders of
            
            
            
              the same restaurant together because we want to know which person ordered
            
            
            
              before or after the other. But we don't really care if an
            
            
            
              order for Luigi's pizza was done before another order for
            
            
            
              Mario's pizza. So we understood that partitions
            
            
            
              are good because they ease the trade off between disk
            
            
            
              space and log size. But partitions are bad because we
            
            
            
              have to ive up on global ordering. But if you think
            
            
            
              about partitions, and if you think about a log with a single partition,
            
            
            
              it's just one unique thread appending one event
            
            
            
              after the other. And you can think that the throughput is done by the single
            
            
            
              thread doing the work. Now, if we have more partition,
            
            
            
              we have multiple independent threads that can append
            
            
            
              data one after the other. So you can still think, I know that there
            
            
            
              are other components, but the throughput of those three processes
            
            
            
              is roughly three times the throughput of the original
            
            
            
              process. So this means that we can have much more producer producing
            
            
            
              data to Kafka, and also we can have much more threats consuming
            
            
            
              data from Kafka. But still we want to consume all the events of
            
            
            
              a certain topic, but we don't want to consume the same event twice.
            
            
            
              How does Kafka handles that? Well, it does by assigning
            
            
            
              a non overlapping subset of partitions to the consumer.
            
            
            
              If these last few words didn't make a lot of sense for you, well,
            
            
            
              let's check. In this demo we have two consumers and three partitions.
            
            
            
              What Kafka will do is assign the top, the blue
            
            
            
              partition to consumer one, and the yellow and red partition
            
            
            
              to consumer two, ensuring that everything works
            
            
            
              as expected. Even more. Let me just focus
            
            
            
              on this one. If consumer one now
            
            
            
              dies, Kafka will understand that after a timeout and redirect
            
            
            
              the blue arrow from consumer one which died,
            
            
            
              to the consumer which is still available consumer two. So now
            
            
            
              let me show you this behavior in a demo. Let me show you
            
            
            
              partitioning in a demo. And this time let's create
            
            
            
              a new producers which is similar
            
            
            
              to the above, nothing different. Apart from now we
            
            
            
              are using Kafka admin client to connect on the admin
            
            
            
              side of Kafka and create a new topic
            
            
            
              with two partitions. So we will force the number
            
            
            
              of partitions here. Okay, now what we
            
            
            
              want on the other side is we have a producer with
            
            
            
              a topic of two partitions. Let's create two consumers that
            
            
            
              are working one against the other to consume all
            
            
            
              the messages from that topic. So let's move consumer
            
            
            
              one here and consumer two
            
            
            
              there. So I'm saying to Kafka
            
            
            
              that im existing those two consumers to the same topic.
            
            
            
              Since I have two partitions and two consumers,
            
            
            
              what Kafka should do is assign one consumer
            
            
            
              to the top partition and one consumer to the bottom partition.
            
            
            
              If I go now, let me check that the
            
            
            
              top one is started. Let me start also the bottom consumer.
            
            
            
              So all the two consumers are started. Now let me
            
            
            
              go back to the producer here and let me send
            
            
            
              a couple of messages. If you remember what I
            
            
            
              told you before, the partition is selected with the key.
            
            
            
              So Kafka does a hash of the key and selects the partition.
            
            
            
              What I'm doing here, I'm using two records with
            
            
            
              slightly different key ed one ed zero. So I'm expecting them
            
            
            
              to land into two different partitions. Let me try this
            
            
            
              out exactly. So I receive one record
            
            
            
              on the top consumer, one record at the bottom consumer.
            
            
            
              If you wonder what those two flex means, this means that the top consumer is
            
            
            
              reading from partition zero, the offset zero. So the first record
            
            
            
              of partition zero, the bottom consumer is reading from partition one,
            
            
            
              offset zero, first record of partition one. If now I
            
            
            
              send another couple of messages from the
            
            
            
              same producer reusing the same keys since what I
            
            
            
              told you before, that messages with the same key
            
            
            
              will end up in the same partition. Im expecting mark order to
            
            
            
              land in the same partition as Frank because they share the same
            
            
            
              key and the same for Jan and Adele. So let's
            
            
            
              check this out. Runs this and as expected,
            
            
            
              mark is landing in the same partition as Frank and can in the
            
            
            
              same partition as Adele. So offset 1
            
            
            
              second record or partition zero offset 1 second record
            
            
            
              of partition one. So let's check also the latest bit.
            
            
            
              What happens if a consumer now fails?
            
            
            
              Kafka will somehow understand that after a timeout
            
            
            
              and should redirect also partition zero to consumers one.
            
            
            
              So let's check now what happens if I send another two events
            
            
            
              to the topic? I'm expecting to read both of them at the bottom consumer.
            
            
            
              Let's check this out exactly. So now I'm reading both from
            
            
            
              partition zero and partition one with the consumer
            
            
            
              which was left alive. So all as of
            
            
            
              now working as expected. Let's go back to a little bit
            
            
            
              more slides. So as of now we saw a pretty linear
            
            
            
              way of defining data pipelines. We had one or more
            
            
            
              threads of producer Kafka and one or more threads of consumers
            
            
            
              that were fighting one against the other in order to read all
            
            
            
              the messages from a Kafka topic. For example, if we go back to
            
            
            
              the pizza case we had, those two consumers could be the
            
            
            
              two pizza makers that are fighting one against the other in order to consume
            
            
            
              all the orders from the pizza order topic. But they
            
            
            
              don't want to make the same pizza twice. So they don't want to read the
            
            
            
              same pizza order twice. However, when they read a message,
            
            
            
              the message is not deleted from Kafka. This makes it available
            
            
            
              for other applications to read. So for example, I could have my
            
            
            
              billing person that wants to receive a copy of every order in order
            
            
            
              to make the bill and it wants to read from the topic at its
            
            
            
              own pace. How can I manage that? Well, with Kafka it's really simple.
            
            
            
              It's the concept of consumer groups. So I have to define the two
            
            
            
              pizza makers are part of the same consumer group. And then I will create a
            
            
            
              new application called like billing person and we'll
            
            
            
              set that as part of a new consumer group and Kafka
            
            
            
              will understand that it's a new application and we'll start sending a copy of
            
            
            
              the topic data to this new application that will read at its own pace that
            
            
            
              has nothing to do with the two pizza makers. Let's check this
            
            
            
              out again. Let's go back to the notebook. And now
            
            
            
              what we will see is we
            
            
            
              will create a new consumer part of a new consumer group.
            
            
            
              So if we go back to the original consumer we had
            
            
            
              this group id that I told you before. We will check that later. Well,
            
            
            
              it's time now. The top consumer was called pizza makers.
            
            
            
              This is how you event driven applications being part of a consumers group
            
            
            
              on can Android. The new consumer is called bill in person.
            
            
            
              So this is for Kafka. It's a completely new application reading
            
            
            
              data from the topic. If you remember when
            
            
            
              we had this original consumer we managed to attach to the topic
            
            
            
              but read not from the beginning of the topic, from the time
            
            
            
              when we attach to the topic. So we were missing
            
            
            
              the order number one. Now with things new application
            
            
            
              we also say out offset reset equal to earliest. So we
            
            
            
              say we're attaching to a topic in Kafka and we want to read from
            
            
            
              the beginning. So when we now start this new application
            
            
            
              we should receive the two messages above
            
            
            
              plus also the first message, the original message of Francesco
            
            
            
              order. There we are. We are receiving all the three messages since
            
            
            
              we started reading from the beginning. Now if we go to the original
            
            
            
              producer and we now send a new event to
            
            
            
              the original topic, we should receive it in both because those are
            
            
            
              two different application. Let's try this out exactly.
            
            
            
              We will receive down order both in the top and the bottom application.
            
            
            
              Everyone adding the data because there are different application reusing
            
            
            
              the same topic. Now we understood also different
            
            
            
              consumer groups. Let's check a little bit more.
            
            
            
              What we saw so far was us writing some code
            
            
            
              in order to produce data or to consume data. However, it's very
            
            
            
              hard that Kafka will be your first tool, your first data
            
            
            
              tool in your company. You will probably want to integrate
            
            
            
              Kafka with an existing set of data tools,
            
            
            
              databases, data stores, any kind of data tool.
            
            
            
              And believe me, you don't want to write your own code for each of those
            
            
            
              connectors. There is something that solves this problem for you
            
            
            
              and it's called Kafka Connect. Kafka Connect is a pre built framework
            
            
            
              that allows you to take data from existing data sources
            
            
            
              and put them in a Kafka topic, or take data from a
            
            
            
              Kafka topic and push them to a set of data syncs.
            
            
            
              And all is driven by just a config file. And one
            
            
            
              or more threads of Kafka Connect allows you to
            
            
            
              event event driven applications. If we go back to one of the initial slides
            
            
            
              where we had our producer producing data to a database, well,
            
            
            
              we now want to include Kafka in the picture, but still we don't want
            
            
            
              to change the original setup, which is still working. How can we include
            
            
            
              Kafka in the picture? Well, with Kafka Connect and with a change
            
            
            
              data capture solution, we can monitor all changes happening in
            
            
            
              a set of tables in the database and propagate those changes as events,
            
            
            
              as messages in Kafka. Very very easy. But also
            
            
            
              we can use Kafka Connect in order to distribute events.
            
            
            
              So for example, if we have our application that already writes to
            
            
            
              Kafka and we have the data in Kafka in a topic,
            
            
            
              well, is our team needing the data in a
            
            
            
              database? Any JDBC database? With Kafka Connect we just can
            
            
            
              ship the data to a JDBC database. They want another copy to a
            
            
            
              postgres database. There we are. They want a third copy to Bigquery.
            
            
            
              Really easy. They want a fourth copy into s
            
            
            
              three for long term storage. Just another Kafka connect thread.
            
            
            
              And again, what you need to do is just define a
            
            
            
              config file from which topic you want to take and
            
            
            
              where you want to bring them. And with Ivan,
            
            
            
              Kafka Connect is also managed service. So you just have to figure out
            
            
            
              how to write the config file. So let's check also this as a
            
            
            
              demo. If we go back to our notebook
            
            
            
              we can now check the Kafka
            
            
            
              connect one. Let me close a little bit of my notebooks.
            
            
            
              What we are doing here, we are creating a new producer. What we will
            
            
            
              create now is a different topic and within each message
            
            
            
              we are going to send both the schema of the
            
            
            
              key and the value and the payload of the key and the value. Why do
            
            
            
              we do that? Well, because we want to make sure that our Kafka
            
            
            
              connect connector understand the structure of the record and
            
            
            
              populates a downstream application that in this case it's postgres table
            
            
            
              properly. So we define the schema of the key and
            
            
            
              the value. And now we pass, we create three
            
            
            
              records. We push three messages to Kafka containing both the schema
            
            
            
              and the value itself. With Frank ordering a pizza margarita,
            
            
            
              Dan ordering a pizza with fries and Jan ordering a pizza with mushrooms.
            
            
            
              Okay so we push the data into a topic.
            
            
            
              Now we want to take the data from this topic and push that
            
            
            
              to postgres. In order to do that I have everything scripted
            
            
            
              but I want to show you how you can do that with Ivan web
            
            
            
              UI. If I go to config I can
            
            
            
              check that there is a Kafka connect set up waiting for me.
            
            
            
              So this is the config file that you have to write in order to send
            
            
            
              data from Kafka to any other data target. So let's check this out.
            
            
            
              What do we have to show? We have to create send the
            
            
            
              data to a postgres database using this connection using a
            
            
            
              very secure new PG user new password. One, two,
            
            
            
              three. Very secure. And what do we want to send to this
            
            
            
              postgres database? A topic called Francesco
            
            
            
              pizza schema and we are going to call things connector sync
            
            
            
              Kafka postgres. Additionally we want to tell that the
            
            
            
              value is adjacent and we are using a JDBC sync
            
            
            
              connector. We are pushing data to a JDBC and please if
            
            
            
              the target table doesn't exist, autocreate.
            
            
            
              Okay let me show you on the database side.
            
            
            
              Whats have postgres database called 42 whats
            
            
            
              I'm connecting to. It has a default Db there whats schemas?
            
            
            
              The public schema and set of tables which is empty.
            
            
            
              Okay now let me take the config file that
            
            
            
              I've been talking to you about earlier on and let me go
            
            
            
              to Ivan Web Ui. Im going into my kafka
            
            
            
              service. There is a connector tab where I can create a new connector.
            
            
            
              I select a JDB sync and syncing the data to a JDBC database
            
            
            
              and I could fill all the information here. Or since I have my config file
            
            
            
              I can copy and paste into the configuration section and
            
            
            
              things parses the information and fills all the details as
            
            
            
              shown before im creating a connector with name sync Kafka postgres
            
            
            
              classes JDBC sync connector and the value is a JSON database
            
            
            
              and I'm sending the Francesco pizza schema topic.
            
            
            
              So now we can create a new connector and the connector is
            
            
            
              running. So this means that now the data should be available in postgres.
            
            
            
              Let me check that out. We can see that now my database
            
            
            
              has still tables. Let's refresh the list of tables and I
            
            
            
              see my Francesco pizza schema which has the same name as the topic.
            
            
            
              If I double click on it and
            
            
            
              I select the data I can see the three orders from
            
            
            
              Frank can and Jan being populated correctly.
            
            
            
              If I go back to the notebook and let's
            
            
            
              go back to our Kafka connect and send
            
            
            
              another order for Giuseppe order in Pizza y this
            
            
            
              is sent to Kafka. Let's go back to the database and let's
            
            
            
              try to refresh and also the order for Giuseppe ordering pizza
            
            
            
              y is there. So Kafka connect managed to create the table,
            
            
            
              populate the table, and keeps populating the table as soon as a
            
            
            
              new row arrives in the Kafka topic.
            
            
            
              So going back to last few slides, I believe
            
            
            
              in less than an hour we saw a lot of things.
            
            
            
              We saw how to produce and consume data, how to use partitions,
            
            
            
              how to define multiple applications via multiple consumer groups
            
            
            
              and what Kafka Connect is and how to make it work.
            
            
            
              Now if you have more questions or if you have any
            
            
            
              questions, I will give you some extra resources. First of
            
            
            
              all, my twitter handle. You can find me at ftziot. My messages
            
            
            
              are open so if you have any question regarding any of the content that I've
            
            
            
              been talking to you so far, just shoes out. Second thing, if you want a
            
            
            
              copy of the notebooks that I've been showing you, you can find
            
            
            
              them as an open source GitHub repository.
            
            
            
              If you want to try Kafka but you don't have a nice streaming data
            
            
            
              source that you can use to push data to Kafka. Well, I created a
            
            
            
              fake pizza producer which creates even more complex
            
            
            
              fake pizza orders than the one that I've been showing you. The last one is
            
            
            
              if you want to try kafka but you don't have Kafka.
            
            
            
              Well check out Ivan IO because we offer that as managed
            
            
            
              service. So you can just start your instance.
            
            
            
              We will take care of it and you will have only to take care about
            
            
            
              your development. I hope this session was useful for you to
            
            
            
              understand what Kafka is and how you can start using it with Python.
            
            
            
              If you have any questions, I will be always there available for
            
            
            
              you to help. Just reach out on LinkedIn or Twitter. Thank you very much and
            
            
            
              Ciao from Francesco.