Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, this talk is called compiling containers. I'm going to
talk about how container images are made. I'm going
to talk about this tool that I think is pretty tool called buildkit. I'm going
to teach you a little bit about compilers and do a code demo and
talk a little bit about the future of containerization and cloud
native stuff. So I hope you enjoy it. I'm Adam Gordon Bell.
I am an open source software developer.
I am a Canadian. I am a podcaster.
I have a podcast about software engineering. You can see there on the right.
If you want to check it out, just search for my name.
Your podcast player. I'm a developer advocate at Earthly
where we are trying to improve builds.
I'm not going to talk about Earthly too much today, but you should
check it out. Earthly dev. You've probably heard a lot
about container runtimes in recent history. Kubernetes stopped
using Docker shim and started using Cri shim.
A lot of people freaked out, but really there's a lot of these runtimes.
You can see a bunch of them here, run C container D, et cetera.
It doesn't really matter the various functionality of
each of these. For the talk today I'm going to be talking about container
images, OCI images, to be specific. All of these
container runtimes can run OCi images. And what I want to talk about is
how you make this image, how you make the image that turns into a
running container. And we're going to get a little bit down in the weeds about
how that is done. So just a little backend information. You probably know
what a container is. It's like a virtual machine, except you share
the operating system kernel and some other stuff. So it has lower overhead.
It's like the ten second version. But do you know the difference between a
container and an image? So the way I like to think
about it is via analogy, if you think about an executable, an executable
is a program that you can run on your computer. And when you launch it,
it becomes a process, and you can launch as many processes
as you want from that single executable. Similarly,
an image is something that's not running, and when you run that
image, it becomes a container and you can run many containers from
a single image. So sort of an executable is to an image as
a process is to a container. You could imagine a
virtual machine fitting in. This has, well, an image is like a VM image.
Container is like a running VM, except that when you make changes to
a running vm, you can save them back to the image. With a
docker image they are immutable, so any changes that
happen while it's running won't actually affect the image.
So that executable turns into a process. But how do we get that executable?
To get the executable we use a compiler. So here is
an example. I have this hello world. It's written in C. You can see it
here on the left. I'm going to pass it to a compiler. I'm going to
use LlvM with the clang frontend. Hello world as an
example comes from this book, the c programming language by Brian Kernahan
and Dennis Ritchie. And I mean we're still using this as an example today in
many languages. So I'm going to compile it to x 86 and
my x 86 ends up being like this code we see
here on the right, which is x 86 assembly. This is the
assembly code for what will be executed when you run this program.
Assembly code maps basically one to one to machine code,
which is binary, which we're not showing here because it's much harder
to read. But the machine code is what actually is executed on your cpu.
It's what your cpu architecture understands. So the
docker equivalent of this simple hello world program is something
like this. So we have from alpine we're going to copy
in a readme file and then we're going to echo some results to
a file called build txt. So to compile
that we use docker build. So Docker build dot,
which is our present working directory. And then we're going to pass it t
to tag it test. So creating an image from a Docker file,
it sort of works the same way as our hello world example. The Docker
file is passed to buildkit along with the build context,
which in our case was the present working directory because we used dot.
Generally each statement is turned into a layer as part of this
build step, and those layers are packaged up into an image.
One thing that's different from a traditional compiler is that with the compiler you're
just taking the source code, but Docker builds, you're actually taking the
file system context and that's how you can do things like this copy command because
you can take things from the file system that are referenced when
you pass it in. So in this case we pass in our present working directory
and we use that to copy in files. So comparing our
c compiling to our Docker building, we can
see how they look quite similar. You have the same type of
steps. Starting with your input. You have the compiler, which is
llvm or buildkit, and then you end up with your executable or image.
So what is buildkit? This is buildkit, kind of
the brains behind Docker build, and it drives not just
docker build, but a number of interesting cloud native projects. One is open
fast open functions as a service, which as I understand it,
is a project to run serverless functions on Kubernetes. There is Rancher
Rio, which is an application development engine for cloud.
And there's probably tens of other projects that use
buildkit. And buildkit's what we're going to dive in a little bit today. I want
to give you intuition for how these images are made,
which I think could be pretty useful. But first let's talk a little bit more
about compilers. So earlier I had this diagram again with the hello world,
and we generate this assembly. Saying compilers work this way is kind of
a lie. It's a simplification. On the left here is the
PDP eleven. That is the first computer that a
C compiler existed for. That compiler was written by Dennis Ritchie.
And the simple diagram that I showed earlier, that actually is
exactly what would happen on the PDP eleven. The C code would be taken by
the compiler and it would be converted into PDP eleven assembly instructions.
But this direct mapping, it poses a problem. Whenever you have a new machine,
it means you need a new compiler. And what language do you write that compiler
in? The first C compiler, as I said, was written in PDP eleven assembly,
but you don't want to have to repeat that for every new machine architecture.
For every new cpu. What happens when the vax eleven comes out?
Or the new Apple M one that you see here on the right? So this
problem has very quickly solved by compiler authors who came up
with this structure. They said a compiler can be split into
stages. You have a front end, you have an optimizer and
you have a back end. The front end takes in your source, your various files
and tokenizes them and parses them, builds an abstract syntax tree.
The middle is the optimizer. It could do performance optimizations
to make your code faster. At some point they called this middle the middle end,
which I'm glad that term didn't catch on. And then you have the back end
which actually generates the actual assembly.
Now the beauty of this approach is you don't need to build a new compiler
for every machine architecture. You can just build a new backend.
So this will all relate to docker images,
trust me. But the trick here is to get all the back ends
speaking the same language. For that you need an intermediate
representation. You need a language that's not c, but isn't assembly
either. You need something in between. So once you have that language, and with LLVM
it's called LLVM iR, which is the intermediate representation.
All the backends then need to do is translate from that intermediate representation
to the machine architecture that's specific to where
they get things to run bit. Once you have that, you can
have multiple front ends as well. So this is modern LLVM.
This is sort of what it looks like. It started with the clang frontend.
Now it has many front ends, right? It has rust front end, there's a go
one, there's Julia. There's all kinds of different front ends.
There's been sort of a cambrian explosion of ahead of time compiled programming languages
in the past ten years. It's a result of how easy it is to add
a front end because of this architecture that LLVM has. And the reason
is that this is just a common interface between the various layers.
It becomes sort of like a protocol. On the back end we
have not just x 86 and arm and risk and power pc,
but we also have webassembly back end, which means you could take your Julia
or your rust and I guess compiler it to run in your web browser.
There's even gpu back ends where you could compile your code to run
on pixel shaders. You can build a front end has well, right? So if
you can create a front end that translates to this llvmir,
you have a new programming language. Everything else, all the support for all
these various back ends is taken care of. So this is super cool.
And the secret is just this intermediate representation. The front ends need to emit it
and the back ends need to consume it. So bringing it back to container images
build bit works the same way. It has something called Llbir,
which is low level builder internal representation, because containers
need to be able to run on lots of machine architectures as well. And once
you have the IR for varying back ends, you can start having varying
front ends as well. So right now there's not too many front ends.
Probably the most commonly used are just the various versions of the Docker syntax.
There's Docker V 1.1, but as I mentioned, there's open fast,
there's buildpack, there's some other things, but it doesn't have to be that way.
So just to explain how it works, LLVM IR looks like this on the left.
It's sort of a verbose and explicit programming language.
LLB is much simpler. It looks quite different with LLVM.
The basics are kind of a simple programming language. With Llbir
it's sort of a language for creating these self contained
cloud native packages. So let me show you how that works. So let's start with
building just a simple docker file and then we'll go from there.
So this is the one we showed earlier just from alpine copy readme.
Then we're going to echo this to a file so I
can build this just like this,
I'm going to tag it, test, build the file and then we
can run it. So we want it to be interactive
and we want to run test and we're going to open a shell.
So now we're inside of it and we can see in
this program standard docker build exit.
So let's now take this file and just build it with buildkit instead.
So I'm on a Mac. So to do that I need to do brew
install buildkit. I already have it installed.
But if you're on a Mac this is what you would need to do and
that gives you this new command called build.
You can kind of read through what it does there, but it lets you build
things bit. When it build things bit needs to communicate with this backend that'll
make the GRPC requests against Buildkit D.
So we actually need to start that up. Let's actually
make sure I don't already have it running. Okay, I don't.
So this is how I would start the backend
part of buildkit. It just runs as a container. It's called Moby
buildkit and I'm just going to call bit buildkit.
And then you can see that here. And then I'm just
going to tell build where it is by doing that.
So now it knows it is running as a docker container and then it has
this name. Actually that is incorrect.
That should say buildkit, right?
So I actually want that I believe.
Let's see how it goes. So once we have that running,
then we can try to build this using buildkit directly.
So using build bit directly is a little bit more verbose.
So just clear this,
you can just use it like this. So I'm going to use build CTL.
I'm going to call build. I can specify which frontend that I want
to use. I'm going to use the v zero syntax,
specify the context. That's like where I can copy in files from. For my
copy command I specify my docker file, which is in my present working
directory. And then here we're giving it some output options. So we want to output
as an image. This is specifying the back end that we showed earlier. There's a
couple of different backends. I can just make a tar file or whatever. So here
frontend, here back end I'm going to give it a name
and tell it to push it. So I'm going to actually push it to Dockerfile
Agbell test. So let's run that and
then,
so we pushed it there. Let's test
by pulling it test.
Actually you know what we should do is change this here.
So let's call this build bit build.
Let's run this again so that we can differentiate
and then we'll pull bit and then we can run it
with Docker run.
It was called AGBL test. And then we want to open
a shell and then if we look at our
build txt, there we go, build bit built.
So we just built this file using buildkit directly.
So now we're getting somewhere. So let's do something a little bit more advanced.
If you remember, let's open up a new file.
The way that buildkit's working is you have your front
end and it's going to send LLB to
build bit to the
build kit daemon which can send LLB
to back end.
So let's just inject ourselves here.
Let's programmatically build a front end. So to do
that, all we need to do is build up this LLB and then send
it. And LLB is specified as a protobuff.
So you just have to make like a GRPC request with like here's my
protobuff. So I put together an example here.
So you could do this in any language where you have support
for protocol buffers. But what I've done is I've used go
and that's because buildkit itself is written in go and they have the nice client
library that I can just pull in. And that gives me the easy ability
to put these together and see the various operations
and even dig in and see how it works if I'd like. So this
is just my translation of our original Docker file.
So we're going to come from an image that is from Docker IO library
alpine. You notice it's a little bit more explicit. I need to copy
in a file from my context, which I'm going to specify.
I'm going to copy my readme and name it there readme. Then I'm
going to do a run. The run you can see is a little bit more
explicit too. Back here we just have echoes here we have
to wrap it in a shell. That's because these commands like Echo
and sending it to a file, those are shell commands. So we need to run
it inside of a shell. Okay. So this will generate all of
our LLB. And up here in our main program
all we're doing is just writing it to standard out. The reason
is we're going to use build to kind of make this happen.
But first let's just take a look at what we get.
Yeah. So we can just run this
and we just get all of the
protocol buffer contents to standard out. And it's
a binary format, so not super readable bit.
What we can do is take it, it's clear.
What we can do is take that output. We can send it to build which
has a command called dump LLB. And then we'll pipe it through JQ.
And we should get something more like a JSOn formatted.
So now we can kind of see this is the raw commands that
we are sending through. So this, you can see is our run
command. We should be able to see. Here's our copy source
and destination should have our context. Yeah,
here's our local context that that copy from will run. Here's our
from. So that's kind of what
the raw LLB, I guess this is the formatted LLB looks
like. So now let's just try to build this.
So I'm going to do this.
No, we haven't built it yet. Let's start with building it.
So we'll clear.
So same idea. We're going to pipe it to build CTL. We're going
to pass it this local context, but now we're telling it to output
an image and call that ag bell test and
push bit. If you were like
using this, if you wanted to build a more robust
solution, you might want to have this application actually find
out if build KitD was running,
maybe start it up. If it wasn't, maybe have some error
checking and whatever. But this is a nice way
to test it out. We just rely on this build CTL tool to
handle all of that for us. So this should be building our program
and we can see it running right there and
then just, we can pull it down just to make sure
that it actually pushed it. And then same idea.
We can do docker, run it AGL
test and let's run a shell.
And then inside the shell what was our file built
instead of built? There we go, programmatically built.
So that image will run anywhere where you have a container runtime.
So I think this is cool because by popping
into a programming language here rather than using a Docker file like
this, if you need to do some things that are very complex, you have all
the kind of control flow and structures and libraries
and things that a programming language brings you, right? You can kind of
raise the abstraction up. You could use it to get rid of duplication.
I think it could be an interesting solution depending upon your use case.
But front ends are actually
usually more compiler than that, right? Like a front end usually
instead of just sending LLB, it's usually more like
taking in source code, which would be our dockerfile,
and turning bit into tokens
and then parsing those into an ast and
then generating the LLB
and sending it on. So let's try something a
little bit more like that. That might actually take a little while.
So I put together a little bit of an example front.
You did I share this already? If you
look in this repo, you'll find it. It is here in this IC file
subfolder. So this is a Ic file. This is my own frontend
that I put together for Docker. Docker has this
syntax. When you call Docker build, if you specify
syntax and then the name of a publicly accessible
Dockerfile container, it will pull this down and it
will use bit as the front end for building things. It'll expect
it to take the source over GrPC
to parse it, to tokenize it, I guess
tokenize first and then parse, generate the LLB and then send
that on to buildkit. So that is pretty cool because
you don't even have to have people install anything. Docker will
pull this down if you just specify this. So I built
my own front end, which I'm calling an IC file.
It's more of a proof of concept, but there exists this language called
intrical, which is from the kind of built as a joke. I thought
it was kind of humorous to build a front end for Dockerfile
that looks somewhat like intrical. So mainly it
does all the same things as a Docker file, it just uses different words.
So instead of from we have come from, instead of copy
you use stash, which kind of stash the files inside of your container.
Instead of run, you use please, which is just more
polite. Intragal was known
for its please statements. So in integral every you could either say do
or please do. And if you didn't say please enough times then the compiler
would say like sorry, your program is not polite enough, I won't compile it.
I think if you put too many please statements it would also say your program
is too polite. Anyways, it's just a proof of concept, just for fun.
But now that we are building a complete
frontend and putting it in a docker container, we can actually build
without even using buildkit, without using it
directly, I guess. So let me show you what that might look like.
So I guess first let's change it to my IC file folder
and then it's just the standard docker builds. But we
need to specify the file explicitly because it's not used to having
something called Nic file. Then we'll tag it ic one so we can
build that just standard dockerfile,
no build CTL required. And then
we can do the same thing, we'll pull it, what did I just
call it? Oh, I guess I didn't push it. So I think we can just
run it. So we want interactive,
we want to run iq one and same thing. Let's get a shell
and here we are inside of it, we can take a look. And if
we look at cat built, we can see custom
built front end. That is our image. So we have
just built our own syntactical front end for docker files.
I don't recommend you actually use this, but I think it is a nice proof
of concept. So if you wanted to build your own syntax on the frontend for
a docker file, this project could be a good place to start. What I did
is I took one of the existing front ends for Docker and I
yanked it out. So you'll find in here all the parsers for
parsing the lines. You'll find the conversion steps for converting
from the Dockerfile syntax to LLB. You will find
the various commands for starting it up. But the main thing that I
did was just look at the mapping for terminology to
features and I replaced the existing commands with
the intercal style keywords, such as using stash
for copy or using please for
run or health check is are you okay?
Anyways, it's a fun little example that just shows some of the things that you
might be able to do with these tools. And that concludes
the demos. So I think this is pretty neat, but you might be imagining like
who cares? We can create our own backend and front ends for containerization,
but why would we? These examples are just proof of concept, but you can do
real things with this. Something you could do is build your own AWS
lambda clone, right? Like use the programmatic functionality that I has
just showing earlier where you can build a service that
takes in JavaScript. Maybe it just takes in a post request
with some JavaScript and then programmatically builds a docker container with
that in it and ships it off to kubernetes. That would actually not be that
hard, right? As a proof of concept, you could probably do that in a couple
hundred lines of code. You could also just build a specific image format
for your. This is called a mocker file. The creator noted
that Dockerfile in his organization were mainly just a list of a
whole bunch of things that you need to call like Aptgeton.
So you see this commands in Dockerfile files. One thing you can do
is put one per line, but then just because of the way that Docker images
get built, you end up with a whole bunch of layers. So the thing that
happens more commonly is you build just like a giant line that has all
these installs on it. That's kind of a mess, right? So Dockerfile,
which is specified in Yaml, you just provide a list of packages and the Dockerfile
will install each of these packages by just issuing those LLB requests.
This mocker file is specified in Yaml, which depends. Sometimes like Yaml,
sometimes I don't. But I think it really works for this use case. It nicely
constrains the problem down to exactly that needed to be done for
many docker image creation tasks. The other
thing you could do with build bit, and this is probably my big messages,
is I don't know, let's figure it out. We can't know what we can create
until we do it. When LLVM was created, rust didn't exist and Swift didn't exist
and Julia didn't exist. There was no pixel shader backend or
webassembly, but these things were all enabled by it existing.
And now we have this tool for creating cloud native workflows.
What should we do with it? I work on this project called Earthly.
We use some of these build kit features for doing pretty cool
reproducible builds like CI pipelines.
But I want to see more projects using it, like building on these foundations on
a historical scale. I think we're really early in cloud computing,
so we get to kind of decide what the future looks like.
So your name here, your project here, I mean, what can you build
with this? And one other thing, this kind of three stage
compile solution is super useful.
Compiler problems are hidden everywhere, and when you spot one,
there's like a whole literature of how to solve compiler problems.
So if you recognize something you're facing is actually a compiler problem,
all of a sudden there are all kinds of tools you can use to solve
it. There's more stuff about Buildkit that I also didn't get
a chance to cover in the demo. It's more than just
an old fashioned compiler. I've shown you how you can add front ends for it,
but you might have noticed you can do that without requiring anybody to install things.
You just have to reference the container it's in, because it unsurprisingly
uses Docker containers for the different parts of its structure. But that's not
all right. A traditional compiler is usually a pure function,
kind of where it takes in all the source, does some stuff, and produces an
output. But buildkit does more than that. It has workers which power
a feature called Docker Build X, which let you build a single docker
image for many machine architectures at once. You can run multiple architecture
backend at the same time, and also it has really advanced
caching. It's very cache efficient because it's expecting things
to be called repeatedly, and so it can keep track of what steps can
be skipped and replay them. Think about it like kind of like an incremental compiler.
I worked in Scala for a long time. The compiler was super slow from a
cold start, but when you were incrementally building things, it was super fast.
So buildkit is kind of built from the ground up to be aggressively incremental.
Another thing is, it's concurrent. Anytime you build something, we can kind of
look at the structure of what you're building, determine what steps depend on
others, and buildkit will run them in parallel where it can. And the coolest
feature is probably that it can be distributed because
you're using GRPC between these layers instead of just like a
C API, like a standard compiler might use. It's trivial
to actually have a cluster of these things. You could have a
whole bunch of workers and they could be running on different machine architectures,
or they could just be all in the cloud. So you can distribute your workload
and you can distribute out the work that way. So I think that is super
neat. So I am Adam Gordon Bell. This was my talk.
You can find me basically anywhere by searching for that name. I'll share
the links to the source for the examples. Build something cool using Buildkit
and let me know about it if you're looking for inspiration, you can take a
look at earthly, you can find it on GitHub or at Earthly Dev, thank you
so much.