Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. How are you doing? I hope you're enjoying comfort. Two python
and all its good content. My name is Gaspar and I work for
the CI CD in an autonomous driving project, and today I'm going
to talk about reproducible builds with Python and beta.
If you run two builds with the same sources code and the same
commit, but on two different machines, do you expect
to get the same result? Well, in most of the cases you
will not. Today we will go through sources of nondeterminism.
In most build processes, we will look at how Bazel
can be used to create reproducible hermetic
builds. Then we will apply these concepts
to create a reproducible python environment and a
flask application that can be built with Bazel so that
the Python interpreter and all dependencies are hermetical.
According to the reproducible Builds project, a build
is reproducible if given the same source code, build environment
and build instructions. Any party can recreate bit by bit
identical copies of all specified artifacts.
This means that to achieve a reproducible build,
you must remove all sources of nondeterminism.
Although this can be difficult, there are several benefits.
Reproducible code is more secure and reduces
the surface for attackers. You can determine the binary
origin of an artifact like what sources it was built
from, and it can drastically speed up the build time
thanks to caching of intermediate build artifacts in large
build graphs. This is not trivial for big projects, and if your build is
reproducible, you can guarantee safe machines. To obtain a
reproducible build, you need to tackle sources of nondeterminism.
One of the most common causes of nondeterminism are inputs to
the build. With that, I mean everything that is not
source code, compilers, build tools, third party
libraries, and any other inputs that might influence the
build. All references must be unambiguous for
your build to be arithmetic, either as a fully resolved version,
numbers or ashes. When you get to such a situation,
you can say you have arithmetic build. Your build
is insensitive to the libraries and other software installed
on the build machine to be arithmetic. You can start checking in
all the information needed by your build as part of the source code.
Hermetic builds enable also cherry picking let's
say you want to fix a bug in an older release that's running in production.
If you have armetic build process, you can check out the old revision,
fix the but, and then rebuild the code. Thanks to
arithmeticity, all the build tools are versioned in the source
code repository. So a project built two months
ago will not use today's version of the compiler
because it can be incompatible with a two months old source
code. This is very important. So you are now thinking why to
be so strict with my build? Well, it may sound painful,
but know which depends on what will pay off in the long term,
and we will see that later on. Internal randomness is an
issue you have to tackle before you can achieve a reproducible build,
which can be a sneaky thing to fix. There are many sources
of internal randomness, but timestamps are a common
one. They are often used to keep track of when the
build was done. Get rid of them. With reproducible builds,
timestamps are irrelevant since you are already tracking your
build environment with source control for the languages that
don't initialize values, you need to do it explicitly.
Avoid randomness in your build due to capturing random
bytes from memory, there's no easy way around it.
You must inspect your code. All this may sound a
bit overwhelming, I know, but it's actually not as complex as it sounds.
Bazel makes this process much easier.
Bazel is a fast, scalable, multi language and extensible
build system. As stated on the official Google website,
Bazel can help you to achieve a reproducible build,
providing off the shelf support for arithmeticity.
One of the key concepts behind Bazel is sandboxing
Bazel file system sandbox will run processes in
a working directory that only contains known inputs,
such that compilers and other tools can
even see sources files they should not access.
This means that you must specify all the inputs or your
build will fail. As a consequence of hermeticity,
Bazel allows you to encapsulate your build targets,
meaning that you can hide the internals and
be sure that no one can implicitly depend on your target.
Another great feature of Bazel is its caching system,
which can make your repeated build 50 times
faster. There are a few key concepts for Bazel that we
need to cover before jumping to the code. The directory that
contains the source file of the project is called workspace,
and it must contain a text file called workspace as
well. A workspace file is where I define all the reference
to the external dependencies required by the build. Here,
external dependencies can be anything, kernel libraries,
git repositories, phaser rules, or any other thing you
may require in your build. Phaser rule specifies the relationship
between inputs, outputs, and the steps needed.
Two, build the outputs and is specific to the programming language you
use. In our case, we will use Python rules to
tell Bazel how to create an executable Python program
starting from some PY files. The code
is organized in packages and each package is
a collection of targets. A package is defined as a directory
containing a file named build build files describe
how source code can be built. Basically, when you want
to build your code, you can specify the package and
which target you want to build, like in the example here. As I
mentioned before, today we will apply the concept of reproducible build
in Python. We will create a reproducible local environment using
Python three eight three that
we will build from scratch. We will write a test to make
sure we are using the right Python binary to build our code.
We'll be able to reuse the local environment the foundation
to develop your next Python project. Later on we will create a
flask application and this will allow us two understand how
to manage dependencies in Python in aromatic way.
So let's jump to the code. So this is what our
workspace looks like. You need to assign a name to it.
Here we just call my flask app. We define
a new variable. This is a variable called compute
Python based on OS, which is the command we need to execute to
compile Python from scratch. We will use it later on.
Note that here we need to make a distinction if we are running this example
in macOS. So here we can see our
first Bazel rule. This is HTTP archive,
which is a basic rule that allow us to download a compressed archive
file, compress it and use it. In our project
we use HTTP archive to fetch
and build from scratch python. With this we
can be sure two have control over the Python binary inversion.
Remember, you don't want to use the Python version installed
on the os machine or your build will not be reproducible.
The hermeticity here is ensured by the URLs
field which tell Bazel where to find the dependency and
the shot 156 field which is the unique identifier
for it. Every build will use the same unambiguous Python version.
Another important field is the patch commands that we use to
define a sequence of batch commands to execute. We use it
to run the build command for Python using the configured Python
Bazel analyze variable that we defined earlier.
Once we run this HTTP archive rule,
this will fetch Python pin version
of Python and build it and we will have our three
data, three version of Python to use in our next
we need the Python Bazel rules to create the build and
test target. Since those rules don't come with Bazel,
we need to fetch them using HTTP archive like
this. And here again we use the SHA 256
as identifier, the version of the so we said that we want
to compile code written in Python using the Python binary we defined
early on. To do that, we need to define a new Bazel
tool chain. Bazel tool chains are defined in build
files. Here pyramtime is used from
the Python rules that we fetched before we define Python
three runtime using the Python interpreter. We found Python before,
and then we use py runtime pair,
which wraps up to two Python runtimes, one for Python three and
one for Python two. Since we only want to support the three eight
three version, we don't define any Py two runtime,
then use the pyramtime pair to define our toolchain,
the Py three toolchain that we can use in our project.
But to use toolchain in Bazel, you need to register them
and you do that at the end of the workspace file with
this line. Remember, the registered tool chains
must always be at the end of the workspace file.
Nice. You now have arithmetic Bazel build environment set up, but don't
just take my word for it. Let's write a test. For writing
tests in Python, we will need Pytest. So let's add
the requirement txt file like this,
and along with Pytest we need all its child dependencies.
This is a normal requirement file that use daily
in Python. But since we want to be arithmetic, we need to pin
the versions and ash as an identifier for arithmeticity.
This means that when Bazel will try two build the test, he will
look for the exact version of the dependency we want to use.
This example if Bazel can find a library called Python
with version five four one with this exact edge,
the build will fail due to a missing dependency. Okay,
now we can modify the workspace again. We add pip
install. A pip install is a rule friendly dependencies. It allows
importing Pip dependencies from a requirement TxT file,
but by default it uses the Python interpreter that
is in the OS machine. We can override
this behavior by passing the Python interpreter
target, the interpreter that we just built from scratch before.
Cool. Now everything is ready to write the test. Let's create a new
folder called test and a file called compilerversion test Py.
This is a very simple test that will check that the Python
executable is present and that the version is correct,
and to include the test in the build process. Two, add a build file
so we add it to the test folder. Here we define
Pytest target pytest is just a
way to say to Bazel that we want to create test that is using Python.
We need to specify a name for that. We use compiler version test
and the source files needed to compile and execute the test.
In this case it's just compilerversion test py. We also
need two define the dependencies that are needed for the test.
We load dependencies using the requirement function which maps
a piP package name to a label and avoids our
code a dependency name into the piP file. Dependency management
in Bazel is very straightforward. You don't need to create any virtual environment,
you don't need to run any PiP install or any other eggy
thing. Just list dependencies under the depths field and you're
done. Note that up to this point everything
is explicit, so this will ensure reproducibility of
the build. Okay, we can now run our first bezelized Python test.
So from the project root this
is the way that we're using Bazel to run test,
and we need to specify which test we want.
So run the package test. The target is called compiler
version.
Okay, so this runs the test.
So as you can see, the test is passing. This means that we are using
the right version of the Python executable.
You can notice here that it says cached. This is because I executed
this test before, so now it doesn't execute it again.
Since I didn't modify anything in the test, it's just using
the cache resulted in radius. Up to this point
we went through the foundation of a baseline environment using Python,
but let's see something more complex and close to a real use
case. We can create a new folder called Src with a new file
flask app py and this is a simple flask
application that will show the binary path and the
python version of the OS machine along with the one used by Bazel.
We can then check that the two paths are different.
To build it, we need a build file. So let's add a build
file under the SrC folder, and this time
we are creating a binary. So we just say py binary again.
We need to specify a name less cap the source that are needed,
less cap PI and we load the dependencies here.
We need to extend the requirement 60 with the flux dependencies and all
the child dependencies as well, and just reload them using the
requirement function again. Okay, so now we can run the application.
This time we use Bazel run. Bazel run first
compile and then execute the application. So let's do it.
Bazel run it is SRC less.
Okay, so this is compiling and executing the application.
And as you can see it runs and it's running on the ost
on the local lost. So if we open the browser and
navigate to localost,
we can see as expected, the bazel is using Python version
three eight three that we compiled from scratch and
not python three eight five that I have
on my YoS machine. Are we sure that the build is reproducible?
We can do a quick test. We run a build two times and check
the output binaries for any differences by comparing the MD five ashes.
Here we computed the ash of the binary that we just built. Clean all the
build artifacts and dependencies with bezel clean,
and then run a build again. The new binary is identical two the old
one. So we have a reproducible build, right?
Well, actually it's not fully reproducible and let me show you
why. If we go back to the workspace file, we are
trying to build python inside Bazel to achieve full reproducibility.
However, using HTTP machines patch commands means
that Python is built using the compiler of the OS machine that
runs the build the Python interpreter, which is pin two. A precise
version will depend on the machine's GCC and system libraries
that are not pinned or controlled in any way. In other words,
the build is not fully reproducible,
but there are solutions for that. You can run Bazel build from a
docker container with a pin GCC version and then check
in the docker information within your project. This is a common approach
in CI systems. Instead of compiling Python
from scratch, you can use a precompiled binary executable,
check it in the source control, and pin it on the build.
Or you can use a different approach and use a tool like Nix,
which allows importing thematically external dependencies
into like system packages, and you can find a link in
the presentation. To summarize the biggest takeaways.
From now on, don't take for granted that your build is reproducible,
since most probably is not. Arithmeticity enables
cherry picking and can save you from uncomfortable situations.
Impulse to the build must be versioned with source code or
you will not have any control over them. Internal randomness
can be sneaky, but must be removed. You now
have a working python environment that is hermetic thanks to Bazel,
and that you can reuse for your next Python project.
You have seen how to compile a flask application in a reproducible
way, and how to manage dependencies automatically. The following
link you will find the code I presented today. Feel free
to reuse it. Thank you very much. If you want to connect here,
you can find my contacts. Reach out and let me know your thoughts. Enjoy the
rest of the conference and talk to you soon.