Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, thank you very much for joining me. My name
is Marco Nicola. I'm yet another software developer.
I've been making software for more than 20 years. By now my main
focus has mostly been on machine learning and specifically natural
language processing applications. And in more
recent years I've also tried to expand my skill set working on
full stack web applications and also bit of software as
a service and cloud applications as well. I'm currently
employed at Exop, it's a german company and our main business
is mobility risk management. If you want to
be in touch with me, you can find me of course on GitHub
or Twitter if you prefer, or LinkedIn. The references
are there on screen. In this presentation,
I'm going to show you how you can effectively deserializing Python objects
with Gopickle with the help of a little library called Gopickle.
As you can see, this is going to be a sort of cross languages talk.
We will start by analyzing the Pythonista Pickle serialization
module. We'll see exactly what it is. How does it work? Why is it
interesting? We'll have a quick look at the Gopickle serialization
format and finally we'll reach our beloved
go programming language and we'll see how we can effectively
and easily read Pickle formatted data from Go without even
those need to run Python in the first place. First of all,
Pythonista Gopickle is a Pythonista built in
module. In Pythonista programming language modules
are something just similar to Go packages. The Gopickle
module in particular implements binary protocols
for serializing deserializing Python objects. With Gopickle,
here is all about data serialization and persistency.
You can imagine to have your Python script which builds some
data structures. Maybe you have an object and with the pickle
module you can serialize it to a file. For example
with the pickle dump function. This process is also called
data pickling. You'll then have a binary representation
of your original data, and later on you can read
data back from this file with a function called gopickle load.
And this deserialization process is also called
data unpickling. In this context, I think it's
interesting to talk about the pickle module,
especially because at least according to my own Python
programming experience, the pickle module seems to
be a very popular choice for data serialization in Python,
and it seems to be very popular, especially when
those aspect or format of the actual deserialized data
is not particularly a big concern. The popularity
of this choice also seems to be reflected by a high
number of particularly prominent Pythonista
projects and libraries that you can find around just
to name a bunch of them. Perhaps you've already heard about numpy Python library
for scientific computation. Maybe you've heard about
Pytorch, a machine learning framework for Pythonista or
pandas, a library for data analysis and
statistics. These libraries and many others as well,
provide high level functions for saving and loading your custom
data and behind the hood, either by default
or you can choose that with some option. They seems to
make use of the Python pickle module to actually
achieve data persistency. Now you might be wondering
why in the first place is it interesting for Pythonista programmers
to use this weird and exotic pickle
module over more popular and traditional data representation
formats such as JSON or YamL or XML?
Let's see this with a couple of simple examples.
Let's start with a very straightforward Python data structure.
In this case, we have a dict. Dicts in Python are
similar to go maps. There are a bunch of keys and
values. Values have many different data types. There are
strings, there are numbers. There is an array which
also contains mix of data types, number and a string.
And it turns out to be straightforward and it works
out of the box to deserialized this data to JSON
format. The JSON representation even looks almost identical
to the original Python code. But then
let's see what happens if instead of using built in and
simple data types, we define our own types and
classes. Let's take a little moment to familiarize a
little bit with these greeter class, since it will appear again
in later examples. Let's define in Pythonista
this class called greeter. It has a constructor,
this devinit function which accepts name
string argument and this name value is
saved by the constructor in an internal instance
variable underscore name. And then let's add
a simple greet method to this class. And all it
does is to print to standard, output the string
high and then interpolating those name
from the underscore name instance variable. That should be
simple enough even for non experienced Python programmers,
I hope. And of course you can instantiate an object.
You can create a new instance with greeter parentheses,
and then we can pass our name, let's say gopher.
And sure enough, if you call object greet,
you will get on your console the message hi gopher.
So far everything is still fairly simple. But now if
we want to try to represent to a format like JSON
our little greeter object instance, we don't
get this feature for free anymore. You might try that
and you might get an error just like this one. This might be a very
well expected behavior. You might think yourself about
super easy solutions for representing the humble
greeter object to JSoN, and then to load it back again.
But the whole point here is that in real world applications,
the complexity might escalate very quickly. For example,
when we talk about custom objects, we should think as well about
external libraries. Maybe your project is using third party
libraries which don't provide out of the box the ability to
export to your preferred data representation format,
and in that case you might have to implement that by yourself.
Also, think about object identity and shared objects.
Maybe you have an object instance which is referred
twice from an array when you serialize and later
deserialized this array, you might expect as well to
have a single object instance which is again
referred twice from the array and not, for example,
two different copies of the original object. Also,
think about recursive objects. Consider having a
list an array, and then you append the very
same array to the array itself. And this might be very
hard to represent in formats like JSon or YaMl. In order
to elegantly solve this and other interesting situations,
the pickle module adopts a fairly interesting and
original approach. In fact, instead of more
traditionally mapping your original data almost one
to one to a certain data representation format,
and also later on requiring a parsing step for reading
the format and rebuilding your objects,
the Gopickle module instead implements a fully qualified
virtual machine. So when you are serializing
data with the pickle module, it will create for you a
binary pickle program that you can store somewhere,
perhaps to a file. And later on this program can be given to
a so called unpickling machine, which is in charge of
running the pickle program and rebuilding those original
objects. This approach is highly flexible.
Pickle programs can instruct the unpickling machine in order
to reconstruct arbitrarily complex data structures.
Moreover, the virtual machine itself doesn't need to know
anything really specific about custom classes,
so custom classes and data types just work out of
the box without farter intervention. The only downside
is that the implementation of this virtual machine
is highly tied to python specific functions,
methods, and types. We can also have a quick,
high level look at the virtual machine implementation. We saw
that serializing data with Gopickle produces Gopickle programs,
and a Gopickle program is really just a sequence of instructions
where each instruction is identified by a one
byte opcode. Certain opcodes might be
followed by one or more additional bytes values,
and these values correspond to instruction specific operands.
They are just like instruction arguments.
The Gopickle module actually implements a stack based virtual machine,
so there is a traditional stack structure,
and the virtual machine can push and pop elements against the stack.
Additionally, there's also an additional data area which is
called the memo, which is just something that makes the
virtual machine implementation fairly simple. At the end
of the program interpretation, the stack
will contain just a single object, which will be those fully
deserialized object. Also, the virtual machine instructions
are not too many and not particularly complex as well.
In no way you can perform any sort of looping or
testing. There are no conditionals, there are no arithmetic instructions,
and no function calls. The structure of pickle programs
is really simple, and the virtual machine just
read one time from start to the end. The pickle
programs to deserialized data let's now see a
practical use case and example. Here we are
in Pythonista. We are defining again the greeter class.
We already saw it, nothing has changed. Here.
We can instantiate an object, and then let's say that we
want to serialize it. So let's import the pickle
module. Let's open a file object pickle
in writing mode, and finally, let's simply invoke
pickle dump, passing to it the object and
the file. This code will effectively write some
content to the object pickle file, and in fact,
this file is now supposed to contain the Gopickle program,
which can later be used to deserialize our object. We can
try to have a look at the content of the file, for example, with an
hexadecimal editor, and all we see is just a
bunch of bytes. Here and there you can see some
human readable sequences, but still it's hard to get a good idea about what's
going on. However, if you are curious enough, you might go
on with your exploration, perhaps making use
of another built in Python module called Gopickle tools.
For example, from the command line you might want
to run a command just like those one that allows you
to get the annotated representation of your pickle
program of the content of your file. It's very
likely that you'll get a highly dense output just like this one.
Don't worry, we are not going to explore every detail about those screen,
just to name a few things. On the very left you can see
the bytes positions. Then in yellow color,
I highlighted for you the opcodes. They're just single
bytes. They are followed by the name of the instructions,
and they are in turn, sometimes followed by
the values of certain operands and then on the right
you can see short annotations describing what
each instruction is supposed to do. But now let's
go back to some simpler Pythonista code,
especially to see how to deserialize our data and objects.
First, let's make sure that our custom classes and functions
and data types are defined in our current scope.
Here's again those greeter class just as a reference.
And after that, let's simply import again the pickle module.
Let's open our object pickle file for
reading, and let's give those file to the Gopickle load
function. This will actually run the unpickling
machine, which will execute our pickle program, and we'll
get back our object, which is almost identical to the original
greeter instance object. And of course we
can try to invoke the greet method on this
object, and we get, as expected, our high
Gohper message. Yet another important
thing to say, but the pickle module is that it comes with
different protocol versions. At present, there are six
different versions, number from zero to five. And simply put,
each protocol version identifies a set of instructions
that the underlying virtual machine can handle.
So from time to time, new protocol versions were introduced for
reasons such as providing better efficiency in the
virtual machine implementation. Perhaps new instructions were
added for better handling specific Python types
coming with newer Python releases. And important things
to know in general is that each protocol version is back compatible
with all previous versions. So that's enough
Python stuff for now. If you are curious about
further details, you can visit the official Python documentation for the
Gopickle module, and also have a look at the Gopickle tools module,
which provides even more extended documentation and details
about the implementation of the unpickling virtual machine,
as well as analysis tools. Okay,
so everything was particularly cool and simple enough in the world of Pythonista.
But what if I'm a go developer and maybe I have around
some files containing data serialized with Python Gopickle
module, and I might want to load that data from the Go
language. Just some time ago I found myself in that exact situation.
I was working on a machine learning library for the Go language.
It's called spago. I recommend to check that out. And we
wanted to load from go pretrained neural
network models, which are exported from Pythonista
from the popular machine learning framework Pytorch.
And for doing that, we discovered that apart from
other technicalities behind the hood, the Pytorch
serialization process involves heavily the pickle
module. And so the problem was, how do
we load pickle data from Go a possible solution
might have been to simply write a Pythonista script that
would read the initial data and then transform it to a data
representation format more suitable for being read from
go. But instead of doing that, I decided to write
a little wish list, and with this I was wishing for
the existence of an easy to use go library
that would allow me to unpickle data in go,
possibly supporting all Gopickle protocols. It should
handle out of the box basic simple data types such as
numbers, integers and floating points, or strings and
booleans, et cetera. It should be yet easy to be expanded
with custom data types or types coming from external libraries,
and it will be cool to do that without having to run
Pythonista at any step for the deserialization process.
And it will be also cool if such a library would have
minimal, if maybe no dependencies at all, and possibly
also not making use of unsafe data types or the Spago extension.
I tried to look around a little bit for existing projects,
but I couldn't really find exactly what I was looking
for, and so I just decided to try to do that by myself.
And here finally, I introduce you to the Gopickle library,
a library for loading pythonista data serialized with the Pico module.
Here's the link to the project. This library
is focused on deserializing only, at least for now,
and it's actually a part of the Pythonista
Gopickle class that you can find on
the cpython reference implementation source code.
It turned out that mapping those basic data
types from Python to go was a fairly easy
process. I'm talking again about boolean values
and numbers, floating points and integers and strings.
And even the Python none type was easily mapped
to the go nil value, and everything else that
was otherwise especially tied to the Pythonista programming
language has been in this library emulated by
using structures and interfaces. Also,
when I was starting this little project, I was especially reassured by
those fact that the pickle library itself is not
particularly big. For example, in cpython version
three nine, you can find the lib Gopickle py file,
which includes both the serialization and the deserialization
code, and in total it's less than 2000 lines
of code. So that was especially reassuring.
But without further ado, let's jump right in with
a basic usage example. It all starts once
again with some Python code. Let's start by defining
an object just using simple built in data types.
We already saw previously this very data structure.
It's a dict containing a bunch of keys and values and different
data types, some strings and numbers. We also
already know by now how to serialize data with the pickle
module. So again, nothing new. Once this code is executed,
we'll get can object pickle file containing our pickle
program. So now of course we can deserialize
our data back from this file. We already know how
to do that in Python by using the pickle module itself.
But here's something new we can try to do that from go
by first installing the Gopickle package library.
Here's the typical goget command to install the library,
and then you can import the gopickle Gopickle package
and make use of the Gopickle load function,
which simply accepts the name of the file
containing the Gopickle program and gives back to you
the deserialized object. And also an error
which in positive case will be simply nil.
If everything goes as expected, the object variable will
eventually contain something just like these here on
the left. I reported the original Python data
structure just for reference and comparison, and you can see very
well here how the Gopico library transformed some of
the original Python data types in specific
Go types. For example, the original
Pythonista dict is transformed to a Go type,
which is also called dict, of course, and it's implemented
provides a series of Dict entry elements being
each dict entry, just a simple key value pair,
and you can see how the various Dict keys and values are
mapped in go. There's also the nested
dict here, and you can also see the additional
list value which contains both the number and the string.
These custom types come from the Gopickle
types subpackage. You can have a look at it,
and it just provides a limited amount of
structs and interfaces to represent
and handle a limited set of python structures and
data types, which are particularly useful for the implementation
of the amplitude machine. So, for example, you have ways to represent
and handle lists or dicts or tuples and
so on and so forth. Please keep in mind that the implementation of
some of these types is not particularly clever, and especially
is not particularly optimized when those
types were created. The main goal was to
quickly have a working implementation of the whole unpickling machine,
and some of these types still have a pretty unpolished
aspect. And now that the whole unpickling machine seems
to work fairly well, there's plenty of room for further
improvements here. Let's now do something a little bit more advanced,
and let's see how the Gopickle library behaves with
foreign custom classes. So here from Pythonista we
have once again the greeter class. We instantiate
can object, and we deserializing Python objects with
gopickle pickle modules to our object pickle file.
If we now try to deserialize our object from go
just like we did before, alas, this time we'll
get an error back from the gopickle load function.
The message of this error might not be particularly
easy to understand. In fact, you might be
required to have a little bit more familiarity with the gopickle
project, and perhaps the Python pickle module as well.
So for this time, let me clarify what's going on here.
The first thing you have to know is that when the gopickle unpickling
machine encounters unknown data types or classes,
for example, the greeter class. In this case, it makes
use of a couple of structures available from the Gopickle
types subpackage, which are the generic object
type and the generic class type. And of course,
go is not, strictly speaking, can object oriented language.
That's why we have this clear distinction between objects
and classes. Sometimes letting go people
creating those generic objects and classes is absolutely
enough in order to deserialize certain data structures.
However, here you can clearly see how even the humble
greeter class apparently already has something too much
in order to be handled out of the box by the Gopickle
library so we can give to our library a little
help. In order to better understand the data that is going to deserialize,
even the Python pickle module would need to
have the greeter class defined in the context in order to properly
deserialize it. And so here the plan is to somehow emulate
the greeter class and objects here in go.
A fairly natural way to port in
go the original Pythonista Greeter class is to
define a greeter struct, also giving to
it the name string field, which is a parallel to the
original Python class underscore name instance variable.
Later on, we can expect the unpickling
machine to handle greater struct values,
and it will eventually require it to satisfy those
pydict settable interface. This interface
is there in order to emulate the Pythonista specific behavior
of setting a key value pair on a particular property
that almost every Python object has, which is
called underscore underscore dict underscore underscore
with those assignment in Python, assuming that the
object is actually an instance of a certain class,
you are effectively assigning a value to
a specific instance variable inside that object,
and the name of the instance variable is
identified by the value of those key. We can
easily emulate this behavior in go as well.
Let's then define this pydict set
function for the greeter struct. It will be automatically
invoked by the unpickling machine, which will provide
a key and a value. They can be both of almost
any type, so they are just both generic
empty interfaces. We know that the original Python
class had can instance variable called underscore
name, so we might expect that this method will be invoked with
a key equal to underscore name. And when
we encounter this, we can just expect the value
to be a string, so we can stringify the value and assign
it to the name field of destruct. And of course we
can also provide a little bit of error handling here and there. Of course,
in go we don't even have the object oriented concept
of classes and being able to create object
instances from them yet. This is an important feature in
the context of the unpickling machine, so we somehow had to emulate that
as well. In go, the greeter struct that
we just defined seems to be already well suitable for representing
Pythonista Greeter object instances. But in
go we have to do another step and define also
a higher level greeter class. The original Greeter
Python class was fairly simple. There were no class level
variables or methods, and so we can keep it simple here
as well. We can define a greeter class
implemented as an empty struct with no fields.
Again, sooner or later the unpickling machine
will have to handle a greeter class value,
and it will require it to satisfy the interface
called pynouvable. This time this interface
is there to simulate the creation of
new object instances. In particular, it represents the
Pythonista specific invocation of a special method
which almost any class has, which is called again
underscore underscore new underscore underscore.
In go we can define a PI new function for
the Greeter class struct. It should accept a variable amount
of arguments and return a value representing
an object instance generated from this kind of
class, and also an error if something goes wrong. In our
case, emulating the creation of a greeter object
instance is as simple as returning
a new greeter struct value. Having done
this preparation, we are now almost ready to deserialize our
data. We can import again the pickle package,
and this time, instead of using the high
level function gopickle load, we can open
by ourselves a file for reading from it our object
pickle file containing those pickle program, and we can give
this program to the pickle new gopickle function.
With this we'll get a customizable gopickle
object, and after having provided our
desired customization, we can eventually call
unpickler load and this will try to load
the gopickle program. In our case, we can customize the gopickle
object by providing a find class callback
function. With this function, we can finally tell to the virtual
machine what it is in the first place, this foreign
greeter type. So this function will be invoked
with the module value equal to main and the name equal to
greeter, which is the location of the original pythonista
data type. And we can finally provide our
implementation, our go implementation of the Greeter class,
which happens to be just a greeter class struct
value. Without this function, the unpickling
machine will still fall back to the generic object and generic
class types that we saw earlier, and we are finally
ready to deserialize our object for doing that.
Let's call those unpickler load function. Let's see
if there is an error, maybe otherwise, let's just print
to the console the representation of this object and
lo and behold, there are no errors those time and
we get as a result a greeter structures value.
Those name field was populated with the value gopher,
which is exactly the value that we were passing to the constructor
from Python. Having reached this point,
there's really just one more missing thing, and for
that you might want to run the extra mile and implement
the greet method on the greeter struct.
Everything should be already in place, so the implementation itself is
super simple. And once you have your deserialized object,
you can cast it to a pointer to the greeter struct.
And finally you can call the function greeter greet.
And there you go, you have your message. Hi gopher.
As a final reference, here is the full list of interface
types from the Gopickle types package, which are
replacing or emulating Python specific behaviors
or functions. They are especially vital for the
correct functioning of the whole unpickling machine. If you are curious,
you can have a look at the Gopickle types documentation, and also
at the corresponding Python functions documentation as well.
Here's also a quick overview of those unpickler objects
callbacks that you might want or need to customizable
in order to provide a certain guidance for the unpickling process.
We already saw the find class callback in action.
There are other callbacks you can define as well, for example for resolving
objects by a persistent id or handling
custom pickle extensions, or handling particular
data types or specific instructions.
Also, keep in mind that some of these topics might be considered
particularly advanced and might require some learning curve
and time to get used to that. And sometimes a certain intimate
level of knowledge about the those Gopickle model might be required
as well. However, don't worry too much. Most of the times,
even in real world and more complex scenarios,
the required level of customization doesn't
differ much from what we just saw before with our simple
greeter class. As a bonus,
once the whole unpickling machine was there in place,
implemented in Go, it turned out that the original intent
of deserializing neural network models
exported from the Python Pytorch machine learning framework was
a fairly simple job. The very go
code for doing that turned out to be particularly
compact in size, and for that reason we decided to
release it directly in the Go Pico library.
So there is a Pytorch subpackage which exposes types
mapped from the original Pytorch Python implementation,
and there's also a high level pytorch load function
to effectively load at least a subset of Pytorch
models. Also called modules, this package is
effectively used by the Spago project, which I already mentioned
before. Spago is a machine learning framework for go.
Here is the link to the project, especially if you're not a machine
learning expert. Spago comes with built in tools
and configurations to help you solve traditional machine
learning problems. In particular, in the field of natural language
processing, you can easily make use of state
of the art techniques to perform, for example,
text classification, question answering, automatic machine translation,
named entities, recognition, and a lot of other cool
things. Spago implements all the
functionalities, and then you can also easily obtain
ready to use pre trained neural network models,
for examples from the Huggingface website.
Huggingface is a fantastic company, which most prominently
started creating this sort of community where people can
freely share their own pre trained model, and many of these
models are actually generated by using Pytorch.
Indeed, a subset of these models is compatible with
Spago, which provides high level functions and also command
line tools that can automatically download compatible
models, load them thanks to the Gopickle library,
convert them to a spaghespecific format,
and finally, your application can perform a lot of wonderful
things, and you don't even have to leave those terminal.
Gopickle is still a very young project, there's plenty of room
for improvements, and a lot of tasks are still left
to do. Among those others, it's definitely desirable to
have more tests and better test coverage, more and
better documentation. Maybe it will be cool to implement better
error messages and more clear ways to inspect
what's going on in the PICL programs. We should try
to support more and more Python standard classes as
well as Pytorch specific classes, and also performance
might be an interesting point to work on. In conclusion,
here is my call to action for you. Please go visit
the Gopickle GitHub repository page. The easiest
way to contribute is to simply share the link and
if you like, also give us a star. If you use the
Gopickle library in your own projects and experiments, let us
know how it goes. Feel free to come up with suggestions for
fixes or improvements. And also please go on with
your own pool requests. They are very, very welcome.
Get in touch with us for any in prison that you want,
even just for saying hi. And finally, you can also support us
via our fiscal sponsor@opencollective.com,
Nlpodice so that's it. It has been
a long journey, but I hope you enjoyed it. Thank you very much for your
attention, and until next time,