Abstract
We created Python API calls that let you can make queries and manipulate data in our graph database. We thought about what will be best for Pythonistas? What will be the most Pythonic way to do it? (Is it a thing?) Here’s our journey in making WOQLpy and we want to make it useful to you.
Query language is an important part of a database system, it is how people manage their data and how they can make the data useful to them. Starting from the 70s, the world is full of relational databases and SQL was the way to make queries. However, SQL is valuable to an injection attack. Lots of efforts are being used to stop those attacks and it made workflow become less efficient.
We don’t want to make the same mistake. That’s why using a Python query language is good. Having the Python community in our mind, we created WOQLpy, an open-source query language that lets users build queries in Python, instead of JSON-LD which is the native query language for our TerminusDB database. Now users can store data with a knowledge graph and make graph data visualization with Python.
In the first part of the talk, we will talk about what challenges we have when creating a query language in Python, the method we use, the idea and theory behind, and how WOQLpy work. This part will include a quick live demo of using WOQLpy so audiences can have an impression on how to make a query and get the task done, that is, getting a meaningful graph visualization form the source CSVs. The process of how to create a database and schema, loading the data form many CSVs, making a query and visualization, will be demonstrated using just one python script.
In the second part of the talk, we want to stimulate a discussion of what is a good design in Python and what is not. This part will be more interactive with the audiences, as we want to hear from you all, what would be the best for Pythonistas. By first suggesting some possible design, we will use a live voting system to gather opinions. This part of the talk will extend to the Q & A sessions to allow further discussions.
This talk is for Pythonista at all levels who are interested in starting to design a package in Python, no matter if the audience has published a python library or not. By attending this talk, audiences will learn about how to design a Python package that will be useful to Pythonistas and hopefully encourage more people to publish open-source packages online.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, I'm Chuck and I'm going to talk about how to
be Pythonista and my journey of designing a query
language in Pythonista. So these are my contact details.
Feel free to follow me on social medias. So I'm Chuck,
who am I? I actually love open source projects.
I have been involved in different open source projects in my life before,
mainly in Pythonista. Frenchly I work full time for TerminusDB,
which is can open source graph database. I also love organizing
community events from conferences to,
well, meetups before the pandemic and also sprints
that people just contribute to open source together right now
because we are going nowhere. I also do streaming online
on twitch. Yeah, so if you follow me on twitch, sometimes you will catch me
online doing some python stuff. So one questions thats I
always ask is like, what is pythonista?
Because you have hear this many, many times,
people talking about it. So what is pythonic, what does Pythonista
mean? So I found this answer from stack overflow.
Obviously someone's asking the same question. So it
says Pythonista means that the code is
not just that the syntax is right. Well, it means that the code is correct,
it runs, but usually there's like more than one way
of doing it right. So if it's done in
a way that is accepted by the python community,
that is easy to understand code and what
the language is intended to be used, then is pythonic.
So in my mind I think that, okay, so it's
kind of like artistic thing, right? So it's
first subjective. Whether something is beautiful or something
is better is subjective.
How do I know what is pythonic, what is
not? So I think a lot of things that come down will be
like looking at what other people approach
it, or sometimes it's just that if
there's less line of code or to check whether you follow the
sand of Pythonista. So in case you don't know what those sand
of Pythonista try doing, import this in a python terminusdb.
Then you will see the set of python. Yeah, so it's something that,
it's kind of like you learn by doing
it, you learn by looking like maybe contribute to open source
or you learn by just reading other people's code.
One example is that pandas, pandas basically is like
its own ecosystem. So a lot of times when
I browse stack overflow, people ask questions about pandas,
how to do this, how to do that. A lot of times when I first
learned how to use pandas, I was like, why? I can just use a for
loop loop over the data frame row by row and find the answers.
Instead. I have to do like aggregation, joining and all this stuff.
Why do I do it that way? Well, the answer is that, well,
because pandas, it kind of build on numpy and
then numpy is a library that if you use the built in function that
it's optimized, it's a lot faster because it uses the C extension
of it. So it's like a lot more faster than doing a for loop.
So that's very practical advantage of doing it.
It's not just those style, it's not just beautiful. And people love reading
your code, it's also for performance. So if you use Python in
a way that is what is intended to be used,
then there's benefit to it. So for
example, for loop, again, we all love for loop. Well, this is not
Pythonista code. Obvious. I hope it's obvious to you.
So it's a JavaScript code or Java Javascript,
kind of like that. I forgot where I copied this from.
So it's actually something that I learned when
I was in school. Like to use the index and then increment the
index. So you access all the items in an array one by one.
But in Python things are much simpler. You just use I
would be you just care about what's inside a
list, for example. So you just find all the items inside the
list and then you just do whatever, or even sometimes it's
more compact. Right. Like list comprehension calls this stuff.
So usually pythonista way is actually a
simpler way. So things is just like less code is simpler. Usually it's
a good indicator of things thats things
are doing it. Pythonista a lot of times,
especially when I was developing the Thomasdb
Python client, a lot of the work is translating the code
from JavaScript to Pythonista. Then I have to think about,
okay, so how should I do it in Python? It should not
be like a for loop, I shouldn't be writing it, incrementing the
index instead, but I should care about the items in it
instead because that's pythonista. Yes, I mentioned a little
bit about working on the Python client of TerminusDB.
So yeah, my journey of all this thinking
about how to design a python client started when I
become a developer avocado of terminusDB. But before I tell
you my journey, I have to maybe give you
some idea of what terminusDB is, right? It's a graph database.
So some of you may already have experience with graph database.
For example, neo four j is a very popular one that is much
more of a history and it's more well known.
We are also a graph database, but of
course different. But to make things short, imagine that you're not
storing things in a tableau format. So this is how I used to store
data when I work as a data scientist. Lot of CSVs
lets of SQL databases, things are stored in tables
with can index a key. On the left
hand side there you see person id is the key, there, the primary key,
and you have some information about this person, obviously, name, date of
birth and. Okay, so mother and father, there's two columns there,
some of them are null and some of them got some number in it or
what are they? I hope at this point you already figured it out.
It takes a while to make sense out of it and then you
can imagine that it's actually a family tree. So if
you put the information not in a tablet format, but in
a graph format, it kind of looks like that. And it's very obvious for you
that it's a family tree because, well, we named edge mother and father.
So you see that mother and father, of course your parents is a
family tree. And also you can obviously see who are the grandparents,
which person and which person are in the same generation.
So all of these very obvious. If we put it in those graph format
instead of this, you have to think about, okay, mother and father.
Actually it's a key to join back to the person. Then who
is the mother, who is the father, then you have to make some joins.
We do a lot of joins when working with SQL.
Sometimes the data itself actually is more natural
to present it in a graph format. So that's why graph database
is kind of useful. Again, this is how we find
the maternal grandmother of Joan. With SQL, you have to
select from a table and then do some joins, maybe join
it back together. I'm not a big fan of doing thats like awkward aggregation
and then joining tables together. Yeah, I just found it quite
difficult to think it in my mind.
Instead, with Terminusdb we have a query language
called Waqwo. So in Wako it's very similar to Prolog, if you
know what Prolog is. So it's kind of like you're making statements,
logical statements, and then you just find which variables kind of
suited those logical statements. So you can see here
that we will have four triples, which is those relations,
two nooks and an edge is a triple. And then we have a person.
So actually you want to find like for example the grandmother of join.
Then you will put join here and then you find like oh, this person
should satisfy this relationship, which is like John's
mother. And then the other relationship here
will be like the mother of the mother will be the grandmother,
maternal grandmother. So here you can find a variable that
satisfy these relations and give the name of those people.
So you have the name of the mother and the maternal grandmother in this
query. So everything is just done in one place instead
of making multiple select and join statements.
So yeah, like here you can see that we need one and two or two
statements to find mother and grandmother, right. And then here we can just find
all of them in one go. It takes some time to get around
how you think about query the language, but actually if
it works then it's much more efficient.
So this is Waco, right? This is the query language of TerminusDB.
And then when I first joined there wasn't any pythonista clients.
Waco was natively a JSonld. This is the JSON
format is what those front end talk to, the back end what
the clients talk to, the server itself. So Waco,
the native format is a JSON LD format.
And at the point that when I join we have Javascript client.
So I think that, okay, so if we want to help
Python misters and data scientists to use this awesome
graph database, we need a Pythonista client. So we have occupy a
query language for Pythonista users. Great.
So I got support from my team and everybody thinks it's a good idea.
So we start working on Wacopy. So what is wacopy?
Well, it comes with the Python client obviously,
which you can pip install. It's on PYPI release on PyPI
like calls other Python libraries. It includes multiple
modules, so it includes the Python client itself,
which is kind of like a wrapper for the API that
you could carry out different manipulations to
the server database. There is also the wacopy,
the query part, Waco query. So that's how you can build your schema,
how you can query your data, how you can insert the data in
the database. So that's the second bit. The third bit is that there
is a visualization tool that could give
you an interactive graph visualization of your result data.
I'll talk about that later. There's also this data frame
which is an optional module. If you install, then you could use
some of the functions inside to convert your result from a
JSON format into a pandas data frame,
which is quite cool. So this is an example of how to use terminusdb here.
In this example we are building a schema for this bike
data so you can see that we have like three objects
that we created in the schema graph. So it's a station, which is a
document type. So there is also like label and
description of it. You can also add property, for example like the journey data
type, you can add property to it. So you can imagine all of these actually
describe, for example, I have a station object, then I would have,
well have a name. It will have
a label, a description. If it's a journey object on top of
those you will also have these properties, right? You also have the end
time, start time, the journey bicycle, and all these properties.
So this is a schema, you can also add them all
together. So I have created three different type of
object and the schema consists of these three types of objects.
So I'm just add them together and then execute it with the Python client.
It's still not the most optimal way of doing it. You still have to make
a lot of method calls on the Waco query object, but I'll show you a
better schema building design that we have just come up with that
is still under development. I show you this first because this is what is being
used currently. But originally, like I said before,
Waco is actually a JSOn format natively.
So yeah, you can use the Python client, but if you don't
use Waco API, you have to write the query like this, right?
So obviously people don't want to write a query like this, they would
rather do this with all the Python code.
Or even better, when I show you the newest Python schema building
regime, I would say I don't know how to describe that,
the scheme for building a schema. Yeah. So also we
have some flexibility in our query objects
that you could design your document type by chaining the
extras like label and description. Or you can just put it in
as can optional variable. So you could do that because
these two are quite optional. So you could put it in like
this or just chain it up so both will work. There's some flexibility in
the design, but there's also challenge when I try to translate the
JavaScript one into Python. For example, the method n
in pythonista, some of you may know that n
because it's a keyword, it can be used, has a method name so
it can't be directly translated, has the same as the JavaScript.
So you have to use Waco n, which is not
a very good name. So that's why now we have overloaded
with the operator. You can see the plus sign that I have showed you before
also right now for the oR, we also have the pipe operator to
be used as or for not also
a keyword. You can't use it as and from, you can
just use it as it, because again, it's a keyword.
So now you have to add the waco underscore prefix to
it. But we may change the design in the future.
Never know. This is about the extra things that we float
in for the Python client. So this one is the
integration with Jupyter notebook. So we have a few things that makes data scientists
who use Jupyter notebook have a better experience. For example,
like I mentioned before, there's this data frame module
that lets you convert the result into a pandas data frame.
And also this one is those interactive graph visualization
that I mentioned. So you can see that all
of these are customizable. You can change the color of the note, you can
change the size of the graph. So yeah,
there are different ways of visualizing your result in Jupyter notebook,
which is quite nice. So it's not just Jupyter notebook, you can actually output
thats as can HTML file as well, because it's just users d
three to generate it. I have another talk about how
I make this part of the Python client happened,
but I don't have time for this talk. We're mainly focusing on the query language.
But I think you can find my previous talk in YouTube
or something. It's recorded. This is something new.
So this is the schema builder that I talk about. So remember
before, if we have to build a schema, we have to
use those doc type thing here. Like this example here,
thats we have to use the doc type and then add the property to the
doc type, right? And we just think that it's not intuitive enough
for Python neisters. So actually we kind of use more
like a data modeling approach that you could use class
and then it could be. So everything would be a
class of object or document or enum. Actually I'm still
working on enum thing. So you can have calls the properties as its
attribute and then you can use annotation to
fix the type of the property. So whether it's like a data
type property like float or string, or it
could be can object property like for example country, then I
can have the parameters like I'm using coordinate,
which is an object type that I
just created. So this is what we
are moving forward to. Right now it's under development.
Also, instead of setting the label the description,
it will just use the name of those label description will be the doc string
if there's any. So yeah, this is what we are working on
right now. So I hope that is a good design. I hope you like
it. So let me know whether you found this has a better way to build
a schema. Look into the future. What we still want
to do, but not quite there yet, we just want to
make life much more easier for pandas user.
Now we can output the results back in the data frame format,
but the other way around is not there yet. Hopefully in the future if
you have some csv, things will be more automated. If it's simple
enough that we could look at your CSV or data frame and
then kind of create a schema automatically and then input
all your data from your CSVs or data frames automatically.
That's something that we are aiming for after the schema builder
is finished. Also the network graph analysis,
I know that neo four j has it, we don't have it yet, but it's
something that I really want to do. But we
don't know when we're going to have this. If you want to work
on this, please let me know. I'm happy to collaborate with you.
Also, the schema checker is something that since we have
the schema builder like this, so when we add the
object in, so all the objects thats the data adding in
will be an object of these classes, right? So we can
actually efficiently check whether the data followed the
schema correctly. Right? Now the challenge is that when people
get an error when they insert the data, they don't know what's going on,
they don't know what went wrong, that they can't insert the data.
But with the schema checker it's kind of like a linter for the schema that
you can see that. Oh, which part? That make it wrong?
Maybe the type of your data doesn't match the type that is described in the
schema. It should raise a flag to tell you exactly
which point is wrong, that you should fix it rather than just pure
guessing. So yeah, your suggestions are always welcome.
So if you have any questions or suggestions, feel free
to leave an issue in our repo. So it's
TerminusDB. TerminusDb Pythonista.
I will show you the link later. Yeah, so that's basically everything that
I want to talk about in this talk. There's more to explore,
of course, if you're interested in how we created the python
client, how to create the rockupy, the query language,
if you want to learn more about how to do like graph data modeling.
We have the TerminusDB academy. We heres planning to organize
more workshops for people to learn how to do the model build,
use the new model builder tools and all this. Also follow us
on Twitter. You will get all the news, check out our website or
what is better. Join the discord server. We have office hour every week
which you can talk with those tech team directly, ask questions,
give suggestions, feedback, whatever you like. Just hang out with us.
Yeah we want to hear from you. So our GitHub repo is
here at GitHub and then terminusdbclinepython
here so you can see it's heres. This is our repo so feel free
to open an issue, just suggest anything.
So yeah, that's it for my talk and thank you so much for
listening and feel free to ask any questions. Join our discord
server. I will see you there.