Transcript
This transcript was autogenerated. To make changes, submit a PR.
Jamaica real
time feedback into the behavior of your distributed systems and
observing changes exceptions errors in real
time allows you to not only experiment with confidence, but respond
instantly to get things working again.
Close hello
everyone and welcome to my talk lira disrupting full tech search
industry with javascript before we start, I always like to introduce
myself very briefly. I am Michele. I work as
a staff engineer at Nearform. I'm a Google developer expert and a Microsoft
MVP. I want to start by saying that I love elasticsearch.
When I think of full text search engines, I always think
of my favorite open source project ever, which is elastic.
In the past I had the opportunity to work a lot on elasticsearch or
with elasticsearch, of course on projects and products. And I
got to say, I also tried working with solar. I tried working
directly with Lucine. So the library underneath elasticsearch
itself, but also underneath solar. But I got to say,
every single time I come back to elasticsearch because it's
just an amazing open source project and I truly,
truly love it. In the past, of course, I've been contributing to
open source a lot. And I've been working a bit on UnomI, which is a
CDP, a customer data platform from Apache that
uses elasticsearch as a leader database in the whole overall architecture.
And I was amazed by the fact that every day we
could just throw millions of data to Unomi,
and elasticsearch would work just fine. And the performances
in search couldn't be that much affected by the amount of data
we just insert into the database. And this is true.
Amazing. And one question I always wondered was
how is it possible that Lucine, and of course then solar or elasticsearch
can be that fast? And when we think about
performances, when we think about how elasticsearch, how Lucine
works, we have to make a distinction. Of course, we already anticipated that
a couple of seconds ago, but yeah, Lucine is
the actual full text search library, which is written in Java.
And elasticsearch is not just a
search engine itself because it uses Lucine, but it's also a
restful server. It is a distributed system.
It adds sharding on top of the overall distributed
system so that if you have a lot of data, you can shard data
amongst multiple nodes. It takes care of data consistency monitoring,
cluster management, a lot of stuff.
And as I already said, I love elasticsearch and I
wanted to start recreating it from scratch. Not because
I didn't like it, but because I wanted to learn more.
And quoting one of the best of all time,
Richard Feynman what I cannot create, I do not understand.
So that's my life motto. This is something that truly
explains the way I do learn stuff. So yeah,
I wanted to create something in order to understand how that works,
of course. Again, I love elastic,
but I also had some problems with it. It's not the easiest
software to deploy and set up, let's be honest with that.
It's quite hard to upgrade, it has a big memory
footprint, cpu consumption is not great,
and it's really, really costly. So if you go on the cloud version, it has
a cost. If you want to maintain it on your own, on your cloud
provider, provider of choice, it also has a cost, of course.
So at the end of the day, I find it to be a very
good product. Has no competition in my opinion right now,
but has some problems. Let's be honest about that.
Before I continue, I want to say that all the problems that I found,
maybe it was all my fault, maybe I was too inexpert and elasticsearch
was a bit too much for me. We all know that making simple software is
hard. We can give it a try. But I want this to be
clear from the start, I love elasticsearch and I started
recreating something similar just because I wanted to learn more, not because I wanted to
replace in first place the whole system.
So I set myself a goal. I wanted to give a talk on how
full text search engines work. That's also for another nice reason.
If I don't have a goal, I'm not able to
understand how stuff works and basically study.
I need a motivation. So that was my motivation. And yeah,
I started learning more about full tech search algorithms,
data structures, whatever. And yeah,
I started going down the rabbit hole. So that was me the
first few hours reading the theory behind full tech search. It's not
easy, let's be honest about that. The hard truth is that
I needed to study a lot of algorithms in data structures. I have
no degree, so I didn't have any place in my mind where I
could say like, okay, I remember I discussed this during a university
lecture, I can reach out to that. I don't know, professor,
for example, for learning something, asking questions, I started literally
from scratch. And that can be kind of a problem for us.
I thought developers, as I am,
but was very interesting anyway.
And of course after you start implementing stuff,
after you start at least learning how stuff
works, you need to implement it. And of course when you start implement
something, you have to choose a programming language to implement your
algorithms and your data structures. And of course I
wanted to be a cool guy. Cool guy uses
cool programming languages. So of course, oh no,
Haskell, of course.
I started working with Rust, and I
gotta say, I've been working with Haskell for quite a long time,
so I thought that rust could have been a nice option,
easiest option, possibly. I was terribly wrong.
It's not an easy programming language. It's super cool. Of course, cool guys uses
rust all the time, it just wasn't for
me. So I decided to implement everything from scratch in
Golang. And also Golang is not super easy in my
opinion. I mean, when compared to other program languages such as typescript or
Ruby, for example, these are another kind
of program language. So I started feeling a bit frustrated about that because
I want to make stuff done, but I didn't
have enough knowledge of those program languages to get stuff done,
of course. And then I remembered a quote from
Jeff Oatwood, the co founder of Stack Overflow, as known as
the Otwood load, that says an application that can be written in JavaScript
will eventually be written in JavaScript.
And yeah, why not? JavaScript is the king
of programming languages, right? So I decided to start implementing stuff
with JavaScript, and that was kind of surprising to me. I started
implementing stuff with rust. All the data structures that are required
to work on search engines
started with rust. Re implementing everything in Golang was quite
optimized actually, because I've spent a lot of time on stack
overflow, of course asking for code reviews to more expert people. So I
was pretty confident in the performances. But I got to
say, even though JavaScript couldn't outperform those languages, it was
very close. And we will see how close we are
in the next slides. That brings me to the next question. There is no
slow programming language, but just bad algorithms in data structure design,
which basically means that maybe my algorithms, of course, I ask for
code reviews, I ask for many things and help,
but maybe they weren't so well optimized. So rust
cannot make your shitty code be better.
But I have more experience in JavaScript, so my JavaScript code is written better
than my rust code, of course, and performs not better. But we
are pretty close, and that's a good point in my opinion,
to understand. That's a lesson that I had to bring home.
So basically, after spending a couple of months
working on that search engine, I gave
my talk at we are developers in Berlin on August,
I guess. And yeah, this is how lira
was born. So lira nowadays it's can open source project
that of course, it's a full text search engine written in typescript.
And one nice thing about lira that I'd like
to highlight is that targets every single JavaScript runtime.
So it's not a problem.
If you want to run like JavaScript on, let's say node js,
on Dino, on can, on Cloudflare workers, on react native,
it's not a problem, we don't have any kind of dependency.
We test everything on every single runtime, and we implemented
everything from scratch with backward compatibility in
every single runtime in mind, as well as performances, of course.
So we implemented everything from scratch. We implemented prefix,
true, inverted indexes, b trees, tree maps,
we implementing the stamped algorithm, stopwords support
for custom stopwords, we introduced support for multiple
languages, everything from scratch so that you can use lira wherever
you want on your favorite runtime. And when I say
runtimes, I refer to the fact that you can run, let's say
lira on cloudflare workers or netrify functions.
So you can target edge computing, you can run it on browsers, on can
lambda functions, on AWS lambda edge dino
react native node js, you can literally run it wherever you want.
And talking specifically about edge computing,
it wasn't super easy for us to
get there. I remember I was in a conference in Berlin a couple of
months ago, and I was there with a colleague of mine and I
said, you know what, it would be cool to
run lira on the edge, right? He told me,
okay, yeah, hold my beer, that's all he said.
And the day after we've been able to ship a
very basic version of the very first full text search engine capable
of running on edge networks. And let me show you how we did it.
So talking about Lira, how it works, we basically have
a collection of functions actually, which are for example, create,
insert, remove and search. So create creates a new lira instance,
insert inserts new data into the existing instance.
Remove removes data and search of course performs a search operation. But let's
start from the beginning. It's not schemaless,
so it's not really like elasticsearch in
that sense, which is totally schemaless, but you have to define the types
for the stuff that you are going to input into the database itself.
So in that example we just create a schema containing author,
which is a string, and quote, which is another string. Then we
want to insert stuff. So as you can see there, we basically pass the DB,
so we mutate the original DB instance by inserting
new stuff. This is a synchronous operation, as you can
see there. So we also provide an insert batch method which is asynchronous
and prevents the event loop from blocking. So that's pretty important just
for you to know. And once we insert stuff,
we can eventually start searching for data. So in
that example, we pass DB, so inside that database,
because we might have multiple databases, why not inside
that database, search for the term, let's say if
on all the properties. So search it on quote and author.
But you can also choose to search on quote only on author only.
And you will see that elapsed. It's 99.
We'll get back later. Count is two and
two results. So these are the results. This is the API
for searching data. But now you may be wondering, what is 99?
It's records. Milliseconds.
No, it's actually microseconds. And you might be thinking,
wow, okay, you only inserted like four documents.
Of course, it's fast. Yes, this is true. But we also made
some benchmarks. So we took 1
million entries, which are actually 1 million movie titles
from the international movie database. We inserted everything inside
lira and we performed different searches. So for example, we searched
for believe inside of all the indexes and
on average it took 41 microseconds,
so millionth of a second to
get all the results. And if you go like in criminal
minds, for example, it takes 187 microseconds. But as you can see,
criminal minds has two different terms, so it performs two different
searches, then intersects all the documents containing both terms.
So there is a lot to do in that case. But it's so
damn fast. And again, there's no slow programming language out
there. There is just bad algorithm and data structure design. So that's something
to keep in mind after we get there. We also wanted to give support from
multiple languages because of course English is default,
but I'm italian, so I might want to index italian data.
And when indexing data, we also want to make
our search engine smart. So let me give
you an example. We perform stemming operations
and stopwords removals. For example, if we
have sentences
containing commonly used words such as articles
like a v, et cetera,
and other similar words, we just remove them because they have
no sense. And if we have words like lucky, for example,
we stem it to luck, so that if you search luck,
you will find exactly that word, or luckiest,
you will find all the documents containing luck,
lucky, of course you can also say, okay, give me the exact result.
So luckiest, not luck or lucky, so you can filter out
results that you don't want. But we also give you an interface for
smart searches, and we do that using a stemming algorithm
that right now it's supporting in 23 different languages. So that's an
example of how it works. We basically expose a true shakeable
italian stemmer. In that example you can see on the screen,
so you choose the language and the tokenizer. So the
tokenizer, sorry, the stemmer, it's part of the tokenizer
process right now, will be the italian stemmer. Of course, writing in
stemmers, it's not easy because every single language has different rules.
So you can't stem italian words like english words,
for example. We will see more example going on. So lucky
gets stemmed to luck. But the same rules can't apply for Italian,
for example, or Germany, or German, or Russian,
or, I don't know, Swedish, Turkish, et cetera. So we relied
on snowball. So snowball, it's a project created
by Martin Porter, which is the author of the Porter Stemmer,
which is possibly one of the most beautiful stemming algorithms out
there. Super brilliant, and it's totally open source.
So not only the stemming algorithm itself which is
written in c, but can be compiled down to Javascript or
imported into Golang, rust, wherever, but it also gives you
idea on how to create your own stemmer. So in that example,
as you can see here, we have like, I don't know, step zero,
search for the longest among the suffixes.
If we find that suffix, we just remove it. Then next step, search for
the longest amongst following suffixes.
And whenever we find one of those suffixes, we perform a given operation
following the algorithm description. So it's really easy
and convenient to follow these
instructions to create more and more stemmers.
There are battle tested, accurate and
also widely used also inside other projects.
I'm not sure that elasticsearch for example uses this stemmer algorithm,
but I wouldn't be surprised. Let's say that I know that a lot of search
engines out there are using the exact same algorithm, so the results are pretty
accurate in every other, let's say, competitor we
can find for lira. So just for giving an
example, this is how stemming algorithm networks in English,
so we have like consign, which gets stem to consign,
but then we have the past form, so consigned in the past
gets stemmed to consign, again,
consigning gets stamped again to
consign. So this is how it works. And of course I'm italian, I could write
can italian stemmer. And these are the tests,
for example, that I had to perform. I know nothing about
German, but I'd love to, but I don't know how
to speak it or how to write it. So following the porter's terminal
algorithm description, it was possible to
do that. And of course if you want to have fun, you can
create your own stemming algorithm. So in that example,
stemming function is nothing more than a function giving you
a word and expecting a new word in return. So in that
example we are just appending ish
ish to the word. So that if you
really want to have fun or you don't like the stemming
algorithm, you can bring your own, which is really really convenient because there
are many out there and you can just import a library
and use it as you prefer. So we did also the
same for custom stopwords. So for example, common stop
words are, I don't know, a d
me, you, these are all stopwords.
That doesn't carry a lot of meaning to the search if we think
of overall meaning for the search
query and results. So basically,
given the language that you support, for example,
in that case we don't specify a given language, so it's English by default.
We give you the list of english stopwords and you can filter them out,
you can append new stuff, or you can bring your own stop words.
So it's all up to you. So it's highly customizable.
Lira had some project goals, of course, we wanted to
run on any JavaScript
runtime. We wanted it to be as small as possible,
as fast as possible, and easy to maintain and deploy.
And I gotta say, we had some great achievement.
It works on every single JavaScript runtimes. It has a small
modular core, as you can see, you can always interact with the core, so that,
I don't know, for example, we have a hug system, so you can interact with
all the processes to customize the experience as you
wish. It's pretty fast,
can be deployed literally everywhere, serialized data in different formats.
So we will see what that means in just a second and has a powerful
plugin system. Speaking of which, at a certain point you
have the data, you want to deploy the data, you may want to persist
the data somewhere so you don't have to index
everything from scratch every single time.
So we created a plugin which is called plugin data
persistence. It's an official plugin and basically allows you
to export the index from one lira instance and re import it
somewhere else. And this is pretty important, let me show you why. So let's say
we have this lira instance, we have
a schema. We insert data into the original instance.
Then we import the persist to file function
which is runtime specific. This only works on can and node JS.
As for now, if there's anyone wanting to implement this on Dino,
I'd be super happy to help of course. And persist
to file will return the file path where you persist the data. It's an
absolute path, so you pass the original instance that we just created. So this
one for reference, we choose the format binary
by default. So message pack, but you can also choose DPAC or JSON
serialization and an output file. So in
that case quotes MSP, which is message pack
of course. So we are basically taking Lira
index, we are serializing it and saving it to disk in a binary
format. Then we can use the restore from file from another totally
different machine or service and we read it in
memory. So restored instance constant in that case will be
an in memory index for lira. So as you can see,
we choose the file quotes MSP,
the one we just created, and we can now immediately
perform search on restored databases.
That brings us to a lot of target architectures. Let me give you
a couple of examples. We have the offline search architecture
for example. So let's say we have a mobile app on
react native which is totally supported from lira.
So we perform search and whenever the connection
is not established to the server, we can also
rely on a local backup for the data. So we can fall back to
the local DB that let's say every five minutes asks the
server for new data. So if there's new data, we serialize that new data,
we send it to the local database on our applications.
And whenever the Internet connection
fails, we will always be able to fall back to in
memory database, be it, I don't know,
sqlite or any database you like to use on your
mobile applications. But that's not just it.
We may want to have a kind of a CI process
for lira so that you build your database every
five minutes, let's say, or every minute, every three minutes you
choose and you deposit your serialized
index on s three. For example, it triggers a simple
notification service SNS, which will
deployed some lambdas containing the in memory index so
that you can query the data directly on AWS
lambda. And every time you put some new index
inside s three, you will be able to basically redeploy
lambdas and perform search operations on new data.
So with that target architecture, for example, you have to forget
about cluster management deployed data consistency because it's
all managed by AWS in that case, so you
only have to take care about performing search operations.
Or if you're lazy like me, you can deploy
everywhere using nebula. So I couldn't prepare a real
live demo, but let me show you, what does that mean?
So this is an example.
We installed nebula, which is the build system. Oops, sorry.
We installed nebula, which is the build system for
lira. It's the official one, it's still in beta, but it's working pretty fine.
So as you can see, we basically install it from NPM. And when
we look inside our folder right now in that demo,
we have two files, data JSon and Lira Yaml.
So let's see what's inside those files. If we can lira
YML, we will see that we have a schema, which is a normal
schema definition for lira. We already saw that sharding automatic
or we don't want sharding, I don't know, that's up to
you. An output file. So in that case, bundle JS.
So it will generate a lira application containing
the data inside the bundle JS file.
The kind of data that we have in that kind is of type JSON,
and it comes from the data JSON file.
You could also use type JavaScript
and as a source use, let's say Foo JS, which exports
default, can asynchronous function, so that you can call a database,
get the data, and get the data from there, basically. So you can
interact with the database, but that's up to you. Let's use JSON, that it's
easier. And the target in that case is cloudflare workers.
And we can configure, for example the cloudflare worker name,
in that case Pockydex, because we are going to deploy Pockydex
if we want to run tests, true or false. In that case, we want to
run tests. If we go see the data that we have
inside data JSON, as you can see here, it follows
the schema definition. So we have a lot of pokemons,
and that's really it. We can now run nebula
bundle or nebula D, which stands for
deploy, so it will bundle and deploy.
And as you can see, in just 5 seconds,
we've been able to deploy everything to cloudflare workers.
So if we make a search with the CRL for Pica,
for example, we will get a response and we
are running on can edge network, in that case,
cloudflare workers. So congratulations. In like
5 seconds, we just deployed the very first full text search engines
capable of running on an edge network lira.
It's free and open source and I will be
there if any one of you needs some help setting up it,
creating a target architecture. This is a service I can
kind of help you with. So if you need something, I'd like
to hear from you. I'd like to hear your feedbacks on Lira. If you have
any kind of questions, please feel free to reach out to me directly at
Michele Rivacode on Twitter. But we also have
a slack channel where you can find help from a community,
from me, from my colleagues working on Lira. So please join lirasearch slack.com.
This is where we make lira happen. And before I end
my speech, I'd like to thank Nearform. We are
a professional service company specializing
node JS DevOps react native. We maintain
a lot of open source software. We are responsible for the maintenance
of almost 8% of all the NPI modules used globally,
which gets downloaded like 1 billion times per month, which is totally
crazy. And we are hiring so worldwide
for remote. If you are interested, please feel free to reach out. And I'd
like to thank Nearform for letting me working on lira and presenting this to
you today. So thank you so much. Thank you all for
following this talk. And this is where you can find me, mainly on
Twitter because this is where I live most. Thank again has
been a pleasure. I hope to see you all very, very soon.