Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everybody, my name is Jesus Espino. I'm software engineer at
Maramos and I'm going to talk about struggle embedding, instrumentation and code generation.
Well, what is Maramos? Maramos is a communication platform.
We write in the backend in go,
we write in the front end in typescript and react. And we are focused on
security and performance. We are an open source
project and with an open code model we
have a self hosted version and we provide
features to deploy on the cloud like
Kubernetes operators and things like that. And of course we
have our own SaaS service. Well what are the main pieces
and why I'm talking to you about Maramos?
I'm going to talk about maramos because I'm going to explain something that we did
here in Maramos. This is the Maramos architecture. We have the
client that is a react typescript application that
calls to the API and the websockets API. The API
and the Websockets API call the app layer. That is where our business logic
live. Our app layer is going to leverage a set
of services to provide the final functionality. And one of
these services, for example, is the file service that allows us to store files
in s three or in the local file system, or the email service
that allow us to send email notification and all that
stuff. The important piece here is the store. The store service
is can abstraction that allows us to provide
all the storage mechanism related to database access,
database storage, database queries, all that stuff
is inside the store. And the app layer doesn't know anything about SQL,
doesn't know anything about how the data is actually stored.
The app layer only knows that the store is going to take
care of the entities and store them and is going to return
me then whenever I have to use them.
Well, what does our store look like? Our store
is interface. It's a huge interface that have
a lot of superstores. Each superstore have a single
responsibility over a certain part of the data. For example,
the Tigna store is going to take care about the
team model and how it's storing the database,
how we query the teams, all that stuff. Then we have the user
store that does the same for the users, the bot store for the bots and
so on. This is how it looks like in the code. We have
the store interface that have a set of methods that return,
well, each superstore struct the superstore interface.
In this case, the team Superstore interface is going to have a set of methods
related to the team model.
If we want to implement this interface, we have to implement
each of these methods. So our SQL store,
the implementation of the store that we have is going to implement each
of this method accessing to the database using SQL.
But we will able to build a completely different store using
MongoDB or using any other database.
What's the problem that we are trying to solve? We want to
add caches to our system, to our store, but we don't
want to share responsibilities in the same code.
We want to have a very code generation of concerns
and we decide that we want to build something that is
completely separated. We don't want to see checking cache invalidation
cache insertions in our SQL related
code. So we want to have our SQL code that
is going to generate the queries, it's going to query the database
and then we want to have in another place the
cache logic where you insert things
in the cache, you retrieve things from the cache, you invalidate
the cache, all that logic should be separated.
Well, our initial approach is to use a well known
pattern, that is the middleware pattern. We create
a new set of interfaces and strikes to
provide this pattern and well, the result
wasn't easy to understand. This is how
it looks like. We had the SQL store that
implements the store interface that we already saw.
The SQL store also have to implement this new layered
store supplier. This layered store supplier is
going to have this set chain next that is
going to set the next element in the chain of this chain
of middlewares and the next is going to provide the next
middleware that is responsible for
the rest of the logic. The cache layer is going to implement
this layer, the store supplier only. And then we have this
layer, the store that is another extract that is going to provide,
well, it's going to have the database store database,
the SQL layer, the SQL store, and it's going
to have also the cache layer and any other layer
that you want to add is going to go here and then we are going
to have this set of superstore that are overriding or
not that are going to be delegated to the layer of the store or
are going to be delegated directly to the SQL store. This approach
work well, but it's not easy to understand and it's
not easy to think about or reason about. Well, this is
the code, this is how it looks like in the code. I don't want to
explain much that it's more or less the same that we
saw in the previous slide. What went well
and what didn't work, what went well is the
middleware pattern. It's something that is well known, something that is
kind of easy to think about it, because conceptually you already know
the concept of middleware, you know how it's expected to
work. So from the concept perspective,
was really easy to understand. Also, we had the opportunity to provide extra
information without affecting the layers beneath. So for example,
we were able to add this hint to the cache layer,
allowing to add certain extra information,
certain context from the app layer,
to decide if we want to cache something or not,
or if we want to invalidate the cache or not. This is because
from the app layer we have way more context and we
have the big picture of what we want to do with the data from the
store. We only know that we are adding a new team,
we are removing a team, we are adding people to certain
teams or something like that. But we don't have the big picture.
We don't know why we are adding that and we don't know if
we, right before that we added other thing and
we don't need to cache anything or we don't need to
invalidate the cache or whatever. So that was an interesting thing
to have, but we weren't using it so well,
was a great feature that we were taking advantage
of that what didn't work well was a bit hard to understand
and follow all the code there, and at the same time was a
bit hard to add new caches because you have to
modify different places. We have to modify the cache layer,
we have to modify the SQL layer, and we have to modify other parts.
Like the layer store was complicated to add things
there and there was a lot
of code in a lot of different places and was really
error prone and wasn't the best approach in
terms of maintenance. Well, our current approach, what we
use for our current approach, stroke embedding, instead of creating
all this middleware logic where you have all these
layers and all that stuff, we take advantage of the stroke
embedding feature of go to create
these layers, just store. The concept
of a store is going to be embedded in another
store. So we can create a layer
that embeds the store and automatically is going
to be a store because it's embedding a store.
So this feature is a great feature of the language.
And you can build this kind of layers
really easy. Well, we rely on the existing interface.
We relay on this store interface. We remove the layer,
the store, the layered suppliers, all that stuff is
gone. And we relay only on the existing store interface.
We created this cache store, local cache, store that embeds
the other store, the SQL store in it, and we override the
methods that we need and everything else is transparent. Well this
is how it looks like way simpler, right? The SQL
store is going to implement the store. We are going to write all
the methods, we are going to write all the SQL code. We are going to
write a lot of stuff for the SQL store that is going to be needed
anyway. But for the cache layer, we are going to embed the
store. So automatically the cache layer without any
method in it is going to implement the store. We only
need to override the places where the cache need to take some action.
For example, whenever I add a new post
I going to cache certain information. Or whenever I add
a new user I'm going to cache certain information.
Whenever I get some user I'm going to cache that information.
Whenever I modify the user I have to validate
that cache. So we only need to find that places where
we need to modify and update our cache and
override them. And that is what we do.
We create this local cache store that embeds a storetore.
Any store can be embedded, but let's think about it as the
SQL store. So we have the local cache store and we embed
the SQL store in it. Then we have a set of Superstore.
That is the specific cache implementation of this superstore.
In this case it's the team Superstore. We override
the method that gets the team store in this case is going to return
this own implementation of the team Superstore
and how we implement this superstore.
The Superstore is going to embed again this
team store, any team store, but we're going to think about
it as the SQL team store. Then we add
the root store. That is a private attribute that we are going to
use to share some data and share some methods.
And it's going to be just the local cache store instance.
Finally, we need to add methods to this superstore.
This method for example is get method. And the
get method is going to check the cache. It's going to return the cache.
If there's a heat and if there's not a hit, it's going to go
and use the embedded
track. It's going to take the embedding thing store
and use whatever is there. It's going to use the underneath store.
So the SQL store and it's going to get the data from the SQL
store, check if there's any error, and if there's no error,
I'm going to cache this information and return
the result well. What went well and what didn't work,
one of the things that went really well was the simplicity of the
solution. This is very simple, this is very straightforward,
this is very clear pattern. It's really easy to
understand, really easy to think about it, and really easy
to think about other options that we can
build with. This also was very simple.
To add new caches was super straightforward, was overriding
methods and just delegating everything else
in the underneath the store. You don't need to think about adding
in three different places code. You only have to add code
in the cache layer and it's all that you need to do.
It's a really general approach, so it can
be really easy to reuse for other things as we're going to
see soon what didn't work well. There's some subtleties
around stroke embedding. String embedding
is a struct embedding. It is not inheritance.
So you are embedding a struct in another
struct. So think about it as something
that you are doing manually.
You have a struct and you embed something inside that
struct. Go provides some kind of syntax sugar to
make this track embedding more comfortable,
and allows you to call methods
from the embedded struct in the parent struct.
The parent struct can override that. Methods can
define that methods, and if the methods are defined, it's going to
be call from the parent. But if it's not defined,
it's going to be call from the embedded struct.
But there is the problem. Whenever you call a struct
that embeds other struct, if the methods is not defined,
it's going to call the underneath method the embedded
struct method. And once you call the embedded strand method,
that is not going to know anything about the parent.
The context of that method is going to be the embedded
object. So there's no information about the parent at all.
So if in that method you call another method
of extract, doesn't matter if you override that method on
the parent, because you are in the context of
the embedded extract. So you are going to call the methods
of the embedding extract always. So that
means that that can lead to some
subtle errors that are really hard to track and
really hard to find. But if you really know what struggle
embedding is, it's really easy to avoid them.
So one of the problem is these subtle errors that
can happen. You have to be sure that your team
knows what struggle embedding means and for
sure know that struct embedding is not inheritance.
Okay. Another problem is the interface has to be
homogeneous. So that means that some flexibility
get removed and doesn't allow you to add
these kind of hints or things like that, or specific parameters
for certain layers. And all that stuff is the price that you have
to pay to have this homogeneous interface that you can wrap in
layers. But this is the first solution.
We built this for the cache. Went really well actually.
But we start thinking, oh well, we have this new
layers architecture that we can leverage for other things.
For example, we can leverage that for instrumentation, to add
instrumentation to our store without modifying anything
in the store. Just having some instrumentation
in a well defined layer and separating
all that logic from the rest of the store. It's a great
separation of concerns. You can have login if
you want to log all the actions that
you are doing. If you want to log specific actions in specific
places. Auditing, for example, if you want
to audit when something get accessed or
get removed or get modified or something like that.
Well, something that is really interesting is a storage and query delegation.
For example, if you have your SQL
store that, store things in SQL, SQL is great,
but it's not the best option for every single problem out
there. Sometimes you want to store unstructured data,
sometimes you want to store data that is not so important
to lose over time or is not necessarily to
be 100% sure that you are storing the data
and the data is 100% consistent. For example, some temporary
data related to the status of the user, or if the
user is typing something or was
the last channel that the user viewed or things like that.
That information is important for our users, but it's not critical.
So you can leverage some in memory database,
you can leverage some search specific enzyme like
elasticsearch or bleep. Or you
can start in a struct data in coach tv or Mongo
for example, that can provide you certain performance
improvements or certain extra features for certain
pattern usage. Well, also we can add their
extra validation if we want to be sure that
certain things get consistent in the database. We can add extra
validation in a layer. We can add extra error
handling. For example, if you have a non relabeled network
connection with your database, there can be some timeouts
or there can be some network connection problems and maybe
you want to handle that at the store level and struct
the app layer from all the logic needed
to retry a timeout on the database or retry
certain situations under certain errors or
you want to track certain kind of errors and store that
information in a sentry or something like
that. We start with instrumentation.
We added this timer layer. The timer layer is
just a layer that wraps every single method in the store and
adds a timer and calculates how much time
it takes to execute the query in the store.
Yeah, it wraps everything with
can almost identical method. So this
is a lot of code and it's a very
annoying kind of code that you have to write. And then the
maintenance of that is really boring and
error prone and complicated. So wow.
Generators to the rescue. We are going to write one
after another, same time,
the same thing, a lot of times without a reason.
Well, go provide us generators and we
are going to use them for building this timer layer.
Timer layer. This is an example of a method wrapped
in the timer layer. In this case is the safe method
of the Audi store. We start the timer,
we execute the underneath the store call,
and we calculate the last time, the time that has
been spent in that method. And then if the metrics are enabled,
I'm going to check if the query will succeed.
And if succeed, I going to store, well, whatever it
succeed or not, I'm going to store that information. In Prometheus.
This is really great because it helps a lot to investigate
bottlenecks. For example, we have all the information on
how much it takes to execute every
method in our store. This is can histogram.
So we have the information of the average time.
We have information about things like how many times these
methods have been called. So we know how much
time it takes in a cumulative way.
So we can decide, okay, this method is called
just a few times, but it's taking a lot
of time each time. That is something that we have to handle.
But at the same time you can think, oh, this method is really fast,
but it's getting called like millions of
times. So if you are able to improve the performance there,
you are getting a very important performance improvement.
So sometimes the time that is taken for
certain methods is not that important and it's more
important the time that is taken in total, not for
each call. So this kind of information is there in
a grafana and we can explore that. We can set alerts
on that. So we can decide, for example,
if a methods gets an increment of 10%
of time to get executed in
certain time, we can execute
an alarm and send an email and say, okay, this method
gets degradated in that date will be is
because you upgraded to a new version and maybe
that degradation is acceptable or is explained
by some changes in the code that are necessary,
but you don't degradate that without noticing.
That other thing that we did is adding open tracing. Open tracing is
great and give you a lot of information about what is
going on in your system. But adding open tracing means that
you have to add a lot of small details here and there in
your code. And was something that we didn't want to do because
we don't want to contaminate all our methods
with this set of information in open tracing.
So what we did, we create a layer that is
almost the same of the timer layer. But for
open tracing we created this layer and we
also replicated that for other places. We use open tracing
in the API using the middleware of the API that was
already covered. And then we had to add
open tracing to the app layer. The app layer is
a big structure that has a lot of methods and that methods,
well that is the way that we organize that
methods. So what we did is just build automatically generate
interface that match that structure
with that interface. We created the layer
for the app using again code generation. So now what
we have is whenever we change something in the
app layer or whenever we change something in the store, we only have
to execute code generation and it's going to generate all
the open tracing code for us. And we don't have any open
tracing related code in the app layer and we don't have any open
tracing related code in the rest of the store. We only
have that information in the specific set
of auto generated code that we have there. Well this is
how it looks like in the code, the open
tracing layer method, we just set the open tracing
information. We execute the underneath
method in the store and we add more information
to the open tracing and that's it. Okay,
the retries, the retry layer in the database.
We want to use serial stable isolation level in the database.
And that has a problem when you use read
committed, basically you try to execute the queries and
time is going to work pretty well. And when the load is
pretty low, it's almost impossible. It's really hard
to refuse that a transaction failing there.
But when you are using serial disable isolation level,
the problem is the isolation is so high that
whenever you try to run two transactions and
one of them modifies certain data and the other one
is querying, some part of that modified data is going
to fail. But it's not going to fail in
a way that this query is broken or something like that.
It's just saying okay. I'm not able to
execute this transaction because something was modified
before. So you need to execute the transaction
again. And that is what a repeatable error means
in the database. So whenever a database return, a repeatable error
means retry. It probably
is going to work, you only need to retry it. But because the transaction
is something that I can re execute
automatically from the database, because you are able
to do things between the transactions and you are able to do
calculations between the transaction, it's not easy for the database
to infer that the transaction is repeatable
by itself. So you need to repeat the transaction
from the outside. Well, because we need to repeat
the transactions when we receive a repeatable error. That was pretty
easy to do with a layer. We just generate automatically
a layer that catch any error that is a repeatable error
and try again. Well, this also
helps us with this. Whenever a deadlock
happened in the database, one of the transaction is going to succeed
and the other is going to get killed
with a repeatable error. So that is something
that happened really in a really rare way,
but is something that can happen in very loaded environments.
And what was happening before is just,
well, it returns can error to the app layer and it returns
an error to the API and probably the API is going to retry again.
Now we are going to retry vaguely in the SQL store.
This is how we did that. We have a get,
for example, in this case we have the get method. We just
enter in a loop, try to execute the query.
If that works, great. If it doesn't work,
if this is not a repeatable error, I'm going to return the
error. But if it is a repeatable error, I'm going to try
again. I'm going to repeat and repeat and repeat until it
succeed or it fails three times.
After three times we give up and return an error.
Then what is really interesting here is
we have the timer layer, we have the open tracing layer,
we have the retry layer, and all that layers are
auto generated. Everything that we change in
the store is going to be automatically up to date with just
a make generate. That is awesome.
So if you have this kind of code, it's really great to
have generators. And how we do that, we use ASt
to analyze this struct, this interface,
this store interface and all the soup interfaces.
And we build a data struct where we
have all the superstore
that are defined, all the method of the superstores,
all the parameters of the methods, all the return values of the
methods, all that information is in a new struct
that we pass that information to a template
and that template generates the code. We have
different templates, we have the same ast code that
analyze the store and then we use that same
structure that we just generated to populate
three different templates, one for the timer layer,
one for the open tracing layer and one for the retry layer.
And that templates are going to be generated and it's going to generate a certain
amount of code. And on top of that we are going to use go
format package to reformat that. Why we use
go format package because we don't want to be super
correct when we generate the code. Generating the
code is already a complicated task and generating
code that the Go format likes
is even harder. So we just delegate that in go
format. We generate the code and reformat it with Go
format package. So the developers are
happy and the go compiler is happy. So this is an
example of the timer layer template. As you can see
there we have the superstores. We range over the superstores,
we range over the methods of the superstores and
we generate the functions there. We generate the
star equal time module now and all
that stuff. We are generating all the code there.
It's not easy to understand, but once you write this has
been working really well for a long time without
almost any maintenance. Okay. But not everything
can be generated or can be automatically generated. So we
have to build something else. And I already talked about this storage
query delegation pattern and in this case we
use this pattern to build the
search layer. For the searches in maramos we use full
text search in the database. But we also support
other searching mechanism like elasticsearch or bleep.
If you want to use elasticsearch or bleep, what we do is
just add a search layer on top of
our SQL database layer and every search in
the store is going to be delegated to elasticsearch or bleed.
Every search in the store or every action
to the store that needs to update the indexes is going
to execute an update of the index
in elasticsearch or bleed.
And anytime that you try to search something it's
going to hit the elasticsearch or bleed but
it's not going to hit the database. So you are
going to have probably better performance from a specific
for search back end to search stuff.
And actually we have more features and a better
search using this in giants than the database one.
And you are going to free some database
titles for other stuff. So that is another
interesting thing. Well, we want to make this transparent from
any store user like the app layer. If the app layer is trying
to use the store, they don't need to know if they are using
elasticsearch or bleep or something like that. They only need to know that
they are searching for users. And if the elasticsearch
is enabled, it's going to get handled by elasticsearch.
But the app layer doesn't know to need anything. Well this
time we created the layer writing the code
and here is an example. For example, in this case we are talking about the
post store.
We are overriding the method save of the post store
and we are just saving the post using the SQL
store underneath. And if there is an error
I do nothing. But if there's no error,
I'm going to index that post, I'm going to update the index of
that post in the elasticsearch or bleed. If I
search for a set of posts,
I'm going to check in giants, I'm going to check what engines
are enabled and I'm going to try to search in that in giants.
If one of the injuries fail, I'm going to try in the next
one until I find any injury that works.
If none of our elastics are or bleeding
giant specializes in giants works,
we are going to fall back to the database search. We can disable
this fallback and if we disable that fallback, it's going to return
an empty list. But if we don't disable that fallback, we are going
to just call the underneath SQL store to return the results.
This works well. If you have for example downtime
in elasticsearch, you can just use the
database search as a fallback. And this is the
final onion. This is how it looks like in our
system. We have the app that is calling the store and it's passing
through all the layers down to the SQL store and
going back through the layers again to the app. The SQL
store is at the bottom, it's taking care of all the SQL queries
and all that stuff. The retry layer is going to take care of the
repeatable errors. The cache layer is going to cache things
and it's going to invalidate the caches and it's going to take care
of, maintain and use the cache. The search
layer is going to take care of, maintain and use the
search indexes in elasticsearch or bleep. The timer
layer is going to take care of all
this timing around all this collecting information about
the times and sending that to Prometheus
and it is not here, but optionally you can have
the open tracing layer. The open tracing layer is optional
because have an important performance impact.
So we can enable and disable it and usually
it's disabled. But if you enable that, it's going to wrap this
store entirely and it's going to provide that
information to the open tracing service. This is
how we build the onion. We instantiate the SQL
store, we wrap that into the retry layer, we wrap
that into the local cache layer, we wrap that into
the search layer and we wrap that into the timer layer.
And finally we return that final store. Because everything
there implements the store interface, we can just say
that they are all stores. The SQL store is
a store, the retry layer wrapping a SQL store is a store
and the SQL store wrapped by a retry layer
and wrapped by a local cache layer is a store.
We can reorganize all this and change
where the layers are. For example, I can move the timer layer
right after the SQL store.
And that way we are going to measure only the
time that the SQL store is taking. If you consider
that the local cache layer, the cache layer that we are using
is contaminating the data because you are interested
on how much time the database is taking and you
don't care about how much time is the store taking in
general, only about the database. You can move the timer
layer up there. Even you can create
another timer layer and have different information,
the SQL store information and the store information.
You can play with this concept of everything as a store.
To move the layers and make decisions about
how we set up the layers and
disable any of these layers is just not adding the
wrapping. So if you want to enable or if you want
to disable the search layer, you can just decide
by a config setting if you want to have a search layer or not.
And you just don't wrap the store with this layer
and that's it. Well, there's some drawbacks. As I
said already, all the layer has to share the same interface.
That is a problem because you don't have enough flexibility
to add certain things like the hints for the cache without
modifying the whole store. So you
have to modify the whole store interface if you want to add these
hints for the cache, but if you want to add other kind
of hints for the search, or if you want to add other kind of
extra information for open tracing or for timing, you have
to add more and more and more information to
the store interface. And that is something that doesn't scale well.
So I think this is the price to pay you have
to accept that you have to use the same interface
if you want to build this layer based approach.
Probably there are some tricks that you can try,
but it's not something that
by design is going to fit well then the
other problem is embedding is not inheritance.
So it is not a problem per se,
but it's something that can generate problems
if the people doesn't understand well that the team of the people that
is touching the store in this case needs to understand that
the embedding is not inheritance. Embedding is struct
embedding. So you need to understand that well and how
embedding works to not end up
having weird bugs that are really hard to debug. Well, some reference
if you want to see how we implemented the store and the store layers and
the generators and all that stuff is publicly available in our
mattermost server repo in the store directory.
If you want to see our old version of
that with the middlewares and all that stuff, you can check the version 50
zero that is a bit old already,
but you can check that and well, it can be interesting.
If you want to know more about stroke embedding, there's a
talk really interesting from
Gophercon UK. And if you want to know more about code
generation, there's another talk from Gophercon UK that is really interesting
too. So thank you.