Transcript
This transcript was autogenerated. To make changes, submit a PR.
Greetings everyone. My name is Jayanta along with
Sapna, both of us we'll be talking about Apiz today so let's
briefly understand what this talk is about. From this talk what all things
we can expect. EpIs is a scientific computing
case study. We'll be speaking about building epidemic
simulator using a rust language. We'll be speaking about our
journey, about how did we went ahead and built
a simulator which is optimized for performance and
it can support large scale. Also we'll be sharing some
of our understanding and our insights how
just as a language helped us building the specific simulator.
What is out of scope for these specific talk is we will not be
speaking much more detailed about our Covid-19 epidemic model.
It deserves its own talk. And this talk is purely going to be technical
in nature. But even being said that
we need to cover some of the basic understanding
so as to understand the problem statement. And initially
in the agenda we'll specifically talk about APZ in the beginning
these we can talk about memory optimization, performance optimization
results and the insights which we gathered from this
specific tool. That would be towards the end.
So let's briefly understand what is an episode.
As it has been described, it is an agent based large scale
open source epidemic simulator. So let's try and understand
each and every single term of this. So when we say it's an agent
based simulator, what it means is we understand that
spread of a disease is a complex phenomena. All of us have witnessed in
past two years how Covid-19 is spreading. It is difficult
to understand as there are numerous factors which has impact on these spread.
So what an agent based model does is agent based model is
much more bottom up approach is we start by
creating every individual in the society or every individual
in the system which these individual areas
of interest for us. So we create every individual person,
then we create their homes, then we create their workspaces and
every individual, we assign certain schedule to these individuals.
So we create a virtual life virtual society where these individuals
are performing their day to day routine. They are interconnecting,
exchanging some of the information they areas, meeting each
other in physical world. And all of these interactions
can potentially cause a spread of disease. So that is the
basic fundamental about agent based simulation. So on the right
we can see a grid
which is depicting a minimalist model which we are using for ephesus.
So every blue dot in this grid is
an agents. So this is one of the smaller scale model
which might be for like few hundred individuals.
So all of these individuals, they stay in certain home
area where they interact with their family members
while going to office. Some of these individuals
will take up, will take space and transport areas that they
will be accompanied by other individuals who may be stranger.
And just by sitting next to each other, there's a chance that a disease
can be spread through. So that is why we have a transport section.
Similarly, in work area people will come to work,
they will interact with their colleagues. And again such interactions
might lead to spread of disease. And once these individuals come back home,
it might happen that someone might get infected at work.
And again the person is in connection with the
family members and might spread diseases at home as well.
So it's a minimalist model which is depicting spread
of diseases. And in this case specifically Covid-19.
So at the bottom it sounds
pretty complex, right? But if you want to take the most simple
view of episode, I can easily
explain it as two for loops. So the outer loop is
a clock tick. So there is a vector clock which is simulating
us in this context of a virtual society. So what
we are doing is at ivz clock tick which is at iVzrl.
We are iterating over every agent and asking them
to perform their corresponding routine. And once this
routine is performed, at the end of each action we
are checking if there is a possibility of spread of a disease. And once
we iterate through all the agents then towards the end we do stockkeeping.
In terms of we see how many folks got infected in
the last hour. Also if government wants to intervene,
if there areas certain three conditions which ask government to
put a lockdown which ask government to
ask people to wear masks or to start a vaccination.
Such intervention can as well be modeled. So that is
most simplified view of apologist.
So this is our journey. We started busy, simple. Initially we
started with thousand population. Every individual
in these skis was same as other person. These slowly
we moved to 10,000 individuals. And we started bringing heterogeneity.
Heterogeneity in terms of these individuals
go to work. Not everyone in the society is working.
There might be younger kids or elders who are staying at home.
So based on age there's stratification. Then there are certain
individuals who are essential workers as in these individuals will still
work if these is a lockdown imposed. So we
started slowly modeling, capturing the complexity from these domain
and making our model rich in terms of domain. The way in next
target after finishing ten k was to model Pune city
which is a city in India from where both of us are. And since
the scale was 3.2 million individuals.
Scale and performance became the issue pretty quickly for us.
Then once we modeled Pune, we went ahead
and started modeling Mumbai. So Mumbai is one of
the metro cities in India, which has population, which is 12 million.
So roughly four times of Pune. Also,
one of the challenges which we faced in the Mumbai context is
modeling commute. Because many folks in Mumbai,
they travel a large distance from one place to
reach their workplaces and then come back home.
And next goal, which is work in progress, is we want to reach a scale
which is 100 million. With 100 million, we can model
our state, which is Maharashtra, which is one of the largest state
in India, and which as will help us to model multiple cities
at the same time. So,
having said that, let's briefly understand what are technical complexities
in this context of episode. Epirust is compute intensive.
So what it means is if we talk about a simulation
which runs for 45 days, the outer loop, which we saw
for time, it runs for 1080 ticks, because each tick
is one r 45 into 24, which will come down to 10
80 ticks. So if we started with a population of thousand
individuals, so for such a smaller population,
there would be 7 million behaviors, which needs to be executed
in the context of this 45 days. So if
we increase population to 1 million, the number from 7 million rises
to 7 billion, and moving one step ahead from
1 million to 100 million, the number of behaviors will move
on to 700 million. Also,
when we talk about scale, if we carefully
look at this metrics on the right, the metrics is parse
that not all, most of the places are empty in this context,
which means we will as well need to take care of memory footprint
when we are building such a model in memory. And then
there are domain complexities. We need to understand what are
different things, which goes with Covid-19 specifically because
epirust is built specific initially for Covid-19.
So we need to understand the way disease works. We need to understand
government interventions and how these interventions have impact on
individuals lives. And then adding
on to this, there is one more difficulty.
So the inner for loop, which we saw earlier, which is per
agent, it executes agent one after other.
But when we speak about agent interactions in real
life, all of us move at the same time and we do not move one
after other. So such a loop causes
part dependency. To solve that, we are using 2d buffering algorithm
to take care of to remove the part dependency from the picture.
So how did we start? As I said earlier, we started
with very simple mechanism. We started with two for loops.
The code was pretty material. The grid
which we saw was implemented as 2d metrics.
Population for Pune City is 3.2 million, which means
if we look at number of seals, this 2d metrics,
number of elements, this 2d metrics used to hold that was
close to 32 million cells. Which means the memory consumption
went ahead from. And it was approximately somewhere between
five to Tengb, which was pretty big.
Keeping our goals in mind, we wanted to model not just for Pune
City but as well as for Mumbai and want to
go ahead and reach scale with 100 million.
So we faced a challenge for modeling the
specific grid. One solution which we came up with is if we can
represent this 2d as a grid, as a hash
map. So we created a structure where the key for
hash map is point and the value becomes object of
citizen struct. So point is nothing but xy location
on the plane, on the grid. And the citizen is that
individual agent who is occupying this specific sale.
So essentially what happened is number of agents
became numbers of entries in the specific hash map. Since it
is a hash map, Zival became much more easier. Now instead of order
of n square, now it is just plain order of one.
Oppositions memory which was earlier in the extent
of five to tengB, right away came to few hundred mb.
We went one step ahead and then we started looking at what
can be efficient hashing algorithm which can give us better mileage.
So we experimented with hash brown, we experimented with FX hash,
we experimented with FNV hash. FNV hash because it
is non cryptographic and these hash is not exposed outside.
So FNV hash gave us the best possible outcome
and we went ahead with FNV hash.
Having said that, I'll hand it over to Sapna to talk about
how did we optimize for throughput.
Thanks Taranta. Now we will be seeing how do
we optimize for the throughput performance. So what
we tried doing is first representing
the parallel incrementation form. As you saw, we are using 2d
buffers. One is the read buffer. From there, when our
internal for group starts running, the agent will
see its neighboring agent position from the read buffer and
update its position according to its training routine into
the write buffer. And each updation
of each agent is independent of the other agent status.
So we can easily use the data parallelization here.
So one of the method was the map duries where we actually individually
update the agent status and later on
serially store it into the temporary
sum data structure and later on serially update the status
to the right buffer. That is just to avoid the collision
into the map because we are using buffer as a map.
In this case the parallelization wasn't completely one
of the step, we were doing it parallel and another was a serial.
It did give us some performance improvement.
Another way was doing it using the parallel iterators where we use the rayon
library. And what we are doing is using the
just concurrent data structure dash map,
which allows us to automatically
do both the stages to check whether the particular
entry in the dash map is empty or
not. And if it is empty then you update the agent
status there. And as you can see this
graph, you can see how we are doing against the
5 million population using the Mapreduce or parallel method.
If you see the first where we have throughput 0.5 for the serial
implementation, that increases to the double with
the Mapreduce and almost nearby to it
using parallel. But if we go ahead then the
parallel and map reduce performs same.
If we go with the higher number of scores, that is 6400.
That's about the scaling up where we tried to scale up the population and
did the parallel implementation. But another use
case we had of the scaling out. As you
see we use the 64 threads and those many cpu cores.
If we want to scale it further, we always won't have those
many resources available. And from the domains perspective,
sometimes these disease specifications
or the geography specifications can be different for the smaller unit
of area. One city might have it differently than another
city. That's what we did is we
started those smaller area into the different engines. Each engine
will have its own specifications, it will run its
simulation and that way you
can run it to the distributed mode which allows us
to use the compute opensource efficiently
and solves our domain problem as well. Another thing is these engines
are not completely isolated from each other. Agents will
be traveling from one city to another city or one world to another world.
In that case that we are achieving using the kafka where
a data of such commuters is sended over the Kafka and other
engines will read it over from Kafka. And this has to be happened
at the hour. If at our 24th somebody
is traveling, it should be reaching at 25th hour or 24th
hour depending on the condition. And that for that purpose we have the orchestrator
which do nothing but the synchronization of the Indians to the
particular points. And this is our distributed structure.
And we will see how does it help us in optimizing
throughput further. So as you see the first graph where
we can see the number for the Mumbai and Pune against
the serial, parallel and distributed setup.
And you can see for
the Mumbai or Pune it actually goes with the serial
implementation. When we move to parallel the
throughput has increased to almost twice. And again
if we go to the distributed mode again it's as twice as
the parallel implementation.
Now that is one way of sharing out. Another way is
like as we already saw, the Mumbai has 12 million population
and how did we divide it into the distributed mode
is the worldwide distribution where each engine was having
population around half a million and there
were 24 words. But if we can go little
further and reduce the population per engine and increase the
number of engines that what we tried with the 100k population,
each engine, 100 engines which again sum up to the 10
million which is near to the 12 million. Our throughput actually
is the six times better as compared to the Mumbai throughput.
And that's where these distributed helps us a lot.
Now we saw we did changed
our engines from 24 to 100, but it is
not possible to manually start up the hundred of engines and
then start up Kafka and do all of this stuff.
We need some sort of scale optimization
for here and some infrastructure which
can be used at this scale. So for that we added the cloud.
We migrated to the cloud and added cloud support to
our application. First we containerized our aprest and
then we used Kubernetes for the container management at the scale.
Basically we are using the Kubernetes jobs which runs
Kubernetes jobs for the engine and orchestrator and Kafka is
again deployed on these cloud. We are using Helmchart
to package the application. So anybody can just according to
their use case change these config
values in the helm chart, which way they want to run serial,
parallel or distributed and how many engines, et cetera.
And after the configuration with single command they will be able to
start our application on the cloud and get these output
data again. If we are having application
running at a scale, there are so many issues you need to debug by the
developing purpose to improve the performance.
Most of the metrics are needed for that. We again
take advantage of some open source tools like Elkstack,
Prometheus and Grafana for the logging and monitoring purpose
which helped us for debugging, logging and getting some automated
alerts. Now we
will move to the next part, that is Rust features. So as
you know, our basic language is Rust and all this code, whatever we
saw till now has been written in Rust and
now we will see how Rust helped us to
achieve our goals. So as you know, already. Rust is closer to
the metal and it gives the performance pretty
much similar to the C language.
So that helped us a lot. No runtime and no garbage collection
which gives us the performance.
But the performance is similar to the C. But C
and C Plus plus has many memory issues that Rust
is taking care of already, and that's why it served us
quite a lot of development time. With the fearless
concurrency, we could easily go with the parallel implementation and
most of the errors were coming at the compilation time.
Whatever the issues we got larger on, there is nothing related
to memory or something related to the
language specific. Apart from that already
we saw the performance and memory management helps us in productivity.
Overall ecosystem also help us the documentation,
the way it's written, the various grades for
the different use cases, and the cargo as the build tool
with its simplicity. All of us helps in
these productivity. Most of the features areas doing
converting into the parallel, using the rayon library,
multi threading using Tokyo, etc. That all of us
help us.
This is the team which worked on the epirust
and that's it.