Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome to my session on accelerating AI lifecycle.
Today, I will talk about how we at Meta work towards making AI lifecycle and
essentially the research to prod cycle faster for scientists and engineers
to iterate and deliver their work.
But first, who am I?
I'm Pari Gupta, and I'm a Senior Production Engineer at Meta.
I work on Python Language Foundation team, which manages the entire
Python ecosystem within the company.
For the longest time, I worked in the interactive computing team
focused on Bento, Meta's internal Jupyter notebook distribution.
Outside of Meta, I like to paint and dance.
Before diving into all the exciting and creative improvements that
we've made so far, Let's go back to basics and start with AI lifecycle.
AI is an iterative process.
When I first started out, I thought it was linear.
You collect the data, you train a fancy AI model, and do the testing.
Then you deploy that model in production, and then you're magically done to
reap the fruits of the AI system.
Easy and dreamy, right?
What a noob I was.
It was very soon that I realized that AR lifecycle is more of a cycle with a lot
of back and forth than a linear graph.
Here's an oversimplified version of what an AR lifecycle looks like.
It contains different stages such as product scoping, data engineering,
actual model development, deployment of that developed model, and so on.
monitoring of how it, on how it is performing.
And finally, business analysis and insights.
You'll notice a lot of back and forth arrows between different stages.
These indicate that lifecycle is not a steady loop, but
involves a lot of revisions and iterations for better results.
Out of all of these stages, the data engineering and model development
components is called ML prototyping.
Thank you This phase is very core to the cycle and its effectiveness can really
speed up the overall delivery of AI.
The most commonly used ML prototyping tools are Jupyter
Notebooks and Kana environments.
Let's dive into them.
Jupyter Notebooks is an open source, web based, interactive computing platform.
They allow you to create and share documents that contain live
code, equations, visualizations, And narrative text and images.
It's widely used in academia and industry alike.
One of its main benefits can be seen in data science and engineering.
Notebooks can help you massage the data into the format that's
expected by your model, develop data features, do scientific
computations, and even visualize your data to derive insights from it.
Another is ML modeling and testing.
To prove a hypothesis, a scientist would develop an experiment, code a hacky
prototype, and run it on their notebook iteratively to train and test their model.
Jupyter notebooks have several other benefits.
They're language agnostics, which means that you can easily not just
run Python, but any other language such as R, Hack, JavaScript, etc.
Lastly, it is back end agnostic.
This means that you can run your code locally or on a remote server or a
GPU or a CPU cluster fairly easily.
Conda.
Conda is a popular package manager used for managing software environments
and dependencies in data science and scientific computing world.
It was developed as part of Anaconda distribution and includes Python
and other programming languages for major scientific computing tooling.
Conva allows its users to create isolated environments where different versions
of packages can be installed and used without interfering with one another
or with the base system environment.
This makes it very easy to manage dependencies and avoid conflicts
between different packages and versions.
Now I want to share some of the ideas on how we at Meta have explored
and executed to make ML prototyping faster within the company so that our
researchers can deploy their models.
and make them work for the fast growing world of artificial intelligence.
Bento.
Bento is the internal distribution of Jupyter notebooks.
It is particularly optimized for Meta's needs to make it compatible
with the internal codebase.
It allows users to execute their code on different servers within Meta.
Now, let's see how we've improved this interactive platform.
Have you ever exe Have you ever wanted to rerun and visualize
data on a regular basis?
Maybe make reports or see how the model is performing to make informed decisions?
Inspired by Pet Paper Mill's approach, Pento introduced powerful
tooling to schedule notebooks to run code at any cadence on any
server in a privacy aware manner.
This helps us ensure that all the permissions are intact for user's ease.
Next is persistent sessions.
Ever come across a situation where you wanted to run code remotely
and lost internet or wifi got disconnected or something as simple
as a laptop went to sleep, you lose all the long running executions and
you have to start all over again.
It's sad, right?
We implemented persistent sessions and sophisticated notebook recovery
mechanisms where users can access their data computations on, any Jupyter
session across different servers.
You can simply execute a long running in your notebook and come back to Jupyter
session at a later point in time, essentially resuming the session with all
the data computations already available.
Another idea that made a difference in developer productivity and improved
ML prototyping was cross platform debugging between We are scored at
Meta and Bento here, one click on any Python file would push you into
a debugging session or a note or a notebook to iterate on your code with
all the dependencies that you may need.
Not only this, we also introduced notebooks within VS code itself.
which makes it very easy to import and export in and out of IDE at any time.
There are several other innovations that we've made so far and influenced
the ML development and really made some quantifiable speed ups.
Integrations to enable data privacy and security reviews, support for serverless
servers, enhanced SQL cell support, ability to use multi language kernels,
auto suggestions using internally developed Gen AI agents, and so much more.
Kotlin is relatively a newer introduction to our developer tooling.
We want to bring most of the open source experience internally
in a controlled manner.
This enables faster onboarding of researchers to the ML development cycle,
which will Lowering the learning curve and hence making the product delivery faster.
Researchers can play with their GitHub projects more freely while they are
experimenting with their ML models.
One thing to note here is that Meta is a large monorepo.
This means that there is one version across all the projects
that exist within Meta.
This introduces a lot of difficulties when it comes to quicker
experimentation and version switching.
Conda, on the other end of the spectrum, provides full version
control of dependencies that a user may want to experiment with before
marring to one specific version.
Conda meta is packaged into a portable virtual self contained environment.
which can be accessed from anywhere the user desires.
And finally, as we plug Conda environments for specific MLE use cases, we also
enable better telemetry, accountability, and tracking in terms of errors,
package usages, and vulnerability checks to ensure that researchers can work
without worrying about any breaches.
The last interesting idea that I want to share today is the power
of combining these two tools.
Basically, I mean that Runconda as a Jupyter kernel in the
back end of a notebook.
This in particular has accelerated the development of AI models and
experimentation bringing research and production development
much closer to one another.
It has actually reduced the cost.
Thank you.
Here's the link to my LinkedIn profile.
I'll make the slides public in case anyone wants to go through these resources.
Thank you.