Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello. Welcome to my session on deep learning for protein
structure prediction. My name is Yaroslav. Let's get started who
I am. I spent four years of my life doing machine
learning research and machine learning related software development.
I have a master's degree in computational biology and I also
worked for a year on antibody structure prediction,
machine learning for drug discovery in a biopharmaceutical
company. Today I want to talk about proteins,
why we need proteins and how they can
help solve different problems.
I want to talk about evolutionary information and information
that we can use to get
more data about proteins and predictions,
their behavior and structure. I want to talk about structure
prediction methods and a little bit of history behind that.
And we will discuss different kinds of structure
prediction methods, physics based, statistical and
finally deep learning. So first let's
talk about what proteins are. So proteins are
composed of amino acids, and amino acids are the
building blocks of proteins. They are often
shown as letters or those green
boxes. Here are the same things as letters on
the image below. And proteins are
essentially chains of amino acids.
And proteins have structure,
different kinds of structure. First kind of structure
is the amino acid chain, the sequence
of amino acids that defines the protein.
Secondary structure is a structure that
local parts of protein intake.
They can have spirals, or they
can align on each other in different ways.
And the third way we can describe the
structure of a protein is its ternary structure. So it's
the whole 3d structure of
one chain of protein. And finally,
we can describe the structure of a complex
of proteins, so multiple chains interacting together.
And why do we need to describe this
structure, and why do we need to predict it and know it?
So for different kinds of things,
we may want to estimate
the protein function based on its
amino acid sequence, because we don't know the structure. And the structure
is very hard to obtain through
experimental methods. It's much easier to obtain the
amino acid sequence. But to
run some experiments on computers with the
actual 3d structure, we need to first obtain it.
And we may use this information about the structure
to try to understand the protein function,
try to understand how we can modify the protein and what
use cases there are for that particular protein.
So a couple of examples why we
need proteins. First. One is plastic
degeneration. So we can have a bacteria,
genetically modified bacteria, that produces protein
that acts as an enzyme, so it will speed up
breakdown of plastic, and it
can help us get rid of waste in
different kinds of. The other thing
that we can use antibodies for is vaccines and drugs.
So for example, on the right there
is a coronavirus displayed. And it has proteins
going out of the shell. And those are the spike proteins
that the virus uses to enter the
cell. And those are proteins that our
immune system reacts to. And it produces
antibodies, which are also proteins that can bind to
the spike proteins of coronavirus. And they
can alert your immune system to
destroy the virus. How can we
get more information about protein,
a specific protein,
without any other experimental
methods? On that particular protein we
can have a look at similar proteins. And,
and the idea is that if
we have a similar protein, we have a similar structure or maybe a similar
function. If there are some changes
to protein sequence
in some position, maybe there would be a change
in another position which can be far away
in a sequence, but it is actually close
in 3d space. So if
say one position changes its charge,
so the other position has to change charge as well to preserve
the structure. So we can
have a look in a database and find similar proteins,
align them together. So we have kind
of the same structural positions on
top of each other. And that can help us to
get information about how variable is this
concrete position or which positions it
interacts with. Okay,
so structure prediction methods can
be different. And first of all, we have a
protein folding on the upper left
corner. And this is like the natural protein
folding. It's really, really accurate.
And it doesn't need to obtain a
lot of sequence information or information from multiple sequence
alignment. It only has to do
its natural job. But if
we are talking about protein structure prediction methods, they can
be classified by the amount of
information they use and their accuracy.
So on the bottom left, we have physics based methods.
They are not really accurate and they
need a lot of compute to actually produce
the result. So the next thing is methods
using PSSM. PSSM is derived from
multiple sequence alignment, and it is
kind of a statistic about each
and every position of multiple sequence alignment.
Second order methods use coevolution information,
so they will encode information about pair
interactions in multiple
sequence alignment, and use different
kinds of methods produce the result.
And finally, full multiple sequence alignment methods.
We'll use full multiple sequence alignment and
we'll use deep learning to process the
whole data available from multiple sequence alignment.
And for some classes
of proteins which are not
that, for which that
information from multiple sequence alignment is
not really useful. For example, for highly variable proteins
such as antibodies, multiple sequence alignment
can not be that useful to,
to get more information about the protein. And the
other thing is that end to end, deep learning methods are usually
faster than physics based methods. And we will talk about
why in a moment. So on average,
if you have more information, you have more
accuracy. Using different kinds
of information, you get higher on prediction accuracy.
So why is it difficult to get
a result with physics based methods, and why do they have to
use a lot of compute? That's because problem
with a lot of particles interacting.
So if you have even three particles
interacting with each other, and you know the forces acting on them,
that system cannot be solved in a closed form solution.
And any changes to initial state can
change your end state very drastically,
because that's a chaotic system. And the only
thing, the only method we have for solving that problem is
iteration methods, which require a lot of compute.
So, molecular dynamics
methods use simulation, step by step simulation,
and high performance compute systems to see
how a protein folds and how the
parts of the protein move under forces acting
on the protein from inside and from outside.
So those methods usually use
some really, really expensive hardware,
such as supercomputers. But they also
have benefits, such as trajectory analysis
can be performed on the whole
simulation. So you can know the
dynamic behavior of a protein in
some cases. So those methods
work with forces, and there are many different forces acting on
particles in the protein, and some
of them are described here on the on the right.
And those forces are potential
forces, which means they don't depend on particles
velocity, they only depend on particles coordinates
and properties. So,
with the
methods using physics based simulation,
we are struggling to obtain a good first representation,
because to achieve a low energy state, we have
to spend a lot of iterations. So maybe we can do something
and achieve a good first
structure, and then take it from there to speed up
the whole process. And for that, we can use homology modeling.
Homology modeling is based on the same idea as multiple
sequence alignment, that similar sequences have have similar structures.
And if you have a database with structures and their
sequences, you can look for similar sequences to the sequence
you want to fold, and you can find fragments of
that, of similar sequences, and you can
combine them together to create the
first model, and then you can evaluate
multiple such models, or you can fine tune those
models using physics based methods.
The other problem with physics based methods is that
we don't know how
likely this current position is for that
molecule to be in. So if we have a lot of statistics about
which positions we observe in
real proteins, then we can use this information to try
to kind of forbid
some states in a molecular dynamics process.
If we know that this position is unlikely, we will apply forces
to bring the molecule out of this
position, because we assume that this is an optimization dead
end. But for that we need
to know the likelihood of different positions
in the molecular structure. So if we use
statistics, we just get a lot of data and estimate
likelihood of every position.
But it only works on a
specific protein families, because statistics
in one family can be different from statistics from another family.
And that's where deep learning comes
in. What we can do is we can estimate
that position likelihood using machine learning.
And that is what a model
called alphafold one tried to do. So it
tried to predict the likelihood of different positions
for pairs of atoms. So that matrix
in the middle, it can be treated as distribution
over distances between the atoms. And you can see
the diagonal has green
color to it. That means that those atoms are close together,
but some of the other atoms are close together as well,
and they are not adjacent in the sequence.
And to produce this distribution,
we can use sequence and MSA features, which we
can encode like a picture in
a 2d space. And each
position will tell us how
those two atoms, on those two
amino acids, on two different positions, I and
j, interact together. And then finally, when we
produce this, this kind of likelihood map,
we can use physics based methods to fold
the protein really quickly, because we
know which positions it likely to take,
and it really speeds up
the whole physics based process.
The next evolutionary step in protein prediction
is alphafold two model, which uses multiple sequence
alignment directly. And what
it does is it
produces the whole structure end to end with machine learning,
without using any physics based iterational methods, which is
a lot faster. So it can be divided into three steps.
First step is obtaining an input.
Using an input sequence, you can find a
lot of similar sequences to produce an MSA,
and you can also find their structure.
As in homology modeling, you can find templates for
your protein, pieces of other known
structures that are likely similar to yours.
After that, there is deep learning magic happening,
and in the middle, we just encode
the information that we got into the model.
And the final step is structure prediction.
So for that model, a new
kind of structure prediction model was
created, which would predict
and update angles and distances between
amino acids to produce the final result. End to end worked
with geometrical features to get the final result,
which also can be fine
tuned with physics based methods, because sometimes this
result will not be locally accurate because
the model doesn't know physics. And a few iterations of physics based
methods can kind of relax the model and
push some atoms away or
bring them together so the whole structure looks more natural.
The other method for encoding protein information
using a lot of data is language models.
So proteins consist of different amino
acids, just like text consists out of words.
And we can use similar techniques from text
processing and language processing to encode
a lot of sequences into
a large language model. And then we can use this large
language model to encode our input sequence into
some representation from that representation.
Using the same idea as alpha Fold two,
we can predict geometrical
features for the structure,
and we can predict the structure end to end using
a lot more data for language model
per training, protein language model per training. And then we
can use a smaller model to predict the geometric features.
And the same way as before, we can
use refinement steps to
fine tune the model using physics based methods and
final model that was only released
this month. Alphafold three expands
on this idea of alphafold
two template using using templates,
using multiple sequence alignment, and using
other things that bind to proteins to
get the better result in protein structure prediction. So this
model can not only work on
proteins, but it was changed a little bit. So it can
get other information in from things
that proteins bind to or interact with
that are non proteins and
come from different origins, for example protein DNA interactions or
something like that. So essentially it can be
split into three different stages as well. First is
input input building.
The second is deep learning processing. And the
third one was updated too. So it
can predict not only proteins but other
molecular structures too, such as DNA.
And in this model, they used diffusion
module to produce protein
structure and other molecular
structures from noise, similar to
generative AI for images and videos.
And we can see that many of
technologies that are used in image
and text processing, such as diffusion models,
large language models,
transformers and convolutional models,
they all trickled down into biology. And people
found ways to use this technology for biological
applications, which are kind of far from
image processing and also far
from language processing a little bit.
But anyway,
people find new ways to use technologies,
not only in the spaces where they
were created, but also in biology and many other
applications. So today you
learn about physics based methods,
statistical methods, and deep learning methods for protein structure prediction.
You learned that physics based methods require a lot of compute,
and there is a lot of research
on how to speed them up. There are
heuristics to speed up physics based methods
such as statistical potentials and other
statistical tricks to speed
up the protein folding.
The next logical step is to replace statistics with deep
learning and kind of
automate statistical feature recovery from
data using deep learning.
And the problem with that is that getting more data
into a machine learning model, a single model or multiple models is
challenging. And as the time goes, more and
more methods can unify
information from multiple sources to encode it together
and get a better result. For protein prediction,
you learned that end to end methods allow to
use deep learning for every step of structure prediction except
of obtaining the input.
But those kinds of
methods are end to end. Methods are really important because they can
save a lot of time, because they have really
good properties for parallelization, and they
can be run on efficient hardware, and they
don't require as much compute or as hard of a compute
as iterations in physics based methods.
Physics based methods are not dead
still, so there are still use cases where
you can only use physics based methods if you
want to achieve good performance and accuracy. For example,
if you want to analyze trajectories, or if
you want to refine other structures that were
produced by deep learning models without without really
knowing the physics of it. So they are
still useful for post processing and other applications
where accuracy is really important. But they
use a lot of compute new deep learning methods
such as transformers, diffusion models, and convolutional
networks. They trickle down into biology and
with time I hope we can see more
methods appear in text processing and
image processing that can be applied to biology
and structure prediction. Thank you for joining
me. If you have any questions you can
leave me a message on LinkedIn and I'll be happy to answer them.