Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, I'm Bolu and I am
an independent AI researcher. And today I'm going
to talk about using Python for large language models research.
Specifically, I'll be speaking on the insights Python library
that enables mechanistic interpretability research.
So a bit of a primer on this field
of inquiry. What is mechanistic interpretability?
Some things we can hopefully all agree on. First is that neural networks
solve an increase in number of important tasks.
And the second is that it would be at least interesting and probably important
to understand how they do that is interesting in the sense
of, if you feel any sense of curiosity, to basically
look inside this whole world that is currently black box to most people
just out of. Because these models arrive at solutions
that no person could write a program for. So, out of curiosity,
it'd be interesting to know what are the algorithms being implemented and
hopefully describing them in a human understandable way,
and important in a sense that any sufficiently powerful
system that is being put in strategic places
of great importance in society has
to have a certain level of transparency and understanding before we
as a society can trust it to be deployed. Any effort to
understand how these models work will definitely continue
to be increasingly important in the future. Now,
mechanistic interpretability, or mechinterp, as I'll call it going
forward, because as you can imagine, it's a bit of a mouthful, is a
field of research that tackles this problem starting at a very granular
level of the models. And what does that mean by
granular level? The typical mechanistic interpretability
result provides a mechanistic model that basically means
a causal model describing how different discrete components
in a very large model arrive at some
observed behavior. So we have some observed behavior, and the question is, can you,
through experimental processes, arrive at
an explanation for how these observations come to
be? That is the mechanistic approach to it. Again, this is identifying
mechinterp in a much larger field of interpretability,
which can have different flavors to it. But mechanistic
interpretability is unique in taking this granular causal
model of trying to drill as deep as possible and hoping
to build on larger and larger abstractions,
but starting from the very granular level. And for
today's talk, we're going to be picking one item out of the mechinterp
toolkit, which is that of causal interventions.
So basically the idea is if we abstract the entire network
to be a computational graph, that is, again, we forget that
this is machine learning. Just imagine this has been any abstract computational
graph, and the current state of
not understanding simply means we don't know
what computation each of the nodes are running and how they interact with each other.
So from that perspective, if we're curious about knowing how
one component that is, either it is an attention head,
or an MLP
layer, or a linear layer or an embedding unit, again, you don't have to worry
about any of these means. You can just abstract these as being any node in
some compute graph. But if you do, it'll help to paint a picture better.
So if we're curious to know what any of these nodes contribute to
start with, even knowing if they contribute anything to start with,
one way of doing that is simply taking the node and
observing some behavior that we find interesting
and then changing the impute to that node to see if the downstream
impact for the observed behavior is noteworthy.
That is, if this node, in this example, the node d,
is very vital to some observed behavior downstream.
If we mess with it a bit, that's if we perturb the node,
the observed result should change. That means, okay, this node
is on the critical path from input to output for this observed
behavior. Of course, we expect some part
of the model to change if you mess with anything. But the whole point
of this is that we have to first of all settle on some observed behavior,
and then we tweak the value of some node of interest and
then we observe downstream. If, however, it doesn't have any impact,
then that means this node is not that important and then we
can ignore it. But if it is, then we know that we can drill deeper.
So I think in the rest of the course, I'm going to speak on a
practical example in very recent research that
uses this kind of intervention to try to understand how
a model achieves some outcomes. The topic
of interest today is that of function vectors. So this is a very
recent paper, I think just published last year,
October, from a group from Northeastern University,
from Corey College of Computer Science.
Basically, it is a mechanistic interoperability research effort
that tries to observe some behavior in large language
models, and that behavior is
described thus. So the question is the hypothesis,
is it possible to have some
functional components of large language models?
That is, we can all agree that if I gave,
looking at the top left section here of a string
of input, that is arrive column, depart small column,
big common column, if I
gave this input to something like, say, chad GPT, I think we can
all agree that it will figure out, okay, this is a simple word and opposite
game. That is the first example at the top.
And the second example, I believe is converting to Spanish.
I think we can all agree that something like chat, GBT and
similar large language models are able to do such a thing.
Is it possible for me to take some kernel
of this function of opposite, say again, taking the first example,
and transfer it to a completely different context and
have that same behavior operate on
a token in this new context? What that means is on
the right you see the direction of the arrow. The example, the counterpart on the
right, simply says the word fast means.
Now, under the normal operation of a large language module
model, trying to predict the next token, you can say something like the word fast
means quick, or it means going quickly or any reasonable
thing to follow. However, if this hypothesis
of portability of functions, we should
be able to move something from this context on the left
that clearly is about word and opposite into a completely new
context that has no conception of word and opposite
as an objective and achieve the result of
flipping the word fast into slow. I know
it seems very almost crazy to
expect this is true, but let's just assume this is the leading hypothesis.
And of course we're going to discuss what exactly this thing will be exporting is.
We see there the letter a average layer activation.
What the hypothesis says is this thing in quote that we
plan to port over is simply the
average activation over a series of
projects for a given task. Again, I'm going to break that down a bit.
So again, let's say our task is simple or an opposite. So we have three
different examples. Old young, vanish,
appear, dark. Colin and I
guess something like bright or dark and light will follow. And the
second example, the same thing. Awake, sleep, future, past,
joy. Colin at the very
end of all these contexts, these like query
inputs, the neural network is right on the
verge of doing the thing called flip opposite the
last thing I saw before my column. So the hypothesis is
if we can take that activation state and in
the section b you see there, simply add it
to a completely unrelated context,
would it be possible to observe the same behavior? Because again on
the right we see fair simple. In the absence of this
intervention, we have no reason to expect the model will
say anything other than simple. Then something
like simple, easy, or whatever the model finds appropriate to
follow simple. But if indeed our intervention is important,
we expect to observe something like simple colon
complex or at the bottom there encode becomes decode
just magically by intervening with this average
activation state. Again, I would explain what we mean by activation state in the
following line, but I hope you just get the general thesis of what this is
meant to be. That is the question is, is there a portable
component of operations
and functions inside of neural networks
and more specifically large language models? All right, so I
guess to give a bit of shape what I mean
by activation vector and what is being ported left and right.
So here, this is just like a typical one layer example
of an LLM decoder, only what
we have here is at the very bottom, we input a token,
a sequence of tokens, right? That is like the on colon,
off wet column, dry old colon.
And as we see, the expectation is as this
input passes through subsequent layers in,
there's one single set of vectors that are going to keep being updated and
changed and added on. And again, due to some specifics
to the neural network architectures, more specifically the
skip connections, which I won't get too much into right now, each subsequent
layer adds additional context that is literally just added
on top the last. But in any way that's not really important for now.
So let's just think of it. For example, again,
looking at the journey of the column, the very last column,
when it goes to the embedding layer, it has some vector
that represents okay, cool. This is how the
neural network's embedding layer represents the token of
a column. And again, so we can kind of anthropomorphize,
pretend it's like self aware almost to say
I am a column. Because technically, if you took that embedding vector
and you put it through the unembedding vector on the other
side, it would come out as column is really likely to come,
right? So we might as well just see this as the model
being what information the model has for that position in
the sequence. So somewhere between starting from I am
a column in the beginning to the very end
of the thing that follows me is the word new.
The model has learned some interesting things, right? By definition, like how
else would it know? Again, because it's still
that same column vector that has been
updated for the sequence position of the token column. So the
conjecture here for the hypothesis of portable functions is
that somewhere in between or containing that vector is information
on amicolum, of course, which it had before. And it
also has my next is new. That is, my next token
is the word new, which again is just what
we would observe from Chad GPT so the additional thing the hypothesis is saying
is that, or is asking is that, is there a component that
encodes the operation that it must do or the function it
must do to arrive at new, perhaps before it came
to the conclusion of the next is new, is there a component
that says I am to do or am to call the function opposite
on. Surely there must be of some sort, because how
else would it know to come up with new.
But the question is if there is linearity to this representation
by linearity is just what allows us to do things like
this? Literally take a thing, add it average,
and add it somewhere else and have it do things right. This assumes a lot
of linear behavior. So this is kind of the underlying
implicit assumption that is guiding this hypothesis.
To start with many of the different
research inquiries leads to very interesting result.
Often start with this assumption of can we assume there's lowlinearity? And again,
due to details of the architecture of most
transformer neural networks,
there are reasons to expect there to be low linearity. But just to see it
happen for real is always interesting. And I think this is the first time we're
seeing it in the context of operations as again, just representations,
which I think other research has demonstrated before,
such as for example, the relationship between the word
car and cars, that is, the relationship between a word
and its plural. There's been some regularities observed
in that regard. But this, however, is trying to take it a step further to
say, okay, are there encodings also for functions?
Okay, so we have a rough idea of what it means
for what this h is. It's simply just some vector
that at the very end of the network, right before it
goes into the penultimate layer, or at the penultimate
layer, we could run our model three different times
and snatch that vector across, cross all of them, look at exactly what
read literally what that vector is saying. Because again,
the information on what is to come next is embedded
in the colon token, right? It's the thing that is saying,
okay, dark. So all the information
for what is after dot, dot, dot is in colon. Cool. So we take
that for different runs and we average it out and try to add.
So that gives you an idea of just to draw
a bit of a picture to it.
And of course, this is just restating the same thing now that we have an
idea of what h means and what that vector is.
So for each of the different runs in
a series of prompts that are basically doing the same task,
if we literally took all the values of the vectors,
averaged them in position, added, divided by this
unified, averaged out mean vector,
and we took it into a different environment, into a different context,
and we literally just added it to something else.
The question is, will we be able to get effects
like seen below? That is, if we took. So I think here in the example
you see the representation for encode, again, so encode
column. So there you can see
how we can presume that without this intervention at the end,
after this token goes through the
entire model, it might say something like the
thing to come after the column is base 64, I guess, because maybe
encoding and base 64 is something that shows up
often, right? Remember, the base function
of a large language model is just to predict
the next most likely thing in human generated
text. However, with the addition of our
supposed, our hypothetical average out opposite
function, would we be able to steer
it towards saying something like, actually, instead of saying encode base 64,
I all of a sudden feel the urge to say the opposite of
encode and dev, say encode column decode. This is
the hypothesis. So again, it would be super interesting and kind
of weird if we can indeed prove
this representation. For example, that just has 123456.
Again, our vector just has about six different dimensions.
Encoding it. Of course, as we know, actual large language models
can be much bigger than this in
the billions and billions of parameters. So how
exactly do we plan to do this? Multiple runs
and extracting values and averaging them and intervening and adding
them in the real world, not for a toy model.
And that brings us to our trusted interpretability libraries
and packages. These are packages that are designed solely for
this purpose of staring very deep
into what large
language models of different sizes are up to in
a way that is practical to enable this kind of research.
So we have an insight which is particularly popular
for working with models on the larger side, and I'll discuss
the details of its architecture that afford this
kind of behavior. Then we have transformer lens,
which is also a very great open source
library for doing this. But for today's work, we're going to focus on insights.
So what is it about insights that
makes it work? What is the contact on insight? Where did it come from?
From my understanding, the insights package came
along with an effort called the
NDIF initiative, which is
basically a national deep inference facility.
This is basically a compute cluster that is available to researchers for
doing work that cannot afford the financial burden
of actually running these very large models because they're very costly.
Forget just training, even just running inference on them is quite expensive. So basically you
have this remote cluster of
compute that has been made available to researchers,
and the insights package was basically made as
a point, as an interface
to this compute cluster. So the typical workflow, as is seen here
in this schematic, is that you have the
researcher working locally, basically writing interventions
for how they want to run their experiments and intervene with networks,
which we are going to see. And this is basically change into
a compute graph, or more specifically an intervention graph, as like,
this is how I want the running of this very large model
to be tweaked.
And this is then sent over the network into this cluster
to say, okay, cool, please run this 70 billion model,
70 billion parameter model that I definitely cannot run on
my M one MacBook, but run it with
these different interventions that make it look like that, make it no different from if
I could actually run this locally. And as you
can see, the thing between this boundary of the local environment
to the NDIF infrastructure
is simply this compute graph, and this compute graph is the output
of the Nnsite library, and we'll
see how it does. Cool. So that is the motivational
setup for why an insight exists. It's basically a
counterpart to the NDIF project, which is super interesting,
by the way. Again, I think they just released their paper last November announcing
the launch of the NDIF facility. It is live right now, I believe so,
yeah. Really exciting project. I encourage anyone that's looking for
computer resources, for inference in particular.
And again, this has nothing to do with training. It's just like if you want
to run a big model several times and do different
interventions or read stuff from it to learn more, as we do
with our hypothesis in question, then it
works great. But of course, the library also offers
the option of just running the if you happen to have
several gigabytes of ram to spare.
Okay, so let's jump into the code. What does
it look like to do an intervention? By intervention, we just simply means
anything that either writes or reads execution
state of our model. That is,
again, you have a model we put in a token sequence, and then
stage after stage, the output of one stage is passed to
the next, and that is added to the residual stream, which is just, again,
think of it as like this ever accumulating output
of each component in the model that eventually leads
to a probability distribution or output that we
observe. So if we ever want to poke into it either like
use our binoculars or microscope to
look in, that is one type of intervention, as you can see here on line
five. Again, you can ignore the stuff above, I will explain that
later. But just to dive straight into what exactly the interventions
are, again, what are the things that make up these arrows of this
intervention graph that is being sent over, which is the whole point of this package?
Line five, we have something that is reading.
So you see model layers, input is
equal to something and we save it again, I'll explain why we're saving
that. This giant model is not running on
my colab notebook or my
local environment, right? So it is interesting
that I can indeed read what is happening on it.
And on line six we have the opposite,
which is me changing something in some other component
of my model. So model the layers. So on layer eleven,
on a component called the MLP, I want to change its output to
become zero. And again, just to remind us what
all this is for, how all this relates to our hypothesis,
first we want to get the average of a bunch of runs for the task
in question, which is the opposite. And then we want to append that
average out value to some other examples
that are in a different context called like the one shot or the zero shot
examples. That is, we've not giving the model any idea what we're trying to achieve.
We just wanted to feel the urge to do the thing
we want it to do because we have added the vector. So literally the first
thing of mean is literally just a read output that we want to run several
times, read the average value of this vector, or read the value of this vector
and then average out. And then six shows us changing
the value of some component. Then we want to
add this average value to the running of the
different context and see what happens. Okay, cool. So this basically is
the scaffolding for what we need for
our project. But before we go into the code
for our experiment and research
in question, just to decode a bit,
what exactly is happening here? So one thing is that
the insights library loves Python contexts, which is one
of the reasons why I guess Python
might be a language of choice. But context managers are great in
python, as we know, and they do take great advantage of
them. And the general structure of it is
that as we know, basically the code
might look like the model is running locally. When I do things like save,
I do edits and I do reads. But the whole point is that
none of this, the model actually isn't running right
now. But when the context closes, that's like when the code execution
reaches the point where we exit the uppermost context, which is here
is line three. Runner. All the edits we've made
are, or in the course of running, of being
intervention graph is updated with all the I o. That is,
all the reading. And the writing we're doing to the model is basically
just planned while in the context. And when the context
is exited, this is then sent over,
right? So the model does not run until the context of the highest level,
which in case is the runner, which is a runner. Context is closed,
and for context the
invoker. Again, I would encourage anyone to read the documentation, but invoker is
basically what
does the writing for the graph right between
the invoker and the runner. They are both coordinating. I think
the runner definitely do some high level management, but one
of the initialization inputs to
the invoker implicitly is something called a tracer. And again
you can think of the tracer as just being a new graph
in question. As we're going to see. You can actually have construct multiple
graphs inside of one runner,
which we are going to see shortly. That is you can say like okay,
I want to plan different experiments. And again, this fits perfectly for our
use case, since first we want to run one set of operations that
runs our task inputs and takes
the average, and another set of operations that
take that average and adds it to the state of
the different context. Examples, again, that should have no
idea about the task, and then see what happens when this average
vector is added to it.
So the runner is the high level context manager, and then each
basic subgraph experiment that we want
to run is contained in the
invoker context. Interventions, every read
and write intervention, all the iOS are what are the nodes in
there of type tracing node, which again are
what inform what our entire graph is made of to start with.
And I said I was going to speak on why we need save. So again,
remember that because this isn't running locally,
we have to explicitly tell the model to save any
value that we want to read outside of the context, because the standard
behavior is when the context is exited, the model actually runs
with all our interventions, but because these values are so
large, we actually have to explicitly tell okay, please, I would like you to return
several hundreds of thousands of vector values
to me, because that is important. So that is the only reason why we can
access hidden state outside context, otherwise we wouldn't need to.
So perhaps this was just a temporary variable that
we needed to use for our computation, which is fine
if we have no intention to access it after the
contact closes, we wouldn't put save. So this is only because we want to hold
on to the value. So this is just one of the examples where we have
to remind ourselves of the difference between running the model locally
and just simply using, building an intervention
graph for a remote resource that is going to run
immediately we leave the context. And again,
this is from documentation, basically showing how
each. So here you can see that each line of intervention.
So the first green arrow on the left blue
box is a right. That is, we're setting some layers
output to zero and the next is a read and
the third is also a read. But you
see here, we use the dot save because we do want this
value to be sent over the network when the model isn't running. And you see
the output of this is this intervention graph in the middle.
And this is basically what is what is sent over the network in one direction.
And then the result for things that we ask
the model that we ask the graph to save
are then sent back in the other direction when the execution is done.
Again, just to remind ourselves on what we're trying to do now, we have an
idea of what our library looks like and
how we use it is that we want to pass
in some context, we want to run it.
Remember, we're only interested in what happens by the column, so we
will be indexing to get only the
vector at the very extreme end. Because the
idea is that is the token that will
contain information on what is to come next. Right? Again,
just as a result of how transform architectures work is, the next
prediction is containing the last token. To what end do we want to
do that to this end? So we could do two sets of runs.
The first run is to pass a bunch
of examples doing the task we want. Again,
this is exactly like how you would tell Chat GPT something
like, I want you to give me words. And opposite, like this
example, old, young, separated by Colin. Then it
does this thing, right? So this is basically just like prompting
it with the, this is similar
to prompting with the format of code you want. But in this case we're actually
just going to look at the very last token
and then right before, when it's right on the verge
of predicting, we just take a first, don't know the computation
to know that, okay, this is a word and opposite game we're playing, and I
am to predict the opposite of the last thing I saw before this token.
So, right when it's supposedly figured all that out, we want
to just snatch that vector and average a bunch of them
out to get, hopefully a vector that represents in
some pure form the very essence of the task that it
has figured out, which is opposite function of previous
experiment. That's the first part of the experiment. Then the second part of the experiment
is to take this pure vector and then add it to a different context,
a different series of examples that supposedly should have no idea
what is going on, right? Because again, if you just told chat GBT encode column,
it has no idea what you want is it can't read your mind
yet. So this is
called the zero shot intervention, which basically just means zero
shot means zero examples of what you're looking
for. Except now we're going to add this
hopefully averaged out function has
it just again feel compelled to do the thing that that
vector was obtained from. Right, so how do
we do the first part? That is the part where we just run a
bunch of stuff and we extract the value
at the very last column for all of them and average out.
Cool. So again we have our trusted layout.
Of course, first of all we have to determine what component we
want to look at. Remember Mac, interp is all about having
an observed and interesting observed behavior and trying to
find the contribution of some discrete component, right? So in this case we
narrow down by saying, okay, we want to look at layer eight again.
In the actual experiment we run this for all the different layers, for all the
different components of the model, and then we have like a plot of
which of them happen to be most interesting. And then we drill down.
But this is just showing an example of suppose we wanted to see what layer
eight was doing as far as the task is concerned.
Cool. So just imagine this done for a bunch of tens of
other components. Cool. So we have a runner in Booger.
Then here on line six we simply just
do our trusted, as we would
like to notice here that I don't do save because this
value of hidden states, this variable is only needed
for computation inside of my
context, right? So I do not need to export
this at this very point in time. I just simply need to take
the variable, hold on to it, use it for other computation on len
ten. And as you can see, line six
is simply just the rightmost column. That is, sorry,
beg your pardon, that is the
on. Right. So between line six and seven we
simply just take an example, we choose a layer and
on line eight we say the sequence position should be, I think on
line one, you see, we define that as minus one. So we just simply want
to take the very last value, which is token. So again, all the dark gray
bars is what line eight is holding
onto. Then on line ten we simply just do the average.
So we take that variable and we do the dot mean on the batch
dimension, that's the 0th dimension. Again, that's the dimension of all the stacked examples
on the right there. I think I just put like a clip out to show
you what the vectors and
matrixes will look like. So each of those stacks is just a batch dimension.
So each of the examples of old, young,
awake, asleep is represented by one of these slices.
And we simply just want to average across that to get some hopefully
pure vector that encodes the essence of opposites
and that we want to save because that we want to send back. So it's
kind of meant to be an efficient thing such that we don't want to send
everything over the network, we don't want to send both the full,
all the matrices. Thankfully we could decide to save everything and
then compute locally. So again, this is just some of the considerations
that you make when you remind yourself that actually there is
some throughput cost and efficiency cost.
So let's just do as much as we can on this
environment and then just send down the most condensed
version we want. Again, this should
be similar to running using any remote resource or
when you have trade offs between remote and local resources to contend
with. Cool. So that is how we
do the first part, that's how we get the averages. Literally this is all to
do averages for one layer. And just imagine putting this in a for loop if
you want to iterate over several layers. And for the
second part, having possessed this average
pure vector, which we called h, we want
to then put h into our zero
shot examples. Again, these are the examples that have no context on the task.
They're just doing their own thing. They supposedly
oblivious to the task we find interesting of opposites,
but from nowhere they would just feel the urge to now just do opposites.
Hopefully if we add this average vector
state to them. And here is the example
I mentioned where we're running two invoker
contexts inside of the runner context. So basically the
first is, again, we're trying to, as with
any experiment, we have to have our control example
or our reference or our baseline to say that,
cool. Without adding this average vector, what does
the model feel compelled to predict? So for simple,
does it feel compelled to predict simple?
Simpler? Maybe it just says cool. Simpler should be something to follow simple
or given encode, does it feel compelled to predict base
64? Or perhaps it does feel naturally compelled to decode,
who knows? So the first run on line four and
five is just again simply running the model and
saving the output for the very last token.
And the second is where we do the interesting stuff of running the
model and basically intervening. So on line eleven we literally
just plus equals two, which is identical to how
we were taking the mean before. But again for this context we just do a
plus equals two, add this value to the existing.
And on line 13 we again just
like line five save to see. Okay, cool. Now let's see what
the predictions are and how similar they are, or to
what extent this vector has changed
the opinions of this model
results. I mean,
depending on your standards for impressive
or not, this is what it looks like. This is what run one looks like.
Just by doing that, we can see that. Indeed, on the third
column here, adding that h vector does
move the needle a bit does
have the effect of the opposite function, right? So on
the second column, we just see that the thing the model tries to do,
if you tell the model, if you tell the model minimum column, it just repeats
a lot of stuff, right? So it just says minimum is minimum, arrogant is
arrogant, inside is inside. Although sometimes it does interesting things, like the
fifth example from the bottom. If you say on, it says I.
If you say answer, it says yes. Again,
this is what the model feels compelled to just say if it has no context.
But on the third column we see that in some examples
we do manage to tilt its final judgment
in a different direction. Now, I will mention though, that this
is technically not where the paper stops.
The paper decides to say beyond just averaging h.
Remember, this is taking the value of
the output of the entire layer. Remember, we just see layer eight.
The paper takes it further by saying, okay, instead of just looking at layer eight,
can we drill specifically into what component in layer
eight is contributing? So, back to our reference architecture,
neural network transformer
component transformer block has different things. Our attention
block, rather, excuse me, has the mass self attention,
the feed forward, it has different things, layer norm,
and they decide to drill into the contributions of
the attention head. Again, the distinction isn't that important, but the experimental
method is precisely the same. So they just find a way to drill into studying
the contribution of the top x
attention heads. And instead of averaging looking for the average of
just all the components contributions, which again supposedly
will have more noise, they basically tried to denoise by just narrowing
in on a few, and with that the effect is way more obvious.
But I don't include that for the purpose of this talk.
And that was the talk. So if you are interested in
looking at the NDIF project and Nn Insight Library
as well, which is a companion, please view that site.
And if you're interested in learning more on Mechinterp,
many of the code snippets and
basically concepts introduce you. They were introduced to myself
on the platform called arena education. It is an awesome program.
You should check it out. If you're interested in learning more on doing
mechanistic interpretability. I hope you've had as much fun
going through this as I have. And do enjoy the rest of
the conference.