Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi there. My name is Bolu, and today I'm going to talk about superposition
in neural network representations. So I guess it motivates that
a bit. I'll share some context about where this hypothesis comes
from or the field of neural network research.
This field is called mechanistic interpretability, and basically it
follows from the following reasoning. Okay, so we all understand that neural
networks solve an increasing number of important tasks really well,
and it would be at least interesting and probably important to understand
how they do that. So mechanrap is basically a subfield that tackled
this problem by seeking granular mechanistic
explanations for different observed behaviors in neural networks.
It's basically pushing back on the idea that neural networks are just these black
boxes that are completely inscrutable and just do magic
with linear algebra. So it pairs into a given network
at a granular level to investigate some very isolated behavior.
At the same time, it also has very broad hypotheses and
theories about how neural networks do things. And one of these
is about representation learning.
That is, how do neural networks learn which representations to use for inputs,
and how are these inputs, how these representations passed around in
the computation? That's what this
is basically understanding what a model sees and how it does.
So what informations have model found important to look for?
And how is information propagated and I guess represented
and propagated internally in the network? I guess to
paint a picture of what we mean by representations and propagation.
So basically, at the bottom left, on the bottom left here, we have,
let's just say this is like a simple tokenized version of
input you're going to pass on to based on a
transformer, right? So you have like on colon, off wet colon,
dry old colon. And I guess, as any of us would,
would attest to from using something like chat JPT,
these neural networks are definitely able to predict that old. I figure
that the next thing is going to be old colon, new, right,
since you can figure out what you were in opposite. So the idea is,
between this entry of our text and the prediction
on the other side, entire network has to have
encoded certain information and done computations
in the process to get this output of, oh, the next thing to
come after this column is new. So basically
what we're trying to ask is, okay, just what do we know and what can
we investigate about how this information
is encoded? Right? So, because in the beginning, all it knows is that I'm a
column again. After going through just the embedded network of mapping the column
character to a collection
of numbers are just called vectors of ordered vectors, as you can see there in
the column. So the question is summer between entry on
I'm a column to exiting on, my next is new,
which is again the result of
this vector going through an unembedding layer and the softmax, and again the
highest probability weight being attributed to as.
Again, just for simplicity, let's assume the word new is
its own token, because as we know, prediction is done on a token basis.
So, right, so somewhere between signing with M and Colin and my next is new
is a bunch of stuff. So what do we know about
what these representations look like internally? All right,
so here are a couple qualities of these representations
that the starting school of mechanistic
interpretability posits. Basically, it says
that there are discrete features that a model
has learned to look for in an input, and these discrete features
basically compose into giving any given representation, right?
So if we looked at any layer or at any component in the architecture,
all the information the model has at that stage is basically going to be
some composition of discrete things.
And the second is linearity. And so this
basically takes the composite, the decomposability statement, a bit further to say
that not only are these discrete components composed
together, they're composed linearly. And again, we'll discuss a bit
later what exactly that means. And the third basically just says
we can think of these discrete qualities as something called features. And again,
the precise time code definition for what features are comes
later. So I guess maybe to summarize, maybe this, like one
single line or tagline, it probably, like summarizes the school
of thought that says language model. Again, you can replace your neural network
representation that basically have similar architectures.
Language model representations are linearly decomposable
in two features, right? So we're going to pick apart each one
of those, of those items in the course of this talk.
The first, again, is this is kind of a weak statement.
It's not that strong. And I'll explain why. In isolation,
decomposability just basically means that, well, we assume
that neural networks learn different things. That is,
giving a neural network doesn't just basically
memorize every simple potential input.
It learns to abstract certain features
like blueness or redness, or perhaps even more general,
to color. Right? If it has a general abstraction for
color representations. Again, in this simple case,
we have a neural network that may be trained to identify colors
and shapes like. So let's just say maybe this is like some classification
neural network, right? And the idea is that, okay, somewhere in, you know,
all linear network weights are
basically transformations that are able to extract certain
discrete qualities such as the center shape or
the background color. Right here simplify this. We just look for blueness or
redness on the left. But the interesting thing is, well, this is kind
of just saying sometimes neural networks don't overfit, which is why
I say this is kind of a weak statement because like, sure, like it's pretty
obvious that, yes, I mean, or at least like anyone
that has training neural network can demonstrate in
with like a test set to show that yes, indeed,
these neural networks can generalize and not everything is overfitting or memorizing.
So on the right there, if we have something like a purple triangle
that supposedly this neural network has not seen in training before,
it could depend on its previous, previously learned features to
say that, well, even though it doesn't quite have the conception of purple as a
distinct thing, it could compose of the color purple as being
perhaps reading the RGB values, being equally composed of
red and blue values.
So. Right, so at this day, like all, that's all decomposability is saying,
there's certain things in this diagram that are quite strong assumptions.
Again, this whole idea that, oh, there's one thing called a blue neuron,
as we'll see, that's a pretty strong thing to say. And it's not obvious at
all that this is how things play out in reality. But again, at this stage,
decomposability just says irreproentation is composed
of a bunch of little stuff. Because in your own work,
again, demonstrably does not just generalize all the time. Right?
Again, for sufficient size for spatially small problem set.
The second is linearity. So this again takes the decomposability
property a bit further. You say, okay, cool. Not only are these different
properties, well, distinct or different, they combine
linear sums quite simply,
which basically just says, if you can imagine a vector
representing a certain vector direction representing
some feature. And again,
this is an as contrived. Remember looking at this
diagram, the inputs are
already ordered collectible numbers. So again, everything that's for a colon is already inherently
having this vector format. I should mention though, just because a thing
is an order to collection of numbers, it doesn't mean it has to be linear,
right? So it's a bit confusing, obviously, because it has this like
vector formatting of, again, an order to collection of numbers
relating to one entity, then surely it was obviously mini direction.
That is not obvious. And again, I'll show you examples of what that
looks like when it's not the case. Okay, so again, the larynx
says these decomposable sub vectors
basically, literally just add together to give you
the representation for something, right? So here we have how the,
some other neural network that cares about size and
redness in the abstract has
two different directions. Again, given it only has two
features or qualities it cares about, it can dedicate two different directions to
it, right? And then these directions can simply combine to represent
any one given input. And again, like, how do
we have any evidence for this in practice? Is that. Yes, I guess
this is a bit of a popular example
by now, but there was a paper that came out a couple years ago that
basically showed regularities in the differences
between pairs of vectors. So the difference between the
man and woman, again, this is like the man and woman, let's say
word representation in certain language
model architectures was consistent. So if you do something
as silly as, let's say, subtracted the vector, again, just the ordered collection
of numbers for, of uncle from the vector for
aunt, and you simply just impose that on, let's say,
some other pair on something like
man, you would end up with precisely the
vector called woman, right? And you have a bit of this like vector algebra
here on the right with the card. Again, let's assume this is like another relationship
of plurals, again of
like cars, the vector recognition for cars. If you subtract the
singular recognition for car and add to apple,
you get something like apples, right? So this kind of behavior of literal,
like ordered subtraction of
values is what you would see in a linear
system, right? A system where the masculine,
you know, this abstract feature of this is referring
to a masculine entity is encoded with all
the other stuff that has to do with royalty in king, or has to do
with relatives or relatives of siblings,
of your parents, an uncle and aunt, or just again, in the literal
world, man and woman. Effectively,
if these two, if these multiple things are composed in a linear fashion,
then you can get it. We'll be doing things like this, vector subtraction
and arithmetic as we're seeing here, right? But again,
this doesn't mean everything completely is indeed, right. So this is part,
this is just for the embedding layer. And again, to remind us what the embedding
layer is, it is the very bottom of this, right? It's,
there's still a lot of uncertainty as to sure, if maybe for
simple things like embedding a word, you get this vector
algebra, does that mean like for everything? And all the layers in the network,
all the information that it has to encode is in fact
composed in this linear fashion. Right. So that is why there's still a mystery,
even though we have seen some evidence. I guess, as I mentioned,
it would be worth noting that again, just because a thing is an ordered
collection of numbers, again, which is how neural networks tend to
be like represented or simply just how they are, this kind of,
this meme around how neural networks are just linear algebra
scaled up. Well, just because things aren't auto collection numbers doesn't
necessarily mean that they are linear, right? Linearity is a very particular
statement about how different entities interact. Right?
So here's an example of, again on, we can imagine a different
regime where we had a neural network that again
was able to extract a discrete component for redness and another
for blueness, but then join
them together. It did something like, well, maybe just exploited the
simple precision
decimal places. Again, how it does this,
again, special edition, is by simply just taking the first
thing on the left and then making it the value to one decimal
place and taking the other thing. And here you have an algorithm to
do this. Again, this is an example of a non web winner expression.
Um, and again, like the component of this that makes it nonlinear is,
you see, it relies on the floor operation. Again, this is like from,
like from a python, except that like math or floor, it basically just
like tries to do, do the rounding. That's basically how they exploit this like,
um, precision and placement to basically like squish
these two different values together. Right? So again, so this is just like one,
I guess like dummy example of showing that, well, yes,
things ordered collection of numbers can act in
ways that are not quite vector like or don't quite simply just
do addition. Right? You do have other compression schemes,
right? And so what the linear representation is saying is that actually
on the journey from, again, the input of I am a column,
which is what like the embedding does to the output.
All the information it has at that point. Again, all the information at this
single um, column and the input has maintained as it
went through all the layers were simply just
added to each other. Right? There's one vector that represents, oh, there's something,
there's a bit of like a word and opposite game going on. And there was
an interesting paper that, that showed up to say that actually, yes,
not just nouns or discrete informations about inputs
can be encoded, but also abstract things like functions. Right? Again, this whole idea
of, oh, this is a word and opposite game that's been played here between
wet and dry, old and new, etc. That itself is
one vector and yet another vector. Again, this,
this is something that attention can give us, right? Being able
to basically like, look at previous inputs. So again,
the colon token is able to like, look behind,
immediately behind it. See that all the thing that came before me is old and
also is able to look maybe further back to other
things to like glean the pattern of words and opposites, right? All these different
bits of information are literally just different vectors or different
directions that compose as
simple additions to end with the conclusion
of. Okay, surely my next thing is new.
Again, the representations aren't really concerned with how the
network is doing combination that is like, like, what are the mechanics inside of
it that know how to do that? Okay, giving this vector for work opposite this
vector for this. How does it do that? Like, again, there's another body of work
that explores basically this, like, algorithmic interpretability.
This is just saying, um, basically the variables that is being used for these
computations, what do you look like and how are the variables that composed?
Again, by how? I mean, like how in a sense of, like,
are they saying, like, do you have weird stuff like this going on where
it's like, in the c, in the space of all potential transformations,
of taking redness and blueness together to get purpleness?
Um, is it like some unknown arbitrary thing which would be messy
and hard, or is it just literally symbol vector addition?
Um, so that's what recommendation is as being like,
distinct to like, algorithmic, um, interpretation, interpretability.
Um, right. So again, as we see here, um, again,
just think of the linear composition as just a compression scheme for
how all this information is packed together. Okay? So linearity
is great because it basically helps us narrow down, again, as I said, like one
compression algorithm in a very large function space. So there are many things, again,
these are the giant networks to be doing in their, like, typical inscrutable fashion.
So linearity is pretty helpful in that actually kind of narrows it
down to one like,
well known, unstudied,
basically like compression scheme, right? Which is the entire field of linear algebra.
Right? If it happens to be linear. And also the other things that
this gives us as well, which is, it aids in
diagnostics and helps improve our
understanding of the models in ways that, again, if there were some,
if, for example, in every single representation or every single layer,
use a different type of arbitrary algorithm, will be hard.
So, yeah, it basically would be very convenient if
this was the case. Right? And again, we have seen some evidence for it.
Right? So this is just an important point to make where to
state that this is a combination of having some evidence, but also there's a
bit of a motivated inquiry into this. Right? Again,
if this wasn't something we cared about, there are many things about neural networks that
seem to be interesting, but people just haven't really dug
into. But the fact that they seem to have inner behavior has
drawn a very large community of researchers to study just
why. Because, again, it makes the problem a lot more tractable
than if it wasn't the case. Right? And yes,
I guess I put here effectively mind control being like, they're a bit of like.
As these tools become more mature to understand what's happening, we get
to do different things, like everything from mind reading to mind control.
That is like, again, if you get to run your strand
brain on a computer and you have access to all the numbers and you understand
the general, both the algorithms for how the information is
represented and how the information is transformed, then you can eventually, like,
intervene and or at least just like, you know,
have a log stream on what's going on.
Cool. So that's the motivation for why linear composability is great,
right? Because again, it's a algorithm for
these transformers to use for understanding.
Okay, here is a bit of the downside,
is that linearity is kind of demanding in
that it basically says that to have
a lossless compression scheme that composes linearly,
it requires as many dimensions, that is, as many of those different
boxes, as many, like, of the. As many
distinct ordered numbers in the collected set as
you have qualities you want to encode. Right. As you can see here,
again, we have something for redness, for blueness, for squareness, for triangles. And then,
like, as you see, basically, you have this like, one hot vector kind of
situation going where the thing that makes one of these
directions and code for the property
of redness is that it is only the first cell
that activates for it. Now, I do want to point out that there's
a slight difference between just the dimensions and, uh,
like, the number of dimensions is the requirement.
It doesn't necessarily mean, though, that you would always have this perfect idea of,
uh, one cell coordinating to
one thing, right? You. You could basically have, like, there are,
you know, infinite many, um, orthogonal bases
that are able to achieve this. Basically, all you just want is that for,
you want to have as many orthogonal, um, directions as
you have features, right? Again, just for the sake of,
like, easier understanding, we just focus on the
one in this very large, this infinite set of orthogonal
bases that happens to be one hot encoded,
right? So just going for. Just imagine
that every time I talk about a neuron
or a dimension. I just literally mean one neuron, but that's not necessarily the case.
Cool. So why is this a problem? Okay,
so let's just like, again, to remind ourselves of, like, where we're at right now.
Again, this certain hypothesis for how representation is done says that
language model, large language model representations are linearly decomposable,
composable into features. Okay, this brings
us to the linear organizational puzzle. Why is this a puzzle?
So basically, in a couple steps, first, we have
some evidence that indeed we LLMs do represent stuff linearly.
Right? Again, so, like, meaning this,
this claim has, you know, some basis in reality,
right? And again, there are several other arguments in research to suggest
that, like, this behavior is more likely, or like,
or has either from, or can be observed
either from, like, looking at the number of flops that are dedicated to the transformations
versus not, etc. Cool. So linear
stuff is happening. Linear combinations
require as many dimensions or neurons,
again, which is, again, a subset of the case of
orthogonal basis as features. If you want to encode for redness,
blueness, triangle ness and squareness,
you need literally four different things. Again, if you want to encode these things distinctly
as being different things, you need four different
directions. However, and this is
where the puzzle comes in experience, it seems
that these networks are able to represent way more stuff
than they have neurons for. Right? And again, I have a bit of a back
of the envelope calculation here, where GPT-2 has an order of hundreds of thousands
of neurons. Again, the exact numbers might vary on the architecture,
basically, but if you look at the number of attention heads, MLP layers,
and the dimensions that these architecture
components operate with, um, you, you're in the order of a couple
hundred thousands of neurons. Uh, and the assumption
is that these models encode a lot more than that. And if
you're finding hard to wrap your hand around, what about how 100,000,
a couple hundred thousand, um, features is not enough. Remember that,
again, this is encoding all of the english language, right?
Or all of language, if you recall how
well GBD two was able to perform, it is plausible to that, yes, it probably
encodes a lot more than a couple hundred thousand features,
because again, this is all of language itself.
The question is, how is that possible? Again, we know that linear.
So basically, we seem to have conflicting or contradicting evidence when
we have evidence of linearity, but at the same time,
we have all needs, which is as many
neurons as it has features. But at the same time, we have models in
production, we have external models that seem to do quite
well without having as many neurons
as they seem to have features. Right. Again,
to appreciate why this is a puzzle, you just have to depend on your
gut feeling of there are probably more than 200,000 things that
you need to be looking for in. In any given standards.
Remember how open ended all of language is?
And again, this is one of the difficulties in describing what exactly a
feature is. A feature,
I think, of a feature is one helpful definition. Feature is a
thing that a neuron would be dedicated to in a sufficiently
large language model. But again,
we get to that shortly. Okay, cool. So this is our puzzle. How is this
happening? There's a great paper that came from a team at
Anthropoc that basically tries to. Basically building off of
previous work. Exploring this puzzle suggests
a way forward on tackling what exactly
is going on and basically being able to.
Being able to disentangle all the mess that seems to
be happening, because, again, something strange. But I guess basically,
before this paper came out, the same team introduced
the idea of superposition.
And superposition is basically a hypothesis that attempts
to answer the puzzle the riddle of how can a model do more represent
more features than it has neurons?
Now, it effectively says that neural networks are
able to do this because they exploit
feature sparsity and the relative feature importance.
It basically just says that, like, the model does not, in fact,
do perfectly lossless compression, but it
trades that off in exchange for representing more features,
because of a property called like
feature sparsity, which basically means that, again, even though the english language,
or like any arbitrary text in english language, or like, in the set of all
possible texts of coherent english language sentences,
or perhaps not even coherent,
even though those contain a very large number of features,
it turns out not all those features are active at the same time.
And there's a great. And this provides an opportunity for a
trade off where we can say, okay, what if I choose a
not perfectly orthogonal set of
vectors to represent my features, which, again, is the requirement for
that to be a lossless compression. What if,
instead, I chose n plus
m, actually a bit more than this ideal
set of perfectly orthogonal vectors, and in exchange,
basically, each additional direct feature direction that I
add basically adds some noise.
Again, using the compression analogy, add some noise.
Basically, I trade off a little bit of noise for
having a much wider set of features that can
represent. And again, this only works if all features are not present
together, because, again, if all features are present all the time,
that, again, if you have no sparsity, you will
have noise in all your outputs.
And, yeah, this is actually what this is saying, right?
So in the top and the top three boxes you have
there is basically saying, okay, you have,
again, all the different dots are meant to be different features
that your model cares about. Again, maybe one of them is redness, one of them
is blue nest, one of them is square nest, etc.
And you can see you have like a two dimensional surface. You have
a two dimensional surface. In the case where there's
no sparsity, where every feature is as likely to be important
as the other, you effectively have,
um, what we, what we expect that is
like the neural network indeed only has two directions to represent the
two things. Um, most important thing it cares about. Okay? And again,
let's just imagine right now that they're all equally important. So you just like,
randomly chooses two features. Again, maybe it only cares about,
um, having a distinction between redness and.
And it's unable to have a distinction between squareness
and triangle nest, for example. Right? It basically just chooses like, one pair.
That is, you know, is the best it can do
to like, get a good classification loss on
the problem set. Okay? But it turns out that as
you increase sparsity, right, as you maybe make redness
and circleness not as common. For example,
let's say you had a case whereby sometimes
the image has no shape in the middle. That is, sometimes it's
just color that matters, and sometimes you have a completely colorless input,
but it's just the shape in the middle. Right? And that's what sparsity means,
right? Where you have like, two different. In this
case, you have like more than two different properties
that don't always coincide together, right? It turns out that
in the 80% sparsity example, as you see here, it actually
chooses to represent more features than it should be
able to, right? Again, this is consistent with what we see in the experience.
And on the right here, you see 90% sparsity case. It basically
has that direction showing that, okay, that's what spark, that's what indifference means.
Because the whole point of having an orthogonal basis is that if
you try to extract the component feature, that, again, you have some vector,
and then you have only, again, two potential orthogonal,
um, features. If you do a dot product against each
of these, each value you get is, um,
basically saying how much this giving vector
is composed of
each of those directions. So basically, if you have, again, some direction
for some representation of a red
square, basically if you take a dot product of that against the
redness direction, it gives you, oh, this is either really red or
not that much red. And take a dot product of that against the
squared direction. It says, oh, how much of it does
this look like? A square? Right. So basically, so that's why these needs to be
orthogonal. Like they shouldn't interfere with each other. Like the quality of redness
should only interfere with the quality of squareness, if that makes sense.
So that's why orthogonality is important. However,
again, if sometimes you basically just
get a colorless square or, or you get a
shapeless color, basically you just have any potential color. There's a
shape at either sparsity where you don't always have cases where these two show
up together. And again, for your problem set,
you only need to care about these in isolation. That is, you only need to
be really good at detecting color sometimes or detecting shape sometimes.
It turns out that you actually be fine if
you chose directions for squareness and redness
that interfered with that is what the bottom example
is trying to show. Okay? So on the bottom left square,
you, our, let's say our orange vector is the thing that we actually
want to observe. Again, it's our red square, right?
It's a thing, it's the input is the true
input vector, right? And you
see that, again, we have five different directions, five different features.
There's this, if you take a dot product of this value against all the different,
five different vectors to see, like, how much it,
it has a value with all of them. So you see it is along one
direction, for example, right? So that direction would have is given value, right? Let's say
that direction means how red it is. However,
because there's interference. It has tiny little vectors that
are going in along the other features that it actually
doesn't have. Again, work. Let's imagine that in this case, we have
a simple case of where, oh, this input is simply just, again,
a very red input, right? It's just like a red blob or like a red
square, right? Sorry, not a red square, like, I mean, just like a red image.
There's nothing else in it. So indeed it is aligns perfectly with one of
the features. But again, but because in this representation, again,
you have only two dimensions, two pure dimensions, but then
you're trying to squish five different things inside. Um,
you would actually have a little bit of, it would pick up a
little bit of a component along other features that
it actually doesn't have. So that's what interference is.
Um, and however, why this works is that neural
networks have non linearities. Um, basically, like the activation
functions, um, are non linearies that are able to basically
turn off these, um, tiny,
tiny bits of noise, right. If they didn't
have that, then the bits of noise become quite annoying and
would actually, like, count more towards your errors. But because again, in a
case whereby there's actually is very little annoyance
coming from the other values in the dot product, these are able
to be tuned off effectively. Right, so now let's imagine
now on the bottom right. In this case, let's imagine
that the true input, again, like this thing we care about,
the two blue vectors. Again, that is, you have something
that is a really big square and
also a really yellow background.
Okay? So again, you have squareness and you have yellowness.
If you can observe this vector
addition of these two, again, remember, were operating in like a linear combination
regime. This is exactly the
same thing as we get on the left. This is why interference
is important, because interference requires that, like, for it to work. There should be only
very few number of things in this case, like, just say like, for interference to
have no impact here, it should only be one feature
that is truly trying to be detected at a time. But in a case whereby
the two blue directions are trying to be,
um, it would look to our system,
our neural network, as if it was actually this case
on the left. And in. And as we've seen, it is going to end
up chipping away those two different values to nothing because
of the non linearities. I'm going to think that, oh, actually, instead of
seeing a square, a yellow square, it's just going to see
a circle. Maybe that's what that third direction represents, which is
complete noise, isn't it? Which is completely wrong, so to
speak. Right. This is basically going to end up ignoring the components of this vector
along those two as just being noise, which is really
bad, which is really why, in the case where there's no sparsity, where, like,
again, all the features are likely to be active, the neural network doesn't even
bother trying to do anything funny. It doesn't try any funny
business asset in the top left square. It simply just
represents, you know, again, an arbitrary two features.
Or in a case whereby it's able to have a sense of relative importance
of features, maybe, like, one feature is way more important in
determining. To give you an example, let's say
one feature of language is, well, what language is it
in? That is like, is it English or Spanish? Is it English or Chinese?
Another feature is the sentence referring
to in the past tense or present tense. Right.
This past or present tense feature, you know, helps you avoid
grammatical mistakes, but it's fair enough to assume that the
feature of at least knowing what language the question or the
query is in is probably way more
important in terms of, like, having you avoid errors
than detecting if it's past tense or present tense.
Right? So again, that's just like one abstract idea of
the model. So again, if the model only had like two different features, like one,
let's like just one, like, or like on the margin.
If the model has to represent one more thing and it has to choose between
the language detection and the past or
present tense, it will most likely prioritize choosing to represent, using that one extra
feature to represent language and language type, that is this,
English or French or Chinese, as that probably has way more predictive power.
Like, we'll have it have way less error than if it was instead
of trying to predict the correct tense,
but in the wrong language. Again, this bit
of a toy example. Okay, so how do we solve this?
Right. Again, giving this, we suspect this is what models are doing. Or again,
we suspect this is why they're able to do it. This is why they're
able to get away with it. Again, they're trying to exploit sparsity by compressing
stuff. So the paper I shared basically tries to
do this, um, by tackling a smaller
model, right? So the tackle a one layer transformer, and they pick out one
component of the architecture. Again, as I explained in
this, in our typical large transformer, yeah. Every single component
is doing some version of this, right? Since this information is
flowing through our entire network, each discrete component is going to need to
have to do some version of this, right? So they focus on the MLP
layer, which is what comes the attention heads. And in
the model, they use the dimensions of that,
basically, like how many vectors or how many neurons it has.
So how many dimensions of each vector has or how many neuron,
how many neurons that layer has is 512. And what they do
is, as seen here on the right, they use something called a
sparse overcomplete order encoder. Obviously,
I'll describe what that means, starting from the right. Okay, so what does that mean?
And autoencoder and a BSN. Autoencoder is basically a neural
network whose primary purpose is reconstruction. So basically
you have some input, you have something in
the middle, which is like, again, your network. And the job of
that network is to try to replicate, to reproduce the output.
That seems kind of silly. Like why just bother with this density transform?
Because, well, in some cases, you might want to do something like,
okay, compressed dimensions, right? So let's say you have this very like,
large input you want to find interesting.
You want to find the most important critical features by compressing it
in the middle and seeing how. Well, okay, like,
again, let's say something needs five dimensions to this
input is five dimensions. What are the two most important
dimensions of this or two most important, like representations of these five
dimensions I can take that would have me still
do well on reconstructing with
this. That basically has this property
of feature discovery by compression.
That's what auto encoders do.
Overcomplete. Again, starting from the right to left, describe overcomplete,
basically does the slightly opposite version of that,
which is, instead of compressing, you're basically trying to expand.
You basically try to give the
order encoder in the middle again between the input and your
construction of the input is much larger than as you're saying. If this
neural network representation had way more room
to represent stuff, what would it look
like? Right? And remember, the whole
point of our store
position is that we're assuming that the model we see is actually trying
to simulate a much larger model. Again, remember, that's the
whole point of superposition, right? So by using this
overcompleted encoder, we're trying to say that, cool, whatever representation
this MLP node has for some input,
what if we gave it way more neurons to work with?
What would you do with it? That's the overcomplete section.
Then the sparse component of that description is
saying, like, sure, what if we just go from like, you know,
a five dimensional,
inscrutable, compressed complex thing to
a hundred dimension inscrutable complex thing, right?
We're not much better than we started, right? And again, neural networks, just like,
don't really have any incentives to just make things explainable to us. So the sparse
component says, okay, in addition to giving you the
network, more room to work with, to like, expand, to like, see what you learned,
we want to force you to narrow your
learnings, your features, to being
active in one node at a time. Right. I think I
explained before how just because I
think it was, in this example, just going to jump quickly. So just because
I say that, oh, like, linearity says you must have as
many dimensions as you want features, it doesn't necessarily mean that it
will always be one. Hot, right? You might have a case whereby there's several,
you know, there are infinite, many orthogonal,
four orthogonal vectors, like four vectors that form an orthogonal basis
that aren't one. Hot. Right. It's kind of like smeared between all the different values.
But again, from our, for our convenience to like say that,
that neuron is like firing a lot when it sees redness,
we want to impose an extra basic
constraint on the auto encoder to say that,
cool. Don't just try to find representations
with more nodes to work with when you do this,
narrow down your learnings or try to isolate your learnings
for one feature to like, one node at a time, right?
That's basically just for our interpretability benefit.
And yet that is what a sparse overcomplete order code is.
Or usually they usually just ignore the overcomplete
part part and just call it a sparse or encoder and basically
just says, cool. We want to give neural
networks opportunity to want to extract what they've learned by
trying to reconstruct representations using more dimensions.
And we want this new like extraction
to be sparse in such a way that only one node
is activated at a time for a given feature that is applied and
effectively that looks like this. So they ran
this, this training, this training
process for the sparse auto encoder on their one layer network for the
MLP layer. And again, if you see here, they describe on
the left there you see this like the act 512,
which is like the activation of MLP layer, typically should have 500,
1212 dimensions. Again, that would just mean instead of 1
second, instead of like four different blocks here,
that'll be 512, right? That's like how big the vector is.
So they went from 512 and expanded all the RAN
different versions, but the largest ones went up to 131k.
So basically that would mean if again, with the
network of the quarter on the right, if the,
on the input and output was 512 different like
neurons. And in the middle you had this giant
130K node network,
or a node like layer, basically that was trying
to reconstruct the input into the output.
And they learned a bunch of stuff. They have a very
nice interactive, um, application that
I encourage you all to check out that basically shows the
model learning really interesting things. So one of the,
um, the neurons I discovered, again, neuron simply means like,
because of this like, um, constraint of sparsity, the,
the model learns like isolate some abstract
feature. So literally one of these 130K nodes,
basically, like once this feature is present an input, it just like fires
and screams a lot. You see like this input is really here. And these features
are like so like wide and varied. When,
for example, detects, the,
detects arabic characters in input,
another of them, as you see here, detects if a sequence
of text is probably like a DNA
sequence which I think was pretty wild because, like, this could also be gibberish,
but there are certain patterns and
I guess the actual letters, for example, that is used for DNA encoding.
That seems like such an arbitrary thing that a model will learn. But it did
learn this. And you can check out other esoteric ones that you learned
in this reconstruction. And again, feature was present in
the 512 mlp.
But because it was coked up and
all cooped up together with the superposition,
in the superposition phase, it was hard to basically discern. The whole point of
the recorder is basically to extract these features out there.
So now they become one isolated thing, which kind
of brings us full circle to the definition of what a feature is. So again,
I've been throwing around the idea of feature as just a distinct
thing. Model find to be interesting, right. As basically
one perhaps more narrow
definition of it. I guess I don't want to say formal, but just like,
one more particular definition of it, based on this paradigm that
we've described, is like a feature is basically a property that
a model would encode.
Would dedicate an entire neuron to. Would encode using an.
Using one neuron if it had enough neurons, right?
So basically, if there's such a thing that if a model was sufficiently
large, would it get one neuron to it? That thing is a feature,
right? But if there's a thing that no matter how many neurons
it had, this thing wouldn't have a neuron,
it perhaps would be like a part of some other neuron, then that thing's not
a feature, right. It seems kind of circular, but it
turns about. Yeah. The precise definition of, like, futures can be kind of gnarly.
Um, but, like, for all practical purposes, you know, think of, again, features in the
colloquial sense of just like, you know, a thing that the model finds
to be interesting, like squareness or bonus or whatever. Um,
and but the interesting thing, I guess, is that, again, like,
part of the things. Part of the ways, like, a more powerful model is more
powerful is because it can indeed encode for more
stuff than a small one can. And again, as a proposition suggest
the smaller ones actually encode a lot more than they might than
their size alone might suggest, right? Because, again, the whole point of this is if
indeed there was no superposition, or like, if indeed there was nothing
weird happening, then this MLP layer would actually only have 512,
but they were able to extract way over 100,000
reasonable features. So something for sure where
it is happening. And, like. And these features, like, were consistent with
experimental validation for.
Again like they had different evaluation methods that you can check out in the paper
to show like how much confidence they have for it. But basically the features like
very confidence or incoherence or at least in like explainability
to like it's a real human being human. But I feel like the
number of explainable features, high quality feature definitely exceeds 512.
So indeed this compression is happening for sure and
this is like the proof for it. And yeah the future of this
work could basically look like scaling up this auto encoders to work
on much larger models and uncover more useful features going forward.
Awesome and that is the talk. Thanks for
joining here and I encourage you to read more of the
papers out there. I think the anthropic blog
posts and paper, informal papers and formal papers are a great place to
start as that basically represents
where the frontiers right now. Awesome,
thank you for the time and see you later.