Transcript
This transcript was autogenerated. To make changes, submit a PR.
So before I jump directly into the different
terms, Greenfield and Brown field, I want to give you a very short intro into
data labeling. Most likely every one of you already heard of it, but just a
very short basic concepts. So of course, if you
want to build some classifier or information extractor,
you don't only need raw data, but you need labeled data,
right? So for instance, if we have some email, we want to mark
or categorize different information that we want to
basically predict using an AI model.
And this is typically done via manually labeling the data,
right. So you have some person that labeling the data and you can then use
this labeling data to build your model.
Well, I want to do some analogy and talk about this
kind of data labeling. And we talk a lot about Greenfield and Brownfield
in it projects. Right. So in Greenfield is typically,
is when you can start from scratch, you can build in a completely new environment.
And on the contrary, in Brownfield
you basically build on an existing system. You have to work with legacy
code and basically work with integrations and
stuff like this. So it's not about designing something new, but improving something
continuously. And we can transfer those
concepts also to data labeling because we have something quite
similar there. So in Greenfield labeling,
we basically want to start from scratch. We only have the raw data,
but we want to build some proof of concepts really fast and
go from basically zero to 90 or something like this.
And on the contrary, in Brownfield labeling we already
have existing training data, but we are most likely sure that
there are some potential quality improvements so that we can
continuously work on that and by doing so improve the performance of our model.
That's basically the idea about Greenfield
and Brownfield data labeling and from some other perspectives,
basically is that you mostly focus on
training quantity in Greenfield, and that you really create a lot of
training data. You know that the quality won't be perfect, but you can then improve
on it continuously during brownfield labeling.
So I'm going to talk first about Greenfield, give you
some ideas about that, talk about really cool technologies. Then we're
going to Brownfield. Then we are making our conclusion,
right. And so when we talk about Windfield labeling, we first talk
about how you label from scratch, what kind of options you have.
And we also want to understand what are the real problems we
have there. So the first two options you have if you want to labeling data
is that you either go via crowd labeling,
which is globally scalable, so you can have lots of people working on
that. But typically you have issues when it comes to
very difficult tasks where you need a lot of domain knowledge,
like for instance in insurance companies.
And the contrary to crowd labeling
basically is in house labeling, so that you let your experts in house label
the data, but of course, then you don't have the global scale, right?
So that becomes a lot more expensive and also
oftentimes a bottleneck in your projects. So it
really is difficult to create a large training set
that you can start with, that you can use to prove
your concept, basically. And that's why a lot of people
think about how you can automate your labeling such that you can create large
training sets easily. And one of
those ideas is weak supervision, which is
basically a perspective from a machine learning point
of view on integrating information,
right? And the basic concept is quite easy,
but you can build really cool applications using weak supervision.
And in a sense that for each heuristic
that you can come up with. And the heuristic can be something, we go into
this in a bit more detail, but can be something like a labeling function that
is not perfect, like predicting the right
label 100% of the time, because then you would also have the classifier,
but it's something that gives you the right label in 80%
of the cases, or 70% of the cases, and not for all records,
but just for some small subset. And you
want to come up with several of those heuristics and compute them for each
of every of your records, so that you then have like a matrix
with all of your heuristics that create noisy labels.
And the task of weak supervision is to basically combine them,
right? And again, weak supervision is not one
algorithm, but it's a family of algorithms that you can use.
Two, look at your noise label metrics and come up with the
potentially best synthesized labels for your
data, right? So you also, most of the times, don't have
like a discrete label, but you have a probabilistic label.
And one algorithm could be, for instance, majority vote,
that you just look at the counts of the heuristics
that make a vote for each record. But you can also go into more
sophisticated algorithms that analyze precision,
that analyze coverage, that analyze conflicts,
stuff like that, so that you
can really create sophisticated weekly supervised
labels. And if we now talk about heuristics, I just want to showcase
you what kind of types heuristics can be.
So one of the most simplest one is labeling
functions, which can be just like a very simple python function,
just few lines of code that
basically just takes this input a record and returns then
the label that you want to products other heuristics
can be, for instance distance supervision, which is basically looking up values that
are very much associated to some kind of label
or active learning modules that continuously learn
on the data that you already labeled manually. It can be
zero shot classifiers, for example from hugging phase, which are very
similar to labeling functions, but you don't really write the code of the labeling
function, but you instead just provide the label names, which is really cool.
Can be something like unexperienced labelers. So still manual labeling,
like crowd labeling or interns, and it can
of course be anything that you can integrate. So third party
systems, legacy systems, so in general you really have a generic
interface and the idea is that you collect noisy labels
and the relevance of each heuristic is then determined and
you can combine them into wiki supervised labels. If you
can now automate the labeling, we have to think about why do we even want
to train a classifier? And for that we can have a short
comparison. And the main idea is that you label to
build, whereas for building a classifier do this for real
inference, real time. So for labeling,
the runtime doesn't essentially matter. You can also just run a program
over the night, whereas for inference you oftentimes only have milliseconds.
Also in labeling you can have access two data that potentially is not available
at runtime. And hasso, you don't
aim for like 100% coverage, but instead you
want to have a really high confidence. So there's like a trade off
and you decide for the confidence in data labeling.
Hasso, it's much more that in labeling you produce an artifact.
So like something that you can build software with, whereas on the other side,
again, that software is what you want to use for inference.
So there's like this comparison. And also what's
oftentimes seen and also great studies about it,
is that if you have a model and you then use the labeling training
data from something like supervision, the calcifier that you
train oftentimes generalizes very well, so you still
gain a lot of value from just training your model on the data.
But as we talk about automating labeling, we also have to talk
about that. Even in greenfield labeling, manual labeling
still matters a lot because you have different problems also here
that you want to tackle, because you not only want to automate, but you also
want to explore your data. So you want to see what kind of
patterns there are and get familiar with that.
You also oftentimes for automatic labeling, needs some reference
data so that you know how good your automation actually is.
You want to measure how good is the human performance. Is there any subjectivity in
my data so that two people that label the same data might
disagree every now and then? And of course, manually labeling data
helps a lot when you want to come up with techniques for automation,
right? And also there are different strategies
that you can follow if you want to manually label your data.
So for exploring, for instance, you can make great usage of
neural search, which is for instance, using so called embeddings,
which are vector representations of, for instance, your texts or emails.
And you can then use that information or like metadata
to navigate through your data. I'm going to give you an example in
a second. Of course, then for reference data, you can
just use random sampling if you want to understand the performance
of people on your data. You can filter for subsets
that are already labeled by other people so that you can easily calculate something
like an internetator agreement. Also going shown to you this in
a second. And of course you also want to validate how good,
like come up with new ideas and validate your heuristics and use therefore also
filters for that. Also going to show this in a second.
I'm first jumping into neural search and as I just mentioned, you can
compute embeddings for data using for instance,
pretrained transformer models that put your
textual data into numeric vector representation.
And if you have that data, you can make very good usage of it.
So for instance, if you want to find outliers in your data,
one very sophisticated approach is to use
diversity sampling. And it basically is that you start by grabbing one random
sample, like for instance, this one. And if
you have it labeled, calculate the most distant
record, that is to this reference point, which it will be, for instance,
this green record. And if you
do this continuously and always compute
a vector that is the most distant to
your current pool of labeled data sets, you will find the outliers
of your records, which is super interesting because you can then really
analyze what kind of obstacles will you face when you
really have your model deployed and when it's running and
you want to infer new predictions using that model.
So that's always super interesting to understand what kind
of outliners there are in texts or images
or whatever kind of data you have. On a
different side, neural search can also be used to find very representative
data, but not to stay in one cluster. But for instance, if you have your
embeddings clustered, find a certain number of data
points per cluster that really are representative. And if you do
so, you will very easily explore
the data that really helps your model to learn, right? So for instance,
if you have the three clusters, you can decide that you want to have,
for instance, two or three data points per
cluster really depends on how complicated your
text, data, image data is, but it can really help you to
progress and label efficiently.
So now we talked about greenfield labeling
and we saw that it is really a good usage
to automate your labeling and that you can use great technologies.
I'm going to show some more in a second also that are
very tightly bound to Brownfield labeling.
But you really see that you can automate
labeling to some extent using
technologies like Wix provision, so that you can achieve
the goal of greenfield labeling much faster, which is
that you want to prove concepts, you want to show that some kind of task
might be automated using machine learning models.
So that is really helpful for that case. And now we are jumping
into the second case, which is we have now shown, we now
see that our usage, like our use case, actually works,
but we have not achieved the performance that we want. We need to continuously
improve the performance, and only in the rarest occasions
it is that we really have a super clean data set. In real world,
we mostly have messy data, and that is why
in brownfield we want to improve on the data quality,
right? So for instance, we have those examples from very
well known data sets, and you can see that they have been labeled incorrectly.
So by correcting them, you would improve the data quality
and thus also would improve the performance of your model. The problem
here is you don't know which records are
the ones that have been mislabeled. Of course, you don't want to label everything again.
So how do you make best usage of your time and money?
And this is where technologies like confident learning come
into place, right? So confident learning basically
uses already trained models, like for instance,
your products model or the model that you use per inference,
and it calculates the outcomes of the predictions
and puts them into probabilities so that you have basically
not only one discrete prediction, but one prediction
per label. And using
that, you can not only compare that to your noise
labels, where noise label doesn't essentially have to be some weekly supervised label,
but any kind of training data where you show that there
might be some flaws in your data. And in
confident learning, you try to estimate the joint
distribution of your true labels and
your noisy labels, so that you can then compute
or at least estimate the number of errors, right?
So for instance, if we sum
up the true cases, we have a probability of 75.
And if we sum up the false cases, we have a probability of
25%. So in general, what we could say here that we
most likely have something of roughly 25% potential errors
in our data, which is quite a large
number, right? So of course, this is not a
perfect computation, but it is an estimation.
And what you can now do is that you can, using the model outcomes,
compute confidence scores per each prediction and
then sort so that you have the lowest scores
first. And if you know that you have an error rate of 25%,
chances are hide that in those first quarter of your data,
there are a lot of label errors. And by looping over them
or looking and investigating them, you will
most likely find some labeling issues.
So, confident learning essentially helps to estimate how
large the error rate is, which ultimately helps you to determine
the quality and to find those records
which potentially are still mislabeled.
Then again, if we use technologies like weak supervision,
which makes use of lots of heuristics, you can
also debug those heuristics in many cases. And this is what I meant,
that manual labeling also is very important for burn feed labeling.
And that, for instance, if you have set up a labeling function that looks
for whether a text starts with a digit and then
products, that this is clickbait. What you could do in Burnfeed labeling
is that you filter for those records that are hit by this labeling
function and then try to investigate in which cases
this heuristic is wrong. So, for instance, we see that
clickbait is only the case if it starts with a digit and the sentiment
is also rather positive, so that we can narrow down
our heuristics, they become better over time,
and we can work on that and debug basically our heuristics. Of course,
this works when you use weak supervision. Then again,
if you also labeling with multiple users, and you use strategies
to label on data that also has been labeled by other people,
you can calculate the inter annotator agreement to see where
potential disagreements are and also how subjective
labeling might be in your task. But you can also,
again, if you use something like greek supervision, use your
existing heuristics to estimate where there might be some
explicit bias, right? So, for instance,
if this, my coworker Zimon and I were
labeled on the same heuristic, and he has a very different position
than I have, then it might be that we need to talk about this heuristic
because we have a different understanding about it. So it really helps us understand
the bias that we potentially have.
So you see, for both Greenfield and Brownfield, we have a
lot of very interesting technologies that we can use to
really help us create the training data that we need, both for
creating prototypes quickly,
but also to continuously improve our models.
So if we now think about how data labeling
can change over time, we also can think about if
training data is an integral part of machine learning applications,
how this will look like also with respect to maintenance and documentation,
because if it is an integral part of
the software should also, at least
that we can think about, should also be
documented. And it's essentially different
than software artifact would look like, right? So you don't have a doc
string, but what could it be looking like? Again,
this is just something I want to provide us with some food for
thoughts. Not a definitive answer, but just something to think
about because it's quite interesting. And also,
if we talk about labeling, we see that now technologies like
neural search or weak supervision create lots of
metadata. Then will this potentially shift
our idea of labeling? That is where we've
focused, to an idea of enrichment that is rather holistic.
So, for instance, if we're able to quickly automate data labeling,
this might lead to us creating more classifiers in
shorter time and ultimately helps us to iterate on product reviews,
product feedbacks, stuff like this.
So how will the future of this look like? And what part
will data labeling look like in building machine learning applications?
So, as promised, if you're interested in those
techniques or want to research a bit about it,
I have some great resources that
you can look into. And if this is not enough, you can also
just reach out to me and I will show you some further cool resources
because those are very interesting and state of the art technologies.
And we also have what I just described,
technologies integrated to an open source application that
you can try out if you want to. We're going to publish it
very soon, and if you're interested, you can just register for
our newsletter on our web app, sorry, not web app,
website. And we will reach out to you once we
have published it. So thank you
so much for your attention. I'm super happy to be able to
talk here at Conf 42. If you have any further questions,
please don't hesitate at all to contact me. And thank you so much
for your attention.