Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. I'm really happy to be here today, and thank you all
for joining. I'll introduce myself. My name is Naomi.
I'm a software developer with previous experience in risk and data analysis.
And this lecture today, strain comparison in real life, is inspired
by various projects that I've conducted at my workplace. One of my main goals
today is to provide you tools and ideas that you can apply at your day
to day work. Now, before we dive in and start talking about all
the different methods that we will discuss today, when they're useful and
why, I want to start by sharing with you a story. This is a
story about me in the world of string comparison.
But most of all, this is a story about why what we will discuss today
is really relevant for you. So my story starts about a
year and a half ago. I was at work, and I noticed that one
of my colleagues was quite struggling with comparing two large sets of
data, or of strings, to be precise. So this colleague
of mine, he knew that the two sets of data he was trying to compare
were very similar to one another, but not quite identical.
And he wanted, for each pair of strings that he was comparing
to be able to determine the similarity level, or difference
level, if you'd like, by producing some sort of a similarity score.
So he decided to come up with his own set of logics,
aiming to tackle different scenarios that we face when we compare
strings. Now, of course, trying to come up with your own set
of logics, it's difficult, it's complicated, it is
not very intuitive. But looking back at what he did,
I have to say that he had a point. Because when we talk about strain
comparison, actually many things would go wrong.
I mean, we have typos and abbreviations. Maybe the
words have been reordered in the sentence or repeated with
no added value to the meaning of the sentence. Or even if we talk about
punctuation, we have dot and comma and a dash, and many
things that could affect the result of the comparison. But most of all,
bring us to one important conclusion. There is no single definition for
similarity or difference of strengths.
So my goal is that by the end of this lecture, you will
not only be familiar with the different obstacles that we face when you go to
work, but you will also have a wide enough set of tools to make
your next data oriented project efficient, elegant and useful.
Because at the end of the day, when we talk about string comparison, we talk
about something that has a lot of real world applications if we only know how
to use that correctly. For example, let's talk about fraud
detection. Fraudsters many times steal other people's identities
in order to obtain credit line and make financial purchases.
Now, if we are a company working in financial services, we would probably
like to know if the person on the other side of the screen is a
fraudster or suspected to be one. This way we would
be able to prevent a purchase from being made or take additional security
measures in order to verify the identity of the person on
the other side of the screen. So let's take this a
bit more to practice. Let's say that we are indeed a company working in
financial services, and we have access to a list of identities that
we know are commonly used by fraudsters, either stolen identities
or made up identities. And in this list of identities
we have access to the people's full names, addresses and other relevant
such identifiers. So let's look at George Forge.
George is a legitimate person whose identity has been stolen and
is now being used by a fraudster. The fraudster wants to open an
account in our service by using the info of George. Now as
you can see, the fraudster decided to tweak a few minor details when opening
the account. If we were trying to compare the info that was provided
to us with the info that we already have by using only exact
match, the comparison would fail. But if we would
be able to use a more flexible string comparison algorithm,
we would also be able to say this looks at usage in the info
of George. Let's dig in a bit deeper. Another usage
that I want to discuss is flexibility for typos. So let's
say that you are developing a guessing game, and in our game the user
needs to guess a word or a sentence, which is practically a string,
by typing in free text. So for long up
strings we would like to have some flexibility because let's say that the
user missed out a character, maybe added one,
removed one, or replaced one character with another one. So in
case a human being would look at the pair of strings and would say,
this practically looks like the same thing for me. We, as the
developers of the game would like to say, let's not reject the input.
Another usage that I want to discuss is in the Meditech industry.
Did you know comparing sequences of DNA is done by
comparing sets of strings? Now we will look more in
depth to this DNA sequencing example later on in this lecture.
And of course the list could go on and on. There are many different
usages for string comparison. We just need to be familiar with the tools,
and Python gives us some built in operations for string comparison.
But I want to briefly discuss them and see why they're not
exactly what we're looking for today. First, but in operation
is exact match. When using exact match, even a single change of
a character, replacing it with another one, or changing it to
a capital letter is already enough to have the exact
match returning the value false. Now, this is great for
many applications, but if we're talking about edge basis typos,
a boolean answer is not good enough for us. Another built in
operation is the ability to check if one string is contained
in the other. So if one string is fully contained in the other one,
this is great. But if we're talking about partial containment typos,
such different edge cases. Again, this is not flexible enough
for our needs. And that is when the comparison methods that
we will discuss today will get into action. So today
we will learn about two python libraries. The first is called jellyfish
and the second is called fuzzywazi.
Yeah, don't worry about the bear, we will
get to it later. So, jellyfish,
let's begin. Jellyfish is a python library that
has two main sections of functions. The first is of Eddie distance functions.
Here, each such function receives two strings and produces a distance
score by counting the amount of steps required in
order to convert string a to string b.
The second section is of phonetic encoding. This is not about
comparison, and we will not discuss this today. But I will briefly mention
that here each such function receives a string and
produces a code which indicates about the phonetic pronunciation
of the string. So under editistance, we will learn
about Levinchen distance and the Marvel Levin distance.
But before doing so, I want to introduce a term, a definition
of string metric. String metric receives two
strings and produces a distance score by applying
some sort of magic. As you can see now, both Levenstein
distance and the Marl Levente distance are different types of string metrics,
each of them with a slightly different logic to calculate
the distance score. Levente distance is
a very fundamental and important string metric
in the string comparison world. Many other string metrics and string comparison
algorithms are based on that one, including all the ones that we will discuss
today. So leverage and distance counts the minimal
amount of steps required in order to convert string a to string
b. A step is considered as addition of a character,
a deletion of a character, or a replacement of one character in
another one. The higher the score is, the bigger the difference.
So let's look at a few examples to make this much more clear.
In our first example, we can see a comparison between exit and exist
here in order to convert string a to string b,
all that we would have to do would be adding the letter s.
Now going the other way around. If you would like to convert string b to
string a, all that we would have to do would be removing the letter
s. So in this case, no matter which direction we choose to take,
the minimal amount of steps required in order to make the conversion would
be one. Now in our second example we can see
a comparison between great and great. In this case,
in order to convert string a to string b, all that
we would have to do would be removing the letter e from the first string
and then adding the letter e again at the end of that string. In this
case, the minimal amount of steps required in order to convert string a to
string b would be two. And on our third
example we can see a comparison between look and lock. In this
case, technically, in order to convert string a to string b,
we could have removed the second o from the word look
and then added the letter c instead. And this
would result in two steps. But we also have a
step of replacement. So we can replace the second
o with c and then the minimal amount of steps
required in order to make the conversion would be one.
We will note that replacement is when we are
looking at a single index and replacing for that same index one
character with another one. That is why replacement only takes
place on the third example and not on the second example. Because in
the second example e, the letter e was relocated
from one index to another one.
Now we are moving on to the moral Levinchen
distance. The moral lensian distance is very similar to levente
and distance, but it also has an added value. The moral
Levin distance counts the swap of two adjacent characters as a
single step instead of two steps. So let's
look at our next example to make this much more clear. Here we can
see a comparison between swap and sub.
Now the Marl Levin distance recognizes that two
adjacent characters a and w have been swapped and
therefore it counts this as a single step.
But for levention distance we can see that this would be two steps
because for levention distance, in order to make the conversion
we would have to remove the letter w and then add it
again, or remove the letter a and then add it again.
So this would be two steps. Now we
are getting to our most important question. When is it useful?
When can we make use of that? So let's look at our next example.
Here we can see dna sequencing. Dna sequencing
is the act of taking two sequences of dna and comparing them
in order to determine the distance level between them. DNA sequencing
is commonly done by using Levinstein distance, which is something that
I, by the way, find as pretty cool.
And now, moving on to our next example, we can see two comparisons where
the first is Mr. Bean versus Mr. Bean, and the second is
Johnny Depp vs. Johnny Depp. In the first example, we can see that the
only difference is a dot, which we, by the way,
know indicates about an abbreviations. And on the second example,
the only difference is a swap of two adjacent characters.
Now, we could say that if we know our expected input is a person's full
name, it is very probable that for small enough changes,
maybe one step, two steps of difference would
be small enough for us to say this is practically
the same thing for me. Let's not reject the input.
Now we're getting to our next question, when is it not useful?
So I want to show you the next example. Here we can see
I love comparing strings versus comparing strings. I love.
Now, if we check out the distance score, we can see that it is 14,
which is quite high. Without any additional information,
we would not be able to determine if the distance score is so high
because the strings are indeed unrelated from one another,
or if this is the case, that a person, a human being, will look at
a pair of strings and would say, this is practically the same thing.
For me, let's not reject the input. And that
is when fuzzy wazi is getting into action.
Now, before we talk about fuzzy wazi, there's one thing that we have to talk
about. I mean, did you know that fuzzywazi was the bear that
had no hair? So fuzzywazi wasn't very fuzzy,
was he? Now apparently fuzzywazi is
a rhyme or a tongue twister, and I was not familiar
to that at all until I started working with Faziwazi Python library.
So I guess there's always something new to learn, and today
we will learn more about this library. So let's begin.
Fuzzywazi is a library that has a few different functions.
Each such function receives two strings and produces a similarity
score. On a scale of zero to 1000 indicates
no match, the strings are completely unrelated from one another,
and 100 indicates either an exact match or something
very close to that. Now, today, under Fuzzy wazzi,
we will learn about three functions, fuzz ratio,
fuzz token sort ratio, and fuzz token set ratio.
So let's start with our first function, fuzz ratio.
Fuzz ratio is a function that receives two strings and produces
a similarity score based on the following equation. Given two
strings where t is the total number of elements, which is practically
characters, and m is the total number of matches where
matches are defined as characters that appear in both sequences and
in the same order. So the similarity score is given by
the equation twice m divided by t all multiplied by
100. Now to give you a bit more sense, a bit more
intuition about this equation, I want to show you what happens when we are comparing
two identical strings. So here when we
are comparing same and same, we can see that m number of
matches is four, t total length is
eight. And then using our equation, we would see that the similarity score
is 100. And if we would be comparing
two strings that are completely unrelated to one another, we would
be able to see that m number of matches is zero,
which is also our numerator. So in this case the similarity
score would also be zero. Now moving on to a more realistic
example, something in between. We can see that the comparison between great
and green is a comparison where m number of
matches is three because gr and e
appear in both sequences and in the same order. T total
length is ten. Therefore, the equation
gives us the twice three divided by ten, all multiplied
by 100 is 60. Now I
want to discuss this example because I think it's a pretty interesting one.
Here we can see two comparisons where the first is
for two other cells that are very similar to one another, let alone a few
minor changes. And the second comparison is a versus
ABCDE. Now if we'd be checking the distance
score, we would see that the distance score for both comparisons
is the same, it is four. But if we would
check the similarity code, we would be able to see that using
the funds ratio function, the similarity score of
the first comparison is 96, which is on a
scale of zero to 100, relatively high. And the similarity code for the
second comparison is 33. And here on a scale of zero to 100,
it is relatively low. So this is interesting because now
we can see the added value that fuzz ratio gives us where the
jellyfish functions that we saw before could not give us.
Another thing that we want to take into account when we're using fuzz ratio function
is that it is case sensitive. As you can see here,
capital a and non capital a are considered as two different strings.
So if we do not care about capitalization, we would want to
apply lower on our strings before making the comparison.
The case sensitivity is true for fuzz ratio function,
but it is not true for all fuzzywazi functions.
Now when we are talking about fuzzywazi that we know, all of
the fuzzywazi functions produce similarity scores on scale of zero
to 100. We also want to be familiar with an important
concept which is called a threshold score. So,
let's say that we have a project, and in our project we made three comparisons.
The first comparison resulted a score of 90.
The second comparison resulted a score of 95.
And the third comparison resulted a score of 96 or 97 or
98. So we would have to ask ourselves
for which score are two given strings considered as identical
or similar enough to one another? And for which score are
two given strings considered as different. Now the answer for that
is a threshold score. So let's say that in our project
we decided that all comparison with scores of 80 and above
would be considered as similar, and all comparisons with
scores of 30 and below would be considered as different.
And all comparisons and resulted scores in the range in between
would be considered as undecisive,
inconclusive. So our next question
now would be how do we choose the threshold code? Now the
full answer for that is a bit more lengthy than the scope of this lecture,
but I will briefly mention that it depends on the requirements and the needs
of our project. For example, it is possible
that in different projects we would have a higher or a lower tolerance
level for what we would call as errors, maybe false positives,
true negatives. So we would want to adjust of
torture scores accordingly. Also, another thing to
take into account is the characteristics of the data that we
are working with. For example, let's say that we have two different projects.
In both projects we would compare people's full names, but in
one project these would be people coming from the US and in
the second project these would be people coming from Europe. So it
is possible that in different continents, people's names would be longer,
shorter or more similar to one another. So it is
also possible that same scores in different projects could indicate
completely different things. Therefore, we will need to
take these considerations into account when we are choosing our first
scores. Now let's move on to our next function.
Fuzz token sort ratio. Fuzz token sort ratio words
the tokens practically substrings before making the comparison.
So here we can see sort and compare versus compare and sort
would produce a similarity code of 100.
Now moving on to a more realistic example, we can see Emma walked home with
Lisa versus Lisa walked with Emma home. In this case,
that similarity score would be again 100.
It also teaches us that fuzz token sort ratio is
not case sensitive. Now moving on to our last function
for today, fuzz token set ratio. Fuzz token
set ratio is very similar to the previous function that we saw.
Now, Fuzz token sort ratio in the meaning that it is
not case sensitive and it sorts the tokens before making
a comparison. But it also has the added value of
a set, which means it does not count duplicates.
So as you can see in our next example, sort lower
and nor Pete versus lower nor peaks.
Sort and sort produces a similarity score of 100.
Now if we would take the same comparison but use the previous
function, we would see that the similarity code is lower
91 in this case because the previous function, as we know,
does not remove duplicates before making a comparison.
Now moving on to a more realistic example, we can see I love
chocolate and ice cream compared to I love ice cream
and I love chocolate. True story by the way.
So in this case we can see that the similarity score is 100.
Now I want to discuss the usages and advantages of fuzzywazi.
Fuzzywazi produces similarity scores on scale of
zero to 100 for pairs of strings. It also
has tolerance for what we would call as typos, which is
minor changes in the strings. So it is possible for two given
strings that are very similar to one another and not quite identical
to produce similarity score of 100.
Because eventually fuzzy wazi is here to answer the question,
would a human being who looks at a pair of strings say,
this practically looks like the same thing for me? And if not so,
to which extent? Fuzzywazi also has tolerance for
changes in order of substrings depending on the function that we choose
to use, and it also has tolerance for repetitions,
again, depending on the functions that we choose to use. Another thing
that we want to know about fuzzywazi is that it simplifies the data preprocessing step
for us, because when we approach a data oriented project, we would
want to go through the data preprocessing step, which means cleaning the
data before making a comparison, and therefore we
would be able to increase the accuracy of our
results. So if our data preprocessing step
goes through removing duplicates, sorting substrings
and applying lower, we know that fuzzywazi can do all of these things
for us. Another thing that we want to know about Fuzzywazi,
which is also true to the jellyfish function that we saw before,
is that it is really easy to use. So in case
we have a feature or maybe a plant feature that is
really nice to have, it is not top priority, not very urgent.
We would want to be familiar with fuzzywazi and jellyfish because
in this case it would not waste expensive time of our software
developers, data scientists, data analysts,
because it is really easy to learn, easy to use, easy to change
whenever we need. Now, if you want to know more about Faziwazi,
I invite you to check out the articles that I wrote about it and posted
on medium. The first article covers the functions that we learned
today and also covers another function that we did not discuss
today. And the second article talks more about the process.
It gets to the before and after of making a comparison,
gets more in depth to the data preprocessing step, as well as explains
what to do after we already made a comparison and have scores with us.
And now let's see what we learned today. Today we learned
about two python libraries. The first is called jellyfish and the second
is called fuzzywazi. Under jellyfish we learn about Levinstein
distance and the moral Levinson distance. And under Fuzzywazi we
learn about fuzz ratio, fuzz token sort ratio,
and fuzz token set ratio. One last thing before we
finish. If you have any questions, comments,
or simply want to keep in touch, I invite you to reach out to me
using my LinkedIn profile. Also, if you want to get more of the
contents that I write, I invite you to follow me on medium.
That's it. That's all for today. Thank you all for listening
and I hope you enjoyed the lecture. I was Naomi Kriger explain about
strain comparison in real life.