Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, I'm Akshay Jain, our engineering manager at Innovate UK.
AWS UK is a government bank fund which help
organization by providing more funding to execute their projects.
At present I work with the data migration projects where the primary
responsibility includes around executing a data migrations,
data integrations and implementation of some of the NLP techniques
to help them resolve some of the machine learning related use cases.
Topic for today's talk is topic modeling for text documentation
using NLP techniques. In this particular seminar I'm just
going to share my journey how we are trying to solve some of the use
cases that we have around the text document analysis and
how we are solving it using the NLP techniques. So yeah,
let me just walk you through to the some of the use cases and the
challenges that we have and both that I will be just
throwing at you some of the solutions and the things which we are implementing
to meet those particular challenges. Yeah, let me just go back to
the use cases. So on the use cases side, primarily we have
four kind of use cases right now to solve. The first use case
is how we can identify the entities name within
the documentations. So generally what happened is that as it
is a venture capitalist company, lots of people submit their applications
in their intention to raise a fund. So lots of
documentation comes in a way where people provide information such as
what is the description, what is the purpose of their work and
things like that. So all of those things come in a form of a
large documentation and what we try to identify
in the document is what is the kind of all different entities and things are
involved just to identify. We are not putting government
money to any of the sanctioned companies or sanctioned persons
or something like that. The second kind of use cases, what we
have is that whatever applications or things we are getting
in that data, we may want to identify whether
the documents have a certain similarity or not. We have
a two purpose for it e to just identify what are the different segments
or industry sectors in which we are getting a different application.
And the second one is to identify the use cases where
people are submitting our own applications multiple time with the word
changes and things like that in a way.
So we don't spend much effort to ss application
again and again in a manner. So that's the kind of second use case we
have where the purpose is to clean the application
data and identify the similarity between the documents.
The third use case, what we have is to understand
what are the sectors and things in people are submitting
their applications, how those applications are relate to each other from
the industrial sector perspective and where the
funding and things are coming in a way just to understand the
market conditions. And the fourth use case we have is to
support a kind of ecosystem where we can say that
in which particular subcategory under the industry codes,
the more and more money is getting funded, or more and
more application which is coming in the market. So from all
those aspects, we just try to identify and
we try to analyze all the applications that we are getting. And for
that purposes, we are trying to build a system which
can help us resolve all of this kind of problem and
give us a concrete solution around it in a particular way. So let
me just walk you through the journey where we have solved some of these use
cases, and some of the use cases are still undergoing appropriation.
So I'll be just walking you through that journey in that particular sense. The first
use case we have is to identify the entity documents.
We generally have lots of textual information in our documents.
What is the purpose of the fundraising, how it is going
to help them, then, what they have written now, what kind of work they are
building, what kind of partnerships they have, what are the people,
what are the different peoples and entity that is involved in those things.
So we get generally lots of actual information in the form of text
and documentation from them on those kind of things.
What we try to identify is that what is the kind of entities that are
involving over here in terms of what is the country from which the funding is
asked, what are the people who are asking for funding and
some more textual information that we just want to identify from those documents
is to ensure that, that we are working on the applications.
As for the government guideline and no complacency
issues and things are happening over there in that case. So in those particular cases,
one of the things, what we have done is whatever documentation and text we
are getting. Like this is one of the example of the text which I have
just randomly from the Internet out of one of the news article.
Well, they're specifically mentioning something about Dandy Murray and
how the things are going on tennis side of the things. I just randomly took
an extract out of it. And on this particular extract,
if I want to identify that, what are the entities that has been one
over here in terms of a people, country,
date and other factors, what I can do is
I can just use some of the available
NLP libraries to identify those kind of information.
One of the library that support identification of all
this critical details with a very minimal usage of coding,
is a spacy library. And in the Spacy library.
What we can do is we can just take a spacy library. We can just
load a model, whatever we like to use. Spacy library provide
multiple models like here. In this example, I'm just showing you
the usage of encore web large language model.
You can use any other language model as well. Whatever works
for you. There are specific language model that has been built to analyze the
news related articles, web related details
and things like that. So you can just choose what kind of
LL model is working best for you and you can just basically load that model
into the spacy. And after you just load that model, what you
can do is you can very easily analyze the textual information
using this particular library. And this library is going to provide your details
that what is the kind of entities and entry types are involved
in that particular document based on the supported entity types.
So if I just go over here and run this particular code
in the article that I have just shown you earlier,
what it's going to do is it is going to provide me an output
which is going to look something like this. Where this is going to shame me
that in this particular article, these are the only different persons that
are involved. These are the geopolitical locations that have been
identified. In this particular article, it has identified the location
as Dubai over here, as you can see event it
is able to unidentify. For example, the Qatar open is one
of the event which has identified.
It is also providing the detail about the various dates, related details
like whether it's a particular day or something
related to d. Those kind of detail it is providing. Also it is able
to identify some of the cardinals or the numerical information that we
have within it, like two or six or any other numbers and
things like that. Whatever is designed over here. So those kind of information
are something which we can very easily identify. And once we identify, this information
can be stored for acquiring purposes. Just to check whether a
particular kind of entities or person or organization names or something
like that is involved in a particular application or not.
Spacy, inbuilt support all these kind of different entity types like
person, organizations, geopolitical locations,
products, law, date, time, etcetera. You would
be able to differentiate whatever textual information you have within
this particular categories. And based on that, basically you
would be able to store this data and can be used, and further
for querying purposes to perform some of the
compliance related checks and things like that. This is like the
way in which we have started the thing using innovate and
we are just progressing further on it to use the information in this particular map.
Now, the next use case, what we have is
to identify the similar documents. As I mentioned in the scenario,
what happens is that the application gets submitted and that application
generally go through a cycle where the application get reviewed and
reviewed by the multiple subject matter expert, depending on in which field
people are looking for funding for and things like that. Basically analyze
in that particular way and based on that general decision is
taken whether to give a funding or not and things like that.
What the general scenario, what we see is that people submit
the applications, if the application got rejected, they just go
and make a wordy changes over here and there.
They change some words, they make a paragraph over here and there
and add some more additional details over here and then, and then resubmit
the applications. So what we basically try to identify over
here is that how two applications are similar to each other,
and if those applications are similar, or if those applications are
submitted across the multiple categories or things like that,
then we just want to identify those kind of applications. So we basically
arrange and manage them properly. So in order to identify again
those kind of textual informations and the details, we basically
use a two kind of methodology. The first methodology is to use
inbuilt functionality of a specific kind of library,
where, you know, we are again going to use some kind of language model.
And in that language model, when we are going to provide a particular
document as an input on those particular documents, it basically
process the documentation, apply the inbuilt algorithm on
it, which is primarily towards the TF IDF kind of algorithm
based scenarios. And then possibly application or vectorization on
it. And based on that as an output, provide us a detail
that how this two documents or this row information which
is mentioned on those test documents are similar
to each other. And if that particular match has
crossed a certain threshold, then we just flag that these
two applications can dissimilar across categories.
Or if, let's say in the future, an application is rejected
which has a similarity to the current application,
then we basically process it in a different way to go further,
deep dive on it. That whether this application is again
submitted from the same source with some changes, or it is a new
application altogether. So those kind of use cases which
we can just solve with this kind of functionality, and as
a part of a specific library, this particular functionality comes in build.
So without writing a very minimal amount of code, you basically solve
this particular problem and identify a solution in a particular way.
So like here in this example, you can see that the first sentence is
I like salty fries and hamburgers and the second
one is fast food tastes very good.
So in this particular way, it basically take those words,
apply the laminizations and the localization
related techniques on it to identify word
like how they are similar to each other and bring the forms and
bring the words in its original format and then it will
just calculate the similarity score on it. So based on that you can just see
the similarity scores that it has generated and the score is something that
can be used further to identify the things
at a particular level. Generally what happen is that
when we go and we apply any kind of this kind of processes,
if you are applying with a spacy libraries or things like that,
it is good to apply directly on the raw data to a certain extent,
because to a certain way specific library use some of the internal
algorithms to implement the tokenization and the lamentations
and the laminization kind of functionality before applying the
respective algorithms. So the things can work
out very smooth over there. But in general use cases, whenever,
if we are going for a more further use cases where let's say we trying
to implement some advanced logic over there to identify the
text documentation based out of the cosine transformation related algorithms
and the vectorization techniques like that, in those cases,
the initial thing, what we generally do is we pre process
the text information and that text information. We basically
apply some kind of data cleansing in terms of making
ensure that that everything is in lowercase. We stop the
punctuations and the generally used words. We also apply the
tokenizations and the laminarization. So laminization, what it generally do
is it basically bring the words within a sentence to its
root. So that when we are comparing our words and things
like that particular command become on the same level playing
field. And further, that if you are applying any kind of algorithm,
like for example, let's say we are applying a DfIDf,
some kind of n gram texting or something like that, then those
kind of calculations become very effective over there to
calculate the score and the further usage. There are some of the techniques which
we can be used and for that, like NLTK is a general
library which provides lots of functionality to implement those kind
of use cases with a very minimal amount of coding and things like that.
Now once we basically clean the data, after that we
go to the other use cases, where in this particular use case, what generally happen
is that at the next level what we try to identify is
that whatever applications and the details and the textual information we are
getting, how that particular information is
segregated across the different industry sectors
and things like that. So we understand that from which
particular sectors more and more funding requirements are
coming, or what kind of growth we are seeing. So a,
we can understand the industry trend and b, we can basically manage
our capacity to access those applications and things in a
particular way. So in order to perform those kind of things,
we basically manage, or we are building a
kind of techniques using a clustering algorithms,
which basically help us identify that whatever textual
information we are getting, that actual informations belong to
which particular clusters, in order to build and take
this particular count, as I mentioned earlier, basically first
go and we basically first clean the data. After doing this
data, we basically apply some kind of actorizations on it.
And after we do this particularizations, we basically go for algorithms
to cluster them in a way, to see, you know, what kind of
clustering is working best for us. We eventually started
with the k means clustering. Then the k means clustering.
This is one of the output which is there on the with some sample data,
where we can see very clearly that with the k means clustering we are able
to basically that data and you are able to
basically tell that what are the kind of clusters and things,
those applications belong in a particular way.
And based on that, basically that can be used further,
and the segmentation can just help us to understand the application in a proper
way. The other things, what happen is, or why we need this
particular thing is because there are chances that a company is working
in a sector e, but when they basically submit the applications
and the details whose application may belong to sector b.
So those kind of things are very general in nature, because companies
are doing generally an innovation either in their field or in some other fields
as well. So basically try to capture those things. And we also try to
understand that what are the kind of overlap we
are seeing between the industries in terms of innovation and
the kind of things they are doing. So this kind of mustering techniques basically help
us identify those kind of things and provide answer more on that side.
So here, basically, Keymans is one of the clustering techniques that we
have used. And in that way, basically it has worked very well out
of us to segregate our data in a certain way.
Other thing, what we have tried out is a fuzzy seaming clustering,
which has comparatively generated a better output for our data sets
because of the characteristic of the data and how the words in
the textual information is connected to each other
in the vector arrays. So based on those particular inputs, we were
able to basically generate a clusters and the things out of it.
To understand those particular things in a more appropriate manner.
Now, based out of this particular information, we are now
able to cognitively basically categorize the data. And we can just say that,
yeah, these are the kind of industry categories and the things that
these applications belong to, and they can be used further on
that side. All of these things are implemented generally using a Python library.
With that, basically we can just build the pipelines on the top of the AWS
using the sagemaker notebooks or something like that. And in that way, basically the
spill dynamics can be solved in a particular manner along
with this thing. The other thing, what also we want to identify is that
whatever things we are getting and whatever things are getting in our cluster,
those things are similar to each other. And how those things are in
general helping us out to identify the things in a particular
way, in order to identify those kind of details and
some of the scenarios around that. What we generally do is that
on, after doing a clustering of those applications, we also
basically go and generate diagrams and triagrams on the top of it
to identify the frequently used keywords and the things
that are present on those particular clusters.
And using the libraries and
the techniques around it, we are basically able to implement those kind
of details and identify what are the common or keywords
and things we are seeing. And there are certain set of keywords which
we basically take out of it, just to understand that these are the
highest highly user keywords in this applications and the things like that.
So this we can just implement using this diagram and diagram techniques.
And we can also around that get us some kind of informations
on dummy data, like the number you can just see over here for some of
the dummy data that, you know, how those particular eggs are being used.
What are the kind of instances we are seeing over here? Generally the term frequencies
and the inverse from frequencies kind of scenarios in this manner. And we basically
just see that how those terms are being used across application to
understand the application similarity, dependency and the
sector's influence on each other. So in that way, basically it
help us get those information and process further on
those particular sites. So once you basically get this particular
information at the next level, this is something which is in progress.
We are still working on it, and this is something like what we're trying to
identify. Is that, okay, where we have identified what are the
clusters we are having, we have identified like what is the kind of
topics we are having in that particular clusters.
But the missing part is that what is the kind of hierarchy
that exists between those particular topics or cluster
that has been present over there in that particular way. So for that
purpose, we are just experimenting with some of the algorithms and the techniques
around it to identify those things. Like one of the rhythm which we have
used right now is related to the accommodative clustering techniques.
So with this technique we are just trying to understand that how these
topics and things are related to each other and whether we are able to build
some kind of a graph around graph around those
things or not, where the root node, maybe we can just see a cluster
as a category. At a sub level we can see this
different hierarchy, and at the end or at
the root level we can just see the different tags or topics that has
been identified within that cluster. So we can just use it in a
particular manner. So this again, your progress. And mostly
once we crack it in a future video related to machine learning or something like
that, I will be happy to share those kind of intenses in terms of
what is work out and what is thought. Right now, the results which we are
seeing on this particular technique using algorithm clustering is
not that much satisfactory. We are trying to improve it with other things and things
like that. Yeah, so that's how we are basically trying to solve
the use cases around the machine learning, using a topic modeling.
And in order to build this particular things, we are generally using most of
the open source Python libraries. And in order to perform this particular
techniques or clustering with the diagrams and the diagrams technique,
we also got an exposure to a different libraries like keyboard or
something. And those libraries are generating a reasonable amount of results.
But one of the things, what we identified is that rather than going for keyword
libraries and other logging page and respective library,
we in our experience, we basically, for a particular domain,
we got a more positive results by using the basic
python techniques around the malgram and grams and the
frequency idea related techniques and things like
that. So yeah, that's all about this particular presentation.
And my goal was just to provide you in a detail in terms
of what other kind of use cases exist and what are
the kind of approach and things generally work out in the industry in that
particular manner. So yeah, that's all about, that's all from me in
today's presentation. If you guys want to learn more about it or
just want to be in touch, please send my contact detail. Feel free
to connect with me on the LinkedIn, or feel free to connect with me
on the email to discuss any of the possible challenges or things that you
would like to discuss around the use of topic
modeling and the analytic data around us, and I would be
happy to connect and share more details over there in
that particular way. So thank you. Thank you for your time, and thank you
for listening this particular session. Thank you.