Abstract
Natural Language Processing(NLP) is an interesting and challenging field. It becomes even more interesting and challenging when we take into consideration more than one human language. when we perform an NLP on a single language there is a possibility that the interesting insights from another human language might be missed out. The interesting and valuable information may be available in other human languages such as Spanish, Chinese, French, Hindi, and other major languages of the world. Also, the information may be available in various formats such as text, images, audio, and video.
In this talk, I will discuss techniques and methods that will help perform NLP tasks on multi-source and multilingual information. The talk begins with an introduction to natural language processing and its concepts. Then it addresses the challenges with respect to multilingual and multi-source NLP. Next, I will discuss various techniques and tools to extract information from audio, video, images, and other types of files using PyScreenshot, SpeechRecognition, Beautiful Soup, and PIL packages. Also, extracting the information from web pages and source code using pytessaract. Next, I will discuss concepts such as translation and transliteration that help to bring the information into a common language format. Once the language is in a common language format it becomes easy to perform NLP tasks. Next, I will explain with the help of a code walkthrough generating a summary from multi-source and multi-lingual information into a specific language using spacy and stanza packages.
Outline
1. Introduction to NLP and concepts (05 Minutes)
2. Challenges in Multi source multilingual NLP (02 Minutes)
3. Tools for extracting information from various file formats (04 Minutes)
4. Extract information from web pages and source code (04 Minutes)
5. Methods to convert information into common language format (05 Minutes)
6. code walkthrough for multi-source and multilingual summary generation (10 Minutes)
7. Conclusion and Questions (05 Minutes)
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Everyone. My name is Gajendra Deshpande and today I will be presenting
a talk on multilingual natural language processing using
Python. So in today's tasks we will be discussing in brief
about natural language processing and its concepts, then challenges
in multisource multilingual natural language processing tools
for extracting information from various file formats,
extracting information from web pages and source code.
Then finally methods convert information into common language
format. Let us first look at the few
of the basic concepts of natural language processing. So first
one is tokenization. So in tokenization, what we do is we
tokenize a paragraph into words and sentences.
So basically you will be given huge text. So it
is not possible to perform computation or processing
the entire text at once. So we need to tokenize it say
for example, we may have to compute the word
frequency and sentence frequency, and we also
need to perform ngram analysis. The next is
word embeddings. So here we represent words into vectors,
that is we convert words into some numeric
formats for computation,
then text completion. So here we try to
predict next few words in a sentence,
then sentence similarity,
where we will try to find out the similarity score
between two sentences. So here the scale will be from
zero to one, then normalization.
So here we transform the text into its
canonical form, then transliteration.
So write the text in language a,
script using language b script say for example,
you can write Korean using english script,
then translation. Convert the text in language a to
language b. That is, you are directly converting korean
text into english language. So there is a difference between transliteration
and translation, so both are totally different. Then phonetic
analysis. So here we try to determine how two
characters sound when we speak
those characters. The next is syllabification.
Convert text into syllables, then lamatization.
Here we convert words into their root form.
Say for example, if there is a word called as running,
then the root form of running will be just
run. Then next concept is stemming.
It is bit similar to lamatization,
but what we do in stemming is we just remove last few characters
from the word. So sometimes it may be similar to lamatization,
but the results are not accurate when you use
stemming, so you will not be able to convert the words into their
root form always. Then language detection.
Detect the language of the text or words,
then dependency parsing. Analyze the grammatical
structure of a sentence, then named entity speechrecognition
recognize the entities in the text,
for example names, pages, et cetera. Then part of speech.
Tag a part of speech in a text, then challenges in
multilingual NLP so first challenge is that language is
ambiguous. So same sentence may mean different in different
languages. So even in the same language, we need to
identify the context of the words. Then languages have
different structure, grammar and order. Say, for example, we have
western languages, we have indian languages, and we have some other
languages. So each language has got their own grammar
and syntaxes different. Some languages are
read left to right and some languages are read right to left.
Then hard to deal with missed languages information. Now we
know that due to globalization, people speak multiple
languages. And when they speak multiple languages, it's a general tendency that
they mix the words of different languages.
So when we have a data of such a
kind, and when we need to process such data, then it may
create a problem in processing that kind of
information. Because our libraries may not
be able to detect or may not be able to identify the
words of other languages. The next translation from one language
to another language is not accurate. So translation
is not always word to word translation. So the meaning
needs to be taken into account. Then language semantics
need to be taken into account. Right. So that's what
I was discussing just now. That is, translation is not word
to word, so we need to take into account the context
for more accurate translation. Then lack
of libraries and features of
course there are many libraries in Python,
but these libraries may not support all languages
and they may be limited by the features. So in
this case we may have to use multiple
libraries or we may have to hard code
many features. Then let us consider a scenario of multi
source, multilingual information processing. Say,
for example, we need to generate a summary. In that case,
these are the steps. So the first step is information source,
which is there in different format, then extract the text, then identify
the language, translation to source language, then processing
the text, and finally translate to target language.
So let us discuss these steps in detail. So, information source,
our information may be present in various formats, it may be present in text,
it may be present in audio, it may be present in video, it may be
present in image. But for processing we need the
information in textual format. So if the information is
available in text format, then it's not a problem.
But if the information is available in audio, video or image,
then we need to extract the text from audio, video and
image. And we know that there are various formats for audio,
video as well as image. So we need to try to
extract the information from maximum possible formats.
So in second step, we are going to extract the text using the
libraries available in Python. The next, identify the
languages. So here we try to detect the language
of the text. So here it is bit challenging
because all the text may not be available in
single source language. So some words
may not be identified because of the feature limitations.
So in this case we may have to hard code some features.
The next is translate to source language.
So here one important step we need to perform. We need to translate
the entire text into one single language so that we can perform
processing over the text. So this is very very important
step. So once we translate the information to one language then
we can process the text, say for example, if we have to
generate the summary, then in this case we can perform tokenization,
we can perform lambdization, we can calculate word
frequency and sentence frequency and we can also
perform Engram analysis. And based on these steps we can
pick top n sentences for the generation of
the summary, then finally translate to a target
language. So you can generate a summary in
the source language or you can generate the summary in the
specified language or a destination language.
Now let us look at few python packages
which will help us to achieve our task. So first one is
Google Trans 3.00.
So it's a free and unlimited Python library that
implemented Google translation API. This uses Google translate
ajax API to make calls to such methods
as detect and translate. So it is compatible with Python 3.6
and higher versions. It is fast and reliable.
That's because it uses the same servers that translate google.com
uses. Then auto language detection feature is supported,
then bulk translations are possible. Then customizable service URL
is supported and it also supports HTTP version
two. Then you can install it using Pip command, so you
can say pil install Google Trans, then it will be installed on your machine.
So if the source language is not given, then Google translate attempts
to direct the source language. So you can see here the
source code. So first we are importing translator
from Google Trans, then we are using the translate
function. So note here that we have just specified source
language, that is Korean, but we have not
mentioned the language. So it tries to detect
that it's a korean language and destination is not specified.
So by default it will be converted to English. Then next
in the next case we have not specified the source language, but we have
specified the destination language. So here the source
language will be korean and destination language will be japanese.
So that is the korean language. Text will be converted
to japanese language text. Then in the next example we
have specified some text and we have specified that the source language
is Latin. In this case we have not specified the destination language,
so the latin text will be converted to English.
Now you can use under Google translate domain for
translation. Say for example if multiple URLs are
provided then it randomly chooses a domain.
So you can specify either one domain or you can specify multi
domains. If you specify multiple domains then it will select one
domain randomly. Then the
detect methods, as its name implies identifies the
language used in the given sentence so
you can use detect method to identify the
language of the given text. Important point here
is that it is unofficial, unstable and maximum character
limit on a single text is 15k characters.
So the solution is use Google's official translate
API for your requirements.
The next is speechrecognition. Here we
try to extract text from an audio file.
So it's a library for performing speechrecognition
with support for several engines and APIs and
also it's available online as well as offline.
So speech recognition engines or API supported
by speechrecognition packages are CMU sphinx, it works offline
and also snow by hot word
detection also works offline. Apart from that
Google speech recognition, then Google cloud speech API width AI,
Microsoft Bing voice recognition, hindi fi
API, then IBM speech to text engines
are supported. Then this is how we write the code.
First we import the speech recognition package,
then we specify the file name from which
we want to extract the text.
Then we use a recognizer function to
initialize the recognizer.
Then we use record
function and there we specify the source file
name. And finally we use recognize Google function
and this function will convert the speech to text and store
the information in the text variable. And finally we
can print the text or you can store it in a variable
for further processing. The next is pytessaract
package, so it is used to extract
text from an image file. So Python
Pytessaract is an optical character recognition tool
for Python. That is, it will recognize and read the
text embedded in images.
Python Pytessaract is a wrapper for Google's pytessaract
OCR engine. It is also useful as a standalone
invocation script to Pytessaract as it can read
all image types supported by the pillow and leptonica
imaging libraries including JPeg,
png, GiF, BMP, TIFF and others.
Additionally, if used as a script, Python test
track will print the recognized text instead of
writing it to a files. So just three
lines are enough to extracting the text
from an image file. So first import
the pytessaract package, then initialize the
pytessaract command and then specify
the images path.
So we use image to string methods
so which will read the image and convert the
data in the image to a string. The next is
beautiful soup four. Then it is used to
extracting the information from a web page. So if you have performed
web page scraping then you are familiar with this package.
So that is beautiful soup four.
So it's a library that makes it easy to scrape information from web
pages. It sits atop an HTML
or XML parser providing pythonic idioms
for extracting, searching and modifying the parse
tree. So this is how we write the code
here. So we first import requests package,
then we import beautiful soup package.
Then we specify the URL from which we want to extracting the
information. Then next we specify the parser.
So here HTML parser has been specified. Now the thing
here is that when you use beautiful soup it extracts the
source code. Also it's not the
server side source code, but it is the code which is rendered
by the web browser. So the
next task will be to remove
all the unwanted code and also to navigate
to appropriate location in the web
page. Maybe if you are using XML then you can
use xPath queries to navigate to a
particular location on the web pages to extracting
its content. The next library which we will
see is the stanza. So it is Python NLP
package for many human languages. So stanza
is by Stanford. So earlier it was known as Stanford
NLP, but now they have challenges its name
so now it is known as stanza. So it's a collection
of accurate and efficient tools for many
human languages in one place. So starting
from raw text to syntactic analysis and entity recognition,
stanza bring state of the art NLP models
to languages of your choosing. So native
Python implementation requiring minimal efforts
to set up so full neural network pipeline for robust text
analytics including tokenization, multi word
token expansion, lamatization part of speech
tagging and morphological feature,
staging, dependency parsing and named entry speechrecognition are
supported. Then pretrained neural models
supporting 66 human languages. Then it's a
stable, officially maintained Python interface to core NLP.
So you can refer GitHub repo for more information.
Then you can also visit stanza run
website for live demo.
Then on this slide you can see the output
of stanza. So here I had pasted one story,
so you all know about the story.
That's a race between tortoise and rabbit.
So you can see here that it is showing the
part of speech output, then it is
also showing the lemmas.
Then it's also showing the named entity recognition.
So you can say that in named entity recognition it's saying that
two is a cardinal, then it is showing the
universal dependency between different words
in a sentence. The next, what I did
was the same text, same story was
converted to another language. So that's Hindi.
So you can see here that part of speech
is working fine. It has successfully identified
part of speech. Then lemmas also working
fine. But if you see the named entity recognition, this feature is
not yet supported. So those who want to contribute, they can
think of contributing here in this particular area for
Hindi language. So likewise if you consider some other languages,
so features are lacking. So this is what I
was speaking earlier. So libraries are limited
by features. We may have to hard code some features
now it also shows universal dependencies. So that's fine,
that's not a problem. Then next is inLtk.
So it's a natural language toolkit for indic languages.
It's created by Gaurav Aurora. It aims to provide out
of the box support for various NLP tasks that an application
developer might need for indic languages. So indic languages means
the languages which are used in India.
So India is very rich in terms of languages.
So it has got around 22 official
languages. Then it supports native languages and code missed
languages. So native languages means the text
in a single language. Maybe the Canada Hindi or Marathi,
Tamil, Telugu or some other language code.
Mixed languages means the words from two or more languages
is mixed. Say for example we can say English which is combination
of Hindi and English. We can say English
which is consideration of Canada and English. That means the
script will be in Canada, but in between some english language
words are used. InTK is currently supported
only on Linux and Windows ten with Python version
greater than or equal to 3.6. The next
library is indic NLP library. So you can see here that
the language support there are different classifications,
that is Indo Aryan, dravidian and others.
Then it also shows that the features
supported by the various languages in India.
So if you see dravidian languages, most of
the features are supported in Indo RN category, Hindi and Bengali,
Gujarati, Marathi and Kokane. They support
all the features. Even Punjabi supports
the features like script information normalization,
tokenization, word segmentation,
romanization and so on.
Then there are some languages which support bilingual
features. That is, script conclusion is possible among the
above mentioned languages, but except Urdu and English it is not
possible. Then transliteration is possible.
Then also the translation is possible.
Then this library was created by Anup Konchukutans.
The goal of the Indian LP library is
to build Python based libraries for common text processing
and multilingual natural language processing. Indian languages indian languages
share a lot of similarity in terms of script, phonology,
language syntax, et cetera. And this library is
an attempt to provide general conclusion to very common
required tool sets for indian language text.
Then polyclot is another interesting library.
It's a very vast library and it
supports most of the human languages in the
world. It's really a massive library.
It's developed by Rami al Rafu.
It supports various features, and you can see here
in bracket how many languages it supports. Say for example,
tokenization is supported by 165 languages, language detection
for 196 languages, name recognition for 40
languages, part of speech tagging for 16 languages,
sentiment analysis for 136 languages, word embeddings
for 137 languages, morphological analysis for 135 languages.
Then similarly transliteration for 69 languages.
Again, note here that the features are limited for
few of the languages. So again, there is a scope
for contribution here. Then finally,
summary. So, performing NLP tasks on multiple human
languages at a time is hard, especially when
the text includes mixed languages.
The information need to be extracted from multiple sources and
multiple languages, and should be converted to a common language.
Multilingual NLP helps to generate output in
a target language. So one feature what
we are doing here is we are converting the information to
a source language, and then we are converting the information to a
specific target language. So there are various libraries offering different
features, but not a single library offers all features. So that means
that there's a lot of scope for contribution and also a
lot of features needs to be hard coded. Thank you everyone
for attend my talk.