Abstract
            
Natural Language Processing(NLP) is an interesting and challenging field. It becomes even more interesting and challenging when we take into consideration more than one human language. when we perform an NLP on a single language there is a possibility that the interesting insights from another human language might be missed out. The interesting and valuable information may be available in other human languages such as Spanish, Chinese, French, Hindi, and other major languages of the world. Also, the information may be available in various formats such as text, images, audio, and video.
In this talk, I will discuss techniques and methods that will help perform NLP tasks on multi-source and multilingual information. The talk begins with an introduction to natural language processing and its concepts. Then it addresses the challenges with respect to multilingual and multi-source NLP. Next, I will discuss various techniques and tools to extract information from audio, video, images, and other types of files using PyScreenshot, SpeechRecognition, Beautiful Soup, and PIL packages. Also, extracting the information from web pages and source code using pytessaract. Next, I will discuss concepts such as translation and transliteration that help to bring the information into a common language format. Once the language is in a common language format it becomes easy to perform NLP tasks. Next, I will explain with the help of a code walkthrough generating a summary from multi-source and multi-lingual information into a specific language using spacy and stanza packages.
Outline
1. Introduction to NLP and concepts (05 Minutes)
2. Challenges in Multi source multilingual NLP (02 Minutes)
3. Tools for extracting information from various file formats (04 Minutes)
4. Extract information from web pages and source code (04 Minutes)
5. Methods to convert information into common language format (05 Minutes)
6. code walkthrough for multi-source and multilingual summary generation (10 Minutes)
7. Conclusion and Questions (05 Minutes)
           
          
          
          
            
              Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Everyone. My name is Gajendra Deshpande and today I will be presenting
            
            
            
              a talk on multilingual natural language processing using
            
            
            
              Python. So in today's tasks we will be discussing in brief
            
            
            
              about natural language processing and its concepts, then challenges
            
            
            
              in multisource multilingual natural language processing tools
            
            
            
              for extracting information from various file formats,
            
            
            
              extracting information from web pages and source code.
            
            
            
              Then finally methods convert information into common language
            
            
            
              format. Let us first look at the few
            
            
            
              of the basic concepts of natural language processing. So first
            
            
            
              one is tokenization. So in tokenization, what we do is we
            
            
            
              tokenize a paragraph into words and sentences.
            
            
            
              So basically you will be given huge text. So it
            
            
            
              is not possible to perform computation or processing
            
            
            
              the entire text at once. So we need to tokenize it say
            
            
            
              for example, we may have to compute the word
            
            
            
              frequency and sentence frequency, and we also
            
            
            
              need to perform ngram analysis. The next is
            
            
            
              word embeddings. So here we represent words into vectors,
            
            
            
              that is we convert words into some numeric
            
            
            
              formats for computation,
            
            
            
              then text completion. So here we try to
            
            
            
              predict next few words in a sentence,
            
            
            
              then sentence similarity,
            
            
            
              where we will try to find out the similarity score
            
            
            
              between two sentences. So here the scale will be from
            
            
            
              zero to one, then normalization.
            
            
            
              So here we transform the text into its
            
            
            
              canonical form, then transliteration.
            
            
            
              So write the text in language a,
            
            
            
              script using language b script say for example,
            
            
            
              you can write Korean using english script,
            
            
            
              then translation. Convert the text in language a to
            
            
            
              language b. That is, you are directly converting korean
            
            
            
              text into english language. So there is a difference between transliteration
            
            
            
              and translation, so both are totally different. Then phonetic
            
            
            
              analysis. So here we try to determine how two
            
            
            
              characters sound when we speak
            
            
            
              those characters. The next is syllabification.
            
            
            
              Convert text into syllables, then lamatization.
            
            
            
              Here we convert words into their root form.
            
            
            
              Say for example, if there is a word called as running,
            
            
            
              then the root form of running will be just
            
            
            
              run. Then next concept is stemming.
            
            
            
              It is bit similar to lamatization,
            
            
            
              but what we do in stemming is we just remove last few characters
            
            
            
              from the word. So sometimes it may be similar to lamatization,
            
            
            
              but the results are not accurate when you use
            
            
            
              stemming, so you will not be able to convert the words into their
            
            
            
              root form always. Then language detection.
            
            
            
              Detect the language of the text or words,
            
            
            
              then dependency parsing. Analyze the grammatical
            
            
            
              structure of a sentence, then named entity speechrecognition
            
            
            
              recognize the entities in the text,
            
            
            
              for example names, pages, et cetera. Then part of speech.
            
            
            
              Tag a part of speech in a text, then challenges in
            
            
            
              multilingual NLP so first challenge is that language is
            
            
            
              ambiguous. So same sentence may mean different in different
            
            
            
              languages. So even in the same language, we need to
            
            
            
              identify the context of the words. Then languages have
            
            
            
              different structure, grammar and order. Say, for example, we have
            
            
            
              western languages, we have indian languages, and we have some other
            
            
            
              languages. So each language has got their own grammar
            
            
            
              and syntaxes different. Some languages are
            
            
            
              read left to right and some languages are read right to left.
            
            
            
              Then hard to deal with missed languages information. Now we
            
            
            
              know that due to globalization, people speak multiple
            
            
            
              languages. And when they speak multiple languages, it's a general tendency that
            
            
            
              they mix the words of different languages.
            
            
            
              So when we have a data of such a
            
            
            
              kind, and when we need to process such data, then it may
            
            
            
              create a problem in processing that kind of
            
            
            
              information. Because our libraries may not
            
            
            
              be able to detect or may not be able to identify the
            
            
            
              words of other languages. The next translation from one language
            
            
            
              to another language is not accurate. So translation
            
            
            
              is not always word to word translation. So the meaning
            
            
            
              needs to be taken into account. Then language semantics
            
            
            
              need to be taken into account. Right. So that's what
            
            
            
              I was discussing just now. That is, translation is not word
            
            
            
              to word, so we need to take into account the context
            
            
            
              for more accurate translation. Then lack
            
            
            
              of libraries and features of
            
            
            
              course there are many libraries in Python,
            
            
            
              but these libraries may not support all languages
            
            
            
              and they may be limited by the features. So in
            
            
            
              this case we may have to use multiple
            
            
            
              libraries or we may have to hard code
            
            
            
              many features. Then let us consider a scenario of multi
            
            
            
              source, multilingual information processing. Say,
            
            
            
              for example, we need to generate a summary. In that case,
            
            
            
              these are the steps. So the first step is information source,
            
            
            
              which is there in different format, then extract the text, then identify
            
            
            
              the language, translation to source language, then processing
            
            
            
              the text, and finally translate to target language.
            
            
            
              So let us discuss these steps in detail. So, information source,
            
            
            
              our information may be present in various formats, it may be present in text,
            
            
            
              it may be present in audio, it may be present in video, it may be
            
            
            
              present in image. But for processing we need the
            
            
            
              information in textual format. So if the information is
            
            
            
              available in text format, then it's not a problem.
            
            
            
              But if the information is available in audio, video or image,
            
            
            
              then we need to extract the text from audio, video and
            
            
            
              image. And we know that there are various formats for audio,
            
            
            
              video as well as image. So we need to try to
            
            
            
              extract the information from maximum possible formats.
            
            
            
              So in second step, we are going to extract the text using the
            
            
            
              libraries available in Python. The next, identify the
            
            
            
              languages. So here we try to detect the language
            
            
            
              of the text. So here it is bit challenging
            
            
            
              because all the text may not be available in
            
            
            
              single source language. So some words
            
            
            
              may not be identified because of the feature limitations.
            
            
            
              So in this case we may have to hard code some features.
            
            
            
              The next is translate to source language.
            
            
            
              So here one important step we need to perform. We need to translate
            
            
            
              the entire text into one single language so that we can perform
            
            
            
              processing over the text. So this is very very important
            
            
            
              step. So once we translate the information to one language then
            
            
            
              we can process the text, say for example, if we have to
            
            
            
              generate the summary, then in this case we can perform tokenization,
            
            
            
              we can perform lambdization, we can calculate word
            
            
            
              frequency and sentence frequency and we can also
            
            
            
              perform Engram analysis. And based on these steps we can
            
            
            
              pick top n sentences for the generation of
            
            
            
              the summary, then finally translate to a target
            
            
            
              language. So you can generate a summary in
            
            
            
              the source language or you can generate the summary in the
            
            
            
              specified language or a destination language.
            
            
            
              Now let us look at few python packages
            
            
            
              which will help us to achieve our task. So first one is
            
            
            
              Google Trans 3.00.
            
            
            
              So it's a free and unlimited Python library that
            
            
            
              implemented Google translation API. This uses Google translate
            
            
            
              ajax API to make calls to such methods
            
            
            
              as detect and translate. So it is compatible with Python 3.6
            
            
            
              and higher versions. It is fast and reliable.
            
            
            
              That's because it uses the same servers that translate google.com
            
            
            
              uses. Then auto language detection feature is supported,
            
            
            
              then bulk translations are possible. Then customizable service URL
            
            
            
              is supported and it also supports HTTP version
            
            
            
              two. Then you can install it using Pip command, so you
            
            
            
              can say pil install Google Trans, then it will be installed on your machine.
            
            
            
              So if the source language is not given, then Google translate attempts
            
            
            
              to direct the source language. So you can see here the
            
            
            
              source code. So first we are importing translator
            
            
            
              from Google Trans, then we are using the translate
            
            
            
              function. So note here that we have just specified source
            
            
            
              language, that is Korean, but we have not
            
            
            
              mentioned the language. So it tries to detect
            
            
            
              that it's a korean language and destination is not specified.
            
            
            
              So by default it will be converted to English. Then next
            
            
            
              in the next case we have not specified the source language, but we have
            
            
            
              specified the destination language. So here the source
            
            
            
              language will be korean and destination language will be japanese.
            
            
            
              So that is the korean language. Text will be converted
            
            
            
              to japanese language text. Then in the next example we
            
            
            
              have specified some text and we have specified that the source language
            
            
            
              is Latin. In this case we have not specified the destination language,
            
            
            
              so the latin text will be converted to English.
            
            
            
              Now you can use under Google translate domain for
            
            
            
              translation. Say for example if multiple URLs are
            
            
            
              provided then it randomly chooses a domain.
            
            
            
              So you can specify either one domain or you can specify multi
            
            
            
              domains. If you specify multiple domains then it will select one
            
            
            
              domain randomly. Then the
            
            
            
              detect methods, as its name implies identifies the
            
            
            
              language used in the given sentence so
            
            
            
              you can use detect method to identify the
            
            
            
              language of the given text. Important point here
            
            
            
              is that it is unofficial, unstable and maximum character
            
            
            
              limit on a single text is 15k characters.
            
            
            
              So the solution is use Google's official translate
            
            
            
              API for your requirements.
            
            
            
              The next is speechrecognition. Here we
            
            
            
              try to extract text from an audio file.
            
            
            
              So it's a library for performing speechrecognition
            
            
            
              with support for several engines and APIs and
            
            
            
              also it's available online as well as offline.
            
            
            
              So speech recognition engines or API supported
            
            
            
              by speechrecognition packages are CMU sphinx, it works offline
            
            
            
              and also snow by hot word
            
            
            
              detection also works offline. Apart from that
            
            
            
              Google speech recognition, then Google cloud speech API width AI,
            
            
            
              Microsoft Bing voice recognition, hindi fi
            
            
            
              API, then IBM speech to text engines
            
            
            
              are supported. Then this is how we write the code.
            
            
            
              First we import the speech recognition package,
            
            
            
              then we specify the file name from which
            
            
            
              we want to extract the text.
            
            
            
              Then we use a recognizer function to
            
            
            
              initialize the recognizer.
            
            
            
              Then we use record
            
            
            
              function and there we specify the source file
            
            
            
              name. And finally we use recognize Google function
            
            
            
              and this function will convert the speech to text and store
            
            
            
              the information in the text variable. And finally we
            
            
            
              can print the text or you can store it in a variable
            
            
            
              for further processing. The next is pytessaract
            
            
            
              package, so it is used to extract
            
            
            
              text from an image file. So Python
            
            
            
              Pytessaract is an optical character recognition tool
            
            
            
              for Python. That is, it will recognize and read the
            
            
            
              text embedded in images.
            
            
            
              Python Pytessaract is a wrapper for Google's pytessaract
            
            
            
              OCR engine. It is also useful as a standalone
            
            
            
              invocation script to Pytessaract as it can read
            
            
            
              all image types supported by the pillow and leptonica
            
            
            
              imaging libraries including JPeg,
            
            
            
              png, GiF, BMP, TIFF and others.
            
            
            
              Additionally, if used as a script, Python test
            
            
            
              track will print the recognized text instead of
            
            
            
              writing it to a files. So just three
            
            
            
              lines are enough to extracting the text
            
            
            
              from an image file. So first import
            
            
            
              the pytessaract package, then initialize the
            
            
            
              pytessaract command and then specify
            
            
            
              the images path.
            
            
            
              So we use image to string methods
            
            
            
              so which will read the image and convert the
            
            
            
              data in the image to a string. The next is
            
            
            
              beautiful soup four. Then it is used to
            
            
            
              extracting the information from a web page. So if you have performed
            
            
            
              web page scraping then you are familiar with this package.
            
            
            
              So that is beautiful soup four.
            
            
            
              So it's a library that makes it easy to scrape information from web
            
            
            
              pages. It sits atop an HTML
            
            
            
              or XML parser providing pythonic idioms
            
            
            
              for extracting, searching and modifying the parse
            
            
            
              tree. So this is how we write the code
            
            
            
              here. So we first import requests package,
            
            
            
              then we import beautiful soup package.
            
            
            
              Then we specify the URL from which we want to extracting the
            
            
            
              information. Then next we specify the parser.
            
            
            
              So here HTML parser has been specified. Now the thing
            
            
            
              here is that when you use beautiful soup it extracts the
            
            
            
              source code. Also it's not the
            
            
            
              server side source code, but it is the code which is rendered
            
            
            
              by the web browser. So the
            
            
            
              next task will be to remove
            
            
            
              all the unwanted code and also to navigate
            
            
            
              to appropriate location in the web
            
            
            
              page. Maybe if you are using XML then you can
            
            
            
              use xPath queries to navigate to a
            
            
            
              particular location on the web pages to extracting
            
            
            
              its content. The next library which we will
            
            
            
              see is the stanza. So it is Python NLP
            
            
            
              package for many human languages. So stanza
            
            
            
              is by Stanford. So earlier it was known as Stanford
            
            
            
              NLP, but now they have challenges its name
            
            
            
              so now it is known as stanza. So it's a collection
            
            
            
              of accurate and efficient tools for many
            
            
            
              human languages in one place. So starting
            
            
            
              from raw text to syntactic analysis and entity recognition,
            
            
            
              stanza bring state of the art NLP models
            
            
            
              to languages of your choosing. So native
            
            
            
              Python implementation requiring minimal efforts
            
            
            
              to set up so full neural network pipeline for robust text
            
            
            
              analytics including tokenization, multi word
            
            
            
              token expansion, lamatization part of speech
            
            
            
              tagging and morphological feature,
            
            
            
              staging, dependency parsing and named entry speechrecognition are
            
            
            
              supported. Then pretrained neural models
            
            
            
              supporting 66 human languages. Then it's a
            
            
            
              stable, officially maintained Python interface to core NLP.
            
            
            
              So you can refer GitHub repo for more information.
            
            
            
              Then you can also visit stanza run
            
            
            
              website for live demo.
            
            
            
              Then on this slide you can see the output
            
            
            
              of stanza. So here I had pasted one story,
            
            
            
              so you all know about the story.
            
            
            
              That's a race between tortoise and rabbit.
            
            
            
              So you can see here that it is showing the
            
            
            
              part of speech output, then it is
            
            
            
              also showing the lemmas.
            
            
            
              Then it's also showing the named entity recognition.
            
            
            
              So you can say that in named entity recognition it's saying that
            
            
            
              two is a cardinal, then it is showing the
            
            
            
              universal dependency between different words
            
            
            
              in a sentence. The next, what I did
            
            
            
              was the same text, same story was
            
            
            
              converted to another language. So that's Hindi.
            
            
            
              So you can see here that part of speech
            
            
            
              is working fine. It has successfully identified
            
            
            
              part of speech. Then lemmas also working
            
            
            
              fine. But if you see the named entity recognition, this feature is
            
            
            
              not yet supported. So those who want to contribute, they can
            
            
            
              think of contributing here in this particular area for
            
            
            
              Hindi language. So likewise if you consider some other languages,
            
            
            
              so features are lacking. So this is what I
            
            
            
              was speaking earlier. So libraries are limited
            
            
            
              by features. We may have to hard code some features
            
            
            
              now it also shows universal dependencies. So that's fine,
            
            
            
              that's not a problem. Then next is inLtk.
            
            
            
              So it's a natural language toolkit for indic languages.
            
            
            
              It's created by Gaurav Aurora. It aims to provide out
            
            
            
              of the box support for various NLP tasks that an application
            
            
            
              developer might need for indic languages. So indic languages means
            
            
            
              the languages which are used in India.
            
            
            
              So India is very rich in terms of languages.
            
            
            
              So it has got around 22 official
            
            
            
              languages. Then it supports native languages and code missed
            
            
            
              languages. So native languages means the text
            
            
            
              in a single language. Maybe the Canada Hindi or Marathi,
            
            
            
              Tamil, Telugu or some other language code.
            
            
            
              Mixed languages means the words from two or more languages
            
            
            
              is mixed. Say for example we can say English which is combination
            
            
            
              of Hindi and English. We can say English
            
            
            
              which is consideration of Canada and English. That means the
            
            
            
              script will be in Canada, but in between some english language
            
            
            
              words are used. InTK is currently supported
            
            
            
              only on Linux and Windows ten with Python version
            
            
            
              greater than or equal to 3.6. The next
            
            
            
              library is indic NLP library. So you can see here that
            
            
            
              the language support there are different classifications,
            
            
            
              that is Indo Aryan, dravidian and others.
            
            
            
              Then it also shows that the features
            
            
            
              supported by the various languages in India.
            
            
            
              So if you see dravidian languages, most of
            
            
            
              the features are supported in Indo RN category, Hindi and Bengali,
            
            
            
              Gujarati, Marathi and Kokane. They support
            
            
            
              all the features. Even Punjabi supports
            
            
            
              the features like script information normalization,
            
            
            
              tokenization, word segmentation,
            
            
            
              romanization and so on.
            
            
            
              Then there are some languages which support bilingual
            
            
            
              features. That is, script conclusion is possible among the
            
            
            
              above mentioned languages, but except Urdu and English it is not
            
            
            
              possible. Then transliteration is possible.
            
            
            
              Then also the translation is possible.
            
            
            
              Then this library was created by Anup Konchukutans.
            
            
            
              The goal of the Indian LP library is
            
            
            
              to build Python based libraries for common text processing
            
            
            
              and multilingual natural language processing. Indian languages indian languages
            
            
            
              share a lot of similarity in terms of script, phonology,
            
            
            
              language syntax, et cetera. And this library is
            
            
            
              an attempt to provide general conclusion to very common
            
            
            
              required tool sets for indian language text.
            
            
            
              Then polyclot is another interesting library.
            
            
            
              It's a very vast library and it
            
            
            
              supports most of the human languages in the
            
            
            
              world. It's really a massive library.
            
            
            
              It's developed by Rami al Rafu.
            
            
            
              It supports various features, and you can see here
            
            
            
              in bracket how many languages it supports. Say for example,
            
            
            
              tokenization is supported by 165 languages, language detection
            
            
            
              for 196 languages, name recognition for 40
            
            
            
              languages, part of speech tagging for 16 languages,
            
            
            
              sentiment analysis for 136 languages, word embeddings
            
            
            
              for 137 languages, morphological analysis for 135 languages.
            
            
            
              Then similarly transliteration for 69 languages.
            
            
            
              Again, note here that the features are limited for
            
            
            
              few of the languages. So again, there is a scope
            
            
            
              for contribution here. Then finally,
            
            
            
              summary. So, performing NLP tasks on multiple human
            
            
            
              languages at a time is hard, especially when
            
            
            
              the text includes mixed languages.
            
            
            
              The information need to be extracted from multiple sources and
            
            
            
              multiple languages, and should be converted to a common language.
            
            
            
              Multilingual NLP helps to generate output in
            
            
            
              a target language. So one feature what
            
            
            
              we are doing here is we are converting the information to
            
            
            
              a source language, and then we are converting the information to a
            
            
            
              specific target language. So there are various libraries offering different
            
            
            
              features, but not a single library offers all features. So that means
            
            
            
              that there's a lot of scope for contribution and also a
            
            
            
              lot of features needs to be hard coded. Thank you everyone
            
            
            
              for attend my talk.