Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, my name is Gajendra Deshpande. Today I will
be presenting a talk on build your first cyber forensics application
using Python. So in today's talk, we are going to
discuss about introduction to digital crimes, digital forensics,
the process of investigation and the collection of evidence, then setting
up Python for forensic application development, built in
functions and modules for forensic tasks, forensic indexing
and searching has functions for forensics, forensic evidence extraction,
metadata forensics, then using natural language tools in forensics. And finally,
this summary, let us first look at some cybercrime statistics.
So, the Internet Crimes report for 2019 released by USA's
Internet Crime Compliance center of Federal Bureau
of Investigation has revealed top four countries that are victims of Internet crimes.
You can see here USA has more than four lakh reports.
Then UK has more than 93,000, Canada has more than 33,000
and India has more than 27,000. Of course, these numbers are only reported
numbers, so unreported numbers are much, much higher. So you can consider
at least three times higher. So according to RSA report of
2015, mobile transactions are rapidly growing and cybercrimes
are migrating to less protected soft channels. So less protected soft
channels are mostly our mobile devices. And most
of times what happens is mobile devices are operated by those who are not
well educated and who are not well versed with the mobile
device and different settings. So according to report by Norton
2015, an estimated 103,000,000 Indians lost about
rupees 16,000 on an average to cybercrime.
So that amounts to us dollar of around 200
plus. According to an article published in Indian Express
on 19 number 216, over 55% millions in
India are hit by this cybercrime. So a recent study by Checkpoint Research has
recorded over one like 50,000 cyberattacks every week during Covid-19
pandemic. So there has been an increase of 30% in cyberattacks
compared to previous weeks. So that's because many people have lost jobs
and many people are suffering. And maybe there are many insider attacks who
are taking advantage of the situation. Now let us look at the definition of
digital forensics. So, forensics science is the use of scientific
methods or expertise to investigate crimes or
examine evidence that might be presented in the court of law. So cyber forensics
is investigation of various crimes happening in the cyberspace.
Examples of attacks include phishing, ransomware, fake news,
fake medicine, extortion and insider frauds.
So according to DFRWs, that is digital forensics research workshop,
digital forensics can be defined as the use of scientifically derived
and proven method toward the preservation, collection, validation,
identification, analysis, interpretation, documentation and
presentation of the digital evidence derived from digital
sources for the purpose of facilitating or furthering the
reconstruction of events found to be criminal, or helping
to anticipate unauthorized functions shown to be disruptive
to planned operations. So the digital forensic investigation
process has the following steps. It starts with
identification, then collection, validation, examination,
preservation and presentation. So in identification steps,
what happens is whenever a cybercrimes investigating officer,
or basically an investigating officer, usually he's the police officer.
When he visits the place, his job is to first identify
all the objects so that he can seize those objects,
which helps in the investigation of the case. So basically
this identification of objects helps in collecting the evidence.
So basically he has to collect all the electronic
gadgets, including smartphones,
laptops, then storage devices,
et cetera. So one important thing he should note here that
there may be some devices, okay, say for examples,
toy USB. So these are very difficult to identify.
Even he has to identify such objects and take into his custody.
Then once the objects are identifying, then the next
step comes is the collection. So in collection of evidence,
the investigating officer has to note down the state of
the system. If it is on, then he has to perform live forensics. If it
is off, he should not turn on the system. So the present state of the
system has to be maintained and the photograph needs to be taken.
Now if in some cases, so generally that situation is rare,
the police officer is not in a position to perform live forensics,
then he just needs to pull the plug so that the present
state of the system can be maintained. So if he turn
on the system or turn off the system, then it will definitely change the state
of the system and it will alter the evidence. And one more important thing,
whenever investigating officer is collecting the evidence is that
they collect the most volatile evidence first and the least
volatile evidence last. So there's a particular order is mentioned,
so that states the volatility.
As per the volatility order, the investigating officer needs to collect
the evidence. Then once it is done, then the next
step is to validate the evidence. Now note here
that the investigating officers usually takes the snapshot
or the image of the system, and this image needs to be validated.
So one algorithm which can be used for validation is the hashing algorithm.
So I will demonstrate how it is done in later slides.
The next is examination. Once the system
image has been captured, the investigating officer
needs to examine it. Now note here that this data will be huge. So without
computer, it will be very difficult to examine the data and get the
useful insights. Now, the next step is preservation.
Now, note here that investigating officers are collecting different objects, such as hard
disk and several other evidences. They needs to be stored in
proper room temperature, in a proper security,
in a proper lockers, and they also need to be stored in special
bags, such as antistatic bags or faraday bags.
This is very important, because if the procedure is not followed, then the evidence may
be altered. If the evidence is altered by any means, then it will not be
presented in the court of law. The next is presentation.
The whole idea behind all these steps is
to extract the evidence and present it in the court of law,
right? So if these steps are not performed, then court will not accept it.
So every step needs to be performed carefully, and finally,
it has to be presented in the court of law. Now,
there is one important standard known as Dobert Standard.
So, let us discuss how Dobert standard is useful
and how Python adheres to Dobert standard. So, in United
States federal law, the Dobert standard is a rule of evidence regarding
the admissibility of expert witness testimony. So a party may
raise a Dobert motion, a special motion in limina raised
before or during trial to exclude the presentation
of unqualified evidence to the jury. So there are some illustrative
factors. So, the court defined scientific methodology as the
process of formulating hypothesis, and then conducting experiments to
prove or falsify the hypothesis, and provided a set
of illustrative factors. So, pursuant to rule 10
four a, in Dobert, the US Supreme Court suggested that
the following factors be considered. So, has the technique been
tested in actual field conditions and not just in
a laboratory? Has the technique been subject to peer review and
publication? What is the known or potential rate
of error? Do standards exist for control of technique's operation?
Has the technique been generally accepted within the relevant
scientific community? Now let's see how Python implements
it. So, in 2003, brain carrier published a paper that examined
rules of evidence standards, including Dobert,
and compared and contrasted the open source and closed source
forensic tools. One of his key conclusion was using the guidelines
of Dobert tests. We have shown that open source tools may
more clearly and comprehensively meet the guideline requirements than closed
source tools. So this statement clearly states that Python
has advantage because Python is open source and free software.
So we can say that Python adheres to the Dobert standard and the investigation
process, or the code written using Python language for
cybercrimes application that can be presented in the court of law.
The results are not automatic, of course, just because the source is open.
Rather, specific steps must be followed regarding design, development and validation.
Can the program or algorithm be explained? This explanation
should be explained in words, not only in code. Has enough information
been provided such that thorough tests can be developed
to test the program? Have error rates been
calculated and validated independently? Has the program been studied
and peer reviewed? Has the program been generally accepted
by the community? Now you can see these five points correlate to
the Dobert standard. Illustrative factors. So that's why we can
say that since Python is open source and you can do
all these points which are mentioned on this slide using Python.
So it adds to the Daubert standard and hence the evidence
can be presented in the court of law. It is very very important.
Whenever you are using tool, you should ensure that it
adheres to the Dobert standard.
The next setting up Python for forensics application development now
there are some factors which actually need to be considered whenever you
are setting the environment. The first one is your background and the
organization support. So what is your qualification, how much skill
you are having in Python and what is the organization support?
Say for example, is your organization funds the development of new software
or is it capable of purchasing new software
or is it interested to invest in open source
tools? The next is choosing the third party libraries. So choosing
third party libraries is also very very important because there is
a dependency issue and you may have to sometimes write
wrappers in order to just get the functions. The next is
the ides and their features, that is integrated development environments.
So what do you prefer? Are you just okay to write the
command line programs or you need sophisticated ides so
that it can help you code faster using the features
such as intellisense and debugging.
The next is installation. So where you want to install it?
On which operating system are you interested in installing on
Windows, Linux or Macintosh? Of course, if the evidence
is just the analysis then it depends.
So it is just a simple analysis then you can use any operating system.
But if you are performing some system specific
analysis, say for example Windows forensics or
Linux forensics or Macintosh forensics, then you need to install Python
on those specific operating systems, then write version of Python.
So this is also very very important. You can't use the recent version of Python
just because it is recent. Okay. Some libraries may support it. May not support
it, okay. So the getting tasks done is very very important.
So you need to use the appropriate version of Python then
next, how you want to execute your programs. Are you
interested in using graphical tools or just these shell commands? Okay,
many times shell commands will do the job and you can get the
job done very quickly. And many times it is very important
to use graphical tools also.
Now let's see how Python supports the development
of cyber forensics applications now. Built in
functions and modules now note here that Python has got many
built in functions and modules. You can list all of them using
the Dir builtin function and
you can see here that there are several built in modules
and functions listed. So if you are a Python developer, you are already
aware of these functions. The only thing we need to see is how we
are going to use them differently when we are developing the cyber cyber
forensic application. This is a simple code which demonstrates
the use of range function. So you might have used range function along
with loops. So whenever you wanted to generate some list of numbers, or whenever
you wanted to work with lists or
basically array kind of data structures. So here you can see
here that the base address has been defined and basically here we are
generating ten local IP addresses. So similarly
you can generate any number of IP addresses, any kind of IP addresses.
You can even generate IPV six addresses.
So similarly you can use this range function to generate any
kind of information. The next application is
to list the files from the directory. So in this
case we are using the OS module. Again, it's inbuilt.
So here we are getting the current working directory.
Then we are using current working directory to print the
files and folders in the present working directory. So note
here that in this case also we have not used any additional library.
The next concept is forensic indexing and searching.
So in case of indexing and searching, so you already are aware of
these concepts. Indexing and searching. So whenever you have worked with
list data structures or arrays or matrices, two dimensional arrays
or multi dimensional arrays, you might have dealt with index
concept. Then similarly in case of Google, you might
have also dealt with page ranking. So algorithm,
so you may be aware of it. The next is searching. Searching is just a
simple operation which is used to find the relevant information.
So you can develop or you can write your own search
functionality, or you can use the search function
available in the Python core library.
So these two are very simple methods.
Now note here that many times what happens is our evidence may be present in
files. So in that case we need to search for particular keywords.
So these keywords are nothing but the clues for the evidence. You need
to search for those keywords. You need to search for particular information and
you can do it using a very simple code. So you just need
to use files data structure. So open the file,
then read the information line by line, then process the information,
then check for the keywords. So if those keywords are found you can
just print those are found. If they are not found you can just print they
are not found. So if they are found means you have found some clues.
So then you can use some additional tools to index
them, right? So even you can perform simple indexing using
dictionary or you can just put it in a list. So when you put it
in a list you will be indexing them by default.
Now there is a library called as Hoosh. So it is advanced
library and it can be used for forensics indexing
and searching. So Hoosh was created and is maintained by
Matt. It was originally created for use
in online help system of side effects software,
3d animation software, Houdini. Since 2016
it is not being maintained but we are not seeing any updates. But still
it works with the present version of Python. You don't face any problem
with the present version, it's still compatible and it works fine without any
problem. It's a pure Python library and
it supports fielded indexing and search. It supports fast indexing and retrieval.
It supports pluggable scoring algorithm,
text analysis, storage and various posting formats,
et cetera. And you can also query it. So it
supports powerful query language and pure Python
spell checker. Now this is the code actually written
using Hoosh. So what we are doing here
is first we are importing the required modules such
as create in, right? Then we are defining a
schema with title path and the content.
Then we are creating a directory index dir with
the schema. Then we are writing the files and the content
to index Dir. Now note here that once the content has been
written, you also need to write the query parser. So this query
parser will help you to extract the information from
this library. So Hoosh can also be used to create
your own custom search engine. So it supports both indexing and
searching features. Now has functions for forensics. So hash functions
are very very important. They are used for basically validation purpose.
So whenever you take the snapshot of the image,
whenever you tasks the record the entire
image of the system. So when I say image of the system, we are using
tools like Northern Ghost and we are making the image
of the entire system. Once your image is ready,
you can start analyzing it. But note here that you cannot perform forensics
analysis on the original data. You need to perform forensic analysis on
the copy of the data. Now when you perform forensic analysis on the copy of
the data, after performing the analysis, you need to check the has
you need to check the has of original image and the copied
image, so it should be the same. If there's a difference,
then that means that something has been altered in the copied image.
So that's why you can see here that using a simple application I'm
demonstrating here, we are importing a hashlib library.
Then we are using SHa 256 hash
module. So of course it also supports other algorithms such as MD Phi.
Then note here that there are two messages have been
written. One is Python is, and second one is a great programming language. They have
been combined and stored in a string called as Yum.
Then we are calculating the digest on Yum.
Then we are defining another variable, x.
Now again here also say we are using same method,
that is Sha 256, and in this case we are using a single sentence.
Python is a great programming language. Now at the end we
are just comparing using a statement print x digest is
equal to m digest. So that means we are checking whether the digest of
x and Yum are same. So in this case, you can see here the output
is same. And also it says that the digest is true.
So that means the hash is same, the information is
not being altered. Now there's the same example. So what
I have done is I have just added one white space at the
end of x message, right after the dot. So in
this case, again the digest has been calculated. Hash has been calculated.
Now in this case it is showing false. So that means the hashes are not
same. So that means that the information has been altered. So has functions
are very, very important. Okay. And note here that the
use of hash algorithms is recognized in the court
of law. Especially I'm not aware of other countries,
but at least in India, it has been recorded by Information
Technology Act 2000. Now next up is
forensics evidence extraction. So for this we can use the
library called as pillow. Pillow is the friendly PIL fork
by Alex Clark and contributors. So PIL is the Python
imaging library by Frederick Lund and contributors.
Basically, it is used for image processing tasks. The Python image
library adds image processing capabilities to your Python interpreter.
So this library provides extensive file format support and efficient
internal representation, and fairly powerful image processing capabilities. The core
image library is designed for fast access to data stored in a few basic
pixel format, so it should provide a solid foundation for general
image processing tool. Now for forensic evidence
extraction, we are using again the PIL library. Now note here that we
can extract the XiF tasks. We can extract the GPS information
using GPS tags. Now we can use both.
Now let's assume that there's a picture which is taken from your mobile
phone and it is stored in your mobile phone. Now the investigating
officer will take that photo and he will write the third script,
that is the script written at the bottom and he will extract the
GPS information about that photo, okay? And he will
also extract the other properties of the image such as size,
image description. Then GPS tax include the longitude,
latitude. So basically the location information so where that
photo has been taken. So all this information can be extracted
using simple PL library and the
modules such as tags and GPS tasks. Of course this library supports
various other modules also which are useful in
extracting the evidence. The next is the PI screenshot
module. So it tries to allow to take Pyscreenshot without installing third party
libraries. Again, note here that it has been written
as a wrapper to pillow, but PIL Pyscreenshot also supports other
libraries. Now note here that performance is not the target
for this library or in any case of cyber forensics activities,
performance is not the target. The importance is given to
the evidence and its protection.
So basically it has to ensure that the information has not
been altered. So this is simple code which
actually tasks the screenshot of entire screen.
So that can be done by using importing by screenshot module
and using a grab method. So once you grab it, you just need to save
the image using save method. So this particular code takes
the screenshot of entire screen. Similarly you can take the screenshot of
part of a screen by specifying the coordinates
to BB box parameter in grab method. Now you
can also check the performance of PI screenshot module if you are real, if you're
interested. So you can see here there are different modules such as
Pil, MSS, right, Piqt and et cetera. And n
equal to ten means this is the time taken to take ten
screenshots. So you can choose the one which is taking the less time.
Okay, so you can force the back end. So if you force the back end
to scratch and if you force the back end to MSS,
and if you set child process to false, then of course
it will help you to improve the performance significantly. But as
I have said, performance is not the target here. Extracting the evidence is
the target. The next is metadata forensics. Now note here that metadata
is associated with every kind of a file. So now mutagen is
the Python module to handle audio metadata. Now you can see here
that many times you may get audio evidence or video evidence. So in that case
you may have to extract the metadata of audio file or sometimes
in even video file. So in that case, mutagen will help you.
So again, mutagen is pure Python library. That means no additional modules
are required, so they don't have any standard
or any additional dependency. So you
can install mutagen using Python three minus mPIP. Install mutagen.
Now what this mutagen does is it takes any audio file and
tries to guess its type and returns the file type instance or none.
Many times it happens that people may change the extension, but even though
you change the extension, the internal algorithm internal
architecture will remain same. So in that case it becomes important to
guess the or get the type of the
file original type of the file. So you can see
here the same mutagen library is able to get the information
about a flac file and also the mp3 file, so it can get
the bitrate and the length of an audio
file. Then similarly, as I have said, since you are dealing with
files, metadata is associated with every kind of file. So there is a library called
as PyPdf. So using this you can extract metadata information
of Pdf file. So again it is a pure Python library
and it is capable of extracting the document information, splitting the documents page
by page, merging documents page by page, cropping pages, merging multiple
pages, encrypting and decrypting Pdf files,
and so on. The next is PE file. So PE
file is a multiplatform Python module to argparse and work
with the portable executable files. So usually PE files are found on
Windows operating system. So most of the information contained in PE file header
is accessible as well as sections, data and the data.
So the structures defined in Windows header files will be accessible as attributes.
In PE instance, the naming of fields or attributes will
try to add to the naming scheme in those headers, so only shortcuts
added for convenience will depart from that convention. So PE
file required some basic understanding of the layout of a PE file.
So with it it's possible to explore nearly every single feature of PE
file format. Some of the tasks which are possible
with PE file are instructing headers, analyzing the functions data,
retrieving embedded data, reading strings from resources,
then warning of suspicious and malformed values,
overwriting fields, then Packer detection with PID signatures,
PID signature detection, signature generation, et cetera.
The next one more important concept is using natural
language tools or NLP packages in Python. Now note here that whenever
you extract the information you are actually taking the image of the entire computer.
So there may be lot of textual information present in it,
there may be lot of system files present in it so it is not possible
to examine each and every file manually. So in that case, to extract the
useful information, to extract the keywords,
NLP packages can be used. We know that NLP packages support
features such as tokenization, lamatization, word frequency,
NGram analysis, and so on. So finally, you can also generate
the summary. You can also generate the frequency of co occurring words
using Ngram analysis. Also they support grammatical
tools such as part of speech tagging and also the named entity recognition.
So several features are supported. And all these features are really important
in forensics analysis. So you can get the required
information, or you can try to get the required information or some insights
using NLP packages. Now, these NLP
packages can be classified into three categories. One is single
language libraries. So most of the times NLTK space and texture,
they work with English, but some of them also support some other languages.
Then we have specific libraries for multiple human languages
such as stanza and polyglot. They support stanza
support at least 60 plus languages. Polygraph support at least 150 or 140
languages. Then there are some libraries such as NLTK
or Indic NLP. These are for indian languages which have
got different structure altogether. Of course, stanza supports and also
polyglot supports some of the indian languages. But in LTC
and Indic, NLP are much more advanced. In today's
talk, we have seen how we can create small cyber forensic applications.
So we had not done, say any extensive application. We had not created
any extensive application. But you can see here that we have created very
small applications using the concepts known to us.
Of course we had seen some advanced libraries.
So in cyber forensic application creation, it is very
important to follow the standard procedure, the law enforcement agencies
during the investigation process. Otherwise it will not be admissible in the
court of law. Then there are many open source as well as commercial tools for
digital forensics. Learning to develop your own tool is always advantageous
because it can save time, it can help you to save money. Then many tools
written in Python are pure python implementations. And most importantly,
Python and open source tools comply with Dobert Standard.
Thank you everyone.