Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, my name is Kosman or I'm Python developer. I'm from Spain.
This tool is called tips and tricks for data
science project with Python. Basically, Python has become the most
widely used language for mature learning and data science
project due to its simplicity and specialty.
For this purpose, Python provides access to grid
libraries and frameworks for artificial intelligence and mature learning,
flexibility and platform independence.
This year I have writing and published this book with title
Big Data, Machine Learning and data science with Python.
This book is published in Spanish. In this book,
basically you can find practical symbols
with panda, Pyspark, psychic learner,
tensorflow, Hadoop, Jupiter Network
and Apache the Pelink. This could
be the main point, the main tackling points. I will start
with an introducing Python as
programming language for mature learning projects.
I will comment the main stages for a mature learning
project also will comment what are
the main Python libraries for your project?
Finally, I will commend the main Python
tools for deep learning in data science projects.
Well, Python simplicity allows developers to write reliable
systems and developers get to put
all the effort into solving a mature learning problem instead
of focusing on the technical nuance of the
language. Since Python is a general
purpose language, it can do a set of complex machine learning
tasks and enable you to build prototypes quickly
that allow you to test your product for machine learning purpose.
For example, there are some fields where artificial
intelligence and mature learning techniques are applied.
For example, Span filters, recommendation systems, certain giants,
personal assistance and fraud detection systems.
In this table we can see the
main libraries, the main module we have in Python for each
depending.
For example for material we have kerastins for flow and central
for high performance in scientific computing we have
numpy and scipy. For computer
vision, we have OpenCV. For data analysis we have numpy and
pandas, and for natural
language processing we have spicy and one
way to review some of these libraries. For example,
Numpy is the fundamental package required for high performance
scientific computing and data analysis in the Python ecosystem.
And the main data structure
in Numpy is the array, which is a shorthand
name for n dimensional array. When working with Numpy
data and the array is simply
referred to has an array you can create, for example, unidimensional,
b dimensional and three dimensional arrays.
The main advantage of Numpy is its speed,
mainly to the fact that it is developed in C programming language
for data science and mature learning tags.
It provides a lot of advantages.
Other models we have in Python is
one of the most popular libraries for scientific
Python is pandas that is built open
numpy array, thereby preserving fast
execution speed and offering many data engineering features
including grading, writing, manifesting the format,
selecting success of data, calculating across row
and columns, filling and filling missing data,
applying operations to independent groups within
the data.
Another task related for example, with combining multiple
data sets together.
One of the structures we have is
very useful in pandas is the data frames is the most widely used data
structure. You can imagine it at a set table in a
database or a spreadsheet with rows and columns.
Basically, data frames is a two dimensional data structure with
potentially iteration data.
The main features is that it has a size
mutable structure that means data can be
added or deleted from it in a simple way.
PandAs provides another interesting project that is called
pandas profiling. That is an open source Python model
with which we can quickly perform an exploratory data analysis
with just a few lines of code. In addition, it can
generate interactive reports in web format that can be
present to anyone. In short, what panel's profiling
does is to save us all the work
of visualization and understanding
the distribution of each variable in
our data set.
Generating a report with all the information is
ill visible.
Now I'm going to commend
the many stages of a machine learning projects. Learning is the
study of certain algorithms and statistical techniques that allow
computers to perform complex tasks without receiving
instructions beforehand. Instead of using pre
programming, directing certain behavior under a certain set of
circumstances, machine learning relies on pattern
recognition and associated inferences.
In this diagram we can see the main stages of a
material. In project we start with level observations
and in
the stage two with
us is splitting these level
observations in training and data sets.
In step three, our model
is built using training data and
for validating the model we use data set.
In the last step, basically, the model is
then evaluated on the degree to which it arrives at the
correct output.
In this diagram we can see in
a more generical way these
stages. The machine learning lifecycle basically is the
cyclical process that data science project follow.
It defines each step that an organization
should follow to take advantage of mature learning and
artificial intelligence to derive practical
business value. These are the five major
steps in the mature learning lifecycle, all of which have
equals importance and go in a specific order. We start
with getting in data from various sources. In the step
two, we try to clean in data to have homogeneity.
In the step three, we try to build our model
selecting the right material learning algorithm depending
our data. In the step four,
we try to grind in insights from the model threshold and
in the step five, we have basically
data visualization and transforming the results into
visual graphs in
a more detail way. In this diagram we can see a
specific task for each stage. For example,
in the fixed step that is related with defining
the project objectives, the fixed step of the lifecycle is
to define these objectives. In the second step, we try
to acquire and explore data when we try to
collect and prepare all of the relevant data for
use in material learning algorithms. In the
third step, we try to
build our model. In order to gain insights
from your data with machine learning, you must determine your target
variable, which is the factor on which you wish to
win deeper understanding. In the four step,
we try to interpret and communicate
the results of the model.
Basically, the more interpretable your model
is, the easier it will be to meet regulatory
requirements and communicate this value to
management and other case stakeholders.
And finally,
the final step is to implement, document and maintain the
data science project so that the project can continue to
leverage and improve upon its models.
We are going to commend the main libraries,
the main models we have in python. For this task,
we start with secular. Secular is an open source tool for
data mining and data analysis and provides a
consistent and easy to use API for doing tasks
related with preprocessing, training and predicting
data. Psychilearn mesh filters include
classification, regression, clustering, dimensionality reduction,
model selection and preprocessing.
It provides a range of supervised and unsupervised
learning algorithms via consistent interface and
delivery. Provides a lot of algorithms for
classification, redression clustering like
for example for clustering it's very
useful the Cummins algorithm and the
scan and also is
designed to work with the Python numerical scientific
libraries like Numpy and scipy.
Scikilear is a grid library to master for machine learning beginners
and professionals. Whoever have an experienced machine learning practitioners
may not be aware of all the hiding hems
of this package which can aid in their tasks significantly.
I am going to commend the main features that we can
the most relevant futures that we can find in this
libraries. For example, pipelines are
very useful to
chain, for example multiple estimators. If we have multiple estimators
in our pipeline, we can use this
future to change these estimators.
This is useful for
when we need to fix a sequence of states in processing
the data. For example, we have a feature selection,
we have normalization classification. At this point,
utility funtium make payline pipeline is a shorthand
for construing pipelines. It takes a variable number
of estimators and produce a pipeline with the steps
that follow it. The use of pipelines to
split, train, test and select the models and epiparameters
made it so much easier to keep track of the outputs
at all stage as well as reporting why you choose specific
epiper parameters.
Eper parameters basically are parameters that are not delivery with
estimators, and inside they
are passed as an argument to the constructor
of the estimator classes. At this point, it's possible to
search the impaired parameter space for the best
cross valuation score. An eparameter provided
when construing an estimator may be optimized using
get params method. Specifically, we can
use this method to find the names and current values
for all parameters for a
given estimator. Every estimator
has its advantages and drawbacks. Its generalization
error can be discomposed in terms of bias,
variance and noise. The bias of an estimator
is its average error for different training sets,
and the variance of an estimator indicates
how sensitive it is to vary in training
sets. At this point, it could be helpful
to plot the influence of a single epiper parameter
on the training score and evaluation score
to find out whether the estimator is overfitting or
underfitting for some parameter values.
At this point, the fonteon
validation curve can help us in this case
and return and validate scores. If the training
score and the validation score are both
low, the estimator will be underfitting.
If the training score is high and the valuation score
is low, the estimator is of overfitting and otherwise
we suppose that estimator is working well.
Another interesting feature is one hotel encoding that is a very common data
processing tax to transfer input
categorical features in one binary
encodings for using in classification or prediction tax.
For example, let us assume that we have two categorical
values and in this table we can
see that we have one column
with js and no values and this column it
transferred into new columns,
one for each category. For example, with the JS value
we have values one and
zero, and for the no values
we have zero and one. In these
two new columns created and
in IC way we
can use the one hot encoder for
applying these distress formations.
Cycular also includes random sapling
generators that can be used to boil artificial data
sets of control size and complexity. It has functions
for classification, clustering, regression, matrix decomposition,
and manifold tense testing.
Other techniques that can be useful when we have a large
data set and we need to reduce
the dimensionality of the data is in
these cases, we can apply the principal component analysis PCA.
Basically, PC functions by finding the directions
of maximum variance in the data and provides the data
in those directions.
The amount of variance explained by each direction is called the
splain variance. Explained variance can be used to choose the
number of dimensions to kept in a reduced data
set. It can also be used to assess the quality of a
mature learning model. In general, a model with high
splain variance will have good predictive power,
while a model with low explained variance
may not be as accurate.
In this diagram we can see that we have two independent principal components,
PC one and PC two. The PC one represents the
vector which explain most of the information variance and PCE
two represents the lesson information.
In this example, we are following the classical machine learning
pipeline where we will first import libraries and data set,
perform exploratory data analysis
and preprocessing, and finally train our models predictions
and evaluate accuracy. At this point we
can use PCA to find optimal
number of features before we train our models.
Performing PCA is as easy as following
these two tips. Process first, we initialize
the PCA class by passing the number of components to the constructor
and in the second step we call the fit and transform
methods. By passing
the future set to these methods and the transform
method returns the specified number of principal
components that we have in this data
set. Another interesting library we can find in
Python for task related with the
obtaining statistical
data for data exploration is
ESTAS model. EstAs model is another grid library which focus on
statistical models and can
be used for predictive and exploratory analysis.
If you want for example to fit linear models,
do statistical analysis, maybe a bit
of pre modeling, then start models is
great.
We continue with commenting libraries
we have in Python for deep learning. We start with tensorflow,
that is an open source library that is based on a
neural network system. This means that it can
relate several network data simultaneously in
the same way that the human brain does.
For example, it can recognize several words of the Alphabet
because it relates letters and phonemes.
Another case is that of image and text that
can be related to each other thanks to the association
capacity of the neural network system.
Internally, what is used in terms of law is
use the tensors for building the neural network. A tensor basically
is a mathematical object represent as a rise
of higher dimensions and this rise
of data with different sizes
and runs get fit as input to the neural
network. Tensorflow has become an intermachent
learning ecosystem for all kinds of artificial
intelligence technology. For example, here are the features the community has add
to the original tensorflow package. For example, we have the
Tensorflow a little for working with a smartphone
operating system and IoT devices.
Since tertial flow 2.0
version, keras has been
adopted as the main API to interact with Tensorflow.
Keras the main difference is that Tensorflow
was at low level and keras it was a high
level for
building neural networks and interact with
the Tensorflow with the API that Tensorflow
provides.
This is maybe the best choice for any beginner in
mature learning. It offers an easy way to express
neural networks compared to other libraries.
Basically, it provides an interface for
interacting with tensorflow in an easy way.
In this code, we can see how a
little code for Keras Keras has I
comment before is the real model for rapid
experimentation and the most common way
to define your model is by building a graph
of layers which correspond to the mental model we normally
use when we think about leap learning. The simplest
type of model is a stack of layers and
you can define such a model using the
sequential API like we can
see in this code. The advantages of this process
are that it's easy to visualize the
book and building
a deep learning model using the
different methods and
different classes that provides the grass API.
Another interesting library
is Pytorch. Pytorch is similar to Tensorflow
and we can use, for example, Tensorflow Pytorch
in all the stages of a material learning project.
We can use Pytorch for getting
the data ready, building or pick up a
training model, fit the model to the data and make a prediction,
evaluate the model, improve, throw experimentation and
save and reload your training model. All these stages
can be executed with Pytorch if
we compare the three libraries. Has I commented we
can see that Keras works at high level,
normally in conjunction with TensorFlow and
Pytorch however works at low level.
At architecture level, Pytorch and Tensorflow are more complex
to use and keras is more
simpler and readable.
Regarding the speed, keras offers low
performance comparing with the others and
Tesoflow and Pytos offers fast and high end performance.
And regarding training models, the three libraries
provides offers this feature and
finally commenting the Theano that
is a Python libraries that allows you to define,
optimize and evaluate mathematical expressions involving
multi dimensional arise efficiently.
Thanos main filters include tight integration
with numpy transparent use of gpus,
efficient symbolic differentiation,
speed and stability optimizations, dynamic C code generation,
and extensive unit testing and self verification.
It provides many tools to define, optimize and evaluate mathematical
expressions and numerous other libraries can be built
upon Theano that explore its data structures.
Theano is one of the most mature of machine
learning libraries since it provides nice data
structures like tensors, like the
structure that we have in terms of flow to represent lies
of neural networks, and they are efficient in terms
of linear algebra. Similar to numpy arise.
There are a lot of libraries which will on top of Thanos
exploiting its data structure,
and as I commented before, it has support for GPU
programming out of the wall of as well.
And that's all. Thank you very much. Thank you for
doing this presentation.
In this slide we can see my
contact if you want to
contact me, in social networks like
Twitter and LinkedIn, and if you have any question
or have any doubt, you can use this channel for
resolving your questions. Thank you very much.