Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi and welcome to my talk. My name is Ivan Popov and
today I will be talking about balancing speed and accuracy in
model development. In the beginning,
a couple of words about myself. I'm a data scientist at
these boundary render. It's a fintech company based in London.
I have three years of machine learning experience in fintech
and computer vision sectors. I also have an extensive experience
as a data engineer data analyst and I also completed project
as a DevOps. Also earlier in my career I have
created an online service and grew it to 80,000 users.
So in today's talk, I will be talking about the
two main factors when you develop a machine learning model,
speed and accuracy, and how we can balance them.
I will talk to you about how you can identify which
things to focus on in your model and how
to optimize your model when you have identified
so when we talk about model performance, we usually think of
model accuracy, how well it can make predictions.
However, there is another angle, the speed with which the model
creates, the prognosis. The main factors that impact the
model accuracy and speed are the complexity of the model
architecture, the amount and quality of input data,
and the hardware. Please note that when I say
accuracy, I don't just mean the percentage of the correct predictions.
I use it as an umbrella term for all metrics
such as f one score, rock, oak, iou, et cetera.
In the ideal world, we want a model that has 100% accuracy
and can return. These result in a nanosecond,
and we have to balance accuracy and speed to achieve the best
value for the business. In today's talk, I'm going to give real world
examples of this balance and provide a step by step instruction
how to identify your model's needs as well as the ways to optimize
it. In some situations, speed is not
the most important factor, for example, in academic research. In that
case, the priority is finding state of the art model that
can push the boundaries of science and advance the field of
machine learning. However, when creating a model for commercial
purposes, it is important to consider the experience of
the end user and their satisfaction. In today's
fast paced world, people have shorter attention spans and are not
willing to wait for more than a few seconds for a page
to load. So your model must be able to quickly return
results to keep the customer engaged on your web page or in your
app. Accuracy should not be compromised entirely for
speed, as reliable and trustworthy predictions
are essential to gain customers trust in your product or service.
So let me provide you some real world examples so you can get a context
in the loan industry, when a person looks for a loan on
the aggregator website, the loan providers must return the quote within a few
seconds, otherwise, their offer won't be shown. In this case, the speed
is prioritized because this is usually not the final offer,
and the underwriters can later review the case in more detail to make a
final decision. But when it comes to ecommerce,
instant recommendations require
a stricter balance between speed and accuracy. A system that recommends products
too slowly may cause customers to lose interest or seek recommendations elsewhere,
while a system that recommends irrelevant products may result in
poor customer experiences and lost sales.
Imagine a heavy metal fan getting Taylor Swift tickets as a recommendation.
That would be hilarious, but not for the ticket website.
Medical diagnosis models are an excellent example where
accuracy is more crucial than speed. Doctors usually spend
considerable time examining and analyzing the outcomes
before making a diagnosis. Therefore, the model can take
more time to provide results as long as the accuracy is not compromised.
As with many other things in life, the problem of balancing accuracy
and speed can be solved with money. Investing in better
hardware, such as cpus and gpus can improve these
inference speed without sacrificing accuracy. However, it is important
to carefully weigh the cost and benefit of each component before making a decision.
Sometimes investing a large amount of money in hardware may
only yield a small speed improvement. Additionally, as budgets
are typically limited, there are only a few options for hardware upgrades.
And again, like with many other things in life, not every problem can be
solved with money. Upgrading hardware can certainly improve
the performance of a model, but it won't fix issues that
stem from poor data quality or feature selection.
The accuracy of a model is heavily reliant on the quality of the
data it's trained on. Furthermore, a model's architecture can also
impact its accuracy and efficiency. If the architecture is
too complex or simple, the model can suffer from overfitting or underfitting,
respectively. This can result in slow inference times and poor
accuracy, even with hive hardware. The choice
of algorithm or learning method used can also
impact a model's efficiency. Some algorithms
may be inherently slow or may perform better only on certain types of
data. For example, using a fully connected network for image segmentation
may not be the best choice. It can be impractical
due to the large number of parameters involved. In image,
every pixel is a feature, and in a fully connected network, each neuron
in one layer is connected to every neuron in the next layer,
leading to a very high number of connections and parameters.
This can result in a computationally expensive and memory
intensive model, making it difficult to train and it will be prone
to overfitting. So mastering the balance between
model speed and accuracy can serve as a significant competitive
advantage for your company. By determining which aspect
is more crucial in your case and investing wisely in
optimization technologies and techniques, you can fine tune your model to deliver
the best output for the end user. This will give the business
the flexibility to succeed in a fiercely competitive market.
How to identify your model's needs you need to understand your business
objectives first step in understanding how to optimize your
model model performance with business goals
and objectives so you need to answer the question, what is
the purpose of the model? Is it for internal users or
is it customer facing? What are the desired outcomes
of the model? Is it to increase revenue, reduce costs,
improve customer satisfaction, or something else?
What are the key performance indicators, or KPIs,
that the business is tracking? How does the model fit into those KPIs
and who are the end users of the model? What are their
expectations and needs? Let me give you two
main scenarios for using ML models. The main ones
are customer facing and internal customer facing
applications. Hear speed is often more critical than
accuracy. For example, in an ecommerce application,
a recommendation engine that takes too long to recommend products can lead to
customers losing interest and seeking recommendations elsewhere.
Similarly for online chat bots that become more and more popular
due to Chat GPT and similar.
So for online chat bots, speed is critical as customers expect
quick responses and don't want to wait too long for a chatbot
to answer. For internal analytics, on the other hand,
accuracy is often more critical than speed.
Financial forecasting accuracy is crucial for making
informed business decisions and in supply chain management,
accurate predictions are necessary to optimize inventory management.
In those scenarios, you can spend a longer time waiting for
the result because the model can run overnight and
you have a lot more time to get the correct answer.
So let's go
from more general things to the actual things
you can do. First and foremost,
get yourself a good data set with quality data and good labels. Of course,
this mainly applies to supervised learning, but commercial models
are usually supervised. The more data you get,
the better, as long as you can ensure its quality.
Let's say you're working on a model that classifies hundreds
and digits. A good data set is a data set of handwritten
digits that includes samples from multiple writers
and different writing styles. It should also have a balanced distribution of
digits, meaning that each digit occurs roughly the same number of
times. All images have a clear label associated with these
bed data set in this case would be the one that only includes
handwritten digits from a single writer because these the model would be
biased towards the writing style of that particular writer and would not be
able to generalize well to other handwriting styles
or if this data set was missing certain digits or labels for
the images. It is always better to have a smaller good
quality data set than a larger bad quality one because you
can always use data augmentation to generate more data
from data you already have. Good raw data alone
is not enough to ensure a good model. The data needs to be processed
to fit your model. This step includes data
cleaning such as removing redundant data and null values, data normalization
such as tokenization, stowboard removal, and embedding in
NLP feature generation such as aggregations,
onepot encoding, and finding trends like recurring transactions
and financial data. Data preprocessing is
a part of model development where a lot of code is
written, and this is also one of the biggest sources
of inefficiencies in the model. Of course, when you
preprocess data for training, it won't impact the
model speed, but remember that the data used for inference
also must undergo the same preprocessing steps.
So how do we find inefficiencies in data preprocessing?
Well, the simplest way is to use time function in Python. You just
surround parts of code with it and see how quickly it runs.
But what if you have a large code base and your data preprocessing
is spread across multiple classes and files? You can't surround
all functions with time. It will be very tedious and messy.
Luckily, there are out of the box solutions such as Python's
built in, cpython and Yappy.
Yappy is a profiler that is written in c,
it's super fast, and most importantly, it lets you profile asynchronous
code. It is my profiler of choice. Here is an example
of these basic usage of yapi,
where foo is a function you want to profile
can be cost, method, or anything
sophisticated. The best thing is that
you will see all of the functions and all
of the files that are called when this function is executed.
Let's go over some basic things when it comes down to
yappy. First of all, let's understand the difference between
times that you can use. Clock types can
be CpU time or wall time. Cpu time or process
time is the amount of time for which cpu
has been used for processing instructions of a computer program
or operating system, or in our case, a function as
opposed to elapsed time, which includes, for example,
waiting for input output operations or entering
low voltime is the actual time taken from the start
of a computer program to the end. In other words, it is difference
between the time at which a task finishes and the time at which the
task started. When you're providing asynchronous
code, you should use the wall time.
Then at the bottom in green you can see the simple
output of yappy. It includes function
name with a function file.
For more sophisticated programs it will be
running. It will have a lot more functions
and files in there. Then you will see n
calls. N calls is the number of function calls how many
times this function has been called. It's a great way to see if
some function has been called a lot more times than you would expect.
Then you would know that maybe there is a way to optimize it.
T sub is the time spent in the function excluding
subc calls. Function includes inside of it
some other functions. The t sub
will not account the time. So t sub usually is if
it's big then you have a problem or it means that the
function doesn't call other functions and then ttot is
ttotal total time spent in the function,
including subcools. Obviously if it's a function like main,
it will have a very large total time. But then you need to
go and see all of the
functions that are inside main to see which
one takes the most time. By using a profiler you can
get a complete overview of how your code is running and which parts of it
are the slope. This is the quickest and simplest way of finding bottlenecks
in your code. Go and see some examples of inefficiencies
that often happen in data preprocessing.
It's no secret we all use pandas. It's great for
data analysis and data preprocessing, and one
of the most useful functions of
pandas is apply.
However, it's not the most efficient. While pandas
itself is a great package, apply does not take
advantage of vectorization. Also apply returns
a new series of data frame objects. So with a very large data frame
you have considerable input output overhead.
A couple of ways to solve it is to instead
of using apply, try using numpy set,
especially if you are just
performing operation on a single column
independence data frame. Alternatively, you can
find a simpler multiply column by
two. It can be done with these built in function.
Also, if you want to use apply to multiple columns in
the pandas data frame, try to avoid using access
equals one format in apply and write
a standalone function that can take in multiple numpy
arrays as inputs and then directly use it on the
values attribute
of the panda series. Sometimes you can be performing calculations more
times than needed. Sometimes you may have metadata in your data
set like gender, city, car type, and you
can be performing a calculation for
every single data point. While you only need to
perform calculation once per these.
So you should consider using and
filtering in pandas and only performing a calculation once per group.
This could significantly improve the speed of your data pre
processing. Finally, wherever possible, it's best
to use numpy instead of pandas. While pandas is
very user friendly and intuitive, numpy is written in c
and it's the champion when it comes down to efficiency. Feature selection
is essentially the final step of data pre processing and it has
a very large impact on the accuracy and speed of a model.
As the name suggests, it's the process of
determining which features in a data set are most relevant to
the output. Your first instinct may be to
take all of the data and throw it in the model because just a minute
ago I told you the more data these better,
but I was talking about the number of data points, not the data
from each particular data point. By selecting
the most important features and removing irrelevant
ones, you can simplify your model and reduce the risk of
overfitting. This not only improves the accuracy of the model, but also
makes it more efficient and less complex, which can
be critical for real world applications where time and
resources are limited. So one of the
methods I users for feature selection is sharply
additive explanations or sharp values. They are
a way to explain the output of any machine learning model.
It uses a game theoretic approach that measures each player's
or features contribution to the outcome machine learning.
Each feature is assigned an importance value representing
its contribution to the model's output. Features with
positive shock values positively impact the prediction,
while those with negative values have a negative impact.
These magnitude is a measure of how strong the effect is.
When I say positive or negative, I don't mean good or bad, I just mean
plus or minus. Sharp values are model agnostic,
so it means they can be used to interpret any machine learning model,
including linear regression, decision trees, random forests,
grading, boosting models, neural networks so they are universal.
Obviously for more complex architectures it is
harder to calculate them and increases the number of
calculations. So even though they can be used
for neural networks, for example, these best
work for simpler models like gradient boosted trees.
Shack values are particularly useful for
feature selection when dealing with high dimensional complex data
sets. By prioritizing features with high shack
values, both positive and negative, we are
looking at magnitude here, you can streamline the model by removing less impactful
features and highlighting the most influential ones.
You can make the model simpler and faster without sacrificing the accuracy.
This method not only enhances model performance, but also
helps to improve the explainability of a model. It also
helps understanding the driving forces behind predictions,
making the model more transparent and transworthy.
You can say that using sharp values for feature selection
is a form of regular prediction, and you will not be
wrong. That's pretty much it.
What's best? Sharp values do not change when the model
changes unless the contribution of the feature changes.
So this means that sharp values provide a
consistent interpretation of the model's behavior even when
the model architecture or parameters change.
You do not need to study game theory to calculate sharp values.
All necessary tools can be found in a shape package in python,
and using it you can calculate shapalis and visualize feature importance,
feature dependence, force, and make decision lets.
And for example, the visualization you see on the slide right now is directly
from d. When we talk about machine learning
models today, we usually talk about llms
and transformers, and while they are great at many tasks,
most businesses don't need such sophisticated architectures for
their purposes, especially because llms are
very expensive to train and maintain. And most
tasks even today, can be easily executed
using much simpler models such as gradient boosted trees
like Xgboost and LightGBM.
Xgboost and LightGBM are two of the most popular gradient
boosting frameworks users in machine learning. Both models are
designed to improve the speed and accuracy of machine learning models.
Xgboost is known for its scalability and flexibility, while LGBM
is known for its high speed performance. Xgboost is a well
established framework with a large user base, and LGBM is
relatively new, but has gained popularity due
to its impressive performance.
Xgboost has been widely used since its release in 2014.
It is flexible because it can handle multiple input data types
and works well with sparse data. Xgboost has
an internal data structure called dmetrics that
is optimized for both memory efficiency and training speed.
You can construct a dmetrics from multiple different sources of data.
Xgboost also has a regularization feature that helps prevent
overfitting common problem in machine learning.
However, Xgboost can be slower than other models when dealing
with larger data sets. This is because when training
a gradient boosted tree model, Xgboost uses level by
tree growth, where it grows the tree by one level
for each branch in what is known in depth
first order. This will usually result in more leaves and
therefore more splits, as in a computational overhead.
So for each leaf, as you can see on the diagram, for each
leaf on the level, it will grow even
if it's not needed there. On the other hand, LGBM is
known for its lightning fast performance. This is because when training a gradient
boosted tree model, LGBM uses leafwise tree
growth where it grows. The tree using a liftwise approach
uses best first order by constructing the splits for
each branch until the perfect split is reached.
LGBM is designed to handle large data sets efficiently and
in certain cases can be much faster than Xgboost.
LGBM also has a feature that allows it to handle categorical
data efficiently, which is a significant advantage over HgBoost,
and it also has a built in regularization support to prevent overfitting.
However, LGBM can be more memory intensive,
which can be a problem when dealing with larger data lets with
limited memory resources. There is no obvious way to choose
one model over the other, so you'll have to experiment
and decide these result. Fortunately, the models can be
set up and trade very quickly, and you can get the testing swiftly
out of the way and move to optimizing the model for let's
make a quick recap of today's balancing speed
and accuracy often depends on the context or the
field of application. We may need quick results in user
focused applications and on the other hand, require the highest degree
of accuracy in fields like medical to identify your model's needs,
align its purpose with business goals, define desired outcomes,
and consider KPIs. Tailor the model to meet
user expectations, prioritizing speed for customer facing applications
and accuracy for internal analytics. Look at your
particular case and sometimes in internal applications
you also need speed, and in customer facing applications you need accuracy.
Make sure you acquire a robust data set with quality
data and accurate labels. The quantity of data is important,
but the emphasis should always be on maintaining its quality.
Consider experimenting with simpler models like Xgboost or light GPM
instead of complex architectures like llms and transformers.
These frameworks, known for enhancing both speed and
accuracy, can be suitable for quite a variety of use cases.
And when you have a simple model that makes accurate
predictions but works too slowly by looking for the bottlenecks
in hours code using these profilers such as cprfile or Yappy,
the most frequent place for the bottlenecks is the data preprocessing step.
Thank you for joining me today and I hope you find this
talk useful. Hopefully see you in the future.