Transcript
This transcript was autogenerated. To make changes, submit a PR.
In this presentation, we'll be deep diving into probabilistic programming,
which is a powerful modeling approach that addresses
critical challenges in traditional machine learning and AI techniques.
We'll explore how probabilistic programming enables us
to embrace uncertainty, incorporate expert
knowledge, and enhance transparency in the
decision making. Finally, I will present the implementation of
probabilistic models in Python.
Why do we need probabilistic programming and
what does it has to offer? What does it offer over other
machine learning and AI techniques? So the fundamental challenge
in conventional machine learning and AI techniques is the lack of
uncertainty quantification. These models
typically provide point estimates without accounting for
the uncertainty surrounding their predictions. This limitation
hampers our ability to assess the reliability of
model and undermines our confidence in the
decision making process. The second challenge we
face is that machine learning models are data hungry and often
require correctly labeled data, and these
models tend to struggle with problems where
data is limited. Conventional machine learning
and AI techniques lack of framework to encode
expert domain knowledge or prior beliefs
into the model. So without the ability to leverage
domain specific insights, the model might
overlook crucial nuances in data and tend
to not perform up to its potential.
Lastly, machine learning models are becoming more and more complex
and opaque, while public demands for more transparency
and accountability on decisions being derived from data
and AI. So all of this presents
a need for a modeling framework,
encode expert knowledge, work with limited
data, provide predictions along with
associated uncertainty, and provide
models or enables models which offer more transparency
and explainability. Probabilistic programming
emerges as a game changer, so to understand probabilistic
programming, it is essential to grasp bayesian statistics.
How bayesian statistics differ from the classical
frequentist approach. In frequentist
statistics, model parameters are treated as fixed
quantities, and uncertainty in the
parameter estimation is typically addressed through techniques
such as conference intervals.
However, frequentist methods do not assign probability
distribution to parameters, and their interpretation
of uncertainty is rooted in the long run frequency
properties of the estimators rather than explicit probabilistic
statements about the parameter values, while in
contrast, bayesian statistics. In bayesian statistics,
unknown model parameters are treated as random variables and
are modeled using probability distribution.
So this approach inherently captures uncertainty within the
parameters themselves, and hence this framework
offers a more intuitive and a flexible approach to quantify
uncertainty. How does Bayesian statistics
work? Bayesian statistical methods use
Bayesian theorem to compute and update probabilities
as you obtain new data. This is a simple but
a powerful equation. What we start with is the
prior belief, right? So what's the prior belief or the prior distribution for the
unknown parameter likelihood represents
the information. The new information represents
our updated belief about this unknown parameter,
which incorporates both prior knowledge and observed evidence.
The term in denominator marginal likelihood is more of a normalizing
constant, making sure that posterior also represents a probability distribution.
Now let's look at how inference happens with
bayesian versus non bayesian models. So we'll start with non
bayesian inference, and then we'll go to Bayesian inference. So in case of Bayesian
inference, what we do is we determine the value of
unknown a point estimate of the unknown parameter which
maximizes the likelihood of data. So likelihood is
given the unknown parameter. So we defined the parameter
which maximizes the likelihood of evidence, and it comes as a single point estimate.
And for a new instance, we predict only using that
point estimate. While in case of Bayesian inference,
we start with our prior belief about this parameter, about this unknown parameter,
which here is represented as p theta, and then we compute
posterior distribution, which is p theta given evidence. So it's
an updated distribution about the unknown parameter
given our prior, starting from our prior
and given the new data set. So now for
a new instance, you compute the probability of
the new instance considering the entire posterior distribution rather
than a single point estimate.
So this simple implementation
is a lot more complex. In practice,
the integral here tends to be interactable,
especially when we work with higher on a higher
dimension parameter space.
There's no closed form solution to get this posterior distribution.
So what do we do in that scenario? Right? So if we can't get a
closed form solution, can we get samples from the posterior distribution?
Right. So if we are able to sample from this posterior distribution, we effectively
have its posterior distribution.
So the whole idea is if we can sample from this posterior,
and then we can use that samples to get
inference for a new instance along with the associated,
as we touched earlier, p of y,
which is the normalizing constant.
Normalizing constant, which involves integrals, is generally
not interactable, and then
we don't really have a closed form solution.
Numerical integration techniques also tend to be too computationally
intensive here. How do we sample from here? Right. How do we sample the posterior?
So for this, we rely on a special class of
algorithms called Markov chain Monte Carlo methods,
through which we are able to sample from a probability distribution.
So if we're able to construct a Markov chain that has the
desired distribution as its equilibrium distribution, one can obtain
samples for the desired distribution by recording states
from this Markov chain. The different MCMC samplers
here. So you have metropoliscape sampling and so on,
which can help you which can help generate samples
from this distribution. So now going
back and explaining what is probabilistic programming or probabilistic modeling.
So probabilistic programming is merely a programming framework for bayesian
statistics. It inherently uncertainty
within its parameters. So it tends to thrive in a world of uncertainty.
And as you define your prior beliefs, you built
in your model, your prior beliefs or the expert domain knowledge. So it tends
to work well with little data as well.
And it can be updated.
Your distribution can be updated as you get more and more new
information. And the whole model architecture offers transparency
and more explainable models.
Now a bit about workflow of probabilistic programming.
So the first step is we identify all the unknown parameters.
We define the prior distribution, and while defining the prior
distribution, we encode our prior belief or we encode
our expert knowledge, expert domain knowledge
about the model parameters. Then we specify the
likelihood, which is the probability distribution of observed data
as a function of unknown quantities. And then we can run a suitable
MCMC sampler to get
the posterior distribution for all of these unknown parameters.
And now for any new instance, now we have, instead of point
estimates, we have a distribution, we have the entire distribution for the
unknown parameters, and we can utilize that distribution to
compute the estimate along with this
uncertainty for a new instance.
So now a quick demo on implementing
bayesian models or probabilistic models in python.
For my demo, I'm using a data set
in which from
a sample of population, I have height, weight and gender.
So gender here is in binary form whether the candidate
is female or not, and then you have the height and weight of that
candidate. So in
a non probabilistic world, we will try
to fit a logistic regression model here. And for bayesian
model I'll be using. So we
start with a simple logistic model with your
is female flag as the target,
and then we try to find the coefficients of height and weight,
which best fit our problem.
Right? So in this case, you run a linear model, a logistic regression
model, and then you get coefficients of that logistic regression model.
And again, you're only getting point estimates, you are not getting
an estimate that what is the range of this parameter
and what's the underlying uncertainty? Associated model
coefficients, right? So the next thing I'm going to do is move
on to running a bayesian model. So for bayesian model, as we
discussed earlier, we start with defining the unknown parameters.
We define the prior distribution, we define likelihood, and then
we run an MCMC sampler. So, Stan, it's its
own language. First thing I have done is
I have built a Stan model. So I'll just give a
quick glimpse of that model. So Stan
has a couple of modules. So you have your data
transform data parameters, transform parameters model and generated quantities.
Since that simple model, I'm just models here.
So data is where you define the structure of your data,
data types, parameters where
you define the data types of
your unknown parameters. And then we come to the moderate part. In the moderate part,
the first thing I do is I start with my priors.
So what are the prior beliefs? I hold about
the three coefficients, your interceptor,
your coefficient for weight, and your coefficient for height. And then here is
the part where I have defined likelihood,
right? And again, the target metric here is binary
Bernali. So we fit a Bernoulli logic Bernoulli likelihood
here. That's it, that's how you define your stand model.
Then I can take this model into Python,
run the compiler for this model. The data
which needs to be fed to stan needs to be in form of a dictionary.
So this data is just being transformed. We run
the MCMC sampler, and then you get a
series of estimates or samples for
each of the unknown parameters. And then you can use those,
you can look at the mean median models and
other centrality measures for your parameters.
And along with that you can also understand, okay, what's the standard deviation,
what's the range? So this gives you a lot more information about
your coefficients rather than giving you point estimates, right? And then subsequently,
as I perform predictions, instead of considering
just a point estimate, I can consider the entire distribution of my parameters,
unknown quantities, right? Or my coefficients here,
that can help us get a better prediction
and along with that get the associated uncertainty with
the prediction as well. So that's about it.
If you go onto Stan website, it has a lot of detailed documentation
and you can find out how you can build more complex stand
models as well. So that's about it from, in terms of my presentation.
Thank you everyone.