# A start to Bayesian Inference

###### test subtitle

Frequentist looks at the maximum likelihood and chose the model that maximizes the probability of the seen Data given the model i,e P(Data/model) . Doesn’t care of what has happened in the past or what constraints the system should be restricted to.Works well when the current data is true representation of the domain.

Bayesians are more reserved when it comes to the current data. They don’t think there is any universal good model and its best to model or average out over the whole belief space i.e build a model based on current data and past learning from data . They want to take an average model over the Posterior or take the parameter that maximized the Posterior Distribution.

Posterior = P(Model/Data) = P(Data/Model)*Prior(Model)/P(Data)

Prior – It’s what prior knowledge you have about the domain or its different parameters. You don’t want to blindly believe the current data. Prior is in the form of a probability distribution. Today’s posterior can be tomorrow’s prior. And let the system evolve and reach some steady state.

**A coin Problem to set things into Perspective**

We look at a coin throw problem where in 6 independent throws Head comes up 5 times and tail once. Now we look at what would be the estimate for the probability of head in the below scenarios -

a) You are a Frequentist - Your estimate would be 5/6 based on the maximum likelihood method that you do in your head. You don't have any prior beliefs.

b) You are a Bayesian

As for a Bayesian he/she always looks into the current data but also has a prior belief about the event. He adjusts his prior belief to an extent based on the current data.

So as a Bayesian the estimate would not be as high as 5/6 but not as low as 1/2 but something in middle as he would have adjusted his guess based on his prior belief of probability 1/2 with the current data to get something in between 1/2 and 5/6

Now lets do the maths for the Frequentist as well as for the Bayesian -

Frequentist:

Each of the coin toss can be treated as independent Bernoulli trials with probability Ѳ for head which we would estimate Ѳ by maximizing the likelihood i.e we would take the derivative of the likelihood and set it to zero.

So as we can see how the estimate of 5/6 = 0.833 comes in a Frequentist approach.

Bayesian :

As for a Bayesian the method is not so straightforward since along with the Likelihood he/she would have to take the Prior probability distribution for Ѳ into consideration. The selection of the Prior is subjective and its depends on the individuals domain knowledge. Several kind of priors can be chosen. However if one deals with the real world and believes in the symmetry of the coin then a good prior would be to have maximum probability of parameter Ѳ at Ѳ = 1/2 . A nice prior distribution for Ѳ in this case would be Beta Distribution with α = 2 and В = 2. So prior P(Ѳ) = Beta(2,2). => P(Ѳ) = (1/6)*Ѳ(1- Ѳ)

So the Bayesian Posterior would look something like below -

Posterior P(Ѳ/Data) = Likelihood X Prior/Evidence = P(Data/Ѳ) * P(Ѳ)/P(Data). Now the Evidence i.e probability of Data for the estimation problem can be treated as constant and hence

On normalizing the Posterior to have total probability of 1 we would get a constant of 252 and the Posterior would become -

Now we can estimate Ѳ through two Bayesian approaches -

a) To determine the Ѳ that maximizes the Posterior that is commonly referred to as MAP(Maximize a posterior) estimate. For that we would have to differentiate the posterior with respect to Ѳ and set the derivative to zero

i.e d[ P(Ѳ/Data)]/dѲ = 0 => Ѳ = 3/4 = .75

b) The other approach is to average over the uncertainty and take the mean value of Ѳ i.e expectation of Ѳ given the data i.e

If we compute the integral we would get the estimate of Ѳ = 0.7

Generally taking the mean parameter value based on the Posterior is preferred over MAP method since the MAP may be misleading if we have bi modal distributions.

A Frequentist might say "A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule :)" But in reality it works since decisions based on prior beliefs combined with the current data gives a relatively better estimate than just blindly believing the data. If the observed data is less and the data is not representative of the overall population Frequentist estimate would go haywire.

Below are the graphs for the likelihood, prior and Posterior for the same coin problem for better illustration -

The diagrams are in keeping with the estimates that we have derived before.

__Conjugate Priors -__

Lets talk a little about Conjugate Priors -

- Conjugate Priors are priors that when combined with the likelihood generate the same family of Posterior Probability Distribution as that of the prior. In such cases the prior and the posterior are called Conjugates and prior is called the Conjugate Prior for the likelihood.
- In the Coin problem the Posterior and Prior are conjugates. The Conjugate prior for a likelihood using Binomial distribution or Bernoulli trials is a Beta Distribution.
- If you relate to your Mathematics conjugate Priors work as Eigen Functions for the Likelihood operators 🙂
- Conjugate Prior makes life easy as its easy to compute the mean of the parameter and other statistics if the posterior is in form of a known distribution.
- We just can’t take Conjugate Prior for Mathematical Convenience if its not a good enough approximation to our Prior beliefs.

**Sampling from the Posterior -**

Its not always possible to find a Conjugate Prior for a particular likelihood.Also as discussed earlier even if there exists a conjugate Prior it might not echo the Bayesian's belief and there may exist more meaningful prior. In such scenarios we would need to sample from the Posterior distribution to compute the mean. Randomly we need to generate samples from the Posterior distribution and take their mean to get the average parameter estimate

Sampling from Complicated Posteriors that doesn’t follow any standard probability distribution is hard to implement through traditional methods. The high dimensionality of model parameters makes it even harder(curse of dimensionality i.e correlation)

Markov Chain Monte Carlo(MCMC) Methods and its variants belongs to the advanced methods that can be used to Sample from Complicated Probability Distribution. Few of the techniques using MCMC are as below -

- Metropolis Hasting Algorithm

- Gibbs Sampler

- Hamiltonian Markov Chain ( the state of the art method as it is inspired from the Newtonian Classical Mechanics for motion in phase space.

In my next article I'll discuss Metropolis Hasting and Hamiltonian Markov Chain in Details.