Markov Chain Monte Carlo

2018-10-30

Programming

A note for interpreting MCMC.

1. Goal of MCMC

1) Basic Probability and Inference

In machine learning, the most concerned problems are training and prediction. In both problems we encounter statistical inferences. If we want to predict $X_{i}$ based on $X_{i-1}$:$P(X_{i}|X_{i-1})$, the first is to estimate parameters $\theta$ based on previous data $X_{i-1}$: $P(\theta|X_{i-1})$ and the second is to estimate $X_i$ based on previous $\theta$: $P(X_{i}|\theta)$. Formally, we call $P(\theta|X_{i-1})$ as posterior, $P(\theta)$ as prior. Our goal is to estimate prior and posterior. But wait, it’s often impossible to analytically obtain those conditional probabilities. Here comes a method – we sample them from historical data.

2) What is sampling?

Sampling is basically, we hope to generate iid data from a given distribution. For example the simplest case is to generate data $X_i \sim Uniform([0,1])$, using Python np.random.uniform(0,1,1000) automatically generates 1000 data. Simliarly, np.random.randn(1000) generates 1000 standard normal data. But how computer works behind the function? A general way is Inverse Transform Sampling. Let’s say $X_i \sim p(x)$ with c.d.f $F(x)=Pr(X \le x)$, and inverse function of c.d.f $F^{-1}(x)$ exists, then we do the following: i) Generate a uniform random variable $\mu \sim Unif([0,1])$; ii) Calculate $X = F^{-1}(\mu)$; iii) Repeat n times and we have n iid $X_i$ which close to $p(x)$ distribution.
However, you may find many defects of this method: If data is discrete, if there is no inverse function, if in high dimension it’s impossible to integrate $p(x)$, etc. Hence, we develop some other sampling methods.

3) Sample Methods

Sample methods other than MCMC: Uniform Sampling, Proposal Sampling, Importance Sampling, etc.
Sample methods include MCMC: Metropolis-Hasting Algorithm, Gibbs Sampling, Monte Carlo EM, etc.

2. Glimpse MCMC

1) What kind of Markov Chain is in MCMC?

First order Markov Chain assumes that next state only conditions on previous state, and we call it “conditional independent”. As you may know each state of a Markov Chain has its own distribution but for a stationary Markov Chain (Irreducible and Aperiodic) in the long run its distribution is invariant. Hence the problem becomes simple, find a stationary Markov Chain and start sample after it shows stationary. Question: How to find a stationary Markov Chain?

2) How to use Monte Carlo here?

Monte Carlo is nothing but iterations. Example: If we observe a person doing random walks, the place with more footsteps, we say it is the place he/she is more likely to go. Similarly here, no matter where we start, after many steps (iterations) we obtain a distribution of samples and statistically we say it simulates our variables’ density function.

3) Combine Markov Chain and Monte Carlo

As stated above, stationary Markov Chain has a nice property that after many iterations its distribution stays the same. First we randomly sample our variables from given data and we feed them to Markov Chain then we iterate Markov Chain for many many times.

3. Inside MCMC

1) How to find stationary Markov Chain?

To apply MCMC, the first thing is to find a decent Markov Matrix. This matrix should be both irreducible (the graph is closed, we cannot have infinity possible states) and aperiodic (our graph wouldn’t stuck in some states forever).

2) Intuition of MCMC

The key of MCMC is to find this transition matrix of Markov Chain: $p(x_{i+1}|x_i)$, where $x_i$ are previous samples and we want to condition on them to sample next $x_{i+1}$. In math, $p(x_{i+1})=\sum p(x_i)p(x_{i+1}|x_i)$ or $p(x_{i+1})=\int p(x_i)p(x_{i+1}|x_i)$. Once we have the conditional probability $p(x_{i+1}|x_i)$ it’s easy to sample next variable $x_{i+1}$. But the only thing we know is the sampling density $p(x)$, how to find its conditional density $p(x_{i+1}|x_{i})$? Here are two ways:

4. MCMC Algorithms

1) Metropolis-Hasting Algorithm

Intuition: Since we don’t know $p(x_{i+1}|x_{i})$, we construct a known density $q(x_{i+1}|x_{i})$ and hope to use $q$ to simulate the distribution of $p$. An intuitive way is to sample based on $q$ then decide if reject/accept it. So here is Workflow: we observe $x_{i}$, we apply $x_{i}$ to a known density $q(x_{i+1}|x_{i})$ to obtain a new sample prediction $x_{i+1}$, and we feed $x_{i+1}$ to a decision function $A(x_{i},x_{i+1})$ to see if $x_{i+1}$ is suitable. If it is, we keep $x_{i+1}$ and repeat above procedures; If not, we reject $x_{i+1}$ (with some probability still accept) and we stay in $x_i$, then we repeat above procedures. So how do we define a decision function?
Math: Let $A(x_{i},x_{i+1})=\text{min}[1,\frac{p(x_{i+1})q(x_{i}|x_{i+1})}{p(x_i)q(x_{i+1}|x_i)}]$. Then $p(x_{i+1}|x_{i}) = q(x_{i+1}|x_{i})p(x_{i})A(x_{i},x_{i+1})$.

Interpretation: Let’s have a closer look at $A$. When the second part of $A$ is greater than $1$, $A$ would always be $1$. This is to say, if $p(x_{i+1})q(x_{i}|x_{i+1}) > p(x_i)q(x_{i+1}|x_i)$, we have $A = 1$ and our model would be $p(x_{i+1}|x_{i}) = q(x_{i+1}|x_{i})p(x_{i})$, that is, we accept $x_{i+1}$ for certain, hence we generate a new sample $x_{i+1}$. The other case, if $p(x_{i+1})q(x_{i}|x_{i+1}) < p(x_i)q(x_{i+1}|x_i)$, our $A$ would be a value between $0$ and $1$, and it is proportional to $\frac{p(x_{i+1})}{p(x_i)}$. This means, the lager probability density the next sample ($p(x_{i+1})$) has, the higher the proportion would be, thus the larger $A$ would be, thus the higher probability we accept this new sample. As for how to choose proposal density $q$, usually we apply Gaussian density.

Check MCMC: How do we know MH algorithm is a MCMC algorithm? Check if $p(x_{i+1}|x_i)$ is both irreducible and aperiodic.
Pseudo-Code:

1. Randomly generate x0
2. for i in N:
    generate u from uniform distribution
    generate x_new from q(x_new|x_old)
    if u < A(x_old, x_new):
        x_new accepted
        x_(i+1) = x_new
    else:
        x_new rejected
        x_(i+1) = x_old

2) Gibbs Sampling

Intuition: Gibss sampling is a special case of Metropolis-Hasting Algorithm. Same as above, we want to use a known density $q$ to simulate $p(x_{i+1}|x_i)$, and decide if we accept or reject. However in Gibbs sampling, we may want to find a special $q$ that $A$ always equal to $1$, which, we always accept samples in this case.

Math: Let $p(x_{i+1}|x_{i})=q(x_{i+1}|x_{i})p(x_{i})$, and we generate our proposal: $q(x_{i+1}|x_{i}) = p(x_{i+1}|x_{-i})$ if $x_{-(i+1)}=x_{-i}$, where $x_{-i}$ is previous information that around $x_{i}$ but exclude $x_{i}$.
Interpretation: For $x_{-i}$ and $x_{-(i+1)}$, negative here means the data exclude $i$ or $i+1$. Since $x_i$ is the $i^{th}$ observed data, $x_i$ is also a vector of $j$ dimension that contains $j$ multi-variable $x_j$. The rule is: $x_{i,1}=x_{i-1,2}, x_{i-1,3}, x_{i-1,4}, \dots x_{i-1,d}$, a.k.a. if $x_{i,j}$ is the first term of its vector, then $x_{-i}$ stands for the information given by previous rows vector $x_{i-1}$ and drop first term. $x_{i,j} = x_{i,1}, x_{i,2}, x_{i,3}, \dots x_{i,j-1}, x_{i-1,j+1}, x_{i-1,j+2}, \dots x_{i-1,d}$, a.k.a if $x_{i,j}$ is in the middle of the vector, $x_{-i}$ means the information given by this sample before $j^{th}$ variable, plus the information given by last sample $x_{i-1}$ after $j^{th}$ variable. If we build our proposal density like this, our decision function would always be $1$.
- Notes We update $x_{i,j}$ immediately once we calculated them and plug them into the next value’s calculation.