ML - Intro to Models

2019-11-22

Programming

Basic Machine Learning Models.

Markov Chains

1. Markov Chains

Properties
- Irreducible: We can access any states from any states. $Pr(x_i=s|x_j=k)>0$
- Aperiodic: Transition matrix has non-zero diagonal.$Pr(x_i=s|x_{i-1}=s)>0$
  - Stationary: $P_{s->t}$ is independent of n.
- Conditional Independent: $Pr(x,y|z)=Pr(x|z)Pr(y|z)$

2. Hidden Markov Model

Structure

Hidden Variable -(Trasition Matrix)-> Hidden Variable -…-> Hidden Variable

Emission Probability = $p(y|x)$
Filtering Problem: Estimate Hidden Variable

With emission matrix known:

$$Q(k|x_n) = \frac {Pr(X_n|Z_n,X_{n-1})Pr(Z_n|X_{n-1})}{ \sum Pr(X_n|Z_n,X_{n-1}) Pr(Z_n|X_{n-1})}$$

Clustering

3. K-means

Algorithm
- Pick K and randomly assign K different centroids ($c_j, j=1..K$).
- Repeat until convergence: Find distance of each point ($X_i,i=1..n$) and each centroids $c_j$, use it to find the closest centroids of every point, assign this point a new cluster ($j$). After all points are updated, calculate the new centroid of each cluster.

Pseudo-Code:

# Use Euclidean distance for distance measure
Input: K, X=[x_1,x_2,..,x_n]
1. Random location for c_j = [c_1,c_2,..,c_K]
2. While c_j is changed between interations:
	for i in [1 ... n]:
		nearest_j = argmin Distance(x_i,c_j)
		assign xi to nearest_j
	for j in [1 ... K]:
		c_j = mean(x_i that assigned to j)

Feature
- Stuck to local minimum
- Each trial may have different result. Initialization is important.

4. K-medoids

Algorithm
- Randomly assign each point to a cluster from $1$ to $K$.
- Repeat until convergence: For each cluster, find a point $x_i$ which minimize the total pairwise distance $z_j = \text{argmin} \sum D(x_i, x_{i^*})$. Assign each points to its closest cluster $j$ from calculating $D(x_i,z_j)$
Pseudo-Code:

Input: K, X=[x_1,x_2,..,x_n]
1. Random cluster for each point X
2. While c_j is changed between interations:
	for j in K:
		x_ij = [x_i is assined j]
		for x_i in x_ij:
			x_i* = x_ij exclude x_i
			c_j = argmin Distance(x_i, x_i*)
	for i in n:
		nearest_j = argmin Distance(x_i,c_j)
		assign nearest_j to x_i

Feature
- Same as K-means, except that centroid is required to be one of the observations.
- Deal with categorical variables.

Classification

5. K-NN

Algorithm
- Find the K nearest neighbors around a new point, assign the majority vote of K neighbors to that point.
Pseudo-Code:

1
2
3

Input: K, X_train(1,2,..,n), X_predict
k_nn = sort(Distance(X_predict,X_train),"ascending")[1:K,]
Y_predict = majority_class(knn)

Feature
- Can fail in high dimensions, because it becomes difficult to gather K observations close to a target point.
- Large $K$, smoother decision boundary, error $\downarrow$, variance $\uparrow$. Bias & Variance tradeoff.

6. Bayesian Classifer

Algorithm
- Want to find a function of $x$ that outputs the value of $y$ in a most reliable way, this is to say, find $y=f(x) = \text{argmax } P(y|x)$.
- Using Bayesian formula: $P(y|x) \propto P(y)P(x|y)$, ignore the scale term in denominator.
- Direct way: $P(y)=\frac{N_{happen}}{N_{total}}$ is a prior from historical data (could directly use portion), then find $P(x|y)$ by seperating the data and using maximum likelihood.
Feature
- Logit and LDA are also Bayesian classifers, here just list the most direct way to apply bayesian classifer, ie, using portion probability.

7. Logit

Algorithm
- Using maximize likelihood to find $\beta_i$：

$$ \log [\frac{P(Y=1|X)}{P(Y=0|X)}]=\beta_0+\beta_1x_1+\beta_2x_2+…+\beta_px_p $$

$$\prod_{i=1}^{n} P(Y=y_i|X=x_i)=\prod_{y_i=1} P(Y=1|X=x_i)\prod_{y_i=0} P(Y=0|X=x_i)$$

$$=\prod_{y_i=1}\frac{e^{\beta_0+\beta_1x_{i1}+..+\beta_px_{ip}}}{1+e^{\beta_0+\beta_1x_{i1}+..+\beta_px_{ip}}}\prod_{y_i=0}\frac{1}{1+e^{\beta_0+\beta_1x_{i1}+..+\beta_px_{ip}}}$$

Feature
- Difficult to extend to more than 2 categories. Assumed Y is a binary variable. If more than 2 categories, use baseline.
- The coefficients become unstable when there is collinearity. Affects the convergence of the fitting algorithm.
- When the classes are well separated, the coefficients become unstable. This is always the case when $p \ge n − 1$.

8. Ridge & Lasso Regression

Ridge & Lasso

$$\min_{ \beta } \sum_{i=1}^N (y_i - \beta_0 - \sum_{j=1} \beta_j x_{i,j})^2 + \lambda \left( \frac{1-\alpha}{2}\sum_{j=1}^p \beta_j^2 + \alpha \sum_{j=1}^p |\beta_j| \right)$$

Algorithm
- When $\alpha= 0$, it’s Ridge, while $\alpha= 1$, it’s Lasso.
Feature
- Scale the data before training elastic model.
- The parameter $\lambda$ is a tuning parameter. It modulates the importance of fit vs. shrinkage.
- The Lasso shrinks some of the coefficients all the way to zero while Ridge cannot.

9. LDA-Linear Discriminant Analysis

LDA

Algorithm
- Still apply Bayesian idea, $P(y|x) \propto P(y)P(x|y)$, needs to find $P(x|y)$. Split data into $k$ categories, we model each of them $P(x|y=k)$ as a Normal Distribution, hence $P(y|x)$ becomes a Multivariate Normal Distribution ($k$ normal distribution shares a same covariance. *left graph). We need to maximize $P(x|y)$. Its PDF contains three parameters: $\mu_k, \Sigma, \pi_k$.
- $\mu_k$ is the center of each class $k$, $\Sigma$ is the covariance martix, estimated by vectors of deviations $(x_1-\mu_{y_1}),(x_2-\mu_{y_2}),…(x_n-\mu_{y_n})$. $\pi_k$ is the fraction of training samples of class k.
Feature
- LDA assumes common covariance to all categories ($\Sigma$ is the same for all $k$ categories).
- LDA will have a linear decision boundary (*right graph).

10. QDA-Quadratic Discriminant Analysis

QDA

Algorithm
- Similar to LDA except we won’t assume a same covariance. We assume each category has its own covariance matrix $\Sigma_k$.
Feature
- The decision boundary is quadratic.

11. SVM

SVM1

SVM2

SVM3

SVM4

SVM5

Algorithm
- Consider the convex set (minimum convex set is convex hull), every point $x$ in a convex set can be represented as a convex combination of the extreme points. Distance of a point to set: $d(x,A)=\min_{y \in A}| x-y |$. Similarly, we define margin of two set as the minimum distance to either of the convex hulls.
- Main solution: $\min | v_H |$, subject to: $y_i(\left< v_H, x_i\right>-c)\ge 1$. $y_i$ is a indicator function. Since $| x_v | = \frac{\left< v_H, x_i\right>}{| v_H |}$, $\left< v_H, x_i\right>$ is set to be a constant, minimize $| v_H |$ is to maximize $| x_v |$, the margin between hyperplane and convex hull (upper graph SVM1).
- Solving above optimization problem is difficult, here we use Convex Duality. Find the hyperplane which maximize margin is equivalent to find the hyperplane which is orthogonal to the shortest line of two convex hulls. (see middle graph SVM2)
- Dual problem: Derive from duality: $\min | u-v |^2$, we could have new optimization problem: $\max \sum \alpha_i - \frac{1}{2}\sum \alpha_i \alpha_j y_i y_j \left< x_i, x_j \right>$, subject to $\sum y_i\alpha_i = 0$.
- Soft margin and slack variable. Permit training data to cross the margin, but impose cost which increases the further beyond the margin we are. The cost parameter $\gamma$ is very small, many points may end up beyond the margin boundary. If it’s large, we recover the original SVM. (graph SVM3)
- However until now, our SVM is still a linear classifier. We use Kernel Trick to map $\left< x_i, x_j \right>$ to $k\left< x_i, x_j \right>$, where $k=\exp(-\frac{| x_i-x_j |^2}{2\sigma^2})$, is called a RBF kernel. (graph SVM4 & SVM5)
Feature
- Robustness against outliers
- Nonlinear decision boundary