進捗(4/25) EMアルゴリズムを学んだ

進捗(4/25) EMアルゴリズムを学んだ

@(進捗ブログ) まだ学んでなかったのか。Bayesian Method for Machine Learning,本当に面白い。別に高度なことをやっているわけじゃないけど、確率論が絡んだ話大好き。


  • Coursera

    Latent Variable Model

The problem of the bottom equation of above image is that calculating Z is too complicated. Using latent variable Intelligence, they can be more simplified.

Then, the following equation holds.

Pros: * Simpler models * Fewer parameters * Latent variables are sometimes meaningful

Cons: * Hardar to work with

Probablistic clustering

Consider clustering in probablistic way. It enable us to tune hyper parameters. Also, it can make generative model of the data.

GMM(Gaussian Mixture Model)

Using maltiple gaussian distributions, our model can fit more complicated dataset.

However, training GMM is a little difficult. In MAP prediction, we can lead following equation. Alt text In the above image, we need the constraint of $$\sum^{N}_{i=1}\pi_i=1$$ due to the sum of the equation must lower than 1. Also, the metrixes can't be arbitorary, because if all the element of the metrixes are zero, we devide by zero when calculating gaussian distribution.

Due to these constraint, we can't easily apply SGD to this model.

  • Gaussian Mixture Model is a flexible probability distribution.
  • It is hard to fit with SGD.

EM Argorithm Deriviation

EM argorithm: * Method for training Latent Variable Models * Handles Missing data * Sequence of simple task instead of one hard * Guaranties to converge * Helps with complicated parameters constraints * Numerous extensions: * Variational E-step: restrict the set of possible q * Sampling on M-step

Cons * Only local maximum * Requires math $$p(x_i|\theta)=\sum3_{i=1}p(x_i|t=c,\theta)p(t_i=c|\theta)$$

$$\log{p(x|\theta)}=\sum^{N}{i=1}\log{p(x_i|\theta}$$ $$=\sum^{N}{i=1}\log{\sum^{3}{c=1}}\frac{q(t_i=c)}{q(t_i=c)}p(x_i,t_i=c|\theta)$$ $$\geq \sum^{N}{i=1}\sum^{3}_{c=1}q(t_i=c)\log\frac{p(x_i,t_i=c|\theta)}{q(t_i=c)}\;(Jensen's\; inequality)$$

E Step

minimizing the GAP of $\log p(X|\theta)$,$L(\theta,q)$ $$=\sum^{N}{i=1}\log p(x_i|\theta)-\sum^{N}{i=1}\sum^{3}_{c=1}q(t_i=c)\log\frac{p(x_i,t_i=c|\theta)}{q(t_i=c)}$$

$$=\sum^{N}{i=1}\sum^{3}{c=1}q(t_i=c)(\log p(x_i|\theta)-\log \frac{p(x_i,t_i=c|\theta)}{q(t_i=c)})$$


The last equation is same as KL divergence. Then the following equation can hold.

$$ \log p(X|\theta)-L(\theta,q)=\sum^{N}_{i=1}KL(q(t_i) || p(t_i|x_i,\theta))\geq 0 $$

M step

Summerize, $$\arg \max_{q(t_i)} L(\thetak,q)=p(t_i|x_i,\theta)$$