Project of Introduction to Probabilistic Graphical Models course (Umut Şimşekli)
Author: Guillaume Fradet
The following cell enables the cancel $\LaTeX$ package.
%%javascript
MathJax.Extension["TeX/cancel"]={version:"2.4.0",ALLOWED:{color:1,mathcolor:1,background:1,mathbackground:1,padding:1,thickness:1}};MathJax.Hub.Register.StartupHook("TeX Jax Ready",function(){var c=MathJax.InputJax.TeX,a=MathJax.ElementJax.mml,b=MathJax.Extension["TeX/cancel"];b.setAttributes=function(h,e){if(e!==""){e=e.replace(/ /g,"").split(/,/);for(var g=0,d=e.length;g<d;g++){var f=e[g].split(/[:=]/);if(b.ALLOWED[f[0]]){if(f[1]==="true"){f[1]=true}if(f[1]==="false"){f[1]=false}h[f[0]]=f[1]}}}return h};c.Definitions.Add({macros:{cancel:["Cancel",a.NOTATION.UPDIAGONALSTRIKE],bcancel:["Cancel",a.NOTATION.DOWNDIAGONALSTRIKE],xcancel:["Cancel",a.NOTATION.UPDIAGONALSTRIKE+" "+a.NOTATION.DOWNDIAGONALSTRIKE],cancelto:"CancelTo"}},null,true);c.Parse.Augment({Cancel:function(e,g){var d=this.GetBrackets(e,""),f=this.ParseArg(e);var h=b.setAttributes({notation:g},d);this.Push(a.menclose(f).With(h))},CancelTo:function(e,g){var i=this.ParseArg(e),d=this.GetBrackets(e,""),f=this.ParseArg(e);var h=b.setAttributes({notation:a.NOTATION.UPDIAGONALSTRIKE+" "+a.NOTATION.UPDIAGONALARROW},d);i=a.mpadded(i).With({depth:"-.1em",height:"+.1em",voffset:".1em"});this.Push(a.msup(a.menclose(f).With(h),i))}});MathJax.Hub.Startup.signal.Post("TeX cancel Ready")});MathJax.Ajax.loadComplete("[MathJax]/extensions/TeX/cancel.js");
Consider the following probabilistic non-negative matrix factorization (NMF) model: (for $f = 1, . . . , F , n = 1, . . . , N , k = 1, . . . , K$)
where $\mathcal{G}$ and $\mathcal{PO}$ denote the gamma and the Poisson distributions, respectively.
We introduce a set of latent random variables that we define as follows:
$$ s_{fkn} \sim \mathcal{PO}(s_{fkn};w_{fk},h_{kn})$$$$v_{fn} = \sum_{k=1}^Ks_{fkn} \Rightarrow v_{fn} \sim \mathcal{PO}(v_{fn}; \sum_{k=1}^Kw_{fk}h_{kn})$$We want to compute:
$\mathcal{L}_t(W,H)= \mathbb{E}[\log p(V,S,W,H)]_{p(S|V,W^{(t)}, H^{(t)})}$
$ \begin{align*} p(V,S,W,H) & = p(V,S|W,H)p(W,H)\\ & = p(V|S)p(S|W,H)p(W)p(H) \end{align*} $
$ \begin{align} \boxed{\mathcal{L}_t(W,H)=^+ \mathbb{E}[\log p(S|W,H)] + \log p(W) +\log p(H)} \end{align} $
First, we calculate $p(S|V,W,H)$.
$ \begin{align*} p(S|V,W,H) & = \frac{p(S,V|W,H)}{p(V|W,H)}\\ \end{align*} $
$ \begin{align} \log p(S|V,W,H) & = \log p(S,V|W,H) - \log p(V|W,H)\\ \end{align} $
$ \begin{align*} \log p(S,V|W,H) & = \sum_f\sum_n\left[\sum_k\left(-w_{fk}h_{kn}+s_{fkn}\log (w_{fk}h_{kn})-\log \Gamma(s_{fkn}+1)\right) + \log \delta(v_{fn}-\sum_k s_{fkn}) \right] \end{align*} $
$ \begin{align*} \log p(V|W,H) & = \log \sum_S p(V|S)p(S|W,H)\\ & = \log \prod_{f,n}\mathcal{PO}\left(v_{fn};\sum_k w_{fk}, h_{kn} \right)\\ & = \sum_f\sum_n\left[v_{fn} \log \left(\sum_k w_{fk} h_{kn}\right) - \sum_k w_{fk} h_{kn} - \log \Gamma(v_{fn}+1) \right] \end{align*} $
From $(2)$, we now have:
$ \begin{align*} \log p(S|V,W,H) = & \sum_f\sum_n \left[\sum_k\left( \cancel{-w_{fk}h_{kn}} +s_{fkn} \log (w_{fk}h_{kn})-\log \Gamma(s_{fkn}+1)\right) \right. \\ &\left. + \log \delta(v_{fn}-\sum_k s_{fkn}) - v_{fn} \log \sum_k w_{fk}h_{kn} + \cancel{\sum_k w_{fk}h_{kn}} + \log \Gamma(v_{fn}+1) \right]\\ = & \sum_f\sum_n\left[\sum_k\left(s_{fkn} \log (w_{fk}h_{kn})-\log \Gamma(s_{fkn}+1) \right)+ \log \delta(v_{fn}-\sum_k s_{fkn})+ \log \Gamma(v_{fn}+1) - v_{fn} \log \sum_k w_{fk}h_{kn} \right]\\ = & \sum_f\sum_n\left[\sum_k\left(s_{fkn} \log (w_{fk}h_{kn})-\log \Gamma(s_{fkn}+1) \right)+ \log \delta(v_{fn}-\sum_k s_{fkn})+ \log \Gamma(v_{fn}+1) - \sum_k\left(s_{fkn}\log \sum_i w_{fi}h_{in} \right) \right]\\ = & \sum_f\sum_n\left[\sum_k\left(s_{fkn} \log (w_{fk}h_{kn}) -s_{fkn}\log \sum_i w_{fi}h_{in} -\log \Gamma(s_{fkn}+1) \right)+ \log \delta(v_{fn}-\sum_k s_{fkn})+ \log \Gamma(v_{fn}+1) \right]\\ = & \sum_f\sum_n\left[\sum_k\left(s_{fkn} \log \frac{w_{fk}h_{kn}}{\sum_i w_{fi}h_{in}} -\log \Gamma(s_{fkn}+1) \right)+ \log \delta(v_{fn}-\sum_k s_{fkn})+ \log \Gamma(v_{fn}+1) \right]\\ = & \sum_f\sum_n\left[\sum_k\left(s_{fkn} \log \frac{w_{fk}h_{kn}}{\sum_i w_{fi}h_{in}} -\log (s_{fkn}!) \right)+ \log \delta(v_{fn}-\sum_k s_{fkn})+ \log (v_{fn}!) \right]\\ = & \sum_f\sum_n\left[\sum_k \log \left( \Big(\frac{w_{fk}h_{kn}}{\sum_i w_{fi}h_{in}}\Big)^{s_{fkn}} \frac{1}{s_{fkn}!} \right)+ \log \left(\delta\Big(v_{fn}-\sum_k s_{fkn}\Big)v_{fn}!\right) \right]\\ = & \sum_f\sum_n\left[\log \left(\prod_k \bigg( \Big(\frac{w_{fk}h_{kn}}{\sum_i w_{fi}h_{in}}\Big)^{s_{fkn}} \frac{1}{s_{fkn}!} \bigg) \right)+ \log \left(\delta\Big(v_{fn}-\sum_k s_{fkn}\Big)v_{fn}!\right) \right]\\ = & \sum_f\sum_n\log \left(\frac{v_{fn}!}{\prod_k s_{fkn}!} \prod_k \bigg( \Big(\frac{w_{fk}h_{kn}}{\sum_i w_{fi}h_{in}}\Big)^{s_{fkn}} \bigg) \delta\Big(v_{fn}-\sum_k s_{fkn}\Big)\right)\\ = & \sum_f\sum_n\log \left(\mathcal{Mult}\left(s_{f1n},...,s_{fKn} | p_{f1n},...,p_{fKn}; v_{fn} \right) \right) \\ \end{align*} $
With $p_{fkn}= \large \frac{w_{fk}h_{kn}}{\sum_i w_{fi}h_{in}}$
$ \begin{align*} p(S|V,W,H) & = \exp \left(\sum_f\sum_n\log \left(\mathcal{Mult}\left(s_{f1n},...,s_{fKn} | p_{f1n},...,p_{fKn}; v_{fn} \right) \right) \right)\\ & = \exp \left(\log\Big( \prod_f\prod_n\mathcal{Mult}\left(s_{f1n},...,s_{fKn} | p_{f1n},...,p_{fKn}; v_{fn} \right) \Big) \right)\\ & = \prod_f\prod_n\mathcal{Mult}\left(s_{f1n},...,s_{fKn} | p_{f1n},...,p_{fKn}; v_{fn} \right) \end{align*} $
From $(1)$, we had the following equation for the log likelihood:
$ \begin{align*} \mathcal{L}_t(W,H) & =^+ \mathbb{E}[\log p(S|W,H)] + \log p(W) +\log p(H)\\ & =^+ \sum_f\sum_n\sum_k\left( \mathbb{E}\big[s_{fkn}\big]\log (w_{fk}h_{kn})-w_{fk}h_{kn} \right)\\ & + \sum_f\sum_k\big((\alpha_w-1)\log w_{fk}-\beta_w w_{fk}\big)+ \sum_k\sum_n\big((\alpha_h-1)\log h_{kn}-\beta_h h_{kn}\big) \end{align*} $
We have $ \begin{align*} \boxed{\mathbb{E}[s_{fkn}] = v_{fn}\frac{w_{fk}h_{kn}}{\sum_i w_{fi}h_{in}} } \end{align*} $
When we derive over $w_{fk}$, we get:
$ \begin{align*} & \frac{\sum_n \mathbb{E}[s_{fkn}]^{(t)}}{w_{fk}^{(t+1)}}-\sum_n h_{kn}^{(t)}+\frac{\alpha_w-1}{w_{fk}^{(t+1)}}-\beta_w = 0\\[5pt] & \Rightarrow \boxed{w_{fk}^{(t+1)}=\frac{\alpha_w-1 +\sum_n \mathbb{E}[s_{fkn}]^{(t)}}{\beta_w + \sum_n h_{kn}^{(t)}}} \end{align*} $
When we derive over $h_{kn}$, we get:
$ \begin{align*} & \frac{\sum_f \mathbb{E}[s_{fkn}]^{(t)}}{h_{kn}^{(t+1)}}-\sum_f w_{fk}^{(t)}+\frac{\alpha_h-1}{h_{kn}^{(t+1)}}-\beta_h = 0\\[5pt] & \Rightarrow \boxed{h_{kn}^{(t+1)}=\frac{\alpha_h-1 +\sum_f \mathbb{E}[s_{fkn}]^{(t)}}{\beta_h + \sum_f w_{fk}^{(t)}}} \end{align*} $
To implement the EM algorithm, we derive the three formulas that we have for the E-step and M-step (last three boxed formulas), into two simple updates. We do this to reduce the time of computation.
The updates are:
$ \begin{align*} W^{(t+1)} & = \bigg( W^{(t)} * \Big( \big(V / W^{(t)}H^{(t)}\big)H^{(t)^T} \Big)+(\alpha_w-1)J_{F,K}\bigg)/ \Big(\beta_wJ_{F,K}+J_{F,N}H^{(t)^T} \Big)\\ H^{(t+1)} & = \bigg( H^{(t)} * \Big( W^{(t)^T}\big(V / W^{(t)}H^{(t)}\big) \Big)+(\alpha_h-1)J_{K,N}\bigg)/ \Big(\beta_hJ_{K,N}+W^{(t)^T}J_{F,N} \Big) \end{align*} $
Where $J_{F,N}$ is a matrix of ones of shape $F\times N$.
import numpy as np
import scipy.io as sio
import scipy.stats as stats
import matplotlib.pyplot as plt
from scipy import ndimage
# custom style for matplotlib
usetex = True
fontsize = 16
params = {'axes.labelsize': fontsize + 2,
'font.size': fontsize + 2,
'legend.fontsize': fontsize + 2,
'xtick.labelsize': fontsize,
'ytick.labelsize': fontsize,
'text.usetex': usetex}
plt.style.use('seaborn')
plt.rcParams.update(params)
Load the dataset
vectors = np.array(sio.loadmat('attfaces.mat')['V'])
vectors.shape
Visualize some images
# reshape images
images = []
for i in range(vectors.shape[1]):
images.append(np.reshape(vectors[:, i], (92, 112)))
# plot the 10 first faces
plt.figure(figsize=(15, 10))
for i in range(10):
plt.subplot(1,10,i+1)
plt.imshow(ndimage.rotate(images[i], -90), cmap="gray")
plt.axis('off')
EM-algorithm
def nmf_em(V, K, alpha_w, alpha_h, beta_w, beta_h, max_iter=50, compute_KL=False):
F, N = V.shape
div_KL_list = []
# initialization with samples drawn from a Gamma distribution
W = np.random.gamma(shape=alpha_w, scale=(1/beta_w), size=(F,K))
H = np.random.gamma(shape=alpha_h, scale=(1/beta_h), size=(K,N))
for t in range(max_iter):
W = (W * ((V / W.dot(H)).dot(H.T)) + np.full((F, K), (alpha_w - 1))
) / (np.full((F, K), beta_w) + np.ones((F, N)).dot(H.T))
H = (H * (W.T.dot(V / W.dot(H))) + np.full((K, N), (alpha_h - 1))
) / (np.full((K, N), beta_h) + W.T.dot(np.ones((F, N))))
if compute_KL:
div_KL_list.append(kl_divergence(V, W.dot(H)))
return W, H, div_KL_list
def kl_divergence(P, Q):
eps = 1e-10
# add a tiny epsilon to P and Q to avoid division by 0
P = P + eps
Q = Q + eps
return - np.sum(P * np.log(Q / P) - Q + P)
def plot_faces(vects, *args):
plt.figure(figsize=(16,8), facecolor='#ececec')
plt.suptitle(r'$K={}\\[4pt]\alpha_w={},\alpha_h={}\\[4pt]\beta_w={},\beta_h={}$'.format(
args[0], args[1], args[2], args[3], args[4]), fontsize=20)
for i in range(21):
plt.subplot(3,7,i+1)
plt.imshow(ndimage.rotate(np.reshape(W[:, i], (92, 112)), -90),
cmap="gray", vmin=0, vmax=1)
plt.axis('off')
plt.subplots_adjust(top=0.8)
plt.show()
Output of images in $W$
alpha_w = 1
alpha_h = 1
beta_w = 1
beta_h = 1
K = 25
W, H, div_KL = nmf_em(vectors, K, alpha_w, alpha_h, beta_w, beta_h,
max_iter=300, compute_KL=True)
plot_faces(W, K, alpha_w, alpha_h, beta_w, beta_h)
Kullback-Leibler divergence
plt.figure(figsize=(12,7))
plt.plot(div_KL)
plt.title('Kullback-Leibler divergence', fontsize=20)
plt.xlabel('Iterations')
Comments:
We notice that 300 iterations are enough for the algorithm to stabilize. Therefore, we will use the same amount of iterations for the next runs.
Run the algorithm on the face dataset. Set $K = 25, \alpha_w=\alpha_h = 1$. Try different values for $\beta_w$ and $\beta_h$. Visualize the columns of estimated $W$ matrices. What do you observe when you change the parameters?
alpha_w = 1
alpha_h = 1
beta_w = 0.1
beta_h = 1
K = 25
W, H, _ = nmf_em(vectors, K, alpha_w, alpha_h, beta_w, beta_h, max_iter=300)
plot_faces(W, K, alpha_w, alpha_h, beta_w, beta_h)
alpha_w = 1
alpha_h = 1
beta_w = 10
beta_h = 1
K = 25
W, H, _ = nmf_em(vectors, K, alpha_w, alpha_h, beta_w, beta_h, max_iter=300)
plot_faces(W, K, alpha_w, alpha_h, beta_w, beta_h)
alpha_w = 1
alpha_h = 1
beta_w = 1
beta_h = 0.1
K = 25
W, H, _ = nmf_em(vectors, K, alpha_w, alpha_h, beta_w, beta_h, max_iter=300)
plot_faces(W, K, alpha_w, alpha_h, beta_w, beta_h)
alpha_w = 1
alpha_h = 1
beta_w = 1
beta_h = 10
K = 25
W, H, _ = nmf_em(vectors, K, alpha_w, alpha_h, beta_w, beta_h, max_iter=300)
plot_faces(W, K, alpha_w, alpha_h, beta_w, beta_h)
Run the algorithm with $K = 25, \beta_w = \beta_h = 1$. Try different values for $\alpha_w$ and $\alpha_h$ . Visualize the columns of estimated $W$ matrices. What do you observe when you change the parameters?
alpha_w = 0.1
alpha_h = 1
beta_w = 1
beta_h = 1
K = 25
W, H, _ = nmf_em(vectors, K, alpha_w, alpha_h, beta_w, beta_h, max_iter=300)
plot_faces(W, K, alpha_w, alpha_h, beta_w, beta_h)
alpha_w = 10
alpha_h = 1
beta_w = 1
beta_h = 1
K = 25
W, H, _ = nmf_em(vectors, K, alpha_w, alpha_h, beta_w, beta_h, max_iter=300)
plot_faces(W, K, alpha_w, alpha_h, beta_w, beta_h)
alpha_w = 1
alpha_h = 0.1
beta_w = 1
beta_h = 1
K = 25
W, H, _ = nmf_em(vectors, K, alpha_w, alpha_h, beta_w, beta_h, max_iter=300)
plot_faces(W, K, alpha_w, alpha_h, beta_w, beta_h)
alpha_w = 1
alpha_h = 10
beta_w = 1
beta_h = 1
K = 25
W, H, _ = nmf_em(vectors, K, alpha_w, alpha_h, beta_w, beta_h, max_iter=300)
plot_faces(W, K, alpha_w, alpha_h, beta_w, beta_h)
Now try changing the number of components K. What do you observe?
alpha_w = 1
alpha_h = 1
beta_w = 1
beta_h = 1
K = 2
W, H, KL_K1 = nmf_em(vectors, K, alpha_w, alpha_h, beta_w, beta_h,
max_iter=300, compute_KL=True)
plt.figure(figsize=(6,4), facecolor='#ececec')
plt.suptitle(r'$K={}\\[4pt]\alpha_w={},\alpha_h={}\\[4pt]\beta_w={},\beta_h={}$'.format(
K, alpha_w, alpha_h, beta_w, beta_h), fontsize=20)
for i in range(2):
plt.subplot(1,2,i+1)
plt.imshow(ndimage.rotate(np.reshape(W[:, i], (92, 112)), -90), cmap="gray")
plt.axis('off')
plt.subplots_adjust(top=0.7)
plt.show()
alpha_w = 1
alpha_h = 1
beta_w = 1
beta_h = 1
K = 50
W, H, KL_K2 = nmf_em(vectors, K, alpha_w, alpha_h, beta_w, beta_h,
max_iter=300, compute_KL=True)
plot_faces(W, K, alpha_w, alpha_h, beta_w, beta_h)
alpha_w = 1
alpha_h = 1
beta_w = 1
beta_h = 1
K = 200
W, H, KL_K3 = nmf_em(vectors, K, alpha_w, alpha_h, beta_w, beta_h,
max_iter=300, compute_KL=True)
plot_faces(W, K, alpha_w, alpha_h, beta_w, beta_h)
plt.figure(figsize=(12,7))
plt.plot(KL_K1, label='K=2')
plt.plot(KL_K2, label='K=50')
plt.plot(KL_K3, label='K=200')
plt.title('Kullback-Leibler divergence', fontsize=20)
plt.xlabel('Iterations')
plt.legend()
The choice of the Gamma parameters $\alpha_w$, $\alpha_h$, $\beta_w$ and $\beta_h$ are essential for the factorization of our matrix $V$. If they are chosen badly, we might end up with completely white or dark images in $W$. It would then be a failure for our factorization. We should try different values to find the optimal ones.
Concerning the choice of the number of components $K$, it is an arbitrary choice. It only depends on how much we want to factorize the original matrix.