Dictionary Learning
CS 754
1
Problem statement
2
Index for image/image-patch = i
Problem statement
3
Problem statement
4
Dictionary learning algorithms
5
PCA (Principal Components Analysis)
6
Principal Components Analysis (PCA)
7
PCA
8
PCA: Algorithm
9
PCA: algorithm
4. Find the eigenvectors of C:
5. Extract the k eigenvectors corresponding to the k largest eigenvalues. This is called the extracted eigenspace:
10
There is an implicit assumption here that the first k indices indeed correspond to the k largest eigenvalues. If that is not true, you would need to pick the appropriate indices.
PCA: algorithm
6. Project each point onto the eigenspace, giving a vector of k eigen-coefficients for that point.
11
We are representing each face as a linear combination of the k eigenvectors corresponding to the k largest eigenvalues. The coefficients of the linear combination are the eigen-coefficients.
Note that αik is a vector of the eigencoefficients of the i-th sample point, and it has k elements. The j-th element of this vector is denoted as αik (j).
PCA and Face Recognition: Eigen-faces
12
Example 1
13
A face database
14
Top 25 Eigen-faces for this database!
PCA and Face recognition: �Eigenfaces
15
PCA and Face recognition: �Eigenfaces
16
Eigen-coefficients of the probe image zp.
PCA and Face recognition: �Eigenfaces
17
Eigen-coefficients of the probe image zp.
Eigen-coefficients of the l-th gallery image xl.
Note: other distance measures (different from sum of squared differences) may also be employed. One example is sum of absolute differences, given as follows:
Another could be normalized dot product (and this distance measure should be maximized!):
PCA and Face recognition: eigenfaces
M. Turk and A. Pentland (1991). Eigenfaces for recognition, Journal of Cognitive Neuroscience, 3(1): 71–86.
18
PCA and Face recognition: eigenfaces
19
One word of caution: Eigen-faces
20
Eigen-faces: reducing computational complexity.
21
Eigen-faces: reducing computational complexity.
22
Back to Eigen-faces: reducing computational complexity.
23
Xw is an eigenvector of C=XXT! Computing all eigenvectors of C will now have a complexity of only O(N3) for computation of the eigenvectors of XTX + O(N x dN) for computation of Xw from each w = total of O(N3 + dN2) which is much less than O(d3). Note that C has at most only min(N-1,d) eigenvectors corresponding to non-zero eigen-values (why?).
Eigenfaces: Algorithm (N << d case)
24
Eigen-faces: Algorithm (N << d case)
4. Find the eigenvectors of L:
5. Obtain the eigenvectors of C from those of L:
6. Unit-normalize the columns of V.
7. C will have at most only N eigenvectors corresponding to non-zero eigen-values*. Out of these you pick the top k (k < N) corresponding to the largest eigen-values.
* Actually this number is at most N-1 – this is due to the mean subtraction, else it would have been at most N.
25
Example 1
26
A face database
27
Top 25 Eigen-faces for this database!
Example 2
28
The Yale Face database
29
Top 25 eigenfaces from the previous database
Reconstruction of a face image using the top 1,8,16,32,…,104 eigenfaces (i.e. k varied from 1 to 104 in steps of 8)
What if both N and d are large?
30
PCA: A closer look
31
PCA: what does it do?
32
PCA
33
Look at this scatter-plot of points in 2D. The points are highly spread out in the direction of the light blue line.
PCA
34
This is how the data would look if they were rotated in such a way that the major axis of the ellipse (the light blue line) now coincided with the Y axis. As the spread of the X coordinates is now relatively insignificant (observe the axes!), we can approximate the rotated data points by their projections onto the Y-axis (i.e. their Y coordinates alone!). This was not possible prior to rotation!
PCA
35
PCA
onto , most accurately approximates it.
36
xi
ai
Note: Here is a unit vector.
PCA
37
PCA
38
This term is proportional to the variance of the data points when projected onto the direction e.
PCA
39
Independent of the direction e
See appendix for details
PCA
40
This term is proportional to the variance of the data when projected along e.
PCA
41
PCA
42
PCA
43
Some observations about PCA for face images
44
This error is small on an average for a well-aligned group of face images – we will see why on the next slide.
45
The eigenvalues of the covariance matrix typically decay fast in value (if the faces were properly normalized). Note that the j-th eigenvalue is proportional to the variance of the j-th eigencoefficient, i.e.
What this means is that the data have low variance when projected along most of the eigenvectors, i.e. effectively the data are concentrated in a lower-dimensional subspace of the d-dimensional space.
PCA: Compression of a set of images
46
PCA: Compression of a set of images
47
PCA: Compression of a set of images
48
PCA: Compression of a set of images
49
Which is the best orthonormal basis? Relationship between PCA and DCT
50
Which is the best orthonormal basis?�Relationship between PCA and DCT
51
Which is the best orthonormal basis?�Relationship between PCA and DCT
52
PCA: separable 2D version
53
But PCA is not used in JPEG, because…
The DCT is used instead!
54
DCT and PCA
55
Experiment
56
See code here.
57
0.3536 0.4904 0.4619 0.4157 0.3536 0.2778 0.1913 0.0975
0.3536 0.4157 0.1913 -0.0975 -0.3536 -0.4904 -0.4619 -0.2778
0.3536 0.2778 -0.1913 -0.4904 -0.3536 0.0975 0.4619 0.4157
0.3536 0.0975 -0.4619 -0.2778 0.3536 0.4157 -0.1913 -0.4904
0.3536 -0.0975 -0.4619 0.2778 0.3536 -0.4157 -0.1913 0.4904
0.3536 -0.2778 -0.1913 0.4904 -0.3536 -0.0975 0.4619 -0.4157
0.3536 -0.4157 0.1913 0.0975 -0.3536 0.4904 -0.4619 0.2778
0.3536 -0.4904 0.4619 -0.4157 0.3536 -0.2778 0.1913 -0.0975
DCT matrix: dctmtx command from MATLAB (see code on website)
0.3517 -0.4493 -0.4278 0.4230 0.3754 0.3247 -0.2250 -0.1245
0.3534 -0.4366 -0.2276 -0.0110 -0.3078 -0.4746 0.4732 0.2975
0.3543 -0.3101 0.1728 -0.4830 -0.3989 0.0498 -0.4299 -0.4109
0.3546 -0.1115 0.4799 -0.3005 0.3342 0.4102 0.1856 0.4761
0.3547 0.1141 0.4823 0.2944 0.3301 -0.4182 0.1745 -0.4771
0.3543 0.3104 0.1771 0.4834 -0.3977 -0.0322 -0.4308 0.4103
0.3535 0.4357 -0.2319 0.0143 -0.3009 0.4656 0.4851 -0.2975
0.3520 0.4468 -0.4328 -0.4204 0.3686 -0.3253 -0.2342 0.1261
0.3520 -0.4461 -0.4305 0.4224 0.3696 0.3247 0.2342 0.1283
0.3537 -0.4338 -0.2345 -0.0114 -0.3000 -0.4671 -0.4814 -0.3028
0.3545 -0.3086 0.1662 -0.4896 -0.4007 0.0359 0.4261 0.4102
0.3548 -0.1145 0.4763 -0.3031 0.3339 0.4198 -0.1800 -0.4713
0.3548 0.1056 0.4839 0.2926 0.3349 -0.4194 -0.1766 0.4733
0.3543 0.3043 0.1863 0.4833 -0.4028 -0.0354 0.4269 -0.4097
0.3532 0.4389 -0.2269 0.0180 -0.3008 0.4654 -0.4811 0.3037
0.3512 0.4562 -0.4300 -0.4126 0.3694 -0.3242 0.2335 -0.1319
VC: Eigenvectors of column-column correlation matrix
VR: Eigenvectors of row-row correlation matrix
1.0000 0.0007 0.0032 0.0002 0.0013 0.0001 0.0005 0.0000
0.0007 0.9970 0.0097 0.0689 0.0009 0.0322 0.0003 0.0110
0.0033 0.0106 0.9968 0.0118 0.0713 0.0004 0.0334 0.0025
0.0002 0.0718 0.0124 0.9926 0.0007 0.0927 0.0017 0.0276
0.0010 0.0001 0.0737 0.0004 0.9942 0.0008 0.0780 0.0010
0.0000 0.0261 0.0015 0.0962 0.0005 0.9934 0.0011 0.0569
0.0003 0.0007 0.0276 0.0021 0.0802 0.0010 0.9964 0.0013
0.0000 0.0076 0.0026 0.0227 0.0012 0.0596 0.0015 0.9979
1.0000 0.0002 0.0029 0.0001 0.0010 0.0000 0.0004 0.0000
0.0002 0.9965 0.0028 0.0766 0.0005 0.0314 0.0009 0.0107
0.0029 0.0025 0.9969 0.0046 0.0728 0.0017 0.0304 0.0013
0.0001 0.0795 0.0044 0.9923 0.0029 0.0916 0.0015 0.0243
0.0008 0.0003 0.0747 0.0026 0.9948 0.0061 0.0696 0.0004
0.0000 0.0246 0.0021 0.0949 0.0069 0.9940 0.0131 0.0452
0.0003 0.0004 0.0252 0.0003 0.0715 0.0137 0.9970 0.0002
0.0000 0.0076 0.0013 0.0207 0.0001 0.0476 0.0009 0.9986
Absolute value of dot products between the columns of DCT matrix and columns of VR (left) and VC (right)
58
DCT bases
64 columns of V – each reshaped to form an 8 x 8 image, and rescaled to fit in the 0-1 range. Notice the similarity between the DCT bases and the columns of V. Again, V is the Kronecker product of VR and VC.
DCT and PCA
59
DCT and PCA
60
61
1.0000 0.9902 0.9795 0.9733 0.9682 0.9639 0.9604 0.9570
0.9902 1.0005 0.9908 0.9795 0.9734 0.9684 0.9643 0.9605
0.9795 0.9908 1.0010 0.9908 0.9796 0.9735 0.9689 0.9646
0.9733 0.9795 0.9908 1.0005 0.9904 0.9793 0.9735 0.9686
0.9682 0.9734 0.9796 0.9904 1.0004 0.9903 0.9794 0.9734
0.9639 0.9684 0.9735 0.9793 0.9903 1.0001 0.9903 0.9793
0.9604 0.9643 0.9689 0.9735 0.9794 0.9903 1.0004 0.9904
0.9570 0.9605 0.9646 0.9686 0.9734 0.9793 0.9904 1.0002
1.0000 0.9888 0.9770 0.9704 0.9648 0.9599 0.9554 0.9510
0.9888 1.0004 0.9891 0.9768 0.9703 0.9646 0.9596 0.9548
0.9770 0.9891 1.0004 0.9886 0.9764 0.9698 0.9640 0.9587
0.9704 0.9768 0.9886 0.9994 0.9878 0.9755 0.9687 0.9627
0.9648 0.9703 0.9764 0.9878 0.9986 0.9870 0.9746 0.9676
0.9599 0.9646 0.9698 0.9755 0.9870 0.9978 0.9861 0.9734
0.9554 0.9596 0.9640 0.9687 0.9746 0.9861 0.9967 0.9847
0.9510 0.9548 0.9587 0.9627 0.9676 0.9734 0.9847 0.9951
CR/CR(1,1) -
Notice it can be approximated by the form shown two slides before, with ρ~0.99
CC/CC(1,1) -
Notice it can be approximated by the form shown two slides before, with ρ~0.9888
More results from the previous experiment. See code here.
Non-negative matrix factorization (NMF) and non-negative sparse coding (NNSC)
62
NMF
63
NMF
64
PCA versus NMF bases
65
Source:
Lee and Seung, “Learning the parts of objects by non-negative matrix factorization”,
http://lsa.colorado.edu/LexicalSemantics/seung-nonneg-matrix.pdf
NMF bases are non-negative unlike eigenfaces. In that sense, each basis image from NMF is more like a model face than from the eigenfaces.
In practice they do not appear as sparse as seen in this figure, but more like what you will see on the next slide.
66
Source:
Wang et al, NMF framework for face recognition,
NMF: Objective function
67
NMF: Multiplicative updates
68
NMF: Multiplicative updates
69
Non-negative sparse coding
70
Non-negative sparse coding
71
Results
72
NNSC and Poisson Data
73
X is a Poisson random variable, i is a non-negative integer, λ is the mean of the Poisson distribution
NNSC and Poisson Data
74
75
Image quality at 9 different average rates of photon emission.
NNSC and Poisson denoising
76
Non-negative Poisson likelihood where ykl is a noisy pixel
Sparsity on coefficients
77
78
A comment on image processing under Poisson noise
79
Learning Overcomplete Dictionaries
80
Introduction: Complete and over-complete dictionaries
81
Introduction: Complete and over-complete bases
82
Introduction: Construction of over-complete bases
83
Introduction: uniqueness? ☹
84
Introduction: compactness! ☺
85
Introduction: compactness! ☺
86
More problems ☹
87
More problems ☹
88
Solving (?) those problems
89
Learning the dictionaries
90
Learning the dictionary: analogy with K-means
91
Learning the bases: analogy with K-means
92
Learning the bases: Method 1 (Olshausen and Field, 1996)
93
Uniform prior on A
Gaussian likelihood, i.e. assuming Gaussian noise
Laplacian prior on sk
(given any A)
Two-step iterative procedure
94
Toy Experiment 1
95
Result of basis learning (dictionary with 144 elements) with sparsity constraints on the codes. Training performed on 12 x 12 patches extracted from natural images.
It is conjectured that the simple cells of the V1 area of the visual cortex in the brain have responses similar in shape to these learned dictionary elements.
Ref: Olshausen and Field, “Natural image statistics and efficient coding”
96
These are the results one obtains upon performing PCA on the 12 x 12 patches.
Ref: Olshausen and Field, “Natural image statistics and efficient coding”
Toy Experiment 2
97
Ref: Lewicki and Sejnowski, “Learning overcomplete representations”
98
Ref: Lewicki and Sejnowski, “Learning overcomplete representations”
99
Ref: Lewicki and Sejnowski, “Learning overcomplete representations”
Learning the Bases: Method of Optimal Directions (MOD) - Method 2
100
Ref: Engan et al, “Method of optimal directions for frame design”
Learning the Bases: Method of Optimal Directions (MOD) - Method 2
101
Learning the Bases: Method 3- Union of Orthonormal Bases
102
Learning the Bases: Method 3- Union of Ortho-normal Bases
103
Learning the Bases: Method 3- Union of Ortho-normal Bases
104
There is a quick way of performing this optimization given an ortho-normal basis – SOFT THRESHOLDING (could be replaced by hard thresholding if you had a stronger sparseness prior than an L1 norm
Learning the Bases: Method 3- Union of Ortho-normal Bases
105
Why are we doing this? It is related to the so-called orthogonal Procrustes problem - a well-known application of SVD. We will see this on the next slide.
The specific problem we are solving is given below. Note that it cannot be solved using a pseudo-inverse as that will not impose orthonormality constraint, if there is noise in the data or if the coefficients are perturbed or thresholded.
106
Learning the Bases: Method 3- Union of Ortho-normal Bases
107
Learning the Bases: Method 4 – K-SVD
108
109
Row ‘k’ (NOT column) of matrix S
Does NOT depend on the k-th dictionary column
How to find ak, given the above expression? We have decomposed the original error matrix, i.e. Y-AS, into a sum of rank-1 matrices, out of which only the last term depends on ak. So we are trying to find a rank-1 approximation for Ek, and this can be done by computing the SVD of Ek, and using the singular vectors corresponding to the largest singular value.
110
Problem! The dictionary codes may no more be sparse! SVD does not have any in-built sparsity constraint in it! So, we proceed as follows:
Consider the error matrix defined as follows:
Considers only those columns of Y (i.e. only those data-points) that actually USE the k-th dictionary atom, effectively yielding a smaller matrix, of size p by |ωk|
This update affects the sparse codes of only those data-points that used the k-th dictionary element, i.e. those that are part of ωk.
KSVD and K-means
112
Implementation issues
�
113
(Many) Applications of KSVD
114
Application: Image Compression
115
Application: Image Compression (Training Phase)
116
Application: Image Compression (Testing Phase)
118
Application: Image Compression (Testing Phase)
119
Application: Image Compression (Testing Phase)
120
Number of bits required to store the number of coefficients for each patch
Number of bits required to store the dictionary index for each coefficient
Number of bits required to code each coefficient (quantization level). The quantization makes this a lossy compression algorithm.
Sum total of the number of coefficients representing each patch
Huffman encoding
121
Even if the error ‘e’ for the OMP was fixed, we need to compute the total MSE between the true and the compressed image. This is due to effects of quantization while storing the sparse coefficient values for each patch.
123
Y = noisy image
X = underlying clean image (to be estimated)
αij = sparse dictionary coefficients for patch at location (i,j)
D = dictionary
Rij = matrix that extracts the patch xij from image X, i.e.
xij = Rij X
Application: Image Denoising
124
KSVD Algorithm for Denoising (Dictionary learned on the noisy image)
125
KSVD Algorithm for Denoising (Dictionary learned on the noisy image)
126
This equation is a mathematically rigorous way to show how X was reconstructed by averaging overlapping denoised patches together with the noisy image as well.
127
Baseline for comparison: method by Portilla et al, “Image denoising using scale mixture of Gaussians in the wavelet domain”, IEEE TIP 2003.
128
130
Application of KSVD: Filling in Missing Pixels
131
Application of KSVD: Filling in Missing Pixels
132
Application of KSVD: Filling in Missing Pixels
133
Application of KSVD: Filling in Missing Pixels
134
NOTE: Dictionary elements without masking!!
KSVD Application: Coded Exposure Image
135
It is a coded superposition (i.e. summation) of T sub-frames within a unit integration time of the video camera.
Conventional capture
(simple integration across time, without modulation by binary codes)
Coded exposure image (captured in one unit integration time of the camera)
Binary code at time instant t
Sub-frame at time instant t
KSVD Application: Sparse Coding in Video CS
136
y is a vectorized form of the snapshot image I of n pixels.
f is a vectorized form of the underlying video with T frames and each frame having n pixels, hence f has nT pixels in total.
Ψ is a 3D DCT basis – the basis can also be “learned offline on a representative set of videos”
Φ Is a matrix of size n x nT containing values from the binary code S. See below:
Diag(St) represents a n x n diagonal matrix. The diagonal contains elements from the mask St.
Dictionary Learning
137
KSVD is a dictionary learning technique. Given a set of input patches in n X N, it learns a set of vectors (n X K), such that a sparse linear combination of these vectors approximates each patch as closely as possible. There are K coefficients per patch, out of which most are encouraged to be 0 ( sparse).
139
Video reconstruction results on real data available below:
Blind compressed sensing
140
Blind compressed sensing
141
Blind compressed sensing
142
Blind compressed sensing
143
Blind compressed sensing
144
This matrix will be invertible with probability 1 only if N is such that Nm >= n. Notice that it is important for every point to have a different sensing matrix, otherwise the dictionary cannot be inferred.
Blind CS: Pseudocode
145
Blind compressed sensing: experiments
146
Blind compressed sensing: experiments
147
148
149
150
Transform Learning for Classification
151
Transform learning
152
Transform learning
153
154
Good w
Not so good w
Fisher’s linear discriminant (FLD)
155
Fisher’s linear discriminat (FLD)
156
Criterion for a good w:
The transformed mean vectors should be as far apart as possible.
But that is not enough. We want the difference between the transformed mean vectors to be large RELATIVE to the variances of the transformed points of each class. Otherwise, there may be conflation between the points of the two classes due to a large spread even if their means are far apart.
Fisher’s linear discriminat (FLD)
157
Between-class scatter matrix
158
Between-class scatter matrix
Within-class scatter matrix
Find w which maximizes this criterion – often called the Fisher criterion
Fisher’s linear discriminant
159
Called a Generalized eigenvalue problem.
Assumes that SW is not singular.
Fisher’s linear discriminant
160
Fisher’s linear discriminant: classification
161
Fisher’s linear discriminant: case of c classes
162
Fisher’s linear discriminant: case of c classes
163
Fisher’s linear discriminant: case of c classes
164
Fisher’s linear discriminant: case of c classes
165
Fisher’s criterion: case of c classes
166
The determinant of the scatter matrix is a scalar measure of the scatter. As the determinant is the product of eigenvalues, it is a product of variances in the principal directions and thus the square of the scattering volume of the hyperellipsoid.
Fisher’s criterion: case of c classes
167
Fisher’s linear discriminant: uniqueness
168
Fisherfaces: applications in face recognition
169
Fisherfaces: applications in face recognition
170
Fisherfaces: applications in face recognition
171
d x (N-c) matrix
Results with Fisherfaces
172
173
Fisherfaces produces better results!
174
Fisherfaces produces better results!
Classification using Mutual Information
175
Mutual information: definition
176
Mutual information: definition
177
MI for classification
178
MI for classification
179
Probability expressions
180
Segway: Kernel density estimation
181
Segway: Kernel density estimation
182
Segway: Kernel density estimation
183
Probability expressions
184
Quadratic mutual information
185
QMI expressions
186
QMI expressions
187
Within-class differences
Differences of all pairs of samples (regardless of class)
QMI expressions for derivatives
188
QMI optimization
189
QMI optimization: orthogonalization
190
191
This forces u2 to be orthogonal to u1.
Gram Schmidt Orthogonalization Procedure
Results
192
Compressed Sensing with Overcomplete Dictionaries
194
Towards CS with overcomplete dictionaries
195
Towards CS with overcomplete dictionaries
196
Towards CS with overcomplete dictionaries
197
Note that Σs is the image under D of all s-sparse vectors. The only difference is that D here has more columns than rows unlike the case of orthonormal matrices before. Also random Gaussian and Bernoulli matrices obey D-RIP with high probability given O(s log n) rows, just like they obey RIP.
CS with tight frames
198
Candes et al, Compressed sensing with coherent and redundant dictionaries, Applied and Computational Harmonic Analysis, 2010
References
199
Appendix: Covariance matrix
200
Sample values of the random variable
Probability density function of x
Appendix: Covariance matrix
201
k-th sample value of xi
Appendix: Covariance matrix
202
Appendix: A word about Lagrange Multipliers
203
Appendix: A word about Lagrange Multipliers
204
205
Suppose we are walking from (x1,y1) to (x2,y2) along the constraint curve h(x,y) = 0. Initially, the tangent vector along h(x,y) = 0 has a component along –grad(f), and hence there is a decrease in the value of f(x,y) as we move along h(x,y) = 0. If the point (x*,y*) is a local minimum of f(x,y), a small motion along h(x,y) = 0 from (x*,y*) will have no component along grad(f), else it would cause an increase in the value of f. Hence the tangent to h(x,y) = 0 at (x*,y*) will be perpendicular to grad(f). As we move further, motion along h(x,y) = 0 will have a component along +grad(f) leading to an increase in the value of f(x,y). At the minimum point, we have grad(f) perpendicular to the tangent , i.e. grad(f) and grad(h) are collinear. Hence, there exists some value λ (the Lagrange multiplier ) such that we have:
206