1 of 206

Dictionary Learning

CS 754

1

2 of 206

Problem statement

  • Images or small patches from images can often be expressed as a linear combination of columns (or “atoms”) of a matrix.

  • This matrix is called a dictionary and the coefficients of the linear combination are called dictionary coefficients.

2

Index for image/image-patch = i

3 of 206

Problem statement

  • We have seen previously that these columns could belong to the DCT or wavelet basis matrix – these are called universal bases.

  • But we can go one step beyond – we can infer (or learn) the dictionary from the data. This is called dictionary learning.

3

4 of 206

Problem statement

  • The dictionary can be forced to obey a variety of constraints:
  • it may be orthonormal
  • it may be overcomplete (i.e. the number of columns exceeds the number of rows)
  • it may be non-negative.

  • The learned dictionary can be useful in a variety of applications – compression, denoising, deblurring, inpainting as well as compressive recovery.

4

5 of 206

Dictionary learning algorithms

  • We will study the following dictionary learning algorithms:
  • Principal Components Analysis (PCA)
  • Non-negative sparse coding
  • Method of optimal directions (MOD)
  • Union of orthonormal bases
  • K-SVD

5

6 of 206

PCA (Principal Components Analysis)

6

7 of 206

Principal Components Analysis (PCA)

  • Consider N vectors (or points) xi, each containing d elements, each represented as a column vector.
  • We say: . d could be very large – like 250,000 or more.
  • Our aim is to extract some k features from each xi, k << d.
  • Effectively we are projecting the original vectors from a d-dimensional space to a k-dimensional space. This is called dimensionality reduction. PCA is one method of dimensionality reduction.

7

8 of 206

PCA

  • How do we pick the ‘k’ appropriate features?

  • We look into a notion of “compressibility” – how much can the data be compressed, allowing for some small errors.

8

9 of 206

PCA: Algorithm

  1. Compute the mean of the given points:

  • Deduct the mean from each point:

  • Compute the covariance matrix of these mean-deducted points:

9

10 of 206

PCA: algorithm

4. Find the eigenvectors of C:

5. Extract the k eigenvectors corresponding to the k largest eigenvalues. This is called the extracted eigenspace:

10

There is an implicit assumption here that the first k indices indeed correspond to the k largest eigenvalues. If that is not true, you would need to pick the appropriate indices.

11 of 206

PCA: algorithm

6. Project each point onto the eigenspace, giving a vector of k eigen-coefficients for that point.

11

We are representing each face as a linear combination of the k eigenvectors corresponding to the k largest eigenvalues. The coefficients of the linear combination are the eigen-coefficients.

Note that αik is a vector of the eigencoefficients of the i-th sample point, and it has k elements. The j-th element of this vector is denoted as αik (j).

12 of 206

PCA and Face Recognition: Eigen-faces

  • Consider a database of cropped, frontal face images (which we will assume are aligned and under the same illumination). These are the gallery images.

  • We will reshape each such image (a 2D array of size H x W after cropping) to form a column vector of d = HW elements. Each image will be a vector xi, as per the notation on the previous two slides.

  • And then carry out the six steps mentioned before.

  • The eigenvectors that we get in this case are called eigenfaces. Each eigenvector has d elements. If you reshape those eigenvectors to form images of size H x W, those images look like (filtered!) faces.

12

13 of 206

Example 1

13

A face database

14 of 206

14

Top 25 Eigen-faces for this database!

15 of 206

PCA and Face recognition: �Eigenfaces

  • For each gallery image, you compute the eigen-coefficients. You then store the eigen-coefficients and the identity of the person in a database.
  • You also store in the database.
  • During the testing phase, you are given a probe image (say) zp in the form of a column vector of HW elements.
  • You deduct the mean image from zp:

15

16 of 206

PCA and Face recognition: �Eigenfaces

  • You then project the mean-deducted face image onto the eigen-space:

  • Now, compare αp with all the αik (eigen-coefficients of the gallery images) in the database.

  • Find the closest match in terms of the squared distance between the eigen-coefficients. That gives you the identity (see next slide).

16

Eigen-coefficients of the probe image zp.

17 of 206

PCA and Face recognition: �Eigenfaces

17

Eigen-coefficients of the probe image zp.

Eigen-coefficients of the l-th gallery image xl.

Note: other distance measures (different from sum of squared differences) may also be employed. One example is sum of absolute differences, given as follows:

Another could be normalized dot product (and this distance measure should be maximized!):

18 of 206

PCA and Face recognition: eigenfaces

  • The eigen-face images contain more and more high frequency information as the corresponding eigen-values decrease.
  • Although PCA is a technique known for a long time, it’s application in face recognition was pioneered by Turk and Pentland in a classic paper in 1991.

M. Turk and A. Pentland (1991). Eigenfaces for recognition, Journal of Cognitive Neuroscience, 3(1): 71–86.

18

19 of 206

PCA and Face recognition: eigenfaces

  • We can regard the k eigenfaces as key signatures.
  • We express each face image as a linear combination of these eigenfaces, i.e. the average face + (say) 3 times eigenface 1 + (say) 5 times eigenface 2 + (say) -1 times eigenface 3 and so on. (note: 3,5,-1 here are the eigen-coefficients, and some of them can be negative).

19

20 of 206

One word of caution: Eigen-faces

  • The algorithm described earlier is computationally infeasible for face images (or eigenfaces), as it requires storage of a d x d Covariance matrix (d – the number of image pixels - could be more than 10,000). And the computation of the eigen-vectors of such a matrix is a O(d3) operation!

  • We will study a modification to this that will bring down the computational cost drastically.

20

21 of 206

Eigen-faces: reducing computational complexity.

  • Consider the covariance matrix:

  • It will require too much memory if d is large, and computing its eigenvectors will be a horrendous task!

  • Consider the case when N is much less than d. This is very common in face recognition applications. The number of training images is usually much smaller than the size of the image.

21

22 of 206

Eigen-faces: reducing computational complexity.

  • In such a case, the rank of C is at the most N-1. So C will have at the most N-1 non-zero eigen-values.

  • We can write C in the following way:

22

23 of 206

Back to Eigen-faces: reducing computational complexity.

  • Consider the matrix XTX (size N x N) instead of XXT (size d x d). Its eigenvectors are of the form:

23

Xw is an eigenvector of C=XXT! Computing all eigenvectors of C will now have a complexity of only O(N3) for computation of the eigenvectors of XTX + O(N x dN) for computation of Xw from each w = total of O(N3 + dN2) which is much less than O(d3). Note that C has at most only min(N-1,d) eigenvectors corresponding to non-zero eigen-values (why?).

24 of 206

Eigenfaces: Algorithm (N << d case)

  1. Compute the mean of the given points:

  • Deduct the mean from each point:

  • Compute the following matrix:

24

25 of 206

Eigen-faces: Algorithm (N << d case)

4. Find the eigenvectors of L:

5. Obtain the eigenvectors of C from those of L:

6. Unit-normalize the columns of V.

7. C will have at most only N eigenvectors corresponding to non-zero eigen-values*. Out of these you pick the top k (k < N) corresponding to the largest eigen-values.

* Actually this number is at most N-1 – this is due to the mean subtraction, else it would have been at most N.

25

26 of 206

Example 1

26

A face database

27 of 206

27

Top 25 Eigen-faces for this database!

28 of 206

Example 2

28

The Yale Face database

29 of 206

29

Top 25 eigenfaces from the previous database

Reconstruction of a face image using the top 1,8,16,32,…,104 eigenfaces (i.e. k varied from 1 to 104 in steps of 8)

30 of 206

What if both N and d are large?

  • This can happen, for example, if you wanted to build an eigenspace for face images of all people in Mumbai.
  • Divide people into coherent groups based on some visual attributes (eg: gender, age group etc) and build separate eigenspaces for each group.

30

31 of 206

PCA: A closer look

  • PCA has many applications – apart from face/object recognition – in image processing/computer vision, statistics, econometrics, finance, agriculture, and you name it!

  • Why PCA? What’s special about PCA? See the next slides!

31

32 of 206

PCA: what does it do?

  • It finds ‘k’ perpendicular directions (all passing through the mean vector) such that the original data are approximated as accurately as possible when projected onto these ‘k’ directions.

  • We will see soon why these ‘k’ directions are eigenvectors of the covariance matrix of the data!

32

33 of 206

PCA

33

Look at this scatter-plot of points in 2D. The points are highly spread out in the direction of the light blue line.

34 of 206

PCA

34

This is how the data would look if they were rotated in such a way that the major axis of the ellipse (the light blue line) now coincided with the Y axis. As the spread of the X coordinates is now relatively insignificant (observe the axes!), we can approximate the rotated data points by their projections onto the Y-axis (i.e. their Y coordinates alone!). This was not possible prior to rotation!

35 of 206

PCA

  • As we could ignore the X-coordinates of the points post rotation and represent them just by the Y-coordinates, we have performed some sort of lossy data compression – or dimensionality reduction.

  • The job of PCA is to perform such a rotation as shown on the previous two slides!

35

36 of 206

PCA

  • Aim of PCA: Find the line passing through the sample mean (i.e. ), such that the projection of any mean-deducted point

onto , most accurately approximates it.

36

xi

ai

Note: Here is a unit vector.

37 of 206

PCA

  • Summing up over all points, we get:

37

38 of 206

PCA

38

This term is proportional to the variance of the data points when projected onto the direction e.

39 of 206

PCA

39

Independent of the direction e

See appendix for details

40 of 206

PCA

  • PCA thus projects the data onto that direction that minimizes the total squared difference between the data-points and their respective projections along that direction.

  • This equivalently yields the direction along which the spread (or variance) will be maximum.

  • Why? Note that the eigenvalue of a covariance matrix tells you the variance of the data when projected along that particular eigenvector:

40

This term is proportional to the variance of the data when projected along e.

41 of 206

PCA

  • But for most applications (including face recognition), just a single direction is absolutely insufficient!

  • We will need to project the data (from the high-dimensional, i.e. d-dimensional space) onto k (k << d) different mutually perpendicular directions.

  • What is the criterion for deriving these directions?

  • We seek those k directions for which the total reconstruction error of all the N images when projected on those directions is minimized.

41

42 of 206

PCA

  • We seek those k directions for which the total reconstruction error of all the N images when projected on those directions is minimized.

  • One can prove that these k directions will be the eigenvectors of the S matrix (equivalently covariance matrix of the data) corresponding to the k-largest eigenvalues. These k directions form the eigen-space.

  • If the eigenvalues of S are distinct, these k directions are defined uniquely (up to a sign factor)

42

43 of 206

PCA

  • One can prove that these k directions will be the eigenvectors of the S matrix (equivalently covariance matrix of the data) corresponding to the k-largest eigenvalues. These k directions form the eigen-space.

  • Sketch of the proof:
  • Assume we have found e1 and are looking for e2 (where e2 is perpendicular to e1 and e2 has unit magnitude).
  • Write out the objective function with the two constraints.
  • Take the derivative of the objective function, set it to 0, and do some algebra to see that e2 is the eigenvector of S with the second largest eigenvalue.
  • Proceed similarly for other directions.

43

44 of 206

Some observations about PCA for face images

  • The matrix V (with all columns) is orthonormal. Hence the squared error between any image and its approximation using just top k eigenvectors is given by:

44

This error is small on an average for a well-aligned group of face images – we will see why on the next slide.

45 of 206

45

The eigenvalues of the covariance matrix typically decay fast in value (if the faces were properly normalized). Note that the j-th eigenvalue is proportional to the variance of the j-th eigencoefficient, i.e.

What this means is that the data have low variance when projected along most of the eigenvectors, i.e. effectively the data are concentrated in a lower-dimensional subspace of the d-dimensional space.

46 of 206

PCA: Compression of a set of images

  • Consider a database of N images that are “similar” (eg: all are face images, all are car images, etc.)
  • Build an eigen-space from some subset of these images (could be all images, as well)
  • We know that these images can often be reconstructed very well (i.e. with low error) using just a few eigenvectors.

46

47 of 206

PCA: Compression of a set of images

  • Use this fact for image compression.
  • Original data storage = d pixels x N images = Nd bytes (assume one byte per pixel intensity) = 8Nd bits.
  • After PCA: Nk x number of bits to store each eigen-coefficient = 32Nk bits (remember k << d, example: d ~ 250,000 and k ~ 100).
  • Plus storage of eigenvectors = 32dk bits (remember k << M as well).
  • Plus mean image = 8d bits.
  • Total: 32(N+d)k + 8d bits

47

48 of 206

PCA: Compression of a set of images

  • Example: N= 5000, d = 250000, k = 100
  • Original size/(size after PCA compression) ~ 12.2.
  • Note: we allocated 32 bits for every element of the eigen-vector. This is actually very conservative and you can have further savings using several tricks.

48

49 of 206

PCA: Compression of a set of images

  • This differs a lot from JPEG compression.
  • JPEG uses discrete cosine transform (DCT) as the basis. PCA tunes the basis to the underlying training data! If you change the training data, the basis changes!
  • The performance of the PCA compression algorithm will depend on how compactly the eigen-space can represent the data.
  • We will study image compression using JPEG and other standards later on in the course.

49

50 of 206

Which is the best orthonormal basis? Relationship between PCA and DCT

  • Consider a set of M data-points (e.g. image patches in a vectorized form) represented as a linear combination of column vectors of an ortho-normal basis matrix:

  • Suppose we reconstruct each patch using only a subset of some k coefficients as follows:

50

51 of 206

Which is the best orthonormal basis?�Relationship between PCA and DCT

  • For which orthonormal basis U is the following error the lowest:

51

52 of 206

Which is the best orthonormal basis?�Relationship between PCA and DCT

  • The answer is the PCA basis, i.e. the set of k eigenvectors of the correlation matrix C, corresponding to the k largest eigen-values. Here is C is defined as:

52

53 of 206

PCA: separable 2D version

  • Find the correlation matrix CR of row vectors from the patches.
  • Find the correlation matrix CC of column vectors from the patches.
  • The final PCA basis is the Kronecker product of the individual bases:

53

54 of 206

But PCA is not used in JPEG, because…

  • It is image-dependent, and the basis matrix would need to be computed afresh for each image.
  • The basis matrix would need to be stored for each image.
  • It is expensive to compute – O(n3) for a vector with n elements.

The DCT is used instead!

54

55 of 206

DCT and PCA

  • DCT can be computed very fast using fft.
  • It is universal – no need to store the DCT bases explicitly.
  • DCT has very good energy compaction properties, only slightly worse than PCA.

55

56 of 206

Experiment

  • Suppose you extract M ~ 100,000 small-sized (8 x 8) patches from a set of images.
  • Compute the column-column and row-row correlation matrices.

  • Compute their eigenvectors VR and VC.
  • The eigenvectors will be very similar to the columns of the 1D-DCT matrix! (as evidenced by dot product values).
  • Now compute the Kronecker product of VR and VC and call it V. Reshape each column of V to form an image. These images will appear very similar to the DCT bases.

56

See code here.

57 of 206

57

0.3536 0.4904 0.4619 0.4157 0.3536 0.2778 0.1913 0.0975

0.3536 0.4157 0.1913 -0.0975 -0.3536 -0.4904 -0.4619 -0.2778

0.3536 0.2778 -0.1913 -0.4904 -0.3536 0.0975 0.4619 0.4157

0.3536 0.0975 -0.4619 -0.2778 0.3536 0.4157 -0.1913 -0.4904

0.3536 -0.0975 -0.4619 0.2778 0.3536 -0.4157 -0.1913 0.4904

0.3536 -0.2778 -0.1913 0.4904 -0.3536 -0.0975 0.4619 -0.4157

0.3536 -0.4157 0.1913 0.0975 -0.3536 0.4904 -0.4619 0.2778

0.3536 -0.4904 0.4619 -0.4157 0.3536 -0.2778 0.1913 -0.0975

DCT matrix: dctmtx command from MATLAB (see code on website)

0.3517 -0.4493 -0.4278 0.4230 0.3754 0.3247 -0.2250 -0.1245

0.3534 -0.4366 -0.2276 -0.0110 -0.3078 -0.4746 0.4732 0.2975

0.3543 -0.3101 0.1728 -0.4830 -0.3989 0.0498 -0.4299 -0.4109

0.3546 -0.1115 0.4799 -0.3005 0.3342 0.4102 0.1856 0.4761

0.3547 0.1141 0.4823 0.2944 0.3301 -0.4182 0.1745 -0.4771

0.3543 0.3104 0.1771 0.4834 -0.3977 -0.0322 -0.4308 0.4103

0.3535 0.4357 -0.2319 0.0143 -0.3009 0.4656 0.4851 -0.2975

0.3520 0.4468 -0.4328 -0.4204 0.3686 -0.3253 -0.2342 0.1261

0.3520 -0.4461 -0.4305 0.4224 0.3696 0.3247 0.2342 0.1283

0.3537 -0.4338 -0.2345 -0.0114 -0.3000 -0.4671 -0.4814 -0.3028

0.3545 -0.3086 0.1662 -0.4896 -0.4007 0.0359 0.4261 0.4102

0.3548 -0.1145 0.4763 -0.3031 0.3339 0.4198 -0.1800 -0.4713

0.3548 0.1056 0.4839 0.2926 0.3349 -0.4194 -0.1766 0.4733

0.3543 0.3043 0.1863 0.4833 -0.4028 -0.0354 0.4269 -0.4097

0.3532 0.4389 -0.2269 0.0180 -0.3008 0.4654 -0.4811 0.3037

0.3512 0.4562 -0.4300 -0.4126 0.3694 -0.3242 0.2335 -0.1319

VC: Eigenvectors of column-column correlation matrix

VR: Eigenvectors of row-row correlation matrix

1.0000 0.0007 0.0032 0.0002 0.0013 0.0001 0.0005 0.0000

0.0007 0.9970 0.0097 0.0689 0.0009 0.0322 0.0003 0.0110

0.0033 0.0106 0.9968 0.0118 0.0713 0.0004 0.0334 0.0025

0.0002 0.0718 0.0124 0.9926 0.0007 0.0927 0.0017 0.0276

0.0010 0.0001 0.0737 0.0004 0.9942 0.0008 0.0780 0.0010

0.0000 0.0261 0.0015 0.0962 0.0005 0.9934 0.0011 0.0569

0.0003 0.0007 0.0276 0.0021 0.0802 0.0010 0.9964 0.0013

0.0000 0.0076 0.0026 0.0227 0.0012 0.0596 0.0015 0.9979

1.0000 0.0002 0.0029 0.0001 0.0010 0.0000 0.0004 0.0000

0.0002 0.9965 0.0028 0.0766 0.0005 0.0314 0.0009 0.0107

0.0029 0.0025 0.9969 0.0046 0.0728 0.0017 0.0304 0.0013

0.0001 0.0795 0.0044 0.9923 0.0029 0.0916 0.0015 0.0243

0.0008 0.0003 0.0747 0.0026 0.9948 0.0061 0.0696 0.0004

0.0000 0.0246 0.0021 0.0949 0.0069 0.9940 0.0131 0.0452

0.0003 0.0004 0.0252 0.0003 0.0715 0.0137 0.9970 0.0002

0.0000 0.0076 0.0013 0.0207 0.0001 0.0476 0.0009 0.9986

Absolute value of dot products between the columns of DCT matrix and columns of VR (left) and VC (right)

58 of 206

58

DCT bases

64 columns of V – each reshaped to form an 8 x 8 image, and rescaled to fit in the 0-1 range. Notice the similarity between the DCT bases and the columns of V. Again, V is the Kronecker product of VR and VC.

59 of 206

DCT and PCA

  • DCT is very close to PCA when the patches come from what is called as a stationary first order Markov process, i.e.

59

60 of 206

DCT and PCA

  • One can show that the eigenvectors of the covariance matrix of the form seen on the previous slide are the DCT basis vectors!

  • Natural images approximate this first order Markov model, and hence DCT is almost as good as PCA for compression of a large ensemble of image patches.

  • DCT has the advantage of being a universal basis and also the DCT coefficients are more efficiently computable than PCA coefficients (because DCT computation uses FFT).

60

61 of 206

61

1.0000 0.9902 0.9795 0.9733 0.9682 0.9639 0.9604 0.9570

0.9902 1.0005 0.9908 0.9795 0.9734 0.9684 0.9643 0.9605

0.9795 0.9908 1.0010 0.9908 0.9796 0.9735 0.9689 0.9646

0.9733 0.9795 0.9908 1.0005 0.9904 0.9793 0.9735 0.9686

0.9682 0.9734 0.9796 0.9904 1.0004 0.9903 0.9794 0.9734

0.9639 0.9684 0.9735 0.9793 0.9903 1.0001 0.9903 0.9793

0.9604 0.9643 0.9689 0.9735 0.9794 0.9903 1.0004 0.9904

0.9570 0.9605 0.9646 0.9686 0.9734 0.9793 0.9904 1.0002

1.0000 0.9888 0.9770 0.9704 0.9648 0.9599 0.9554 0.9510

0.9888 1.0004 0.9891 0.9768 0.9703 0.9646 0.9596 0.9548

0.9770 0.9891 1.0004 0.9886 0.9764 0.9698 0.9640 0.9587

0.9704 0.9768 0.9886 0.9994 0.9878 0.9755 0.9687 0.9627

0.9648 0.9703 0.9764 0.9878 0.9986 0.9870 0.9746 0.9676

0.9599 0.9646 0.9698 0.9755 0.9870 0.9978 0.9861 0.9734

0.9554 0.9596 0.9640 0.9687 0.9746 0.9861 0.9967 0.9847

0.9510 0.9548 0.9587 0.9627 0.9676 0.9734 0.9847 0.9951

CR/CR(1,1) -

Notice it can be approximated by the form shown two slides before, with ρ~0.99

CC/CC(1,1) -

Notice it can be approximated by the form shown two slides before, with ρ~0.9888

More results from the previous experiment. See code here.

62 of 206

Non-negative matrix factorization (NMF) and non-negative sparse coding (NNSC)

62

63 of 206

NMF

  • Consider a matrix Y ϵ Rn x N, each column of which is a vectorised patch of n elements, i.e. Y = [y1|y2|…|yN].

  • Then NMF seeks to factorize Y into the product of matrices W (size n x r) and H (size r x N) which are both non-negative, such that YWH.

  • Note: non-negativity means that every element of W, H must be non-negative.

63

64 of 206

NMF

  • The columns of W are basis vectors and the values in the i-th column of H are coefficients.

  • We seek to approximate each vector yi with a linear combination of the columns of W.

  • Usually, r is chosen to be smaller than n or N, which gives good model compression.

  • The speciality of NMF is that it allows only additive combinations of the columns of W to approximate Y.

  • This is in tune with the intuition of combining different parts to form a whole.

64

65 of 206

PCA versus NMF bases

65

Source:

Lee and Seung, “Learning the parts of objects by non-negative matrix factorization”,

http://lsa.colorado.edu/LexicalSemantics/seung-nonneg-matrix.pdf

NMF bases are non-negative unlike eigenfaces. In that sense, each basis image from NMF is more like a model face than from the eigenfaces.

In practice they do not appear as sparse as seen in this figure, but more like what you will see on the next slide.

66 of 206

66

Source:

Wang et al, NMF framework for face recognition,

http://www.cs.ucsb.edu/~mturk/pubs/IJPRAI05.pdf

67 of 206

NMF: Objective function

  • The objective function to be minimized is:

  • Method of alternating optimization using projected gradient descent with adaptive step size:

67

68 of 206

NMF: Multiplicative updates

  • The objective function to be minimized is:

  • However there are also multiplicative updates of the following form:

68

69 of 206

NMF: Multiplicative updates

  • The multiplicative updates maintain the non-negativity of W and H and converge to a stationary point of E(W,H).

  • The multiplicative updates also guarantee decrease of the cost function. They are based on majorization-minimization.

69

70 of 206

Non-negative sparse coding

  • An extension to NMF with sparseness penalty imposed on the coefficients.
  • The objective function to be minimized is:

  • Need to be careful to avoid multiple solutions – also impose unit norm constraint on all columns of W.

70

71 of 206

Non-negative sparse coding

  • Algorithm:

71

72 of 206

Results

72

73 of 206

NNSC and Poisson Data

  • The noise affecting real-world images is known to follow the Poisson distribution.
  • The Poisson distribution is given as follows:

  • For such a distribution, the mean is equal to the variance.

73

X is a Poisson random variable, i is a non-negative integer, λ is the mean of the Poisson distribution

74 of 206

NNSC and Poisson Data

74

  • Consider an image I, a Poisson-corrupted version of this image is obtained as Ji ~ Poisson (Ii) where Ii is the average intensity (photon flux) at pixel i.
  • Note that the noisy intensity at any pixel location i is always a non-negative integer, drawn from a Poisson distribution having mean (hence variance) Ii.

75 of 206

75

Image quality at 9 different average rates of photon emission.

76 of 206

NNSC and Poisson denoising

  • NNSC lends itself naturally to Poisson denoising due to the non-negativity constraints – on measurements as well as data (mean of Poisson = photon rate).
  • Consider Y ~ Poisson(WH), i.e. ykl ~ Poisson([WH]kl), 1 <= k <= n, 1 <= l <= N. We have [WH]kl = [Whl]k.
  • The objective function to be minimized is as follows:

76

Non-negative Poisson likelihood where ykl is a noisy pixel

Sparsity on coefficients

77 of 206

77

78 of 206

78

79 of 206

A comment on image processing under Poisson noise

  • The relative error due to Poisson noise is equal to standard deviation divided by mean = (λ)-0.5.

  • At low photon counts , i.e. low λ, the noise variance is less, but its relative effect is more, and hence the problem is more challenging!

  • This is unlike the signal independent additive iid Gaussian noise!

79

80 of 206

Learning Overcomplete Dictionaries

80

81 of 206

Introduction: Complete and over-complete dictionaries

  • Signals are often represented as a linear combination of basis functions (e.g. Fourier or wavelet representation).

  • The basis functions always have the same dimensionality as the (discrete) signals they represent.

  • The number of basis vectors is traditionally the same as the dimensionality of the signals they represent.

  • These bases may be orthonormal (Fourier, wavelet, PCA) or may not be orthonormal (in fact any full rank matrix can be used to represent signals – albeit not compactly).

81

82 of 206

Introduction: Complete and over-complete bases

  • A more general representation for signals uses so called “over-complete dictionaries”, where the number of basis functions is MORE than the dimensionality of the signals.
  • Complete and over-complete bases:

82

83 of 206

Introduction: Construction of over-complete bases

  • Over-complete bases can be created by union of multiple sets of complete bases.
  • Example 1: A signal with n values can be represented using a union of n x n Fourier and n x n Haar wavelet bases, yielding a n x 2n basis matrix.
  • Example 2: A signal with n values can be represented by adding sinusoids of more frequencies to an existing Fourier basis matrix with n vectors.

83

84 of 206

Introduction: uniqueness? ☹

  • With complete bases, the representation of the signal is always unique.
  • Example: Signals are uniquely defined by their wavelet or Fourier transforms, the eigen-coefficients of any signal (given PCA basis) are uniquely defined.
  • This uniqueness is LOST with over-complete basis.

84

85 of 206

Introduction: compactness! ☺

  • Advantage: over-complete bases afford much greater compactness in signal representation.
  • Example: Consider two types of audio-signals – whistles and claps. Signals of either type can be represented in a complete Fourier or wavelet basis (power-law of compressibility will apply).
  • BUT: imagine two complete bases respectively learned for whistles and claps – B1 and B2.

85

86 of 206

Introduction: compactness! ☺

  • Suppose B1 and B2 are such that a whistle (resp. clap) signal will likely have a compact representation in the whistle (resp. clap) basis and not in the other one.
  • A whistle+clap signal will NOT have a compact representation in either basis - B1 or B2 !
  • But the whistle+clap signal WILL have a compact representation in the overcomplete basis B = [B1 B2].
  • Another (and simpler) example: cosines and spikes!

86

87 of 206

More problems ☹

  • Since a signal can have many representations in an over-complete basis, which one do we pick?
  • Pick the sparsest one, i.e. the one with least number of non-zero elements which either perfectly reconstructs the signal, or reconstructs the signal up to some error.
  • Finding the sparsest representation for a signal in an over-complete basis is an NP-hard problem (i.e. the best known algorithm for this task has exponential time complexity ☹)

87

88 of 206

More problems ☹

  • In other words, the following problem is NP-hard:

88

89 of 206

Solving (?) those problems

  • The NP-hard problem has several methods for approximation – basis pursuit (BP), matching pursuit (MP), orthogonal matching pursuit (OMP) and many others.
  • None of them will give you the sparsest solution always – but they will yield the sparsest solutions under certain conditions on the overcomplete dictionary A.

89

90 of 206

Learning the dictionaries

  • So far we assumed that the basis (i.e. A) was fixed, and optimized for the sparse representation.
  • Now, we need to learn A as well – along with the sparse codes!
  • We’ve learned about PCA – but it gives an orthonormal basis.

90

91 of 206

Learning the dictionary: analogy with K-means

  • In K-means, we start with a bunch of K cluster centers and assign each point in the dataset to the nearest cluster center.
  • The cluster centers are re-computed by taking the mean of all points assigned to a cluster.
  • The assignment and cluster-center computation problems are iterated until a convergence criterion is met.

91

92 of 206

Learning the bases: analogy with K-means

  • K-means is a special sparse coding problem where each point is represented by strictly one of K dictionary elements.
  • Our dictionary (or bases) learning problem is more complex: we are trying to express each point as a linear combination of a subset of dictionary elements (or a sparse linear combination of dictionary elements).

92

93 of 206

Learning the bases: Method 1 (Olshausen and Field, 1996)

93

Uniform prior on A

Gaussian likelihood, i.e. assuming Gaussian noise

Laplacian prior on sk

(given any A)

94 of 206

Two-step iterative procedure

  • Fix the basis A and obtain the sparse coefficients for each signal using MP, OMP or BP (some papers – like the one by Olshausen and Field - use gradient descent for this step!).
  • Now fix the coefficients, and update the basis vectors (using various techniques, one of which was described on the previous slide).
  • Normalize each basis vector to unit norm (so we may use projected gradient descent).
  • Repeat the previous two steps until some error criterion is met.

94

95 of 206

Toy Experiment 1

95

Result of basis learning (dictionary with 144 elements) with sparsity constraints on the codes. Training performed on 12 x 12 patches extracted from natural images.

It is conjectured that the simple cells of the V1 area of the visual cortex in the brain have responses similar in shape to these learned dictionary elements.

Ref: Olshausen and Field, “Natural image statistics and efficient coding”

96 of 206

96

These are the results one obtains upon performing PCA on the 12 x 12 patches.

Ref: Olshausen and Field, “Natural image statistics and efficient coding”

97 of 206

Toy Experiment 2

  • Data-points generated as a (Laplacian/super-Laplacian) random linear combination of some arbitrarily chosen basis vectors. Note: the coefficients are chosen from a (super)-Laplacian distribution.
  • In the no-noise situation, the aim is to extract the basis vectors and the coefficients of the linear combination.
  • The fitting results are shown on the next slide. The true and estimated directions agree quite well.

97

Ref: Lewicki and Sejnowski, “Learning overcomplete representations”

98 of 206

98

Ref: Lewicki and Sejnowski, “Learning overcomplete representations

99 of 206

99

Ref: Lewicki and Sejnowski, “Learning overcomplete representations

100 of 206

Learning the Bases: Method of Optimal Directions (MOD) - Method 2

  • Given a fixed dictionary A, assume sparse codes for every signal are computed using OMP, MP etc.
  • The overall error for dictionary update is now given as

  • We want to find dictionary A that minimizes this error (assuming fixed sparse codes).

100

Ref: Engan et al, “Method of optimal directions for frame design”

101 of 206

Learning the Bases: Method of Optimal Directions (MOD) - Method 2

  • Take the derivative of E(A) w.r.t. A and set it to 0. This gives us the following update:

  • Following the update of A, each column in A is independently rescaled to unit norm.
  • The updates of A and S alternate with each other till some convergence criterion is reached.
  • This method is more efficient than the one by Olshausen and Field.

101

102 of 206

Learning the Bases: Method 3- Union of Orthonormal Bases

  • Like before, we represent a signal in the following way:

  • A is an over-complete dictionary, but let us assume that it is a union of ortho-normal bases, in the form

102

103 of 206

Learning the Bases: Method 3- Union of Ortho-normal Bases

  • The coefficient matrix S can now be written as follows (M subsets, each corresponding to a single orthonormal basis):

  • Assuming we have fixed bases stored in A, the coefficients in S can be estimated using block coordinate descent, described on the following slide.

103

104 of 206

Learning the Bases: Method 3- Union of Ortho-normal Bases

104

There is a quick way of performing this optimization given an ortho-normal basis – SOFT THRESHOLDING (could be replaced by hard thresholding if you had a stronger sparseness prior than an L1 norm

105 of 206

Learning the Bases: Method 3- Union of Ortho-normal Bases

  • Given the coefficients, we now want to update the dictionary which is done as follows:

105

Why are we doing this? It is related to the so-called orthogonal Procrustes problem - a well-known application of SVD. We will see this on the next slide.

The specific problem we are solving is given below. Note that it cannot be solved using a pseudo-inverse as that will not impose orthonormality constraint, if there is noise in the data or if the coefficients are perturbed or thresholded.

106 of 206

106

107 of 206

Learning the Bases: Method 3- Union of Ortho-normal Bases

  • Keeping all bases in A fixed, update the coefficients in S using a known sparse coding technique.
  • Keeping the coefficients in S fixed, update the bases in A using the aforementioned SVD-based method.
  • Repeat the above two steps until a convergence criterion is reached.

107

108 of 206

Learning the Bases: Method 4 – K-SVD

  • Recall: we want to learn a dictionary and sparse codes on that dictionary given some data-points:

  • Starting with a fixed dictionary, sparse coding follows as usual – OMP, BP, MP etc. The criterion could be based on reconstruction error or L0-norm of the sparse codes.
  • The dictionary is updated one column at a time.

108

109 of 206

109

Row ‘k’ (NOT column) of matrix S

Does NOT depend on the k-th dictionary column

How to find ak, given the above expression? We have decomposed the original error matrix, i.e. Y-AS, into a sum of rank-1 matrices, out of which only the last term depends on ak. So we are trying to find a rank-1 approximation for Ek, and this can be done by computing the SVD of Ek, and using the singular vectors corresponding to the largest singular value.

110 of 206

110

Problem! The dictionary codes may no more be sparse! SVD does not have any in-built sparsity constraint in it! So, we proceed as follows:

Consider the error matrix defined as follows:

Considers only those columns of Y (i.e. only those data-points) that actually USE the k-th dictionary atom, effectively yielding a smaller matrix, of size p by |ωk|

This update affects the sparse codes of only those data-points that used the k-th dictionary element, i.e. those that are part of ωk.

111 of 206

111

112 of 206

KSVD and K-means

  • Limit the sparsity factor T0 = 1.

  • Enforce all the sparse codes to be either 0 or 1.

  • Then you get the K-means algorithm! ☺

112

113 of 206

Implementation issues

  • KSVD is a popular and effective method. But some implementation issues haunt K-SVD (life is never easy ☺).

  • KSVD is susceptible to local minima and over-fitting if K is too large (just like you can get meaningless clusters if your number of cluster is too small, or get meaningless densities if the number of histogram bins is too many).

  • KSVD convergence is not fully guaranteed. The dictionary updates given fixed sparse codes ensure the error decreases. However the sparse codes given fixed dictionary may not decrease the error – it is affected by the behaviour of the sparse coding approximation algorithms.
  • You can speed up the algorithm by removing extremely “unpopular” dictionary elements, or removing duplicate (or near-duplicate) columns of the dictionary.

113

114 of 206

(Many) Applications of KSVD

  • Image denoising
  • Image inpainting
  • Image deblurring
  • Blind compressive sensing
  • Classification
  • Compression
  • And you can work on many more ☺�

114

115 of 206

Application: Image Compression

  • Recall JPEG algorithm
  • It divides the image to be compressed into non-overlapping patches of size 8 x 8.
  • It computes the 2D-DCT coefficients of each patch, quantizes them and stores only the non-zero coefficients – using Huffman encoding.
  • KSVD can be used in image compression by replacing the orthonormal 2D-DCT dictionary by an overcomplete dictionary trained on a database of similar images.

115

116 of 206

Application: Image Compression (Training Phase)

  • Training set for dictionary learning: a set of 11000 patches of size 8 x 8 – taken from a face image database. Dictionary size K = 441 atoms (elements).
  • OMP used in the sparse coding step during training – stopping criterion is a fixed number of coefficients T0 = 10.
  • Over-complete Haar and DCT dictionaries – of size 64 x 441 – and ortho-normal DCT basis of size 64 x 64 (JPEG), also used for comparison.

116

117 of 206

117

118 of 206

Application: Image Compression (Testing Phase)

  • A lossy image compression algorithm is evaluated using an ROC curve – the X axis contains the average number of bits (BPP) to store the signal.
  • The Y-axis is the compression error or PSNR. Normally, the acceptable error e is fixed and the number of bits is calculated.
  • The test image is divided into non-overlapping patches of size 8 x 8.
  • Each patch is projected onto the trained dictionary and its sparse code is obtained using OMP given a fixed error e.

118

119 of 206

Application: Image Compression (Testing Phase)

  • The encoded image contains the following:
  • Sparse-code coefficients for each patch and the indices of each non-zero coefficient (in the dictionary).
  • The number of coefficients used to represent each patch (different patches will need different number of coefficients).

119

120 of 206

Application: Image Compression (Testing Phase)

  • The average number of bits per pixel (BPP) is calculated as:

120

Number of bits required to store the number of coefficients for each patch

Number of bits required to store the dictionary index for each coefficient

Number of bits required to code each coefficient (quantization level). The quantization makes this a lossy compression algorithm.

Sum total of the number of coefficients representing each patch

Huffman encoding

121 of 206

121

Even if the error ‘e’ for the OMP was fixed, we need to compute the total MSE between the true and the compressed image. This is due to effects of quantization while storing the sparse coefficient values for each patch.

122 of 206

122

123 of 206

  • KSVD for denoising seeks to minimize the following objective function:

123

Y = noisy image

X = underlying clean image (to be estimated)

αij = sparse dictionary coefficients for patch at location (i,j)

D = dictionary

Rij = matrix that extracts the patch xij from image X, i.e.

xij = Rij X

124 of 206

Application: Image Denoising

  • Note: The dictionary may be learned a priori from a corpus of image patches. The patches from the noisy image can then be denoised by mere sparse coding.

  • The more preferable method is to train the dictionary directly on the noisy image in tandem with the sparse coding step (as in the previous slide).

  • This avoids having to depend on the training set and allows for tuning of the dictionary to the underlying image structure (as opposed to the structure of some other images).

124

125 of 206

KSVD Algorithm for Denoising (Dictionary learned on the noisy image)

  • Set X = Y, D = overcomplete DCT
  • Until some “convergence” criterion is satisfied, repeat the following:
  • Obtain the sparse codes for every patch (typically using OMP) as follows:

  1. Perform the dictionary learning update typical for KSVD.
  2. Estimate the final image X by averaging the reconstructed overlapping patches, OR estimate X given D and αij:

125

126 of 206

KSVD Algorithm for Denoising (Dictionary learned on the noisy image)

126

This equation is a mathematically rigorous way to show how X was reconstructed by averaging overlapping denoised patches together with the noisy image as well.

127 of 206

127

Baseline for comparison: method by Portilla et al, “Image denoising using scale mixture of Gaussians in the wavelet domain”, IEEE TIP 2003.

128 of 206

128

129 of 206

129

130 of 206

130

131 of 206

Application of KSVD: Filling in Missing Pixels

131

132 of 206

Application of KSVD: Filling in Missing Pixels

  • An over-complete dictionary is trained a priori on a set of face images.
  • A test face image (not part of the training set) is synthetically degraded by masking out 50 to 70 percent of the pixels.
  • Patches from the degraded image are sparse coded by projection onto the trained dictionary using OMP.
  • OMP is modified so that only the non-degraded pixels are considered during any error computation (the dictionary elements are therefore appropriately re-scaled to unit norm).

132

133 of 206

Application of KSVD: Filling in Missing Pixels

133

134 of 206

Application of KSVD: Filling in Missing Pixels

  • The patches are reconstructed in the following manner:

134

NOTE: Dictionary elements without masking!!

135 of 206

KSVD Application: Coded Exposure Image

135

It is a coded superposition (i.e. summation) of T sub-frames within a unit integration time of the video camera.

Conventional capture

(simple integration across time, without modulation by binary codes)

Coded exposure image (captured in one unit integration time of the camera)

Binary code at time instant t

Sub-frame at time instant t

136 of 206

KSVD Application: Sparse Coding in Video CS

136

y is a vectorized form of the snapshot image I of n pixels.

f is a vectorized form of the underlying video with T frames and each frame having n pixels, hence f has nT pixels in total.

Ψ is a 3D DCT basis – the basis can also be “learned offline on a representative set of videos”

Φ Is a matrix of size n x nT containing values from the binary code S. See below:

Diag(St) represents a n x n diagonal matrix. The diagonal contains elements from the mask St.

137 of 206

Dictionary Learning

  • Done offline – training set was 20 video sequences, each video rotated in 8 directions and played forward + backward = 320 videos.
  • All videos had target frame rate (2000 fps, as we work with a 60 fps camera and want 36 fold gain).
  • Video-patch size was 7 x 7 x 36 = 1764 x 1
  • Offline learning: KSVD (*), K = 100,000 atoms
  • Sparse coding done online (using OMP)

137

KSVD is a dictionary learning technique. Given a set of input patches in n X N, it learns a set of vectors (n X K), such that a sparse linear combination of these vectors approximates each patch as closely as possible. There are K coefficients per patch, out of which most are encouraged to be 0 ( sparse).

138 of 206

138

139 of 206

139

Video reconstruction results on real data available below:

https://www.youtube.com/watch?v=JAYC0C3NIdY

  • Ideally, the videos whose patches are used for training must have the TARGET frame rate.
  • Suppose we have a 60 fps video camera and want to increase its temporal resolution to say 60T fps. This would require coded versions of T consecutive sub-frames in the Hitomi camera (cf: slides on CS systems) to be superposed to produce a snapshot.
  • Thus from a single snapshot with n x n pixels, we would like to reconstruct a video of size n x n x T. The dictionary needs to be trained on patches of size m x m x T (m much less than n).
  • Thus ideally, we would require the availability of videos acquired at a frame rate of 60T (or close to it).
  • If you wish to design a system with a different value of T (say T’), you may ideally need to re-train the dictionary on patches taken from videos with appropriate frame rate.

140 of 206

Blind compressed sensing

  • In standard compressed sensing, we assume knowledge of the basis (may be orthonormal or even overcomplete) in which the signal is sparse.
  • However this knowledge may not be available to us always.
  • For example, consider a remote sensing application with compressive acquisitions of water bodies followed by forests over mountain ranges.
  • Images of the former are likely to be highly sparse in DCT whereas the latter may not be – and another basis may serve a better purpose.
  • Blind compressed sensing is the task of inferring the signal basis as well as the signal coefficients directly from the compressed data.

140

141 of 206

Blind compressed sensing

  • Consider N signals and their compressive measurements in the following form:

  • Our sparsity model for each signal is as follows:

  • Combined model is:

141

142 of 206

Blind compressed sensing

  • The objective function to be minimized is:

  • The algorithm for this is alternating in nature: sparse coding assuming a fixed dictionary, and dictionary updates assuming sparse codes.

  • Sparse coding: OMP

142

143 of 206

Blind compressed sensing

  • The dictionary update is as follows:

143

144 of 206

Blind compressed sensing

  • Taking derivative w.r.t. dk and setting to 0 gives:

  • Repeat for other columns iteratively.

144

This matrix will be invertible with probability 1 only if N is such that Nm >= n. Notice that it is important for every point to have a different sensing matrix, otherwise the dictionary cannot be inferred.

145 of 206

Blind CS: Pseudocode

  • Start with a random dictionary, infer the sparse codes for each data point.
  • Given the sparse codes, infer each column of the dictionary keeping the others fixed. Infer the corresponding dictionary coefficients as well.
  • Repeat the previous two steps till a convergence criterion is reached.

145

146 of 206

Blind compressed sensing: experiments

  • Synthetic dictionary of size 50 by 10 (K = 10) generated iid from N(0,1).
  • Training signals generated using a linear combination of these dictionary columns, with coefficients generated from N(0,80).
  • Corresponding compressive measurements are generated using a different random projection for each training signal.
  • The dictionary and coefficients are inferred from the compressed data, starting with random initial conditions.

146

147 of 206

Blind compressed sensing: experiments

  • The inferred dictionary columns are compared to the true dictionary columns using dot products.
  • The number of correctly identified (i.e. dot product of more than 0.95) dictionary columns is computed.
  • Plots seen on next slide.

147

148 of 206

148

149 of 206

149

150 of 206

150

151 of 206

Transform Learning for Classification

151

152 of 206

Transform learning

  • Techniques like PCA, NMF, KSVD, MOD, etc. are design for (compact) image representation.

  • They sometimes yield great results on image classification.

  • But they are not designed for optimal image classification.

152

153 of 206

Transform learning

  • Given input points {xi}, i = 1 to N, we will learn a transformation vector w with the aim of optimal classification of the transformed points yi = wtxi.

  • Two techniques for this:
  • Fisher’s linear discrimant
  • Mutual information based discriminant

153

154 of 206

154

Good w

Not so good w

155 of 206

Fisher’s linear discriminant (FLD)

  • Also called linear discriminant analysis (LDA).

  • We start off considering c = 2 classes C1 and C2 and generalize later.

  • We have:

155

156 of 206

Fisher’s linear discriminat (FLD)

  • Consider:

156

Criterion for a good w:

The transformed mean vectors should be as far apart as possible.

But that is not enough. We want the difference between the transformed mean vectors to be large RELATIVE to the variances of the transformed points of each class. Otherwise, there may be conflation between the points of the two classes due to a large spread even if their means are far apart.

157 of 206

Fisher’s linear discriminat (FLD)

  • Consider:

  • The vector w should be designed based on maximizing the following criterion:

157

Between-class scatter matrix

158 of 206

158

Between-class scatter matrix

Within-class scatter matrix

Find w which maximizes this criterion – often called the Fisher criterion

159 of 206

Fisher’s linear discriminant

  • Taking derivatives and setting to 0, we have:

159

Called a Generalized eigenvalue problem.

Assumes that SW is not singular.

160 of 206

Fisher’s linear discriminant

  • In the c = 2 case:

  • Thus FLD determines a direction that optimizes the criterion mentioned earlier.

  • Since SB is rank 1, only a single such direction is possible.

160

161 of 206

Fisher’s linear discriminant: classification

  • Thus FLD determines a direction that optimizes the criterion mentioned earlier.

  • It reduces dimensionality from d to 1.

  • The actual classification however proceeds by nearest neighbor method in the transformed space.

161

162 of 206

Fisher’s linear discriminant: case of c classes

  • In this case, we project the d-dimensional points xi onto a c-1 dimensional space (we assume dc).

  • Generalizing the within-class scatter matrix:

162

163 of 206

Fisher’s linear discriminant: case of c classes

  • Define the total scatter matrix:

163

164 of 206

Fisher’s linear discriminant: case of c classes

164

165 of 206

Fisher’s linear discriminant: case of c classes

  • We perform a dimensionality reduction from d dimensions to c-1 dimensions using a transform matrix W of size d x (c-1) where:

  • We define scatter matrices in the transformed space:

165

166 of 206

Fisher’s criterion: case of c classes

  • The Fisher criterion which we maximize w.r.t. W is:

  • This is a generalized eigenvalue problem again.

166

The determinant of the scatter matrix is a scalar measure of the scatter. As the determinant is the product of eigenvalues, it is a product of variances in the principal directions and thus the square of the scattering volume of the hyperellipsoid.

167 of 206

Fisher’s criterion: case of c classes

  • The solution is a set of c-1 generalized eigenvectors of the following form:

  • There are no more than c-1 such eigenvectors with non-zero eigenvalues as the rank of SB is at the most c-1. (As SB is the sum of c rank-one matrices.)

167

168 of 206

Fisher’s linear discriminant: uniqueness

  • Is the solution W unique?

  • Answer is no – it is unique only up to an arbitrary non-singular linear transformation because

168

169 of 206

Fisherfaces: applications in face recognition

  • Problem: the matrix SW becomes singular.

  • Why? Because the number of images is usually much smaller than the image dimensions!

  • PCA to the rescue!

169

170 of 206

Fisherfaces: applications in face recognition

  • Reduce the dimensionality of the d-dimensional points – of which there are N – using PCA.

  • Project onto the largest N-c eigenvectors – the former causes no loss of information. (This technique is called “Fisherfaces”- see equation 5 of this.)

  • Then perform FLD to reduce dimensionality to c-1. See equation on next slide.

170

171 of 206

Fisherfaces: applications in face recognition

  • The criterion for this modified FLD is as follows:

171

d x (N-c) matrix

172 of 206

Results with Fisherfaces

  • Training data: Database of face images under different lighting conditions.

  • Subsets 1, 2, 3, 4, 5 where the latitudinal and longitudinal angles of the light source direction are respectively within 15, 30, 45, 60, 75 degrees of the camera axis.

  • Interpolation: Training on sets 1 and 5 – testing on the others.

  • Extrapolation: training on set 1, testing on the others.

172

173 of 206

173

Fisherfaces produces better results!

174 of 206

174

Fisherfaces produces better results!

175 of 206

Classification using Mutual Information

  • The FLD approach is constrained to reduce data dimensions to at most c-1 when there are c classes.

  • It is desirable to remove this limitation and have more flexibility.

  • Mutual information = quantity that characterizes the dependence between two random variables.

175

176 of 206

Mutual information: definition

  • Given random variables X and Y, the mutual information between then I(X,Y) is defined as:

  • For continuous random variables, it is:

176

177 of 206

Mutual information: definition

  • It thus tells you how much information X and Y contain about each other – or how close they come to being independent of each other.

  • MI has the property of being non-negative and symmetric.

  • Zero-valued for independent random variables.

177

178 of 206

MI for classification

  • Consider data {xi} for i = 1 to N where each xi in RD .

  • Each xi has label ci which belongs to exactly one of Nc classes.

  • Let X = R.V. for data, C = R.V. for class label.

  • Infer transform vector w such that the mutual information between {ci} and {yi = wtxi} is maximized.

178

179 of 206

MI for classification

  • Infer transform vector w such that the mutual information between {ci} and {yi = wtxi} is maximized.

  • For our problem, we seek to maximize:

  • To compute MI, we need expressions for all the probabilities.

179

180 of 206

Probability expressions

  • The expression for P(c) is

  • Expression for p(y):

180

181 of 206

Segway: Kernel density estimation

  • What motivates the specific expression for p(y|cp)?

  • This is a common way of non-parametric estimation of the probability density.
  • Non-parametric: because it does not assume any families of densities (eg: Gaussian, Laplacian, Cauchy, etc.)

181

182 of 206

Segway: Kernel density estimation

  • G in the expression below is called the kernel, which is often chosen to be Gaussian.

  • This estimator is often called the Parzen-Rosenblatt density estimator.

  • Histogramming is another form of non-parametric density estimation – but produces discontinuous estimates.

  • Note: Histogramming and kernel method are both estimates of a true underlying (unknown) probability density.

182

183 of 206

Segway: Kernel density estimation

  • The choice of σ in this estimator is critical:

  • The optimal σ depends on the second derivatives of the underlying true density – which is unknown.

  • A rule of thumb is to choose σ = O(n-0.2) where n is the number of samples.

183

184 of 206

Probability expressions

  • Expression for p(y):

184

185 of 206

Quadratic mutual information

  • The mutual information expression involves log of summations – difficult to efficiently solve.

  • Instead, we use the quadratic mutual information (QMI):

185

186 of 206

QMI expressions

186

187 of 206

QMI expressions

187

Within-class differences

Differences of all pairs of samples (regardless of class)

188 of 206

QMI expressions for derivatives

188

189 of 206

QMI optimization

  • Proceeds by projected gradient ascent with adaptive step-size and can reach a local maximum of the QMI:

  • This was for optimizing over a single direction.
  • For multiple directions, we need to optimize over a matrix W (size D x d, d < D) such that yi = WTxi.
  • We need to impose orthonormality of W – hence the need for a projection step.

189

190 of 206

QMI optimization: orthogonalization

  • Let w1, w2,…,wd be the row vectors of W – which is not at this point orthogonal.

  • Define the following operator:

  • The next slide gives the orthogonalization procedure – called Gram-Schmidt O.P.

190

191 of 206

191

This forces u2 to be orthogonal to u1.

Gram Schmidt Orthogonalization Procedure

192 of 206

Results

  • Three classes:
  • C1: Bimodal Gaussians with centers at (1,0,0) and (-1,0,0)
  • C2: Bimodal Gaussians with centers at (0,1,0) and (0,-1,0)
  • C3: Single Gaussian at (0,0.7,1)

192

193 of 206

193

194 of 206

Compressed Sensing with Overcomplete Dictionaries

194

195 of 206

Towards CS with overcomplete dictionaries

  • A basis in a vector space contains vectors that are all linearly independent.
  • A frame generalizes this concept to vectors that may be linearly dependent.
  • A set of vectors ek, 1 <= k <= n, form a frame if there exist positive values A and B, A <= B, such that for every vector v, we have:

195

196 of 206

Towards CS with overcomplete dictionaries

  • A frame is overcomplete or redundant if it not a basis (of the vector space).
  • A frame is called a tight frame if A = B. For a tight frame, we have for any v:

196

197 of 206

Towards CS with overcomplete dictionaries

  • We mention a generalization of the restricted isometry property called the D-RIP.

197

Note that Σs is the image under D of all s-sparse vectors. The only difference is that D here has more columns than rows unlike the case of orthonormal matrices before. Also random Gaussian and Bernoulli matrices obey D-RIP with high probability given O(s log n) rows, just like they obey RIP.

198 of 206

CS with tight frames

  • Let Ψ be a tight frame of size n x d, and Φ be a m x n measurement matrix satisfying D-RIP with δ2s < 0.08 or δ7s < 0.6. Then the solution x* to the optimization problem P1* below satisfies:

198

Candes et al, Compressed sensing with coherent and redundant dictionaries, Applied and Computational Harmonic Analysis, 2010

199 of 206

References

199

200 of 206

Appendix: Covariance matrix

  • In probability and statistics, the expected value of a scalar random variable x is called its mean:

  • The variance of a scalar random variable is the expected value of its squared deviation around the mean:

200

Sample values of the random variable

Probability density function of x

201 of 206

Appendix: Covariance matrix

  • The expected value of a vector random variable is called its mean vector:

201

k-th sample value of xi

202 of 206

Appendix: Covariance matrix

  • The variance is now replaced by a covariance matrix. Each entry contains the covariance between two elements of the vector:

202

203 of 206

Appendix: A word about Lagrange Multipliers

  • Consider you want to find the minimum or maximum of f(x,y).
  • Normally, you take the derivative and set it to 0 and check the sign of the second derivative.
  • Now you want to find the optimum subject to a constraint that h(x,y) = 0.

203

204 of 206

Appendix: A word about Lagrange Multipliers

  • Physical analogy: if you drop a pebble into a parabola-shaped bowl, the pebble will fall to the bottom-most point.
  • But if you placed a wooden plank into the bowl, the pebble will settle somewhere on the plank!

204

205 of 206

205

Suppose we are walking from (x1,y1) to (x2,y2) along the constraint curve h(x,y) = 0. Initially, the tangent vector along h(x,y) = 0 has a component along –grad(f), and hence there is a decrease in the value of f(x,y) as we move along h(x,y) = 0. If the point (x*,y*) is a local minimum of f(x,y), a small motion along h(x,y) = 0 from (x*,y*) will have no component along grad(f), else it would cause an increase in the value of f. Hence the tangent to h(x,y) = 0 at (x*,y*) will be perpendicular to grad(f). As we move further, motion along h(x,y) = 0 will have a component along +grad(f) leading to an increase in the value of f(x,y). At the minimum point, we have grad(f) perpendicular to the tangent , i.e. grad(f) and grad(h) are collinear. Hence, there exists some value λ (the Lagrange multiplier ) such that we have:

206 of 206

206