2 of 109

Outline of the Lectures

Introduction and Motivation
CS Theory: Key Theorems
Practical Compressive Sensing Systems
Reconstruction algorithms for Compressive Sensing Systems
Proof of some of the key results

3 of 109

Introduction + MOTIVATION

4 of 109

The Era of Big Data

Huge amounts of data being constantly acquired

In many applications
E-commerce, stock markets
Weather forecasting
Sports (esp. cricket!)
News reports
Image data (medical, surveillance, satellite, etc.)

https://en.wikipedia.org/wiki/Big_data

CC image courtesy of Ender005 on WikiCommons licensed under CC-BY-SA-4.0

5 of 109

The Era of Big Data

Do we really need to measure all these data?

Can we save resources by intelligent acquisition?

Resources:
Time (MRI machine)
Battery Power (mobile phones)
Radiation administered to patients (CT machine)
Chemical reagents (Covid-19 RT-PCR testing)

Use inherent properties of (redundancy in) data

6 of 109

Conventional Sensing and Compression

Before applying signal/image compression algorithms like MPEG, JPEG, JPEG-2000 etc., the measuring devices typically measure large amounts of data.

The data is then converted to a transform domain where a majority of the transform coefficients turn out to have near-zero magnitude and can be discarded with small reconstruction error but great savings in storage.

Example: a digital camera has a detector array where several thousand pixel values are first stored. This 2D array is fed to the JPEG algorithm which computes DCT coefficients of each block in the image, discarding smaller-valued DCT coefficients.

7 of 109

Converting to transform domain: DFT

All discrete signals can be represented perfectly as linear combinations of complex sinusoidal functions of different frequencies.

This is called as the Discrete Fourier transform (See next slide).

What purpose does it serve? It enables determining the frequency content of the signal (eg: which frequencies were present in an audio signal and at what strength?)

8 of 109

Fourier Transform

Fourier coefficient

(complex) sinusoidal bases

Value of signal f at location n (f is a vector of size N)

u-th Fourier coefficient; u = frequency index

Vector of N Fourier

coefficients

N x N orthonormal matrix (Fourier basis matrix)

u-th column of H (corresponding to Fourier coefficient F(u))

i=sqrt(-1)

9 of 109

Fourier Basis Matrix

Time index n is constant over all entries in a given row. Frequency index u is constant across all entries in a given column. Thus each column has a fixed frequency (of the complex exponential).

10 of 109

2D Fourier Transform

F and f are vectors of length MN

Matrix of size MN x MN, whose columns are indexed by a frequency pair u,v (note H_uv)

Vector of size MN x 1. Note that though F and f are 2D signals, they are represented here in vectorized form.

11 of 109

Discrete Cosine Transform (DCT) in 1D

Note that by convention transpose here refers to the conjugate transpose (this is also a convention followed by MATLAB).

12 of 109

Discrete Cosine Transform (DCT) in 1D

Across any column of H, the frequency index u is constant and time (or spatial) index n varies. Across any row of H, the frequency index varies and n is constant.

13 of 109

DCT

Expresses a signal as a linear combination of cosine bases (as opposed to the complex exponentials as in the Fourier transform). The coefficients of this linear combination are called DCT coefficients.
Is real-valued unlike the Fourier transform!
Has better compaction properties for signals and images – as compared to DFT.

14 of 109

DCT basis matrix is orthonormal. The dot product of any row (or column) with itself is 1. The dot product of any two different rows (or two different columns) is 0. The inverse is equal to the transpose.

Being orthonormal, it preserves the squared norm, i.e.

DCT is NOT the real part of the Fourier!

DCT basis matrix is NOT symmetric.

Columns of the DCT matrix are called the DCT basis vectors.

15 of 109

DCT in 2D

The DCT matrix is this case will have size MN x MN, and it will be the Kronecker product of two DCT matrices – one of size M x M, the other of size N x N. The DCT matrix for the 2D case is also orthonormal, it is NOT symmetric and it is NOT the real part of the 2D DFT.

Do not interpret H_nmuv as a 4-D index, rather it is an element of the column of H indexed by frequency (u,v). Which element? An element indexed by spatial coordinates (n,m).

16 of 109

How do the DCT bases look like? (1D case)

17 of 109

How do the DCT bases look like? (2D-case)

The DCT transforms an 8×8 block of input values to a linear combination of these 64 patterns. The patterns are referred to as the two-dimensional DCT basis functions, and the output values are referred to as transform coefficients.

Each image here is obtained from the 8 x 8 outer product of a pair of DCT basis vectors. Each image is stretched between 0 and 255 – on a common scale.

http://en.wikipedia.org/wiki/JPEG

18 of 109

Again: DCT is NOT the real part of the DFT

http://en.wikipedia.org/wiki/JPEG

Real part of DFT

DCT

19 of 109

What is Compressive Sensing?

Conventional sensing is a “measure and compress+throw” paradigm, which is wasteful!

Especially for time-consuming acquisitions like MRI, CT, video photography, etc.�
Compressive sensing is a new technology where the data are acquired/measured in a compressed format!

20 of 109

What is Compressive Sensing?

These compressed measurements are then fed to some optimization algorithm (also called inversion algorithm) to produce the complete signal.
This part is implemented in software.
Under suitable conditions that the original signal and the measurement system must fulfill, the reconstruction is guaranteed to have very little or even zero error!

21 of 109

Why compressive sensing?

It has the potential to dramatically improve acquisition speed for MRI, CT, hyper-spectral data and other modalities.

Potential to dramatically improve video-camera frame rates without sacrificing spatial resolution.

22 of 109

Shannon’s sampling theorem

A band-limited signal with maximum frequency B can be accurately reconstructed from its uniformly spaced digital samples if the rate of sampling exceeds 2B (called Nyquist rate).

Independently discovered by Shannon, Whitaker, Kotelnikov and Nyquist.

23 of 109

Actual Analog Signal

Digital signal (sampled from analog signal)

Samples

Reconstructed signal (reconstruction is accurate if Nyquist’s condition is satisfied)

24 of 109

Band-limited signal

A truly band-limited signal is an ideal concept and would take infinite time to transmit or infinite memory to store.
Because a band-limited signal can never be time-limited (or space-limited).
But many naturally occurring signals when measured by a sensing device are approximately band-limited (example: camera blurring function reduces high frequencies).

25 of 109

Aliasing

Sampling an analog signal with maximum frequency B at a rate less than or equal to 2B causes an artifact called aliasing.

Aliasing when the sampling rate less than 2B

Original signal

26 of 109

Aliasing when the sampling rate is equal to 2B

No aliasing when the sampling rate is more than 2B

27 of 109

Whittaker-Shannon Interpolation Formula

The optimal reconstruction for band-limited signals from their digital samples proceeds using the sinc interpolant. It yields a time-domain (or space-domain signal of infinite extent with values that are very small outside a certain window).
The associated formula is called Whittaker-Shannon interpolation formula.

Magnitude of the n-th sample

Sampling period =1/Sampling rate

28 of 109

The sinc function (it is the Fourier transform of the rect function)

29 of 109

Some limitations of Shannon’s theorem

The samples need to be uniformly spaced (there are extensions for non-uniformly spaced samples, with the equivalent of Nyquist rate being an average sampling rate).
The sampling rate needs to be very high if the original signal contains higher frequencies (to avoid aliasing).
Does not account for several nice properties of naturally occurring signals (talks only about band-limitedness which is not perfectly realizable).

30 of 109

Motivation for CS: Candes’ puzzling experiment (Circa 2004)

Ref: Candes, Romberg and Tao, “Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information”, IEEE Transactions on Information Theory, Feb 2006.

31 of 109

Motivation for CS: Candes’ puzzling experiment (Circa 2004)

The top-left image (previous slide) is a standard phantom used in medical imaging - called as Logan-Shepp phantom.
The top-right image shows 22 radial directions with 512 samples along each, representing those Fourier frequencies (remember we are dealing with 2D frequencies!) which were measured.
Bottom left: reconstruction obtained using inverse Fourier transform, assuming the rest of the Fourier coefficients were zero.

Bottom right: image reconstructed by solving the constrained optimization problem in the yellow box on the previous slide. It gives a zero-error result!

32 of 109

THEORY OF COMPRESSIVE SENSING

33 of 109

CS: Theory

Let a naturally occurring signal be denoted f.
Let us denote the measurement of this signal (using some device) as y.
Typically, the mathematical relationship between f and y can be expressed as a linear equation of the form y = Φf.
Φ is called the measurement matrix (or the “sensing matrix” or the “forward model” of the measuring device).
For standard digital camera, Φ may be approximated by a Gaussian blur.

34 of 109

Measuring devices can be represented as linear systems

35 of 109

CS: Theory

Let f be a vector with n elements.

Since the measurements need to be compressive, Φ must have fewer rows (m) than columns (n) to produce a measurement vector y with m elements.

We know y and Φ, and we wish to estimate f.

36 of 109

CS: Theory

We know y and Φ, and we wish to estimate f.

We know that in general, this is an under-determined linear system, and hence there is no unique solution.

Why?

37 of 109

CS: Theory

But CS theory states, that in certain cases, this system does have a unique solution.

Conditions to be satisfied:
Vector f should be sparse (so not all vectors in Rⁿ are potential solutions).
Φ should be “very different from” (“incoherent with”) any row-subset of the identity matrix.

38 of 109

CS: Theory

It turns out this is wonderful news for signal and image processing.
Why?
Natural signals/images have a sparse representation in some well-known orthonormal basis Ψ such as Fourier, DCT, Haar wavelet etc.
The measurement matrix Φ can be designed to be incoherent (poorly correlated) with the signal basis matrix Ψ, i.e. the sensing waveforms (rows of Φ) have a very dense representation in the signal basis.

39 of 109

Signal Sparsity

Many signals have sparse* representations in standard orthonormal bases (denoted here as Ψ).
Example:

Actually, the representation is compressible, i.e. most of the coefficients are close to zero, but not exactly zero. We will clarify this issue very soon.

This is the L0 norm of a vector = the number of non-zero elements in it

40 of 109

Signal Sparsity

The barbara image (left) of size 512 x 512 and its reconstruction from just the top 100,000 (out of 262144) DCT coefficients. The difference is indistinguishable and similar results hold for wavelet transforms as well

41 of 109

Incoherent Sampling

Consider a sensing matrix Φ of size m by n, where m is much less than n (if m >= n, there is no compression in the measurements!).
We will assume (for convenience) that each row of Φ is unit-normalized.
CS theory states that Φ and Ψ should be “incoherent” with each other.

42 of 109

Incoherent Sampling

The coherence between Φ and Ψ is defined as:

We want this quantity μ to be as small as possible.
Its value always lies in the range .

‘j’-th row of Φ, i-th column of Ψ

43 of 109

Incoherent sampling: Example (1)

Let the signal basis Ψ be the Fourier basis. A sampling basis Φ that is incoherent with it is the standard spike (Dirac) basis:

which corresponds to the simplest and most conventional sampling basis in space or time (i.e. Φ = identity matrix) .

The associated coherence value is 1 (why?).

44 of 109

Incoherent sampling: Example (2)�or Randomness is cool! ☺

Sensing matrices whose entries are i.i.d. random draws from Gaussian or Bernoulli (+1/-1) distributions are incoherent with any given orthonormal basis Ψ with a very high probability.

Implication: we want our sensing matrices to behave like noise! ☺

45 of 109

Signal Reconstruction

Let the measured data be given as:

The coefficients of the signal f in Ψ – denoted as θ – and hence the signal f itself - can be recovered by solving the following constrained minimization problem:

This is the L0 norm of a vector = the number of non-zero elements in it

46 of 109

Signal Reconstruction

P0 is a very difficult optimization problem to solve – it is NP-hard.
Hence, a softer version (known as Basis Pursuit) is solved:

This is a linear programming problem and can be solved with any LP solver (in Matlab, for example) or with packages like L1-magic.

http://users.ece.gatech.edu/~justin/l1magic/

This is the L1 norm of a vector = the sum total of the absolute values of all its elements

47 of 109

Signal Reconstruction: uniqueness issues

Is P0 guaranteed to have a unique solution at all? Why? (If answer is no, compressed sensing is not guaranteed to work!)

Consider the case that any 2S columns of an m x n matrix Φ are linearly independent. Then any S-sparse signal f can be uniquely reconstructed from measurements y = Φf. See proof on next slide.

48 of 109

Signal Reconstruction: uniqueness issues

Proof by contradiction:
Suppose we have S-sparse signals f and f’ such that y = Φf = Φf’.
Then Φ(f-f’) = 0 and hence f-f’ lies in the null-space of Φ.
But f-f’ is a 2S-sparse signal. Let T be the set of indices of its (2S) non-zero elements. Then Φ(f-f’) = Φ_T(f-f’)_T = 0 where Φ_T is a sub-matrix with column indices strictly from T and (f-f’)_T is a vector containing values of f-f’ strictly from T.
Hence (f-f’)_T lies in the null-space of Φ_T, and hence there exist 2S columns of Φ that aren’t linearly independent. Hence contradiction!

49 of 109

Signal Reconstruction: uniqueness issues

But are these conditions on Φ valid?
The answer is that we need measurement matrices such that Φ will obey this property.
In fact, we will see something called the “Restricted isometry property”. If Φ satisfies this property, uniqueness is guaranteed.

50 of 109

Why is BP efficiently solvable?

Because it can be expressed as a linear program, i.e. an optimization problem with linear objective functions and linear equality/inequality constraints.
Linear programs can be solved in polynomial time.
Why can it be cast as a linear program? See next slide.

51 of 109

This is an equivalent formulation of BP and it is a linear program. But why does the new formulation guarantee that the estimated α and β obey our constraint, i.e. the estimated α is non-zero only at the positive entries of θ and the estimated β is non-zero only at the negative entries of θ? See next slide for the answer.

53 of 109

Signal Reconstruction

The first use of L1-norm for signal reconstruction goes back to a PhD thesis in 1965 – by Logan.
L1-norm was used in geophysics way back in 1973 (ref: Claerbout and Muir, “Robust modeling with erratic data”, Geophysics, 1973).
There is something stunning about the L1-norm problem from the previous slide. We will see soon.
Before we proceed: Note that there are other methods to (approximately) solve problem P0- eg: greedy approximation algorithms like orthogonal matching pursuit (OMP).

54 of 109

Theorem 1

Consider a signal vector f (having length n) with a sparse representation in basis Ψ, i.e. f = Ψθ, where |θ|₀ << n.

Suppose f is measured through the measurement matrix Φ yielding a vector y of only m << n measurements.

If (C is some constant) then the solution to the following is exact with probability 1−δ:

Ref: Donoho, “Compressed Sensing”, IEEE Transactions on Information Theory, April 2006.

Ref: Candes, Romberg and Tao, “Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information”, IEEE Transactions on Information Theory, Feb 2006.

55 of 109

Comments on Theorem 1

Look at the lower bound on the number of measurements m: .
The sparser the signal f in the chosen basis Ψ, the fewer the number of samples required for exact reconstruction.
The lower the coherence (greater the incoherence) between Ψ and Φ, the fewer the number of samples required for exact reconstruction.
When δ is smaller, the probability of exact reconstruction is higher. The number of samples required is higher, but the increase in number of samples is an additive factor proportional to log δ.

56 of 109

Comments on Theorem 1

The algorithm to reconstruct f given y makes NO ASSUMPTIONS about the sparsity level of f in Ψ, or the location/magnitude of the non-zero coordinates of θ.

The algorithm does not have any notion/consideration for “uniformly spaced samples” – unlike Nyquist’s theorem.

Even though we are penalizing the L1 norm (and not L0 norm), we are still getting a reconstruction that is the sparsest (besides being accurate).

Although L1-norm has been used in sparse recovery for a long time, the optimality proof was established only in 2005/2006 by Donoho, and by Candes, Romberg and Tao.

57 of 109

Comments on Theorem 1: A New Sampling Theorem?

Let the signal basis Ψ be the Fourier basis. A sampling basis Φ that is incoherent with it is the standard spike (Dirac) basis:

which corresponds to the simplest sampling basis in space or time.

Let the signal f be sparse in this Fourier basis.
Then given any (not necessarily uniformly spaced) m (time-domain) samples of f chosen uniformly at random, where

we can get an exact reconstruction of f with high probability.

This acts like a new sampling theorem that does not require uniformly spaced time-samples or bandlimited signals.
Unlike Shannon’s sampling theorem, this one does not reconstruct an analog signal from the samples – just a discrete time signal is constructed.

58 of 109

Comments on Theorem 1: A New Sampling Theorem?

The measurements are taken from a subset that is drawn uniformly at random from all possible subsets (of the same size).

Hence there will be some combination of (measurement subset, set of non-zero indices of signal coefficients) for which the reconstruction will not succeed.

Hence the element of probability of 1-δ in the guarantee for signal reconstruction.

59 of 109

Comments on Theorem 1: A New Sampling Theorem?

Here is one such example.

Consider a 1D signal f with N elements, consisting of uniformly spaced spikes of unit height.

The spacing between spikes is √N.

Such a signal f is called a Dirac comb. It is sparse in time domain.

The Fourier transform of such a signal is also a Dirac comb with gap of 1/√N.

Consider that the Fourier measurements (in y) are at frequencies that are not multiples of 1/√N.

Then clearly, good signal reconstruction can never be guaranteed, as the observed frequencies and the frequency support of the signal are disjoint.

However since the measurement frequencies are chosen uniformly at random, such “pathological” cases are rare!

“signal support” is a technical term = set of signal indices with non-zero values.

60 of 109

Intuition behind incoherence

This signal of n elements is sparse in the time domain. If you draw m < n samples in time, most of them will most likely be zero. Instead, the sampling or sensing vectors (i.e. rows of the sensing matrix Ф) should be pseudo-random and incoherent patterns as shown on the right.

61 of 109

Intuition behind incoherence

Consider a signal that is sparse in (say) DCT basis – as in the left subfigure on the previous slide.
Remember: we do not know in advance which DCT coefficients are non-zero and which aren’t.
The measurements are given as:

62 of 109

Intuition behind incoherence

Now suppose our measurement functions (i.e. rows of Φ) were also sparse in the DCT basis.
Then taking such measurements carries no information about the original signal – as most of these measurements will be zero!

63 of 109

Intuition behind incoherence

Hence the measurement functions should be non-sparse linear combinations of the DCT basis vectors = having measurements which are non-sparse linear combinations of all the DCT coefficients.
Thereby each measurement will have information about all the DCT coefficients.

64 of 109

New concept: Restricted Isometry Property (RIP)

For integer S = 1,2,…,n, the restricted isometry constant (RIC) δ_S of a matrix A = ΦΨ (*) of size m by n is the smallest number such that for any S-sparse vector θ, the following is true:

We say that A obeys the restricted isometry property (RIP) of order S if δ_Sis not too close to 1 (rather as close to 0 as possible).
If A obeys RIP of order S, no S-sparse signal can lie in the null-space of A (if it did, then we would have

and that obviously does not preserve the squared magnitude of the vector θ).

(*) Without loss of generality, we will assume that the rows of A have unit magnitude

65 of 109

Restricted Isometry Property

Let’s suppose we wanted to sense a signal x that is S-sparse in some orthonormal basis in which it has representation θ, i.e. .
Then the following is undesirable for A

One way to ensure that this doesn’t happen is to design A such that:

66 of 109

Restricted Isometry Property (RIP)

Note that the difference between two S-sparse vectors is 2S-sparse. Then, if A obeys RIP of order 2S, we have the following:

Thus A should approximately preserve the squared differences between any two S-sparse vectors.

67 of 109

Theorem 2

Suppose the matrix A = ΦΨ of size m by n (where sensing matrix Φ has size m by n, and basis matrix Ψ has size n by n) has RIP property of order 2S where δ_2S < 0.41. Let the solution of the following be denoted as θ*, (for signal f = Ψθ, measurement vector y = ΦΨθ):

Then we have:

θ_S is created by retaining the S largest magnitude elements of θ in their correct locations, and setting the rest to 0.

68 of 109

Comments on Theorem 2

This theorem says that the reconstruction for S-sparse signals is always exact if A satisfies the RIP.
For signals that are not S-sparse, the reconstruction error is almost as good as what could be provided to us by an oracle which knew the S largest coefficients (and also their indices) of the signal vector f in the basis Ψ.

69 of 109

Comments on Theorem 2

We know that most signals are never exactly sparse in standard ortho-normal bases. But they are compressible, i.e. there is a sharp decay in the coefficient magnitude with many coefficients being nearly zero!
Theorem 2 handles such compressible signals robustly. Theorem 1 did not apply to compressible signals! Here again, Theorem 2 is more powerful than Theorem 1.
The constant C₀ is independent of n, and it is an increasing function of just δ_2S in [0,1].

70 of 109

Compressive Sensing under Noise

So far we assumed that our measurements were exact, i.e. .
But practical measurements are always noisy so that .
Under the same assumption as before, we can estimate θ by solving the following problem:

Convex problem, called as second-order cone program. Can be solved efficiently with standard packages including MATLAB and L1-MAGIC

71 of 109

Theorem 3

Suppose the matrix A=ΦΨ of size m by n (where sensing matrix Φ has size m by n, and basis matrix Ψ has size n by n) has RIP property of order 2S where δ_2S < 0.41. Let the solution of the following be denoted as θ*, (for signal f = Ψθ, measurement vector y=ΦΨθ+η):

Then we have:

θ_S is created by retaining the S largest magnitude elements of θ in their correct locations, and setting the rest to 0.

72 of 109

Comments on Theorem 3

Theorem 3 is a direct extension of Theorem 2 for the case of noisy measurements.
It states that the solution of the convex program (see previous slides) gives a reconstruction error which is the sum of two terms: (1) the error of an oracle solution where the oracle told us the S largest coefficients of the signal f, and (2) a term proportional to the noise variance.
The constants C₀ and C₁ are very small (less than or equal to 5.5 and 6 respectively for δ_2S = 0.25) , they are independent of n, and they are increasing functions of just δ_2S in [0,1].

73 of 109

Comments on Theorem 3

How do you choose ε in practice?

It depends on the noise model.

Note, if (||η||₂)² ≤ κ, then one must choose ε ≥κ (at least, with high probability).

74 of 109

Comments on Theorem 3

Consider:

If for i = 1 to m, η_i ~ Uniform(-r,r), i.e. uniform random noise with known r, then ε ≥ r²m.

This represents quantization noise.

75 of 109

Comments on Theorem 3

Consider:

If for each i = 1 to m, η_i ~ N(0,σ²), with known σ.

Then, the squared magnitude of the vector η is a chi-square random variable.

Hence with very high probability, the magnitude of η will lie within 3 standard deviations from the mean, i.e. set ε ≥ 9mσ².

In such a case, Theorem 3 will hold with high probability even when A obeys RIP.

76 of 109

Randomness is super-cool!☺

Consider any fixed orthonormal basis Ψ.
For a sensing matrix Φ constructed in any of the following ways, the matrix A = ΦΨ ϵ R^mxn, will obey the RIP of order S with overwhelming probability provided that the number of rows
Φ contains entries from zero-mean Gaussian with variance 1/m.
Φ contains entries with values with probability 0.5 (symmetric Bernoulli) .
Columns of Φ are sampled uniformly randomly from a unit sphere in m-dimensional space.

77 of 109

Randomness is super-super-cool!☺

These properties of random matrices hold true for any orthonormal basis Ψ with very high probability.
As such, we do not need to tune Φ for a given Ψ.

78 of 109

Randomness is super-cool!☺

A sensing matrix Φ containing a random subset of rows of the Discrete Fourier Matrix also obeys the RIP.

Provided the number of rows .

Similar results hold for row-subsampled versions of other orthonormal matrices whose values are upper bounded by O(1/n).

M. Rudelson and R. Vershynin, “On sparse reconstruction from Fourier and Gaussian measurements,” Commun. Pure Appl. Math., vol. 61, no. 8, pp. 1025–1045, Aug. 2008

79 of 109

L₁ norm and L₀ norm

There is a special relationship between the following two problems :

The L₁ norm is a “softer” version of the L₀ norm. Other Lp-norms where 0 < p < 1 are possible and impose a stronger form of sparsity, but they lead to non-convex problems. Hence L₁ is preferred.

80 of 109

L₁ is good, L₂ is bad. Why?

Unit Lp-circle (defined below) for different values of p. Left side, top to bottom, p = 1 (square with diagonals coinciding with Cartesian axes), p = 2 (circle in the conventional sense), p = infinity (square with sides parallel to the axes); right side, p = 2/3 (astroid).

In 2-D

In d-dimensions

81 of 109

L₁ is good, L₂ is bad. Why?

Look at the basic problem we are trying to solve:

In 2D (i.e. α is a 2x1 vector, y is a scalar, Φ is a 1 x 2 matrix, and y = Φα being a line in 2D space), the problem can be represented as the following diagram:

Image source: Romberg, “Imaging via compressive sampling”, IEEE Signal Processing Magazine

α1

α2

α1

α2

82 of 109

L₁ is good, L₂ is bad. Why?

Note that the line y = Φα represents a HARD constraint, i.e. any vector α must satisfy this constraint.
We want to find the vector α with minimum Lp-norm which satisfies this constraint.
So we grow Lp-circles of increasing radius (denoted as r and defined in the sense of the Lp-norm) starting from the origin, until the Lp-circle touches the line y = Φα.
The very first time the Lp-circle touches the line, gives us the solution (the second time, the Lp-circle will have a greater radius and hence the solution vector α will have higher Lp-norm).
For the L1-norm, the solution vector α will be sparse (see left figure). For the L2-norm, the solution vector will not be sparse (right figure).

α1

α2

α1

α2

83 of 109

Another example – see the L1 ball versus the L2 ball – both in 3D. In this case the constraint is a plane, given by y = Φx where y is scalar, x is a vector of 3 elements, and Φ has size 1 x 3.

Source of Images:

http://web.yonsei.ac.kr/nipi/lectureNote/Compressed%20Sensing%20by%20Romberg%20and%20Wakin.pdf

84 of 109

L₁ is good, L₂ is bad. Why?

The L1-norm minimization (subject to the chosen constraint) is guaranteed to give the sparsest possible solutions provided the conditions of theorems 1, 2 and 3 are met.

Note that we get the sparsest possible solutions even though the L1-norm is a softened version of the L0-norm!

85 of 109

Compressive Sensing: Toy Example with images

We will compare image reconstruction by two different methods.
We won’t do true compressive sensing. Instead we’ll apply different types of sensing matrices Φ synthetically (in software) to an image loaded into memory, merely for example sake.
In the first method, we compute the DCT coefficients of an image (size N = 256 x 256), and reconstruct the image using the m largest absolute value DCT coefficients (and a simple matrix transpose). m is varied from 1000 to 30000.
Note that this is equivalent to using a sensing matrix Φ of size m by N.

86 of 109

Compressive Sensing: Toy Example with images

In the second method, we reconstruct the image using m random linear combinations of elements of signal x, where m is again varied from 1000 to 30000.
The sensing matrix Φ is appropriately assembled.
In this case, the reconstruction proceeds by solving the following optimization problem:

87 of 109

Compressive Sensing: Toy Example with images

Better edge preservation with the randomly assembled Φ – which actually uses some high frequency information (unlike conventional DCT which discards higher frequencies). Higher PSNR values with the randomly assembled sensing matrix as compared to the linear DCT-based method.

Image source: Romberg, “Imaging via compressive sampling”, IEEE Signal Processing Magazine

88 of 109

Compressed Sensing for Piecewise Constant Signals

In the case of piecewise constant signals/images, the gradients are sparse (zero everywhere except along edges).
For such images, we can solve the following problem (based on total variation or TV-norm):

This optimization problem is called a second order cone program. It can be solved efficiently using packages like L1-magic. We will not go into more details, but the documentation of L1-magic is a good starting point for the curious readers.

89 of 109

Theorem 4 (similar to Theorem 1)

Consider a piecewise constant signal f (having length n). Suppose coefficients of f are measured through the measurement matrix Φ yielding a vector y of only m << n measurements.

If (C is some constant) then the solution to the following is exact with probability 1−δ:

Ref: Candes, Romberg and Tao, “Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information”, IEEE Transactions on Information Theory, Feb 2006.

90 of 109

Experiment by Candes, Romberg and Tao

Ref: Candes, Romberg and Tao, “Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information”, IEEE Transactions on Information Theory, Feb 2006.

91 of 109

Relevant theorem behind Candes’ experiment

Take a look at theorem 1 (scroll back!).
Consider a 1D signal f (having N elements) that is piecewise constant – the gradients of such a signal form a sparse vector!
Suppose we have measured a subset C of the Fourier coefficients of f. Let us call these measurements as G(u), u ϵ C. These form a vector g.
We wish to estimate f (exploiting the fact that it is piecewise constant) given g.
How do we do this?

92 of 109

Relevant theorem behind Candes’ experiment

Define vector h as follows:

So we will do the following:

93 of 109

Experiment by Candes, Romberg and Tao

Invoking Theorem 1 (scroll back), we can estimate h accurately. Then we can estimate f accurately (up to an additive constant).
In case of images (2D signals) that are piecewise constant, the vector h will contain the magnitudes of the gradients at all the pixels, i.e.

94 of 109

Experiment by Candes, Romberg and Tao

Our measurement matrix in this case is a partial Fourier matrix. It is known to be highly incoherent with the spatial domain!
The L1-norm we are penalizing in this case turns out to be the total variation (sum total of gradient magnitudes over the entire image)!
It was this experiment by Candes, Romberg and Tao (done around 2005) that inspired modern compressive sensing theory!

95 of 109

Experiment by Candes, Romberg and Tao

Compare the following two problems:

In the former case, we are measuring the image/signal (spatial domain) through Φ.
In the second case, we are making measurements in the frequency domain (Fourier coefficient values) through Φ.

96 of 109

Mutual coherence and Restricted Isometry Property�

97 of 109

Concept of mutual coherence

Consider sensing matrix Φ, representation matrix Ψ and matrix A = Φ Ψ.
The mutual coherence of A – denoted μ(A) - is defined as the maximum absolute normalized dot product of any column of A with any other column of A.
In other words:

98 of 109

Theorem 5

Consider measurement y = Φx + ƞ = ΦΨθ + ƞ = Aθ + ƞ.
We want to estimate x via θ.
Let θ_S be a sub-vector of θ with the S largest elements in correct locations.
Then if , the solution θ* to the optimization problem P2 yields the following error bounds:

See problem P2 on next slide.

T. Cai, L. Wang, G. Xu, Stable recovery of sparse signals and an oracle inequality, IEEE Trans. Inform. Theory, 56 (7) (2010)

99 of 109

Theorem 5

Problem P2 is the following:

The constants C0 and C1 depend upon S and μ(A).
Note the similarity between this theorem and Theorem 3!
This theorem handles noise as well as compressible signals but it uses the mutual coherence criterion and not the RIP (unlike theorem 3).

100 of 109

Theorem 3 versus Theorem 5

The sufficient condition for Theorem 5 is easily verifiable in O(mn²) time.

The RIC of a matrix (of arbitrary order S) cannot in general be computed efficiently.

Why? Consider for S-sparse θ:

100

RIC of A (of order S)

101 of 109

Theorem 3 versus Theorem 5

Now consider the following relationship:

101

A subset (of maximum size S) of possible non-zero indices of vector θ

These indicate the maximum and minimum eigenvalues of the matrix (A_Γ)^T(A_Γ) over all sets Г. In general, the maximum eigenvalue of a symmetric matrix B^TB is given by . The vector x that realizes the maximum value is the corresponding eigenvector. Likewise for minimum eigenvalue.

102 of 109

Theorem 3 versus Theorem 5

This therefore requires us to enumerate all the different subsets Γ of maximum size S, and compute the maximum and minimum eigenvalue of matrices (A_Γ)^T(A_Γ).
Subset enumeration is a problem that requires O(n^S) time.
Since there is no upper bound on S, this is clearly not an efficient algorithm – or even a polynomial time algorithm.

102

103 of 109

Theorem 3 versus Theorem 5

However the coherence based bounds are usually too conservative (why?), and the RIC-based bounds are considered tighter.

We will see why when we do the proofs of these results, but we can have an intuitive explanation based on the famous Gershgorin’s disc theorem.

103

104 of 109

Theorem 3 versus Theorem 5

The Gershgorin’s disc/circle theorem states that every eigenvalue of a square matrix B lies in the union of the so-called Gershgorin discs D(B_ii,r_i).
Here r_i is the radius of the i-th Gershgorin disc and is defined as the sum of the absolute values of the off-diagonal elements of the i-th row.
In other words, we have:

104

105 of 109

Theorem 3 versus Theorem 5

Simple example of Gershgorin’s theorem:

105

106 of 109

Theorem 3 versus Theorem 5

Applying this theorem to the matrix A^TA and using the fact that the vector x in the definition is S-sparse, we can conclude that

Theorem 5 does not require the representation matrix Ψ to be orthonormal.
And neither does Theorem 3.

106

107 of 109

Some history: Observations by Logan (1965)

Consider a band-limited signal f (of maximum frequency B) corrupted by impulse noise (i.e. spikes) of arbitrary magnitude, yielding the corrupted signal:

Let the support of the impulse noise be sparse in time (valid assumption), having length T. Support means the set of indices for which the impulse noise is non-zero.

Consider the following reconstruction criterion:

107

Ref: Logan, “Properties of high-pass signals”, PhD Thesis, Columbia University, 1965

108 of 109

Some history: Observations by Logan (1965)

It turns out that this reconstruction yields PERFECT results regardless of the noise magnitude, provided that the following condition holds:

This result exploits the band-limitedness of the signal f, but does not impose constraints on the sampling rate.

Note: this result requires that if B is large, then T must be small.

This was a very early application of L1-norm optimization for signal reconstruction.

108

109 of 109

Summary

Key Theorems
Concept of Incoherence
Concept of RIP
Concept of mutual coherence
L0, L1 and L2 norms for reconstruction
Key references: within the slides, see also class web-page

109

1 of 109

2 of 109

3 of 109

4 of 109

5 of 109

6 of 109

7 of 109

8 of 109

9 of 109

10 of 109

11 of 109

12 of 109

13 of 109

14 of 109

15 of 109

16 of 109

17 of 109

18 of 109

19 of 109

20 of 109

21 of 109

22 of 109

23 of 109

24 of 109

25 of 109

26 of 109

27 of 109

28 of 109

29 of 109

30 of 109

31 of 109

32 of 109

33 of 109

34 of 109

35 of 109

36 of 109

37 of 109

38 of 109

39 of 109

40 of 109

41 of 109

42 of 109

43 of 109

44 of 109

45 of 109

46 of 109

47 of 109

48 of 109

49 of 109

50 of 109

51 of 109

52 of 109

53 of 109

54 of 109

55 of 109

56 of 109

57 of 109

58 of 109

59 of 109

60 of 109

61 of 109

62 of 109

63 of 109

64 of 109

65 of 109

66 of 109

67 of 109

68 of 109

69 of 109

70 of 109

71 of 109

72 of 109

73 of 109

74 of 109

75 of 109

76 of 109

77 of 109

78 of 109

79 of 109

80 of 109