Compressed sensing is an interesting and hot topic recently. I wrote an report on it for my optimization course. I want to summarize it in a more succinct way in a blog post.

**Overview**

Consider an underdetermined linear system Ax = b, where A is a nxN matrix and n << N, x in R^N. If A is not degenerate (or has rank n), then there will be infinitely many solutions to this system. However, if we know the solution x is sparse, i.e., the number of non-zeros is only a few, then the solution is unique and under a certain condition we may find it exactly.

**Application Background**

Many application can be formulated in the above way. For example, in signal processing, to recover the signal x, one must have enough samples to do so. According to Nyquist-Shannon sampling theorem, the sampling rate must be no less than half of the frequency. The matrix A is usually called measurement matrix, which corresponds to the measurement, or sensing in our context. However, in compressed sensing we can still recover the signal even when the samples are not enough. This may seem contradicting, but it is not. *The sampling theorem does not make any assumption to the signal, but we does – the signal is sparse.*

Another example is image 'sensing'. When we are taking images using a digital camera, we must capture all of the pixels. If the resolution is high, the resulting image file will be very large, and people may want to compress it. The modern approach in image compression is wavelet transform, which is used in JPEG2000 standard. An image can be viewed as a 2D signal, and can be represented using a set of wavelets and coefficients. This is similar to Fourier transform, where the signal can be represented by a set of sinusoids and coefficients. The wavelet or sinusoids are called 'basis'. Interestingly, we may always find a basis such that only a few components in it are significant, while others are not. The compressions relies on such an idea, we may safely discard the unimportant components but keep those important components. This will not hurt the image quality too much, and human-eyes cannot see the difference. Intuitively, we can plot the coefficient of the components, and we will see some 'sparks' in the plot, while the others are very small magnitude that is close to zero. We threshold those small magnitude and keep only those sparks. This is roughly how the image compression works. Nevertheless, if we are to abandon lots of information anyway, why not just only capture the important ones from the very beginning?

**Key Discoveries of CS**

The following summarizes the keys:

- If the solution x is sparse enough and matrix A satisfies 'Restricted Isometry Property' (RIP), then x is unique.
- RIP itself is hard to check, but if we obtain A using some random process, it is almost always guaranteed (or, with high probability, with high confident).
- We can formulate an optimization to find the solution: minimize ||x||_0 s.t. Ax=b, where ||x||_0 is the l0-norm of x. Note that the norm is not a true 'norm', but just counting the number of non-zeros in x. However, l0-norm is clearly not continuous, not differentiable and not convex, the minimization is combinatorially hard, or, NP-hard.
- We can convexify the problem. Instead of minimizing l0-norm, we minimize l1-norm, which can be recast as linear program and solved efficiently.
- When A satisfies some property, the l1-minimization has the exact solution as l0-minimization.

**Disclaimer**

The above description is based on the paper I read and my interpretation. Some description may not be very accurate or rigorous in math.