Friday, September 30, 2011

This Week in Compressive Sensing

Someone fancied so much last week's "This Week in Compressive Sensing" that it got put on HackerNews sending trove of newcomers to the subject. The traffic was impressive, five continents watching this entry at one point in time, wow.

This week started with the aftermath of the stunning finding of last Friday. In order to make it clear to some people I wrote a small summary of what that meant. There is more obviously and I will talk about this over time for sure. One item stood out though, it was the videos of the SPARS meeting sent by Jared Tanner. One of them is (the first I think) a video of David Donoho who is now advocating changing the units in the phase diagram (this is very much welcome). I also noted the fascinating mention by him that Belief Propagation acted as some sorts of ensemble algorithm (check several solutions at a time). This is the first time I heard how the algorithm worked in such a clear fashion.

This week, several papers appeared on arxiv that are related to compressive sensing, they are:

Robust Sparse Analysis Regularization by Samuel Vaiter, Gabriel Peyré, Charles Dossal, Jalal Fadili . The abstract reads:
This paper studies the properties of L1-analysis regularization for the resolution of linear inverse problems. Most previous works consider sparse synthesis priors where the sparsity is measured as the L1 norm of the coefficients that synthesize the signal in a given dictionary. In contrast, the more general analysis regularization minimizes the L1 norm of the correlations between the signal and the atoms in the dictionary. The corresponding variational problem includes several well-known regularizations such as the discrete total variation and the fused lasso. We first prove that a solution of analysis regularization is a piecewise affine function of the observations. Similarly, it is a piecewise affine function of the regularization parameter. This allows us to compute the degrees of freedom associated to sparse analysis estimators. Another contribution gives a sufficient condition to ensure that a signal is the unique solution of the analysis regularization when there is no noise in the observations. The same criterion ensures the robustness of the sparse analysis solution to a small noise in the observations. Our last contribution defines a stronger sufficient condition that ensures robustness to an arbitrary bounded noise. In the special case of synthesis regularization, our contributions recover already known results, that are hence generalized to the analysis setting. We illustrate these theoritical results on practical examples to study the robustness of the total variation and the fused lasso regularizations.

On Variable Density Compressive Sampling by Gilles Puy, Pierre Vandergheynst, Yves Wiaux. The abstract reads:
We advocate an optimization procedure for variable density sampling in the context of compressed sensing. In this perspective, we introduce a minimization problem for the coherence between the sparsity and sensing bases, whose solution provides an optimized sampling profile. This minimization problem is solved with the use of convex optimization algorithms. We also propose a refinement of our technique when prior information is available on the signal support in the sparsity basis. The effectiveness of the method is confirmed by numerical experiments. Our results also provide a theoretical underpinning to state-of-the-art variable density Fourier sampling procedures used in magnetic resonance imaging.

Lianlin Li. The abstract reads:
The recovery of sparsest overcomplete representation has recently attracted intensive research activities owe to its important potential in the many applied fields such as signal processing, medical imaging, communication, and so on. This problem can be stated in the following, i.e., to seek for the sparse coefficient vector of the given noisy observation over a redundant dictionary such that, where is the corrupted error. Elad et al. made the worst-case result, which shows the condition of stable recovery of sparest overcomplete representation over is where . Although it's of easy operation for any given matrix, this result can't provide us realistic guide in many cases. On the other hand, most of popular analysis on the sparse reconstruction relies heavily on the so-called RIP (Restricted Isometric Property) for matrices developed by Candes et al., which is usually very difficult or impossible to be justified for a given measurement matrix. In this article, we introduced a simple and efficient way of determining the ability of given D used to recover the sparse signal based on the statistical analysis of coherence coefficients, where is the coherence coefficients between any two different columns of given measurement matrix . The key mechanism behind proposed paradigm is the analysis of statistical distribution (the mean and covariance) of . We proved that if the resulting mean of are zero, and their covariance are as small as possible, one can faithfully recover approximately sparse signals from a minimal number of noisy measurements with overwhelming probability. The resulting theory is not only suitable for almost all models - e.g. Gaussian, frequency measurements-discussed in the literature of compressed sampling, but also provides a framework for new measurement strategies as well.

We investigate distributed sensors' failure detection in networks with a small number of defective sensors. We assume that sensors measure a smooth physical phenomenon and that defective sensors' measurements significantly differ from neighboring sensor measurements. We consider that the defective sensors are well represented by binary sparse signals. We build on the sparse nature of the binary sensor failure signals and propose a new detection algorithm based on Group Testing (GT). The distributed GT algorithm estimates the set of defective sensors from a small number of linearly independent binary messages with a simple distance decoder. Furthermore, we theoretically determine the lower bound of the minimal number of linearly independent messages needed for detection guarantees in case of a single defective sensor. We show through experimentation that the number of messages required for successful detection is in practice much smaller for small and medium sized networks. We extend our framework to the detection of multiple failures case by modifying appropriately the message exchange protocol and the decoding procedure. The employed decoder is of low complexity and is robust to noisy messages. The overall method is furthermore resilient to the network dynamics because of our gossip-based message dissemination protocol. We provide results for both regular and irregular network topologies. Given a network setup, we provide parameter selection rules that improve the detection accuracy. Simulations demonstrate that in terms of detection performance the proposed method outperforms methods based on random walk measurements collection. Our method performs detection in less system rounds time, but it requires larger communication overhead compared to random walk based algorithms that collect network measurements.
We explore several reduced-dimension multiuser detection (RD-MUD) structures that significantly decrease the number of required correlation branches at the receiver front-end, while still achieving performance similar to that of the conventional matched-filter (MF) bank. RD-MUD exploits the fact that the number of active users is typically small relative to the total number of users in the system and relies on ideas of analog compressed sensing to reduce the number of correlators. We first develop a general framework for both linear and nonlinear RD-MUD detectors. We then present theoretical performance analysis for two specific detectors: the linear reduced-dimension decorrelating (RDD) detector, which combines subspace projection and thresholding to determine active users and sign detection for data recovery, and the nonlinear reduced-dimension decision-feedback (RDDF) detector, which combines decision-feedback orthogonal matching pursuit for active user detection and sign detection for data recovery. The theoretical performance results for both detectors are validated via numerical simulations.

The Statistical Inefficiency of Sparse Coding for Images (or, One Gabor to Rule them All) by James BergstraAaron CourvilleYoshua Bengio. The abstract reads:
Sparse coding is a proven principle for learning compact representations of images. However, sparse coding by itself often leads to very redundant dictionaries. With images, this often takes the form of similar edge detectors which are replicated many times at various positions, scales and orientations. An immediate consequence of this observation is that the estimation of the dictionary components is not statistically efficient. We propose a factored model in which factors of variation (e.g. position, scale and orientation) are untangled from the underlying Gabor-like filters. There is so much redundancy in sparse codes for natural images that our model requires only a single dictionary element (a Gabor-like edge detector) to outperform standard sparse coding. Our model scales naturally to arbitrary-sized images while achieving much greater statistical efficiency during learning. We validate this claim with a number of experiments showing, in part, superior compression of out-of-sample data using a sparse coding dictionary learned with only a single image.

Liked this entry ? subscribe to the Nuit Blanche feed, there's more where that came from

Thursday, September 29, 2011

Matrix Factorization This Week.

Today we have five papers, a poster and one code in the advanced matrix factorization world:

First we start with a team that took up data from the Kinect to show how good their tracker system was. The issue at hand in motion registration is trying to identify the features from one frame to the other that provide real information about the motion of objects. One always face issues with outliers and this paper show how they use the robust PCA (PCP) with some additional constraint to make that motion estimation a robust process.
Motivated by an emerging theory of robust low-rank matrix representation, in this paper, we introduce a novel solution for online rigid-body motion registration. The goal is to develop algorithmic techniques that enable a robust, real-time motion registration solution suitable for low-cost, portable 3-D camera devices. Assuming 3-D image features are tracked via a standard tracker, the algorithm first utilizes Robust PCA to initialize a low-rank shape representation of the rigid body. Robust PCA finds the global optimal solution of the initialization, while its complexity is comparable to singular value decomposition. In the online update stage, we propose a more efficient algorithm for sparse subspace projection to sequentially project new feature observations onto the shape subspace. The lightweight update stage guarantees the real-time performance of the solution while maintaining good registration even when the image sequence is contaminated by noise, gross data corruption, outlying features, and missing data. The state-of-the-art accuracy of the solution is validated through extensive simulation and a real-world experiment, while the system enjoys one to two orders of magnitude speed-up compared to well-established RANSAC solutions. The new algorithm will be released online to aid peer evaluation.
I still note that "It shows the accuracy of RPCA is still sensitive to high variance dense Gaussian noise." The code that perform this SOLO refinement over a plain vanilla RPCA solver will be available on the author's website at some point in time according to the paper.
This paper presents GRASTA (Grassmannian Robust Adaptive Subspace Tracking Algorithm), an efficient and robust online algorithm for tracking subspaces from highly incomplete information. The algorithm uses a robust $l^1$-norm cost function in order to estimate and track non-stationary subspaces when the streaming data vectors are corrupted with outliers. We apply GRASTA to the problems of robust matrix completion and real-time separation of background from foreground in video. In this second application, we show that GRASTA performs high-quality separation of moving objects from background at exceptional speeds: In one popular benchmark video example, GRASTA achieves a rate of 57 frames per second, even when run in MATLAB on a personal laptop.
No implementation is available for the moment.

Finally, we have: Sparse PCA via Augmented Lagrangian Methods and Application to Informaitve Feature Selection by Nikhil Naikal, Allen Y. YangS. Shankar Sastry. The abstract reads:
Bag-of-words (BoW) methods are a popular class of object recognition methods that use image features (e.g., SIFT) to form visual dictionaries and subsequent histogram vectors to represent object images in the recognition process. The accuracy of the BoW classifiers, however, is often limited by the presence of uninformative features extracted from the background or irrelevant image segments. Most existing solutions to prune out uninformative features rely on enforcing pairwise epipolar geometry via an expensive structure-from-motion (SfM) procedure. Such solutions are known to break down easily when the camera transformation is large or when the features are extracted from lowresolution, low-quality images. In this paper, we propose a novel method to select informative object features using a more efficient algorithm called Sparse PCA. First, we show that using a large-scale multiple-view object database, informative features can be reliably identified from a highdimensional visual dictionary by applying Sparse PCA on the histograms of each object category. Our experiment shows that the new algorithm improves recognition accuracy compared to the traditional BoW methods and SfM methods. Second, we present a new solution to Sparse PCA as a semidefinite programming problem using the Augmented Lagrangian Method. The new solver outperforms the state of the art for estimating sparse principal vectors as a basis for a low-dimensional subspace model

The source code for this instance of Sparse PCA using ALM method can be found here and is listed on the Matrix Factorization Jungle page.

 We introduce a fast and robust subspace-based approach to appearance-based object tracking. The core of our approach is based on Fast Robust Correlation (FRC), a recently proposed technique for the robust estimation of large translational displacements. We show how the basic principles of FRC can be naturally extended to formulate a robust version of Principal Component Analysis (PCA) which can be efficiently implemented incrementally and therefore is particularly suitable for robust real-time appearance-based object tracking. Our experimental results demonstrate that the proposed approach outperforms other state-of-the-art holistic appearance-based trackers on several popular video sequences.
The attendant poster is Fast and robust appearance-based tracking .

And finally, we have:

The inclusion of dictionary learning capabilty within scikit.learn a python package in version 0.9


This paper considers the recovery of a low-rank matrix from an observed version that simultaneously contains both (a) erasures: most entries are not observed, and (b) errors: values at a constant fraction of (unknown) locations are arbitrarily corrupted. We provide a new unified performance guarantee on when the natural convex relaxation of minimizing rank plus support succeeds in exact recovery. Our result allows for the simultaneous presence of random and deterministic components in both the error and erasure patterns. On the one hand, corollaries obtained by specializing this one single result in different ways recover (up to poly-log factors) all the existing works in matrix completion, and sparse and low-rank matrix recovery. On the other hand, our results also provide the first guarantees for (a) recovery when we observe a vanishing fraction of entries of a corrupted matrix, and (b) deterministic matrix completion.

There is currently no release of any implementation of this group, which really means that they are not featured on the Matrix Factorization Jungle we say in Texas, it's a damn shame.

Liked this entry ? subscribe to the Nuit Blanche feed, there's more where that came from

Wednesday, September 28, 2011

Videos of Plenary Speakers at SPARS11 / A comment by Shrinivas Kudekar

Jared Tanner sent me the following:

Dear Igor,
The Plenary speakers for SPARS11 had their presentations recorded (video), and these videos are now available on the conference website:
Would you mind mentioning on your blog, for those who didn't attend SPARS11.
All the best,
Absolutely, thanks Jared.

Also in the comment section of Build it and people will come is the following thoughtful and insightful remark by Shrinivas Kudekar :
Dear Igor, 
I am not sure which comment from anonymous you disagree. But let me make a point that the entire innovation, as Henry pointed out, is in the creation of this special measurement matrix. This special construction, called as coupled measurement matrix, allows larger measurements at the boundary. This allows to recover the signal at the boundary. Then the special coupling structure takes over and the whole effect is propagated like a domino effect. The coupling can be understood by taking multiple copies of the system and then joining them together (almost any random connections work!) in some local window (this window is typically smaller than the number of copies you take). However since we have finite number of copies, there is a termination at the boundary. Surprisingly, it is this termination (high measurements at the boundary in CS or low rate code in coding theory) that propagates through the entire structure. It is like providing small information at the boundary (genie tells you the exact information at boundary) and this effect moves in. In the context of coding theory, this special structure elevates the BP threshold (low complexity algorithm) of the coupled codes to the MAP threshold (best possible threshold) of the "underlying" uncoupled system. And to complete the story, the MAP threshold can be made arbitrarily close to the ultimate Shannon threshold by increasing the degrees of the graphical code. Thus achieving the capacity (till now the proof is only for the erasure channel but the phenomena is observed to be true for any general (binary-input) channel.)In fact this coupling phenomena is now observed in many problems in communication science, computer science and statistical physics. One paper where some results are compiled is 
Also a nice compilation of recent results on Spatial Coupling and Coding and Communications can be found on Prof. Urbanke's webpage: 
Thanks Shrinivas.

Liked this entry ? subscribe to the Nuit Blanche feed, there's more where that came from

Tuesday, September 27, 2011

A short summary

[2012 Update: From Compressive Sensing to Machine Learning ]
Compressive Sensing (CS) started back in 2004 when Candes, Donoho, Tao and Romberg discovered that one could find in polynomial time the sparsest solution to certain types of underdetermined systems of linear equations. These systems have been mapped to all sorts of sensing devices starting with MRI and many others. The details have been told numerous times in the CS literature, so let me extract two sentences:

From (2)

The work by Donoho and Candès et. al. [1], [2] demonstrated that CS reconstruction is a polynomial time problem – albeit under the constraint that more than 2K measurements are used.
From (1)
It has been shown that we cannot hope to perform CS reconstruction of a k-sparse signal using fewer than m = k + 1 measurements [3, 7]. However, the only approach for CS reconstruction using m = k + 1 is via L_0 minimization, which is known to be NP-complete and therefore impractical [8]

The paper by Florent Krzakala, Marc Mézard, François Sausset,Yifan Sun and Lenka Zdeborová (3) featured on Friday shows that with some probability close to 1, a polynomial time (quadratic) algorithm can perform a compressive sensing reconstruction with m = k+1.

In A stunning development in breaking the Donoho-Tanner phase transition ?, I provided different graphs that give an idea on how this algorithm provides an advantage over a previous limit defined by polynomial algorithms (L_1 minimization) which themselves started the whole compressive sensing field / research area.

In Mirror, mirror, who's the sparsest of them all ?, I drew the different known boundaries between P and NP over time.

In Build it and people will come, some views are expressed on whether this algorithm may be useful as:
  • its robustness still has to be tested against noise (A: other studies suggest that some of these types of algorithm are in fact robust in this new region)
  • it imposes a measurement process that might not be a good fit to hardware (A: it's really up to the hardware/sensor designer to decide and it should not be an algorithm development's concern). 


Liked this entry ? subscribe to the Nuit Blanche feed, there's more where that came from

Build it and people will come.

In A stunning development in breaking the Donoho-Tanner phase transition ? some anonymous commenter made the following statement:

There seems to be a little bit of confusion here...
When the signals are exactly sparse and there is no noise, the Donoho-Tanner phase transition characterizes the limits of L1 minimization. As it happens that L1 minimization outperforms other reconstruction algorithms (such as SP, CoSaMP, IHT), this phase transition also seems to apply to these algorithms. However, it will not apply to some other algorithms that outperform L1 minimization, like iteratively reweighted L1.
We also do know what the limit of compressive sensing is in this noise-free setting. It is 2*s measurements, where s is the number of non-zero coefficients in the signal (note this is independent of signal dimension). Furthermore, there are polynomial time algorithms coming from coding theory that achieve this limit (although they are not stable in the presence of noise). This limit can be loosened to s+1 measurements if reconstruction with probability 1 is acceptable (hence the red line in their phase-transition figures).
From what I can tell after skimming through this paper, the authors also follow a coding theory approach: They design their own measurement matrices and reconstruction algorithms for these matrices , and outperform L1 minimization. It seems all their results are on the exact sparse signals with no measurement noise, and they mention that the stability of the algorithm will be studied in further work.
Y'all are nitpicking, indeed as Bob Sturm recently showed us in his recent comparison between different reconstruction solvers such as AMP, SL0, BP, etc Phase transitions on Hawai Time (by the way I am patiently waiting with a dry martini on the beach to see some computations showing IRL1 performing better than SL0), anyways Bob showed the following graph:

The real Donoho-Tanner phase transition ( for L_1 solvers) is designated as "B" on this graph. In other words, we already know that SL0, AMP do better than DT,  but not by a significant margin. In particular, this is the reason, I did not use the word 'stunning' last year when I featured Shrinivas Kudekar and Henry Pfister's work ( The Effect of Spatial Coupling on Compressive Sensing)

I realize there are issues with the stability of coding theory algorithms with noise but these issues don't seem to be so hard to beat either. Take for instance the  recent poster of Jeremy Vila and Phil Schniter at SAHD (`An Empirical-Bayes Approach to Compressive Sensing via Approximate Message Passing,) where they also use an AMP algorithm and try to delimit the noisy phase transition

Here they seem to be doing better than DT with noise. Hence I would not be overly surprised if the algorithm of Florent Krzakala, Marc Mézard, François Sausset,Yifan Sun and Lenka Zdeborová is robust with noise, I would evidently expect some degradation.

The other argument on the specific measurement matrix is weak in my view. We have already seen instances of compressive sensing with measurements matrices and specifically designed reconstruction algorithms. As the noise issue is addressed, the specific measurement matrices will be an issue with the hardware designers, but we have seen this before, two examples come to my mind,

and those hardware designers will manage provided the phase transition is larger than that given by DT.

While writing this entry, I also asked Henry Pfister his point of view on this new algoriithm. Henry kindly responded with:

It is well-known that the MAP decoder requires only p*N measurements for a random Gaussian measurement matrix of a "single-Gaussian" signal model where each signal component is drawn i.i.d. according to

f(x) = (1-p)*\delta_0 (x) + p * \phi_1 (x).

Here, we denote the pdf of zero-mean Gaussian with standard deviation s by

\phi_s (x) = exp(-x2 / (2*s2) ) / sqrt(2*pi*s2).

The trick is finding a low-complexity decoder which achieves this. In that respect, the work in [KMSSZ] is very interesting.

One portion of our paper analyzed verification-based (a.k.a. SBB) decoding with a signal model equivalent to the single-Gaussian model and clearly showed gains are possible with spatially-coupled measurement matrices. We are now considering the question of why we did not observe larger gains in our CS-BP experiment. Other than the possibility of erroneous results, three important differences are clear:

(1) In our spatial-coupling paper, the CS-BP test used signals whose components were drawn i.i.d. from the distribution

f(x) = (1-p)*\phi_{0.5} (x) + p*\phi_{10} (x).

For this two-Gaussian model, the MAP threshold is strictly worse than the \alpha = \rho line. In fact, one can show  that the slope at \rho=0 must be infinite for this case (like the L1 and BP curves in [KMSSZ]).

(2) Our measurement ensemble was very sparse (e.g., each signal component is involved in 5-9 measurements). In their paper, the measurement ensemble is dense as N->Infinity for fixed L.  In their simulations for M=N/3, each signal component is involved in 250-1000 measurements because they use 3 measurement blocks and L ranges between 10 and 40. They also optimize the variance of entries in their measurement matrix.

(3) The last obvious difference is the BP decoder. We used a function-passing BP decoder based on quantized functions. They used a mean-variance passing BP decoder based on the Donoho-Montanari Gaussian approximation of BP.

It would be very interesting to determine the contribution that each of these three changes made to the impressive results they show.

Finally, the name "seeded BP" of their decoder is somewhat of a misnomer because their main innovation is a particularly effective flavor of spatially-coupling for measurement matrices for AMP-like decoding.
Best Regards,
 So, what happened on Friday ?

The sudden realization that the line in the sand separating P and NP had shifted. There is now a polynomial algorithm that can decode the sparsest signals of an underdetermined linear system of equations in a region that was previously thought to require combinatorial search.
For that matter, I seem to recall that Emmanuel Candes initially reconstructed a perfectly sparse Shepp- Logan phantom with an L1 solver. The line on the sand moved when he did that back in 2003, it moved again on Friday.

A final insight was provided by another anonymous reader as to why this algorithm was working::
It seems that their main contribution is a particular sensing matrix that allows a few sparse elements to be detected, and then propagates the knowledge of these elements forward to detect the other ones. while this is certainly interesting, I wonder if the physics of many sensing applications will actually allow this.
This initial detection of a few elements that then provide the capability for detecting the others is intriguing and reminds me of structured sparsity. However, as I stated earlier, I completely disagree with anonymous on her/his last point (would the hardware designer come if the measurement matrix is so special), actually I have only six words heard many times in Aggieland

Build it and people will come

Thanks Henry and Bob and the two anonymous commenters for providing these insights.

Liked this entry ? subscribe to the Nuit Blanche feed, there's more where that came from

Monday, September 26, 2011

A postdoc in Compressive Sensing and an announcement for ICPRAM 2012 on High-Dimensional Inference from Limited Data: Sparsity, Parsimony, and Adaptivity

Amit Ashok sent me the following postdoc announcement:

Postdoctoral Position in Compressive Sensing at University of Arizona (College of Optical Sciences)=============================================================To apply go to:  
The DARPA Knowledge Enhanced Compressive Measurement (KECoM) project has the primary goals of
  1. developing a mathematical framework for incorporating prior knowledge of tasks and observed signals into the design of compressive sensors, and 
  2. applying this framework to relevant sensing modalities. 
Under the KECoM program, the UA is investigating information-theoretic techniques for sensor optimization under prior knowledge, as well as application to a compressive spread-spectrum receiver, a compressive radar transmitter/receiver, and a compressive spectral imager. 
An additional postdoctoral researcher position is available for the DARPA KECoM effort. Original postdoctoral appointment is for one year with possible renewal. Outstanding UA benefits include health, dental, vision, and life insurance; paid vacation, sick leave, and holidays; UA/ASU/NAU tuition reduction for the employee and qualified family members; access to UA recreation and cultural activities; and more!   
Duties and Responsibilities:
  •  Provide advanced technical expertise in RF compressive receiver and compressive spectral imager design 
  •  Provide technical expertise in manifold learning, Bayesian experiment design, and information-theoretic principles, such as rate-distortion theory, relating to estimation and  detection theory 
  •  Periodic reporting and documentation as required by program timeline 
  •  Occasional travel may be required  
Additional Minimum Qualifications     
  •  Ph.D. in Electrical Engineering, Optical Sciences/Engineering, Mathematics, Applied Physics, OR related discipline 
  •  Demonstrated experience in information theory, manifold learning, AND/OR statistical signal processing 
  •  Demonstrated ability for independent research with associated publication record 
  •  Demonstrated ability to work in a team environment 
  •  Demonstrated ability to communicate effectively, both verbally and in writing  
                          Preferred Qualifications     
  • Experience in compressive sensing and statistical inference 
  • Experience in radar target recognition 
  • Experience in numerical optimization techniques 
  • Experience with one or more high-level programming languages (C, C++, etc.)

In a different direction, Rui Castro sent me the following:

Dear Igor,

I’m writing on behalf of myself and Jarvis Haupt from the University of Minnesota. We are organizing a special session on "High-Dimensional Inference from Limited Data: Sparsity, Parsimony, and Adaptivity" at the 2012 International Conference on Pattern Recognition Applications and Methods (ICPRAM). Our vision for this special session is to include contributions in a broad range of topics related to theory and methods for exploiting sparsity and other low-dimensional representations for high-dimensional inference tasks. The conference itself will be held February 6-8, 2012 in Vilamoura, Portugal. Additional information on the conference can be found at, and further information on our (and other) special session(s) can be found at

It would be great if you can announce this special session in your 'nuit-blanche' blog, as I believe there will be readers that might be interested in sending a contribution. Let me know if this is possible. I thank you in advance.

Best regards,

Rui Castro
It is possible.

Liked this entry ? subscribe to the Nuit Blanche feed, there's more where that came from

Sunday, September 25, 2011

Mirror, mirror, who's the sparsest of them all ?

The dark part shows the region where we think finding the sparsest solution to an underdetermined system of linear equations is NP. As time goes, it sure looks like this line is receding. In other words, for every new algorithms shown to work and busting the old limit, the territory of combinatorial problems shrinks and the NP vs P line in the sand shifts.

Liked this entry ? subscribe to the Nuit Blanche feed, there's more where that came from

Friday, September 23, 2011

A stunning development in breaking the Donoho-Tanner phase transition ?

Extraordinary claims require extraordinary proofs which really is the reason why this sorts of discussion is important. Similarly, sometimes, you are so blinded to some sorts of a truth and are faced with something so different that you can misread entirely what is being said. If you read this morning's entry, you might get a feel that I am little ambivalent about the true interesting nature of a paper entitled Statistical physics-based reconstruction in compressed sensing by Florent Krzakala, Marc Mézard, François Sausset, Yifan Sun, Lenka Zdeborová. Let's put this in perspective, our current understanding so far is that the universal phase transition observed by Donoho and Tanner seems to be seen with all the solvers featured here, that there are many ensembles for which it fits (not just Gaussian, I remember my jaw dropping when Jared Tanner showed it worked for the ensembles of Piotr Indyk, Radu Berinde et al) and that the only way to break it is to now consider structured sparsity as shown by Phil Schniter at the beginning of the week. In most people's mind, the L_1 solvers are really a good proxy to the L_0 solvers since even greedy solvers (the closest we can find to L_0 solvers) seem to provide similar results. Then there are results like the ones of Shrinivas Kudekar and Henry Pfister. ( Figure 5 of  The Effect of Spatial Coupling on Compressive Sensing) that look like some sort of improvement (but not a large one). In all, a slight improvement over that phase transition could, maybe, be attributed to a slightly different solver, or ensemble (measurement matrices). So this morning I made the point that given what I understood about the graphs displayed in the article, it may be at best a small improvement over the Donoho-Tanner phase transition known to hold for not only Gaussian but other types of matrices and for different kinds of solvers, including greedy algorithms and SL0 (that simulate some sorts of L_0 approach). At best is really an overstatement but I was intrigued mostly because of the use of an AMP solver, so I fired off an inquisitive e-mail on the subject to the corresponding author:

Dear Dr. Krzakala, 
... I briefly read your recent paper on arxiv with regards to your statistical physics based reconstruction capability and I am wondering if your current results are within the known boundary of what we know of the phase transition found by Donoho and Tanner or if it is an improvement on it. I provided an explanation of what I meant in today's entry ( If this is an improvement, I'd love to hear about it. If it is is not an improvement, one wonders if some of the deeper geometrical findings featured by the Donoho-Tanner phase transition have a bearing on phase transition on real physical systems.
Best regards,

The authors responded quickly with:

Dear Igor, 
Thanks for writing about our work in your blog.
Please notice, however, that our axes in the figure you show are not the same as those of Donoho and Tanner. For a signal with N components, we define \rho N as the number of non-zeros in the signal, and \alpha N as the number of measurements. In our notation Donoho and Tanner's parameters are rho_DT = rho/alpha and delta_DT = alpha. We are attaching our figure plotted in Donoho and Tanner's way. Our green line is then exactly the DT's red line (since we do not put any restriction of the signal elements), the rest is how much we can improve on it with our method. Asymptotically (N\to \infty) our method can reconstruct exactly till the red line alpha=rho, which is the absolute limit for exact reconstruction (with exhaustive search algorithms).
So we indeed improve a lot over the standard L1 reconstruction!
We will be of course very happy to discuss/explain/clarify details if you are interested.
With best regards
Florent, Marc, Francois, Yifan, and Lenka

The reason I messed up reading the variables is because I was probably not expecting something that stunning.

Thank you for Florent KrzakalaMarc MézardFrançois SaussetYifan SunLenka Zdeborová for their rapid feedback.

Liked this entry ? subscribe to the Nuit Blanche feed, there's more where that came from

This Week in Compressive Sensing

[Update: The second part of this entry led to the following entries: A stunning development in breaking the Donoho-Tanner phasetransition ? , Build it and people will come. ]

Today we have a poster : Efficient Deterministic Compressive Sensing for Images with Chirp and Reed-Muller codes.and two papers from arxiv:

One-bit compressed sensing by linear programming by Yaniv PlanRoman Vershynin. The abstract reads:
In compressed sensing, one's goal is to recover a sparse vector from a minimal number of linear measurements. It is known that O(s log(n/s)) random measurements are sufficient to stably recover an s-sparse vector in n-dimensional space, and the solution can be obtained by convex optimization. However, in general this theory breaks down in the absence of infinite bit precision. We study maximally quantized compressed sensing in which one sees a single bit per measurement, specifically the sign of the linear measurement. We demonstrate that a simple linear program achieves accurate estimation from O(s log^2(n/s)) one-bit measurements. This result also extends to approximately sparse signals (compressible signals). Our results are universal; they imply that one measurement scheme will successfully recover all sparse vectors simultaneously. Our argument is based on solving an equivalent geometric problem on random hyperplane tessellations.

This is a new and deep geometrical insight on how to solve these problems that seems to be directly amenable to a CVX formulation. For those of you interested in this 1-bit approach with implementation that allow you to study and evaluate here is a toolbox and a demo:

These implementations can also be found on the big picture in compressive sensing.

Next, I am not quite sure what to think of the following result for compressive sensing reconstruction (albeit I am sure it is a new insight in statistical physics): Statistical physics-based reconstruction in compressed sensing by Florent Krzakala, Marc Mézard, François Sausset, Yifan Sun, Lenka Zdeborová. The abstract:
Compressed sensing is triggering a major evolution in signal acquisition. It consists in sampling a sparse signal at low rate and later using computational power for its exact reconstruction, so that only the necessary information is measured. Currently used reconstruction techniques are however limited to acquisition rates considerably higher than the true density of the signal. We design a new class of "seeded" compressed sensing measurement matrices for which a belief propagation-based algorithm achieves exact reconstruction of the signal even at measurement rates very close to the lowest possible ones. This construction is motivated by a statistical physics study of the problem and by the theory of crystal nucleation. The gain in efficiency obtained is confirmed by numerical studies of several cases.
Namely the authors show this figure with the oversampling ratio and undersampling ratio inverted:

 and seem to show that their reconstruction divide the figure in two equal triangular regions where reconstruction is possible in one but not in the other and that the limiting case seems on the diagonal. I went ahead and looked up the Donoho-Tanner phase transition figure and more specifically went back to Precise Undersampling Theorems (by Donoho and Tanner) to check the phase transitions for each category of polytopes:

The green transition is for the simplex polytopes, i.e from lemma 3.1, it is for the positive solutions to underdetermined linear systems. I then put a dark line where the authors have identified a limiting case.  

Quite clearly, it does not improve on the simplex result, but does improve the crosspolytope and the hypercube results.

Liked this entry ? subscribe to the Nuit Blanche feed, there's more where that came from

Thursday, September 22, 2011

Why am I thinking there is a need for a better dictionary learning or calibration algorithm ?

From the latest Current Biology:

Reconstructing Visual Experiences from Brain Activity Evoked by Natural Movies


Shinji Nishimoto, An T. Vu, Thomas Naselaris, Yuval Benjamini, Bin Yu, Jack L. Gallant 

  • A new motion-energy model can describe BOLD signals evoked by natural movies
  • The model reveals how motion information is represented in early visual areas
  • Speed tuning in human early visual areas depends on eccentricity
  • The model provides reconstructions of natural movies from evoked BOLD signals


Quantitative modeling of human brain activity can provide crucial insights about cortical representations [1,2] and can form the basis for brain decoding devices [3,4,5]. Recent functional magnetic resonance imaging (fMRI) studies have modeled brain activity elicited by static visual patterns and have reconstructed these patterns from brain activity [6,7,8]. However, blood oxygen level-dependent (BOLD) signals measured via fMRI are very slow [9], so it has been difficult to model brain activity elicited by dynamic stimuli such as natural movies. Here we present a new motion-energy [10,11] encoding model that largely overcomes this limitation. The model describes fast visual information and slow hemodynamics by separate components. We recorded BOLD signals in occipitotemporal visual cortex of human subjects who watched natural movies and fit the model separately to individual voxels. Visualization of the fit models reveals how early visual areas represent the information in movies. To demonstrate the power of our approach, we also constructed a Bayesian decoder [8] by combining estimated encoding models with a sampled natural movie prior. The decoder provides remarkable reconstructions of the viewed movies. These results demonstrate that dynamic brain activity measured under naturalistic conditions can be decoded using current fMRI technology.
The rest of the paper can be seen here. Of specific interest are the following two videos:

The left clip is a segment of the movie that the subject viewed while in the magnet. The right clip shows the reconstruction of this movie from brain activity measured using fMRI. The reconstruction was obtained using only each subject's brain activity and a library of 18 million seconds of random YouTube video that did not include the movies used as stimuli. Brain activity was sampled every one second, and each one-second section of the viewed movie was reconstructed separately.

This video is organized as folows: the movie that each subject viewed while in the magnet is shown at upper left. Reconstructions for three subjects are shown in the three rows at bottom. All these reconstructions were obtained using only each subject's brain activity and a library of 18 million seconds of random YouTube video that did not include the movies used as stimuli. The reconstruction at far left is the Average High Posterior (AHP). The reconstruction in the second column is the Maximum a Posteriori (MAP). The other columns represent less likely reconstructions. The AHP is obtained by simply averaging over the 100 most likely movies in the reconstruction library. These reconstructions show that the process is very consistent, though the quality of the reconstructions does depend somewhat on the quality of brain activity data recorded from each subject.

Liked this entry ? subscribe to the Nuit Blanche feed, there's more where that came from

Illustrating the rocket equation with water

I don't know about you but I have a lot of respect for people who do things that I thought were not feasible. This morning, I came across one example: a two stage water rocket. Wow, just wow!

Liked this entry ? subscribe to the Nuit Blanche feed, there's more where that came from

Matrix Factorization This Week

What are the news in Matrix Factorization this week ? well,

Here is some of the disconnect I am beginning to see. On the one hand you have people pushing for a new variety of matrix factorizations based on sparsity, rank, structured sparsity and so on. On the other, many implementations in python, R or other languages are lagging and many implementers (I did not say all) seem to think that the term "matrix factorization" is equivalent to NMF. Maybe the next GSOC projects, in whatever language, should pay more attention to this shift and to the need to make a traceable connection between implementation and actual algorithms. In the case of sparse PCA, some of the PCA algos don't even seem to talk about the same beast.

Anyway, here are the different papers that showed up on my radar screen this week: Let us first start with a poster on an algorithm that aims at speeding up the .... NMF :-)

Statistical Optimization of Non-Negative Matrix Factorization by Anoop Korattikara, Levi Boyles, Max Welling, Jingu Kim, Haesun Park.

We also have:

Low-rank matrix recovery via iteratively reweighted least squares minimization with Massimo Fornasier and Holger Rauhut, Rachel Ward. The abstract reads:
We present and analyze an e fficient implementation of an iteratively reweighted least squares algorithm for recovering a matrix from a small number of linear measurements. The algorithm is designed for the simultaneous promotion of both a minimal nuclear norm and an approximatively low-rank solution. Under the assumption that the linear measurements ful fill a suitable generalization of the Null Space Property known in the context of compressed sensing, the algorithm is guaranteed to recover iteratively any matrix with an error of the order of the best k-rank approximation. In certain relevant cases, for instance for the matrix completion problem, our version of this algorithm can take advantage of the Woodbury matrix identity, which allows to expedite the solution of the least squares problems required at each iteration. We present numerical experiments which con rm the robustness of the algorithm for the solution of matrix completion problems, and demonstrate its competitiveness with respect to other techniques proposed recently in the literature.

The  implementation is here. It will be listed on the Matrix Factorization Jungle page.
Multi-label Learning via Structured Decomposition and Group Sparsity by Tianyi Zhou and Dacheng Tao. The abstract reads:
In multi-label learning, each sample is associated with several labels. Existing works indicate that exploring correlations between labels improve the prediction performance. However, embedding the label correlations into the training process significantly increases the problem size. Moreover, the mapping of the label structure in the feature space is not clear. In this paper, we propose a novel multi-label learning method ``Structured Decomposition + Group Sparsity (SDGS)''. In SDGS, we learn a feature subspace for each label from the structured decomposition of the training data, and predict the labels of a new sample from its group sparse representation on the multi-subspace obtained from the structured decomposition. In particular, in the training stage, we decompose the data matrix $X\in R^{n\times p}$ as $X=\sum_{i=1}^kL^i+S$, wherein the rows of $L^i$ associated with samples that belong to label $i$ are nonzero and consist a low-rank matrix, while the other rows are all-zeros, the residual $S$ is a sparse matrix. The row space of $L_i$ is the feature subspace corresponding to label $i$. This decomposition can be efficiently obtained via randomized optimization. In the prediction stage, we estimate the group sparse representation of a new sample on the multi-subspace via group \emph{lasso}. The nonzero representation coefficients tend to concentrate on the subspaces of labels that the sample belongs to, and thus an effective prediction can be obtained. We evaluate SDGS on several real datasets and compare it with popular methods. Results verify the effectiveness and efficiency of SDGS.
I am sure the code will be available soon.

Divide-and-Conquer Matrix Factorization by Lester MackeyAmeet TalwalkarMichael I. Jordan. The abstract reads:
This work introduces SUBMF, a parallel divide-and-conquer framework for noisy matrix factorization. SUBMF divides a large-scale matrix factorization task into smaller subproblems, solves each subproblem in parallel using an arbitrary base matrix factorization algorithm, and combines the subproblem solutions using techniques from randomized matrix approximation. Our experiments with collaborative filtering, video background modeling, and simulated data demonstrate the near-linear to super-linear speed-ups attainable with this approach. Moreover, our analysis shows that SUBMF enjoys high probability recovery guarantees comparable to those of its base algorithm.

The SubMF site is here with the attendant code. It will be listed on the Matrix Factorization Jungle page.

Model Order Selection for Boolean Matrix Factorization by Pauli MiettinenJilles Vreeken. The abstract reads:
Matrix factorizations—where a given data matrix is approximated by a product of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the ‘model order selection problem’ of determining where fine-grained structure stops, and noise starts, i.e., what is the proper size of the factor matrices. Boolean matrix factorization (BMF)—where data, factors, and matrix product are Boolean—has received increased attention from the data mining community in recent years. The technique has desirable properties, such as high interpretability and natural sparsity. But so far no method for selecting the correct model order for BMF has been available. In this paper we propose to use the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate. We formulate the description length function for BMF in general— making it applicable for any BMF algorithm. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior

The BMF code is here. It will also be listed on the Matrix Factorization Jungle page.

Nonnegative matrix factorization (NMF) is now a common tool for audio source separation. When learning NMF on large audio databases, one major drawback is that the complexity in time is O(F KN) when updating the dictionary (where (F, N) is the dimension of the input power
spectrograms, and K the number of basis spectra), thus forbidding its application on signals longer than an hour. We provide an online algorithm with a complexity of O(F K) in time and memory for updates in the dictionary. We show on audio simulations that the online approach is faster for short audio signals and allows to analyze audio signals of several hours.

We propose an unsupervised inference procedure for audio source separation. Components in nonnegative matrix factorization (NMF) are grouped automatically in audio sources via a penalized maximum likelihood approach. The penalty term we introduce favors sparsity at the group level, and is motivated by the assumption that the local amplitude of the sources are independent. Our algorithm extends multiplicative updates for NMF ; moreover we propose a test statistic to tune hyperparameters in our model, and illustrate its adequacy on synthetic data. Results on real audio tracks show that our sparsity prior allows to identify audio sources without knowledge on their spectral properties.

Task-Driven Dictionary Learning by Julien MairalFrancis Bach , Jean Ponce. The abstract reads:
Modeling data with linear combinations of a few elements from a learned dictionary has been the focus of much recent research in machine learning, neuroscience and signal processing. For signals such as natural images that admit such sparse representations, it is now well established that these models are well suited to restoration tasks. In this context, learning the dictionary amounts to solving a large-scale matrix factorization problem, which can be done efficiently with classical optimization tools. The same approach has also been used for learning features from data for other purposes, e.g., image classification, but tuning the dictionary in a supervised way for these tasks has proven to be more difficult. In this paper, we present a general formulation for supervised dictionary learning adapted to a wide variety of tasks, and present an efficient algorithm for solving the corresponding optimization problem. Experiments on handwritten digit classification, digital art identification, nonlinear inverse image problems, and compressed sensing demonstrate that our approach is effective in large-scale settings, and is well suited to supervised and semi-supervised classification, as well as regression tasks for data that admit sparse representations.

Distributed User Profiling via Spectral Methods by Dan-Cristian Tomozei, Laurent Massoulié. The abstract reads:
User profiling is a useful primitive for constructing personalised services, such as content recommendation. In the present paper we investigate the feasibility of user profiling in a distributed setting, with no central authority and only local information exchanges between users. We compute a profile vector for each user (i.e., a low-dimensional vector that characterises her taste) via spectral transformation of observed user-produced ratings for items. Our two main contributions follow: i) We consider a low-rank probabilistic model of user taste. More specifically, we consider that users and items are partitioned in a constant number of classes, such that users and items within the same class are statistically identical. We prove that without prior knowledge of the compositions of the classes, based solely on few random observed ratings (namely $O(N\log N)$ such ratings for $N$ users), we can predict user preference with high probability for unrated items by running a local vote among users with similar profile vectors. In addition, we provide empirical evaluations characterising the way in which spectral profiling performance depends on the dimension of the profile space. Such evaluations are performed on a data set of real user ratings provided by Netflix. ii) We develop distributed algorithms which provably achieve an embedding of users into a low-dimensional space, based on spectral transformation. These involve simple message passing among users, and provably converge to the desired embedding. Our method essentially relies on a novel combination of gossiping and the algorithm proposed by Oja and Karhunen.

We consider a class of learning problems regularized by a structured sparsity-inducing norm de-
fined as the sum of ℓ2- or ℓ∞-norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of ℓ∞-norms can be computed exactly in polynomial time by solving a quadratic min-cost flow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with non-overlapping groups, but in higher dimension and with additional constraints. We propose efficient and scalable algorithms exploiting these two strategies, which are significantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multi-task learning of tree-structured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches

Abstract: Joint analysis of data from multiple sources has the potential to improve our understanding of the underlying structures in complex data sets. For instance, in restaurant recommendation systems, recommendations can be based on rating histories of customers. In addition to rating histories, customers' social networks (e.g., Facebook friendships) and restaurant categories information (e.g., Thai or Italian) can also be used to make better recommendations. The task of fusing data, however, is challenging since data sets can be incomplete and heterogeneous, i.e., data consist of both matrices, e.g., the person by person social network matrix or the restaurant by category matrix, and higher-order tensors, e.g., the "ratings" tensor of the form restaurant by meal by person.
In this paper, we are particularly interested in fusing data sets with the goal of capturing their underlying latent structures. We formulate this problem as a coupled matrix and tensor factorization (CMTF) problem where heterogeneous data sets are modeled by fitting outer-product models to higher-order tensors and matrices in a coupled manner. Unlike traditional approaches solving this problem using alternating algorithms, we propose an all-at-once optimization approach called CMTF-OPT (CMTF-OPTimization), which is a gradient-based optimization approach for joint analysis of matrices and higher-order tensors. We also extend the algorithm to handle coupled incomplete data sets. Using numerical experiments, we demonstrate that the proposed all-at-once approach is more accurate than the alternating least squares approach.

 Making Tensor Factorizations Robust to Non-Gaussian Noise. by E. C. Chi and Tammy Kolda. The abstract reads:

Abstract: Tensors are multi-way arrays, and the CANDECOMP/PARAFAC (CP) tensor factorization has found application in many different domains. The CP model is typically fit using a least squares objective function, which is a maximum likelihood estimate under the assumption of independent and identically distributed (i.i.d.) Gaussian noise. We demonstrate that this loss function can be highly sensitive to non-Gaussian noise. Therefore, we propose a loss function based on the 1-norm because it can accommodate both Gaussian and grossly non-Gaussian perturbations. We also present an alternating majorization-minimization (MM) algorithm for fitting a CP model using our proposed loss function (CPAL1) and compare its performance to the workhorse algorithm for fitting CP models, CP alternating least squares (CPALS).

Credit Thierry Legault, Video of a pass of UARS satellite 8/9 days before atmospheric reentry, at an altitude of only 250km, taken from the ground with a 14" telescope. More info on

Liked this entry ? subscribe to the Nuit Blanche feed, there's more where that came from