Research not for publishing papers, but for fun, for satisfying curiosity, and for revealing the truth.

This blog reports latest progresses in
(1) Signal Processing and Machine Learning for Biomedicine, Neuroimaging, Wearable Healthcare, and Smart-Home
(2) Sparse Signal Recovery and Compressed Sensing of Signals by Exploiting Spatiotemporal Structures
(3) My Works


Monday, February 28, 2011

Misunderstandings on Sparse Bayesian Learning (SBL) for Compressed Sensing (1)

I will gradually discuss some misunderstandings on the sparse Bayesian learning (SBL), which I have seen in some published papers.

One misunderstanding is the "noise variance" term (written in $\lambda$ or $\sigma^2$ in the literature on SBL). From the Bayesian view, it seems OK to view $\lambda$ as the noise variance. And hence, in computer simulations, people think that setting $\lambda$ to be the true noise variance can allow SBL algorithms to reach their full strength for compressed sensing, and thus this strategy is fair to SBL algorithms when doing performance comparison.

But I have to say, for SBL algorithms the optimal value for $\lambda$ is not the true noise variance, and thus some performance comparisons are not correct!

Here I show some simple experiment results. We can see the optimal value for $\lambda$ is generally larger than the true noise variance.

I use the M-SBL algorithm proposed by David in 2007. The code can be downloaded from the link. Note that this code is a modified version of David's code, since some parameter settings in the latter code are not suitable for performance comparison in low SNR cases.

The Gaussian dictionary matrix was of size 40 by 120 (i.e. N=40, M=120). The number of measurement vectors, L, was 3. The true noise variance was 0.01 (The SNR was around 10dB). The number of nonzero rows in the source matrix (or called the coefficient matrix, the solution matrix), K, was set to be 4, 12, and 16. For each different K, the experiment was repeated 200 times. In each trial, the M-SBL algorithm was fed with different values of $\lambda$. Its performance was measured by two measures: the failure rate and the mean square error (MSE). The failure rate is defined in David's M-SBL paper. These are general experiment settings, for detailed description of the experiments, one can see my paper.

Here is the result (click the figure for full view):
From the figure, we can draw the following conclusions:
(1) The optimal value of $\lambda$ is not equal to the true noise variance. The optimal value is larger than the true noise variance.
(2) Different measures correspond to different optimal values (or different ranges of optimal values). In other words, for given experiment settings, the optimal value for $\lambda$ in terms of MSE is different to the optimal one in terms of the failure rate. This observation is more obvious when using my T-SBL/T-MSBL. 
(3) Changing any experiment setting (e.g. K, L, N, M), the optimal value will accordingly change (in the above experiment, I changed the K). For example, in the figure we can see: with K (the number of nonzero rows in the coefficient matrix) increasing, the optimal value is decreasing, but still larger than the true noise variance.

In fact, I have to add the fourth conclusion (although I don't show here):
(4) For different SBL algorithms, the optimal values for $\lambda$ are different.

Based on these conclusions, now we know that the strategy of setting $\lambda$ being the true noise variance is problematic in algorithm comparison. Suppose we compare two SBL algorithms, say SBL (I) and SBL (II). And suppose that for given experiment settings,  the optimal $\lambda$ of SBL (I) is much closer to the true noise variance than the one of SBL (II). Then, when we set the $\lambda$ to be the true noise variance for the two algorithms, SBL (I) most likely shows better performance than SBL (II), although SBL (II) actually has better performance than SBL (I) if both the two algorithms choose their optimal $\lambda$ values.

Another lesson from the experiment is: when changing any experiment setting, we should again find the optimal $\lambda$ value; otherwise, we can get wrong conclusion. For example, in the above experiment, if we fixed $\lambda$ to be the true noise variance (0.01), we could find that the algorithm's MSE at K=4 is even larger than the MSE at K=12 ! (note that the smaller the K is, the easier the inverse problem is).

Now I think you've believed how important to choose the optimal $\lambda$ in performance comparison.  Especially, when two algorithms have very close performance curves, you must be very cautious to avoid the above wrong strategies.

NOTE: The above conclusions and lessons not only are applied to the comparison of SBL algorithms, but also are applied to the comparison of SBL algorithms to non-SBL algorithms, and also are applied to the comparison of non-SBL algorithms (the regularization parameters of $\ell_1$ algorithms exist similar phenomena!).

You may ask: although the optimal values of $\lambda$ are important, in practice we probably have no way to find them. Yes, you are right. The above conclusions and lessons should be kept in mind when people are really care about which one is better than which one. But in practice, you can use some widely used methods to choose a sub-optimal value for $\lambda$ and do performance comparison. But in this case, you need to be very careful about your conclusions. At least, you cannot say, algorithm (I) is better than algorithm (II). You need to emphasis that when using your $\lambda$ setting strategy, algorithm (I) shows better performance than algorithm (II), and keep in mind that your reader may draw the opposite conclusion when using his/her own strategy for finding a sub(optimal) value for $\lambda$.

The fact that the optimal $\lambda$ is not equal to the true noise variance should not be surprising. In fact, $\lambda$ can be interpreted as noise variance is only reasonable when the dictionary matrix is not underdetermined.

Similarly,  the matrix B in my T-SBL/T-MSBL algorithms cannot be simply interpreted as the covariance matrix of sources. And one should not be surprised that using a single matrix B can lead to excellent performance even if different sources have different temporal correlations.


Reference:

Zhilin Zhang, Bhaskar D. Rao, Clarify Some Issues on the Sparse Bayesian Learning for Sparse Signal Recovery, Technical Report, University of California, San Diego, September, 2011

No comments:

Post a Comment