I realized that I've almost forgot to continue this topic. Well, let's continue now. Particularly, I want to put more words on the regularization parameter lambda.
In the first blog entry of this topic (see Misunderstandings on Sparse Bayesian Learning (SBL) for Compressed Sensing (1)), I emphasized that
(1) for a given SBL algorithm, the optimal lambda is different under different experiment settings.
(2) Different SBL algorithms have different optimal lambda under the same experiment settings.
(3) For most SBL algorithms, noise variance is not the optimal lambda value.
Now I'll talk somethings on the lambda learning rules. Generally, most SBL algorithms have their own learning rules for lambda. However, as I emphasized previously, these learning rules basically cannot lead to the best performance for their associated SBL algorithms. To give a clear understanding on this, I'll show you a simulation result below.
The simulation is a comparison of 4 SBL algorithms for the SMV model (single measurement vector case) in a noisy environment (noise variance = 0.01). The 4 SBL algorithms are as follows:
(1) Wipf & Rao's EM-SBL (published in IEEE TSP, 2004, title: Sparse Bayesian Learning for Basis Selection) using the basic EM updating rule, denoted by EM-SBL
(2) Qiu & Dogandzic's SBL (published in IEEE TSP, 2010, title: Variance-component based sparse signal reconstruction and model selection), denoted by ExCov
(3) Ji, Xue & Carin's Bayesian Compressive Sensing (published in IEEE TSP, 2008, title: Bayesian Compressive Sensing), denoted by BCS
(4) My T-MSBL for the SINGLE measurement vector model (accepted by IEEE Journal of Selected Topics in Signal Processing, 2011, title: Sparse Signal Recovery with Temporally Correlated Source Vectors Using Sparse Bayesian Learning), denoted by T-MSBL for SMV. Although T-MSBL is derived for the MMV cases, it can be also used for SMV cases. In these cases, T-MSBL is essentially the same as EM-SBL, but T-MSBL uses an efficient learning rule for the lambda.
First, let's see how the lambda value affects their recovery performance. I plotted their performance as a function of the lambda (Note in SMV cases, EM-SBL and T-MSBL have the same performance curve as a function of lambda), i.e. all the algorithms estimated the solution when fed with a fixed lambda value, and the lambda value varied from 0.001 to 0.33. The performance curve of EM-SBL (T-MSBL for SMV), ExCov, and BCS are plotted as a red solid line, a blue solid line, and a green solid line, respectively, in the following figure.
As we can see, if we can obtain the optimal lambda for each algorithm, then EM-SBL(T-MSBL for SMV) and ExCov have the similar performance, while BCS has the poorest performance. However, if we choose a wrong lambda, say lambda = 0.0381 (this value was calculated according to the suggestion in the Basis Pursuit Denoising paper), then we may conclude that the performance rank is: EM-SBL better than BCS better than ExCov. But if we choose lambda = 0.0042 (this value was calculated according to the suggestion in the BCS code), then we may conclude that the performance rank is: ExCov better than EM-SBL better than BCS. So, unthoughtful choices of lambda may lead to wrong conclusions. Again, we've seen the noise variance (0.01) is not the optimal lambda values of all the SBL algorithms. However, it can be seen as a good estimate for the lambda for ExCov in this case.
Second, let's see the how lambda learning rules affect the recovery performance. In fact, if we use lambda learning rules to estimate the lambda, we can find the conclusion on performance comparison will change again. I plotted the performance of EM-SBL, ExCov, and my T-MSBL when they used their own lambda leaning rules, shown in red dashed line with square marks, blue dashed line with circle marks, and red dashed line with star marks (for clear display, I omitted the performance of BCS in this case).
You can see, the lambda learning rule of EM-SBL leads to very very poor performance, so poor that even a random guess on lambda may leads to better performance. I've said that in SMV cases the only difference between EM-SBL and T-MSBL is basically the lambda learning rules. Now you can see, the lambda learning rule of my T-MSBL leads to the best performance among the algorithms.
When I read papers in the past one year, I did find that some people heavily relied on the lambda learning rules of SBL. An obvious example is, in terms of recovery performance, SBL algorithms are much better than Lasso, Basis Pursuit, Matching Pursuit etc (this is basically a "truth" well known in our lab). However, in some published simulation results, we found SBL had poorer performance than those algorithms. This is probably due to wrong choices of lambda values, or the use of some lambda learning rules that behaved very poorly.
Here I used the same experiment settings as above to compare the SBL algorithm with Lasso (fed with the optimal regularization value) and Subspace Pursuit (fed with the true number of nonzero elements in the solution vector). Lasso and Subspace Pursuit are plotted as black line with different marks in the following figure:
Clearly, if we allow EM-SBL to use its lambda learning rule, its performance is poorer than Lasso and Subspace Pursuit. However, EM-SBL actually has much better performance than Lasso and Subspace Pursuit, if it uses its optimal lambda value, or uses the noise variance as its lambda value, or uses other lambda values obtained from cross-validation etc.
So, the lambda learning rule is also a crucial factor when evaluating a SBL algorithm's performance. All the lambda learning rules cannot lead to the best performance for their associated SBL algorithms. But some are more effective than others.
Another conclusion is, the work to derive a more effective lambda learning rule is the same valuable as the work to derive a new SBL algorithm using other computation frameworks or models. For example, ExCov is actually an extension of EM-SBL, which treats the nonzero elements of large variance and small variance with different ways. Due to this, the performance curve of ExCov as a function of lambda values is different to that of EM-SBL. When both algorithms choose their optimal lambda value, ExCov has slightly better performance than EM-SBL. However, if both algorithms use their lambda learning rules, ExCov has much better performance than EM-SBL. But if EM-SBL uses the lambda learning rule of T-MSBL (again, note that EM-SBL and T-MSBL is basically the same in SMV cases, except the lambda learning rules), EM-SBL can have better performance than ExCov (see the performance curve denoted by T-MSBL for SMV case). In this sense, it may be better to derive an effective lambda learning rule based on old models than to derive algorithms based on new models.
Actually, lambda learning rules are more important for SBL algorithms in practical applications. This is because in practice, you cannot obtain the optimal lambda value. Although you can use cross-validation, modified L-curve methods or other methods to choose a value for the lambda, for a large-scale dataset, the computational load is heavy.
In simulations you may find some algorithms exhibit excellent performance when use the noise variance as the lambda values, while others behave poorly when use the noise variance as their lambda values. If you conclude that the former algorithms should be better than the latter, you are wrong. This is because it is difficult to accurately estimate the noise variance in practical applications in most cases; errors in estimating the noise variance cannot be avoided. And you should note that the optimal lambda values of SBL algorithms are not far from each other, and not far from the true noise variance. So, the conclusion you get in simulations when use the true noise variance as the lambda values can be totally different to the conclusion you get in practice when you use the estimated noise variance as the lambda values.
In the next blog entry, I will discuss the issue on the threshold to prune out small gamma_i.
For reproducibility, the experiment settings are as follows:
Gaussian random dictionary matrix of the size 30 x 80
nonzero element number: D = 5
The nonzero elements are generated as: nonzeroW = sign(randn(D,1)).* ( rand(D,1)*0.5 + 0.5 );
And their locations are chosen randomly.
noise variance: 0.01
Reference:
Zhilin Zhang, Bhaskar D. Rao, Clarify Some Issues on the Sparse Bayesian Learning for Sparse Signal Recovery, Technical Report, University of California, San Diego, September, 2011
Zhilin I like your blog!! that's awesome man. can you also provide a list of good papers to start in SBL? Thanks!
ReplyDeleteHi, Ali,
ReplyDeleteThank you for your comments. Sparse Bayesian learning was first proposed by Mike Tipping. Here are his seminal papers:
[1] Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research 1, 211–244.
[2] Faul, A. C. and M. E. Tipping (2002). Analysis of sparse Bayesian learning. In T. G. Dietterich, S. Becker, and Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14, pp. 383–389. MIT Press.
Above, [1] gives a comprehensive description on SBL. [2] gives some theoretical analysis.
You can check his website for more papers/overviews/tutorials/slides on SBL: http://www.miketipping.com/index.php?page=rvm
Tipping's works basically focus in the problems in machine learning. David Wipf introduced SBL into the field of sparse signal recovery/compressed sensing, with significant theoretical analysis. To see how SBL is used/developed in the field of sparse signal recovery/compressed sensing, see:
[3] D.P. Wipf and B.D. Rao, Sparse Bayesian Learning for Basis Selection, IEEE Transactions on Signal Processing, vol. 52, No. 8, August 2004
I think the above three papers are the starting point on the road of SBL. Since then, you can look for more papers by Tipping, Wipf and others (such as I mentioned in the blog entry), based on your interests and favor.
Best
Zhilin