I realized that I've almost forgot to continue this topic. Well, let's continue now. Particularly, I want to put more words on the regularization parameter lambda.
In the first blog entry of this topic (see Misunderstandings on Sparse Bayesian Learning (SBL) for Compressed Sensing (1)), I emphasized that
(1) for a given SBL algorithm, the optimal lambda is different under different experiment settings.
(2) Different SBL algorithms have different optimal lambda under the same experiment settings.
(3) For most SBL algorithms, noise variance is not the optimal lambda value.
Now I'll talk somethings on the lambda learning rules. Generally, most SBL algorithms have their own learning rules for lambda. However, as I emphasized previously, these learning rules basically cannot lead to the best performance for their associated SBL algorithms. To give a clear understanding on this, I'll show you a simulation result below.
The simulation is a comparison of 4 SBL algorithms for the SMV model (single measurement vector case) in a noisy environment (noise variance = 0.01). The 4 SBL algorithms are as follows:
(1) Wipf & Rao's EM-SBL (published in IEEE TSP, 2004, title: Sparse Bayesian Learning for Basis Selection) using the basic EM updating rule, denoted by EM-SBL
(2) Qiu & Dogandzic's SBL (published in IEEE TSP, 2010, title: Variance-component based sparse signal reconstruction and model selection), denoted by ExCov
(3) Ji, Xue & Carin's Bayesian Compressive Sensing (published in IEEE TSP, 2008, title: Bayesian Compressive Sensing), denoted by BCS
(4) My T-MSBL for the SINGLE measurement vector model (accepted by IEEE Journal of Selected Topics in Signal Processing, 2011, title: Sparse Signal Recovery with Temporally Correlated Source Vectors Using Sparse Bayesian Learning), denoted by T-MSBL for SMV. Although T-MSBL is derived for the MMV cases, it can be also used for SMV cases. In these cases, T-MSBL is essentially the same as EM-SBL, but T-MSBL uses an efficient learning rule for the lambda.
First, let's see how the lambda value affects their recovery performance. I plotted their performance as a function of the lambda (Note in SMV cases, EM-SBL and T-MSBL have the same performance curve as a function of lambda), i.e. all the algorithms estimated the solution when fed with a fixed lambda value, and the lambda value varied from 0.001 to 0.33. The performance curve of EM-SBL (T-MSBL for SMV), ExCov, and BCS are plotted as a red solid line, a blue solid line, and a green solid line, respectively, in the following figure.
So, unthoughtful choices of lambda may lead to wrong conclusions. Again, we've seen the noise variance (0.01) is not the optimal lambda values of all the SBL algorithms. However, it can be seen as a good estimate for the lambda for ExCov in this case.
Second, let's see the how lambda learning rules affect the recovery performance. In fact, if we use lambda learning rules to estimate the lambda, we can find the conclusion on performance comparison will change again. I plotted the performance of EM-SBL, ExCov, and my T-MSBL when they used their own lambda leaning rules, shown in red dashed line with square marks, blue dashed line with circle marks, and red dashed line with star marks (for clear display, I omitted the performance of BCS in this case).
You can see, the lambda learning rule of EM-SBL leads to very very poor performance, so poor that even a random guess on lambda may leads to better performance. I've said that in SMV cases the only difference between EM-SBL and T-MSBL is basically the lambda learning rules. Now you can see, the lambda learning rule of my T-MSBL leads to the best performance among the algorithms.
When I read papers in the past one year, I did find that some people heavily relied on the lambda learning rules of SBL. An obvious example is, in terms of recovery performance, SBL algorithms are much better than Lasso, Basis Pursuit, Matching Pursuit etc (this is basically a "truth" well known in our lab). However, in some published simulation results, we found SBL had poorer performance than those algorithms. This is probably due to wrong choices of lambda values, or the use of some lambda learning rules that behaved very poorly.
Here I used the same experiment settings as above to compare the SBL algorithm with Lasso (fed with the optimal regularization value) and Subspace Pursuit (fed with the true number of nonzero elements in the solution vector). Lasso and Subspace Pursuit are plotted as black line with different marks in the following figure:
So, the lambda learning rule is also a crucial factor when evaluating a SBL algorithm's performance. All the lambda learning rules cannot lead to the best performance for their associated SBL algorithms. But some are more effective than others.
Another conclusion is, the work to derive a more effective lambda learning rule is the same valuable as the work to derive a new SBL algorithm using other computation frameworks or models. For example, ExCov is actually an extension of EM-SBL, which treats the nonzero elements of large variance and small variance with different ways. Due to this, the performance curve of ExCov as a function of lambda values is different to that of EM-SBL. When both algorithms choose their optimal lambda value, ExCov has slightly better performance than EM-SBL. However, if both algorithms use their lambda learning rules, ExCov has much better performance than EM-SBL. But if EM-SBL uses the lambda learning rule of T-MSBL (again, note that EM-SBL and T-MSBL is basically the same in SMV cases, except the lambda learning rules), EM-SBL can have better performance than ExCov (see the performance curve denoted by T-MSBL for SMV case). In this sense, it may be better to derive an effective lambda learning rule based on old models than to derive algorithms based on new models.
Actually, lambda learning rules are more important for SBL algorithms in practical applications. This is because in practice, you cannot obtain the optimal lambda value. Although you can use cross-validation, modified L-curve methods or other methods to choose a value for the lambda, for a large-scale dataset, the computational load is heavy.
In simulations you may find some algorithms exhibit excellent performance when use the noise variance as the lambda values, while others behave poorly when use the noise variance as their lambda values. If you conclude that the former algorithms should be better than the latter, you are wrong. This is because it is difficult to accurately estimate the noise variance in practical applications in most cases; errors in estimating the noise variance cannot be avoided. And you should note that the optimal lambda values of SBL algorithms are not far from each other, and not far from the true noise variance. So, the conclusion you get in simulations when use the true noise variance as the lambda values can be totally different to the conclusion you get in practice when you use the estimated noise variance as the lambda values.
In the next blog entry, I will discuss the issue on the threshold to prune out small gamma_i.
For reproducibility, the experiment settings are as follows:
Gaussian random dictionary matrix of the size 30 x 80
nonzero element number: D = 5
The nonzero elements are generated as: nonzeroW = sign(randn(D,1)).* ( rand(D,1)*0.5 + 0.5 );
And their locations are chosen randomly.
noise variance: 0.01
Zhilin Zhang, Bhaskar D. Rao, Clarify Some Issues on the Sparse Bayesian Learning for Sparse Signal Recovery, Technical Report, University of California, San Diego, September, 2011