My blogs reporting quantitative financial analysis, artificial intelligence for stock investment & trading, and latest progress in signal processing and machine learning

Wednesday, June 17, 2015

5 Reasons Apache Spark is the Swiss Army Knife of Big Data Analytics


We are living in exponential times. Especially if we are talking about data. The world is moving fast and more data is generated every day. With all that data coming your way, you need the right tools to deal with the growing amounts of data.

If you want to get any insights from all that data, you need tools that can process massive amounts of data quickly and efficiently. Fortunately the Big Data open source landscape is also growing rapidly and more and more tools come to the market to help you with this. One of these open source tools that is making fame at this moment is Apache Spark.

This week I was invited to join the IBM |Spark analyst session and the Apache™ Spark Community Event in San Francisco, where the latest news was shared on Apache Spark.

IBM announced their continuous contribution to the Apache Spark community. At the core of this commitment, IBM wants to offer Spark as a Service on the IBM Cloud as well as integrate Spark into all of its analytics platforms. They have also donated the IBM SystemML machine learning to the Spark open source ecosystem, allowing the Spark community to benefit from the powerful SystemML technology. In addition, IBM announced the Spark Technology Centre and the objective to educate over a million data scientists and engineers on Spark.

Apache Spark is currently very hot and many organizations are using it to do more with their data. It can be seen as the Analytics Operating System and it is the potential to disrupt the Big Data ecosystem. It is an open source tool with over 400 developers contributing to Spark. Let me explain what Apache Spark and show you how it could benefit your organization:

Apache Spark Enables Deep Intelligence Everywhere

Spark is an open source tool that was developed in the AMPLab at UC Berkeley. Apache Spark is a general-purpose engine for large-scale data processing, up to 1000s of nodes. It is an in-memory distributed computing engine that is highly versatile to any environment. This enables users and developers to quickly build models, iterate faster and apply deep intelligence to your data across your organization.

Spark’s distinguishing feature is its Resilient Distributed Datasets (RDDs). This feature allows collections of objects to be stored in memory or disk across a cluster, which automatically rebuilds on failure. Its in-memory primitives offer up to 100 times faster performances, contrary to the two-stage, disk-based MapReduce paradigm. It therefore addresses several of the MapReduce challenges.

Spark lets data scientists and developers work together in a unified platform. It enables developers to essentially execute Python or Scala code across a cluster instead to one machine. Users can load data into a cluster’s memory and they can query it repeatedly. Basically Spark is an advanced analytics tool that is very useful for machine learning algorithms because of these clusters.

Spark is very well suited for the Big Data era, as it supports the rapid development of Big Data applications. Code can easily be reused across batch, streaming and interactive applications.

According to a CrowdChat hosted by IBM the week before the analyst session, some important features of Spark implementation were discussed, including:

Real time querying of your data;

High-speed stream processing of low latency data;

Clear separation of importing data and distributed computation;

Spark’s libraries, including: Spark SQL, Spark Streaming, the MLlib machine learning library, and GraphX;

A large community and support for Spark from major vendors including IBM, MapR, Cloudera, Intel, Hortonworks and many other Big Data platforms.

5 Use Cases of Apache Spark

According to Vaibhav Nivargi, early adopters of Spark by sector include consumer-packaged goods (CPG), insurance, media and entertainment, pharmaceuticals, retailers, automotive.

However, there are also multiple use cases for Spark where high-velocity, high-volume, constant streams of (un)structured data is generated, most of which will be machine data. Use cases include fraud detection, log processing as part of IT Operations Analytics, sensor data processing and of course data related to the Internet of Things.

Experts believe that Spark is likely to evolve into the preferred tool for high-performance Internet of Things applications that will eventually generate Petabytes of data across multiple channels. Apache Spark is already used by multiple companies and some examples of these implementations are the following:

1. ClearStory Data

ClearStory Data uses Apache Spark as the basis for their harmonization. Spark enabled them to create a visualization tool that allows users to slice and dice massive amounts of data while visualizations adjust instantly. This allows their users to collaborate on the data without any delay.

2. Conviva

Another use case of Apache Spark, in particular Spark Streaming, is that for Conviva. Conviva is one of the largest video companies in the world, processing more than 4 billion video streams per month. In order to achieve this, they dynamically select and optimize sources to deliver the highest playback quality.

Spark Streaming has enabled them to learn the different network conditions in real-time and feed this directly into the video player. This allows them to optimize the streams and ensure that all four billion videos receive the correct amount of buffering.

3. Yahoo!

Yahoo is using Apache Spark already for some time. They have several projects running with Spark. One of them uses Spark to offer the right content for the right visitor, i.e. personalization. Machine learning algorithms determine individual visitors’ interests to give them the right news when they visit Yahoo! These same algorithms also help to categorize news stories when they arise.

Personalization is in fact very difficult and it requires high-speed, (near) real-time processing power to understand a profile of a visitor upon entering your website. Apache Spark helps in this process.

4. Huawei

The telecommunications division of Huawei uses the Machine Learning libraries of Spark to analyse massive amounts of network data. This network data can show traffic patterns that reveal subpaths or routers within the network that are congested and that slow down the traffic. When they have identified the hotspots in the network, they can add hardware to solve the congestion.

5. RedRock

RedRock was developed during Spark Hacker Days. This hackathon is organized among IBM employees and within 10 days they created 100 innovations with Spark. One of these innovation is RedRock, which is currently in private Alpha.

RedRock is a Twitter analysis tool that allows user to search and analyse Twitter to get insights such as categories, topics, sentiment and geographics. It lets the user act on data-driven insights discovered from Twitter.

It was developed following the IBM design thinking framework. This is a cyclical process of understanding, exploring, prototyping and evaluating. RedRock uses a 19-node spark cluster with 3.3 Terabyte in-memory Twitter data. Thanks to Spark, RedRock can analyse and visualize this amount of data within seconds based on any query.

Five Ways Spark Improves Your Business

Apache Spark already has a lot of traction and with more companies partnering up and using Spark, it is likely to see that it is going to evolve in something really big. This is quite understandable if you look at the different advantages Spark can offer your business:

1. Spark is the right tool for analytic challenges that demand low-latency in-memory machine learning and graph analytics. This is especially relevant for companies that focus on the Internet of Things.

Since we will see the Internet of Things popping up in every imaginable industry in the coming years, Spark will enable organizations to analyse all that data coming from IoT sensors as it can easily deal with continuous streams of low-latency data. This will enable organizations to create real-time dashboards to explore their data and to monitor and optimize their business.

2. Spark will drastically improve Big Data Scientist productivity. It enables faster, iterative product development, using popular programming languages. The high-level libraries of Spark, including streaming data, machine learning, support for SQL queries, and graph processing, can be easily combined to create complex workflows.

This enables data scientists to create new Big Data applications faster. In fact, it requires 2-5x less code. It will result in reduced time-to-market for new products as well as faster access to insights within your organization using these applications.

Spark also enables data scientists to prototype solutions without the requirement to submit code to the cluster every time, leading to better feedback and iterative development.

3. According to James Kobielus, the Internet of Things “may spell the end of data centres as we’ve traditionally known them”. In the coming years we will see that most of the core functions of a data centre, storage and processing, can be done decentralized. i.e. it can be done directly on the IoT devices instead of a centralized data centre. This is called Fog Computing and was recently described by Ahmed Banafa on Datafloq.

Fog computing will give organizations unprecedented computing power at relatively low costs. Apache Spark is very well suited for the analysis of massive amounts of highly distributed data. Apache Spark could therefore, potentially, be the catalyst required to have fog computing take off and to prepare organizations for the Internet of Things. This in turn could enable organizations to create new products and applications for their customers, creating new business models and revenue streams.

4. Spark’s framework is developed on top of the Hadoop Distributed File System. So this is a major advantage for those who are already familiar with Hadoop and already have a Hadoop cluster within their organization. It therefore works well with Hadoop, using the same data formats and adhering to the data locality for efficient processing. It can be deployed on existing Hadoop clusters or work side-by-side. This allows organizations that already work with Hadoop to easily update their Big Data environment.

5. There is a  large community contributing around Spark, with over 400 developers from around the world contributing and a long list of vendors supporting Spark. Combined with the compatibility with popular programming languages it offers organizations a large pool of developers that can work with Spark. Instead of having to hire expensive programmers because of an unknown tool or programming language, Spark is easy to use because of the libraries and the compatibility with Java, Python and Scala.

The Swiss Army Knife of Big Data Analytics

Spark is a powerful open-source data analytics, cluster-computing framework. It has become very popular because of its speed, iterative computing and better data access because of its in memory caching. Its libraries enable developers to create complex applications faster and better, enabling organizations to do more with their data.

Because of its wide range of applications and it easy use to work with, Spark is also called the Swiss army knife of Big Data Analytics. And with the buzz already happening around Spark, the large community supporting Spark and the multiple use cases we already have seen, Spark could evolve into the next big thing within the Big Data ecosystem.

Saturday, May 9, 2015

Yann LeCun's Comments on Extreme Learning Machine (ELM)

Yann LeCun ( in his Facebook commented on ELM, which I quoted below:

What's so great about "Extreme Learning Machines"?

There is an interesting sociological phenomenon taking place in some corners of machine learning right now. A small research community, largely centered in China, has rallied around the concept of "Extreme Learning Machines".

Frankly, I don't understand what's so great about ELM. Would someone please care to explain?

An ELM is basically a 2-layer neural net in which the first layer is fixed and random, and the second layer is trained. There is a number of issues with this idea.

First, the name: an ELM is *exactly* what Minsky & Papert call a Gamba Perceptron (a Perceptron whose first layer is a bunch of linear threshold units). The original 1958 Rosenblatt perceptron was an ELM in that the first layer was randomly connected.

Second, the method: connecting the first layer randomly is just about the stupidest thing you could do. People have spent the almost 60 years since the Perceptron to come up with better schemes to non-linearly expand the dimension of an input vector so as to make the data more separable (many of which are documented in the 1974 edition of Duda & Hart). Let's just list a few: using families of basis functions such as polynomials, using "kernel methods" in which the basis functions (aka neurons) are centered on the training samples, using clustering or GMM to place the centers of the basis functions where the data is (something we used to call RBF networks), and using gradient descent to optimize the position of the basis functions (aka a 2-layer neural net trained with backprop).

Setting the layer-one weights randomly (if you do it in an appropriate way) can possibly be effective if the function you are trying to learn is very simple, and the amount of labelled data is small. The advantages are similar to that of an SVM (though to a lesser extent): the number of parameters that need to be trained supervised is small (since the first layer is fixed) and easily regularized (since they constitute a linear classifier). But then, why not use an SVM or an RBF net in the first place?

There may be a very narrow area of simple classification problems with small datasets where this kind of 2-layer net with random first layer may perform OK. But you will never see them beat records on complex tasks, such as ImageNet or speech recognition.

The EML's inventor, G.-B. Huang replied by pointing out that the answers can be found in his paper: "What are Extreme Learning Machines? Filling the Gap between Frank Rosenblatt's Dream and John von Neumann's Puzzle" (

Sunday, January 25, 2015

IEEE-CS Unveils Top 10 Technology Trends for 2015

LOS ALAMITOS, Calif., 1 December 2014 – In the coming year, while consumers will be treated to a dizzying array of augmented reality, wearables, and low-cost 3D printers, computer researchers will be tackling the underlying technology issues that make such cutting-edge consumer electronics products possible. IEEE Computer Society today announced the top 10 most important technology trends for 2015 and explores how these technologies will be integrated into daily life.

Cybersecurity in general will remain a critical concern, with increased focus on security for cloud computing and deeply embedded devices. And interoperability and standards will be top priorities to unleash the potential of Software-defined Anything (SDx) and the Internet of Anything (IoA).

"Researchers have been working to address these issues for a number of years, however 2015 should see real progress in these areas," said incoming IEEE Computer Society President Thomas Conte, an electrical and computer science professor at Georgia Tech. "We are reaching an inflection point for 3D printing, which will revolutionize manufacturing, and the exponential growth in devices connected to the Internet makes interoperability and standards critical."

Among the advances that IEEE Computer Society experts forecast:

1) The time is right for wearable devices: Imagine a wearable device that tells time, sends and receives email and messages, makes calls, and even tracks exercise routines. Smartwatches hitting the market do all that and more. Both established players and small startups in 2015 will be actively involved in developing new devices, applications, and protocols for the wearable electronics market.

2) Internet of Anything will become all-encompassing: The reality that up to 26 billion things will be connected on the Internet by 2020 is sinking in. The Internet of Things and Internet of Everything in 2015 will morph into the Internet of Anything. IoA envisions a common software "ecosystem" capable of accommodating any and all sensor inputs, system states, operating conditions, and data contexts — an overarching "Internet Operating System."

3) Building security into software design: As the volume of data explodes, along with the means to collect and analyze that information, building security into software design and balancing security and privacy are becoming top priorities.

4) Industry will tackle Software-defined Anything (SDx) interoperability and standards: Software-defined networking's programmability will turn various network appliances into a warehouse of apps. Several standards groups are working on interoperability issues, including the Open Networking Foundation (ONF), the Internet Engineering Task Force (IETF), the European Telecommunications Standards Institute (ETSI), and the International Telecommunication Union (ITU).

5) Cloud security and privacy concerns grow: The celebrity photo hacking scandal and iCloud breach in China in 2014 has brought cloud security to the forefront for 2015. Enterprises are moving workloads to the cloud and expecting enterprise-level security. To avoid system fragility and defend against vulnerabilities exploration from cyber attackers, various cybersecurity techniques and tools are being developed for cloud systems.

6) 3D Printing is poised for takeoff: Next year will see production of the first 3D-printed car. The 3D-printed car is just one of hundreds of uses that enterprises and consumers are finding for 3D printing, which will revolutionize manufacturing by lowering costs and time to market. Also in 2015, sales of 3D printers are expected to take off, driven by low-cost printing and uptake by a variety of industries.

7) Predictive Analytics will be increasingly used to identify outcomes: Business intelligence in 2015 will be less about examining the past and more about predicting the future. While predictive modeling techniques have been researched by the data mining community for several decades, they are now impacting every facet of our lives. In organizational settings, predictive analytics has gained widespread adoption over the past decade as firms look to compete on analytics.

8) Embedded Computing security will get added scrutiny: Deeply-embedded computing systems often perform extremely sensitive tasks, and in some cases, such as healthcare IT, these are lifesaving. Emerging deeply-embedded computing systems are prone to more serious or life-threatening malicious attacks. These call for revisiting traditional security mechanisms not only because of the new facets of threats and more adverse effects of breaches, but also due to the resource limitations of these often-battery-powered and extremely constrained computing systems.

9) Augmented Reality Applications will grow in popularity: Mobile apps using augmented reality help the colorblind see colors, travelers explore unfamiliar cities, shoppers imagine what they look like in different outfits, and even help drivers locate their parked cars. The availability of inexpensive graphics cards and sensors, and the popularity of applications in such areas as gaming and virtual worlds, is bringing augmented reality into the mainstream.

10) Smartphones will provide new opportunities for Continuous Digital Health: The way we deal with our health is undergoing a major transformation, not only because mobile Internet technology has made it possible to have continuous access to personal health information, but also because breaking the trend of ever-growing healthcare costs is increasingly necessary. Connectivity, interoperability, sensing, and instant feedback through smartphones all provide new opportunities for gaining insights into our health behavior.

To view the full list of IEEE Computer Society technology trends, visit

Sunday, January 11, 2015

Wearables, Sensables, and Opportuniteis at CES 2015

ApplySci released a summary on wearable healthcare devices on CES 2015. And I quoted below. ApplySci commented that "Samsung's Simband is best positioned to take wearables into medical monitoring". Note that in Simband, PPG are used to monitor heart rate for fitness tracking and health monitoring, and I've already obtained the superior PPG-based heart rate estimation algorithms for it. Some general frameworks have been published/under review by IEEE Transaction on Biomedical Engineering. You can check my website for more details:

It was the year of Digital Health and Wearable Tech at CES.  Endless watches tracked vital signs (and many athletes exercised tirelessly to prove the point).   New were several ear based fitness monitors (Brag), and some interesting TENS pain relief wearables (Quell).  Many companies provided  monitoring for senior citizens, and the most interesting only notified caregivers when there was a change in learned behavior (GreenPeak).  Senior companion robots were missing, although robots capable of household tasks were present (Oshbot).  3D printing was big (printed Pizza)–but where were 3d printed bones and organs?  Augmented reality was popular (APX,Augmenta)–but mostly for gaming or industrial use.  AR for health is next.

Two companies continue to stand out in Digital Health.  Samsung’s Simband  is best positioned to take wearables into  medical monitoring, with its multitude of sensors, open platform, and truly advanced health technologies. And  MC10‘s electronics that bend, stretch, and flex will disrupt home diagnosis, remote monitoring, and smart medical devices.

We see two immediate opportunities.  The brain, and the pulse.

1.  A few companies at CES claimed to monitor brain activity, and one savvy brand (Muse) provided earphones with soothing sounds while a headband monitored attention.  While these gadgets were fun to try, noone at CES presented extensive brain state interpretation to address cognitive and emotional issues.

2.  Every athlete at CES used a traditional finger based pulse sensor.  A slick wearable that can forgo the finger piece will make pulse oximetry during sports fun, instead of awkward.  As with every gadget, ensuring accuracy is key, as blind faith in wearables can be dangerous.

ApplySci looks forward to CES 2016, and the many breakthroughs to be discovered along the way, many of which will be featured at Wearable Tech + Digital Health NYC 2015.

Thursday, November 27, 2014

A smart-watch for elders independent of Wi-Fi network

We have seen many smartwatches made by major IT giants and many startups. Now here is another smartwatch made by a crowdfunded startup, Lively.

The senior monitoring sensor system operates independently of a Wi-Fi network. It relies on the Lively Hub, with its own cellular network.

The watch gives medicine reminders, or alerts when medicine has been missed. It has an emergency button contacts a dispatcher and alerts family members. Fitness tracking features, such as steps taken, are included, and can be viewed by the wearer or family. The wearable is waterproof and can be worn in the shower.

Saturday, November 1, 2014

ECG on the run: Continuous ECG surveillance of marathon athletes is feasible

Article From:

The condition of an athlete's heart has for the first time been accurately monitored throughout the duration of a marathon race. The real-time monitoring was achieved by continuous electrocardiogram (ECG) surveillance and data transfer over the public mobile phone network to a telemedicine centre along the marathon route. This new development in cardiac testing in endurance athletes, said investigators, "would allow instantaneous diagnosis of potentially fatal rhythm disorders."

Following trails in two marathon races, the investigators now describe online ECG surveillance as feasible and "a promising preventive concept." They explain in their first report of the technique how "in the case of life-threatening arrhythmias, the emergency services located along the running track could be alerted to take the runners at risk out of the race and start extended cardiologic diagnostics and treatment."

The investigators, from the Center for Cardiovascular Telemedicine, Charité-Universitätsmedizin in Berlin, present their results at the first European Congress on e-Cardiology and e-Health, with a full report published in the European Journal of Preventive Cardiology.

Proof of the method's concept was achieved during two marathon races in Germany, in each of which five healthy runners (mean age 41.7 years) were equipped with a small ECG device and smart phone worn on the arm. Data transfer between the ECG monitor and phone was by Bluetooth technology. The ECG data were streamed from the device to the investigators' telemedicine centre in Berlin, where the data were monitored live and stored for later analysis.

During the trials all ten participants completed the two marathons without problems (in a mean time of 3h 37min) but there were differences in the quality of ECG streaming. In the first race, with more than 7000 runners and 150,000 spectators, there was virtually no accurate streaming from the ECG device because of errors in both the Bluetooth connection and connectivity of the phone to the mobile phone network.

New software to connect both devices was thus introduced for the second race six months later (with more than 15,000 runners and 300,000 spectators), and the athletes were asked to wear each device (ECG and smart phone) on the same arm. As a result of the changes, the quality of streaming ECG data was "excellent," with mean transfer time for an ECG wave complex measured at just 72 seconds.

Thus, on this second attempt feasibility was demonstrated in the two essential parameters: rapid transfer time of ECG data; and the continuity of ECG information between individual mobile phones and the medical centre.

Next on the agenda, say the investigators, is a miniaturised ECG device "to improve comfort and acceptance." And generally, they add, the system should ideally "be able to transmit ECG signals reliably, even under extreme conditions, such as running, with extensive body movements of the sweating athletes. Moreover, there should be no interruption in ECG data transfer within a mobile phone network, even under the condition of an extreme workload caused by thousands of mobile phone customers (athletes and spectators) allocated in a very limited geographical area."

As background to the study the investigators note that sudden cardiac death is rare -- though not unknown -- among marathon runners and other endurance athletes. In 2012 one 42-year-old runner died at the end of the London marathon, the event's second death in three years. All such tragic events are widely publicised -- as in the case of UK soccer player Fabrice Muamba, whose heart stopped for 78 minutes during a televised game in 2012 -- and raise inevitable questions about cardiovascular evaluation in endurance sports. In Italy, for example, the risk of sudden cardiac death is now considered so real that preparticipation screening (with ECG) is obligatory in all athletes and sports players.
This study's principal investigator Professor Friedrich Köhler confirmed that the risk of sudden cardiac death in endurance running is indeed "low," citing a 2012 study in which the incidence of cardiac arrests was put at 0.54 per 100,000 runners. However, he added that preparticipation screening is not able to detect this risk with any certainty.

Now, for real time evaluation to step up from proof of concept to the practical level, there are still technical problems to solve. "First," said Professor Köhler, "we need a way to handle the monitoring of, say, a thousand runners. One solution could be some 'intelligent' IT middleware, which might identify and select all pathological ECG findings for further analysis in the telemedical centre. And second, if the system does signal an abnormal ECG, how do we identify that individual runner at risk among so many runners -- and how do we alert the paramedics out on the course?"

Nevertheless, the two experiments reported today suggest that the concept works, and Professor Köhler indicated its high public health potential in other areas. "The marathon might be just a first indication for the continuous surveillance of vital parameters with mobile phone technology," he said. "There are opportunities in other endurance sports and even in other fields -- perhaps in the drivers of high speed express trains."

Story Source:
The above story is based on materials provided by European Society of Cardiology.Note: Materials may be edited for content and length.

Journal Reference:
  1. S. Spethmann, S. Prescher, H. Dreger, H. Nettlau, G. Baumann, F. Knebel, F. Koehler. Electrocardiographic monitoring during marathon running: a proof of feasibility for a new telemedical approachEuropean Journal of Preventive Cardiology, 2014; 21 (2 Suppl): 32 DOI: 10.1177/2047487314553736

Friday, October 31, 2014

Two Papers are Ranked the No.1 and No.2 Most Cited Articles published in 2013 and 2014 in IEEE T-BME

Just know that my two papers are ranked the No.1 and No.2 Most Cited Articles Published in 2013 and 2014 in the journal IEEE Transactions on Biomedical Engineering. The journal link is here:

The two papers are:

Compressed Sensing for Energy-Efficient Wireless Telemonitoring of Noninvasive Fetal ECG via Block Sparse Bayesian Learning
Zhilin Zhang, Tzyy-Ping Jung, Scott Makeig, Bhaskar D. Rao
IEEE Trans. on Biomedical Engineering, vol. 60, no. 2, pp. 300-309, 2013
[Remark: This paper used BSBL to reconstruct raw fetal ECG. It showed that BSBL can directly recover non-sparse correlated signals without using any dictionary matrix. It may be the first solid evidence showing that exploiting correlation is an effective way to reconstruct non-sparse signals in any domains.]

Compressed Sensing of EEG for Wireless Telemonitoring with Low Energy Consumption and Inexpensive Hardware
Zhilin Zhang, Tzyy-Ping Jung, Scott Makeig, Bhaskar D. Rao
IEEE Trans. on Biomedical Engineering, vol. 60, no. 1, pp. 221-224, 2013[This paper applied BSBL to wireless telemonitoring of EEG, showing DCT coefficients can be recovered using block-structure model. Some explanations on why DCT dictionary matrix is effective for raw EEG are further given in my ST-SBL paper.]

Thursday, October 23, 2014

What the Internet of Things Will Look Like in 2025 (Infographic)

There is a post in, showing the technical development and future of smart-home. It's very interesting and closely related to my research:

(Picture from the post in

Wednesday, May 29, 2013

Use Block Sparse Bayesian Learning (BSBL) for Practical Problems

In LinkedIn, Phil, Leslie, and me have a productive discussion on the use of BSBL and EM-GM-AMP in practical problems. The whole discussions can be seen at here. For convenience, I copied my words on the use of BSBL and related issues below.

Below are several practical examples where intra-block correlation exists and BSBL can be used. In fact, there are many examples in practice.

1. Localization of distributed sources (not point-sources). In EEG/MEG source localization, the sources are generally not just a point. They have areas, so they are called distributed sources. When modeling the problem using a sparse linear regression model: y=Ax + v, the coefficient vector x is expected to contain several nonzero blocks, each nonzero block corresponding to a distributed source. Entries in the same nonzero block are highly correlated to each other in amplitude, since they are all associated with the same source.

(Remark: There are a number of work assuming the sources are point-sources as well. Thus the problem becomes a traditional DOA problem. However, in this case, BSBL can be applied as well, which I will explain later.)

2. In compressed sensing of ECG, an ECG signal has clear block structure (generally all the blocks are nonzero, but some blocks are very close to zero), and in each “block” the entries are highly correlated in amplitude. One can see the Fig.1 in my T-BME paper ( ) to get such feeling.

(Remark: Almost all the physiological signals have correlation in the time domain, although some of them have not clear block-structure. However, BSBL can also be used for such signals. I will explain later).

3. One should note that the mathematical model used in compressed sensing is a linear regression model, and such a model has numerous applications in almost every field. For many applications, the regression coefficients have block structure. In each block, the entries are associated with a same physical “force” or a “factor”, and thus correlated in their amplitude.

In fact, as long as a signal has block structure, then it is highly possible that intra-block correlation also exists. Even if the intra-block correlation= 0, it is not harmful to run BSBL, because, as you can see from my T-SP paper, it also has excellent performance (and outperforms all the known block-structure-based algorithms) when intra-block correlation = 0. You do not need to test whether the correlation exists. You just need to run. That’s it. 

 Now I want to emphasize that BSBL can be used in the general-sparse problems (i.e., no structure in the coefficient vector x). This is due to two factors.

1.One factor is that truly sparse signals do not exist in most practical problems; in fact, they are non-sparse.

As I have mentioned in our personal communication, and also mentioned by Phil and many other people, the truly sparse signal does not exist in most practical problems. Those “sparse” signals are compressive; most of their entries in the time domain (or the coefficients in some transformed domains) are close to zero but strictly nonzero. In other words, they are non-sparse ! Using general-sparse recovery algorithms (such as Lasso, CoSaMP, OMP, FOCUSS, the basic SBL) can recover those entries with large amplitude. But they always have challenges in recovery of the entries with small amplitude. So, the quality of the recovered data has a “glass ceilling”, which is not very high.

However, BSBL has a unique property, i.e., recovering non-sparse signals (or signals with non-sparse representation coefficients) very well. As for the recovery of non-sparse signals, I do have a number of work showing this. Please refer to my T-BME papers:
[1] Compressed Sensing for Energy-Efficient Wireless Telemonitoring of Noninvasive Fetal ECG via Block Sparse Bayesian Learning
[2] Compressed Sensing of EEG for Wireless Telemonitoring with Low Energy Consumption and Inexpensive Hardware

In fact, we have achieved much better results than [1-2], which will be release soon.

2.The second factor is that BSBL is a kind of multiscale regression algorithms (although very naïve).

In DOA or similar applications (e.g. earth-quake detection, brain source localization, or some communication problems), the matrix A (remind the model: y=Ax + v) is highly coherent (i.e., columns of A are highly correlated to each other). In this case, even if x is a very sparse vector, this problem is very difficult, especially in noisy situations. Using the basic SBL can get better performance than most existing algorithms. But we found that using BSBL can get much better performance. This is mainly because that BSBL, using the block partition, divides the whole search space (i.e. the number of whole locations in x) into a number of sub search spaces (i.e. the number of candidate nonzero blocks). This makes the localization problems become easier. Since x is sparse, generally the nonzero blocks are only a few and zero blocks are many. During iteration, the zero blocks are deleted in BSBL gradually. And the problem becomes easier and easier with iteration. I can dynamically change the block partition in BSBL according to some criterion. However, experience showed that it is not necessary to do this. BSBL can eventually find the correct locations of nonzero entries in x (although the rest entries in a nonzero block have very small amplitude).

The Fig.4 in my ICASSP 2012 paper ( ) may give some feeling on what I said “the rest entries in a nonzero block have very small amplitude”.

Last, but not least, I want to say that a comparison between BSBL with other algorithms is helpful to everybody. I will be very happy to see the result. But I suggest one performs the comparison in some practical problems, since the readers generally come from various application fields with questions similar like “which algorithms is the best one for my problems”. Putting the comparison in specific practical problems is more informative and really helpful. Computer simulations have several problems, such as the suitable performance index, the consistency of the simulation model with the underlying models in practical problems, what’s the criterion to choose the algorithms (e.g. for an audio compressed sensing problem, does one think it is a general-sparse recovery problem, or a block-sparse recovery problem, or a non-sparse recovery problem?). Due to these issues, the conclusions may be not solid, and even more or less misleading.

MSE is generally not a good performance index. For example, for image quality, MSE is not recommended (what’s the suitable performance index for images is still a hot topic in the image processing field). In my experience on compressed sensing of EEG, I even found that MSE always misleading when I compared BSBL with my STSBL algorithms. This is why in my two T-BME papers (and other papers coming out), I used a task-oriented performance measure criterion. That is, after recovered the data,

(1) performing task-required signal processing or pattern recognition on the recovered data, obtaining result A;
(2) performing the same task-required signal processing or pattern recognition on the original data (or the recovered data by another algorithm), obtaining result B;
(3) comparing result A with result B, which tell me which algorithm has better data recovery ability. 

Many people asked me whether my BSBL codes can be used for complex-valued problems. My answer is YES. But you need to transform your complex-valued problem into a real-valued problem, as I showed here:
This transform is very simple. You just need no more than 1 minute to do this.

I will update my BSBL codes in the near future such that no need to do the transform.

Friday, April 12, 2013

Compressed Sensing of EEG Using Wavelet Dictionary Matrices

Since my paper "Compressed Sensing of EEG for Wireless Telemonitoring with Low Energy Consumption and Inexpensive Hardware (IEEE T-BME, vol.60, no.1, 2013)" has been published, lots of people asked me how to do the compressed sensing of EEG using wavelets. Their problem was that Matlab has no function to generate the DWT basis matrix (i.e. the matrix D in my paper). One has to generate such matrices using other wavelet toolboxes. Now I updated my codes, where I gave a guide to generate such dictionary matrices using the wavelab ( , and there is a demo to show how to use a DWT basis matrix as the dictionary matrix for compressed sensing of EEG (demo_useDWT.m). The codes ('Compressed Sensing of can be downloaded at here

BTW: Please keep in mind that EEG is generally not sparse except to some special situations.

Wednesday, December 12, 2012

I successfully defended my Ph.D. dissertation today

I have two good reasons to remember today. One is that I successfully defended my Ph.D. dissertation today. The second is that today is 12/12/12 -- you won't have a day with this triple pattern again in this century :)

 (The EBU-1 building of UCSD; the little house on the top is the `falling star')

Wednesday, December 5, 2012

Welcome to attend my dissertation defense on Dec.12

Finally, my dissertation defense is scheduled at 9:15am - 11:15am on Dec.12 (Wednesday) in EBU1 4309.

Welcome to attend!

Below is the title and the abstract of my presentation.

Sparse Signal Recovery Exploiting Spatiotemporal Correlation

Sparse signal recovery algorithms have significant impact on many fields, including signal and image processing, information theory, statistics, data sampling and compression, and neuroimaging. The core of sparse signal recovery algorithms is to find a solution to an underdetermined inverse system of equations, where the solution is expected to be sparse or approximately sparse. Motivated by practical problems, numerous algorithms have been proposed. However, most algorithms ignore the correlation among nonzero entries of a solution, which is often encountered in a practical problem. Thus, it is unclear how this correlation affects an algorithm's performance and whether the correlation is harmful or beneficial.

This work aims to design algorithms which can exploit a variety of correlation structures in solutions and reveal the impact of these correlation structures on algorithms' recovery performance.

To achieve this, a block sparse Bayesian learning (BSBL) framework is proposed. Based on this framework, a number of sparse Bayesian learning (SBL) algorithms are derived to exploit intra-block correlation in a canonical block sparse model, temporal correlation in a canonical multiple measurement vector model, spatiotemporal correlation in a spatiotemporal sparse model, and local temporal correlation in a canonical time-varying sparse model. Several optimization approaches are employed in the algorithm development, including the expectation-maximization method, the bound-optimization method, and the fixed-point method. Experimental results show that these algorithms significantly outperform existing algorithms.

With these algorithms, we find that different correlation structures affect the quality of estimated solutions to different degrees. However, if these correlation structures are present and exploited, algorithms' performance can be largely improved. Inspired by this, we connect these algorithms to Group-Lasso type algorithms and iterative reweighted $\ell_1$ and $\ell_2$ algorithms, and suggest strategies to modify them to exploit the correlation structures for better performance.

The derived SBL algorithms have been used with considerable success in various challenging applications such as wireless telemonitoring of raw physiological signals and prediction of cognition levels of patients from their neuroimaging measures. In the former application, the derived SBL algorithms are the only algorithms so far that achieve satisfactory results. This is because raw physiological signals are neither sparse in the time domain nor sparse in any transformed domains, while the derived SBL algorithms can maintain robust performance for these signals. In the latter application, the derived SBL algorithms achieved the highest prediction accuracy on common datasets, compared to published results. This is because the BSBL framework provides flexibility to exploit both correlation structures and nonlinear relationship between response variables and predictor variables in regression models.

Sunday, November 25, 2012

An new BSBL algorithm has been derived

A fast BSBL algorithm has been derived, and the work has been submitted to IEEE Signal Processing Letters.

Below is the paper:

Fast Marginalized Block SBL Algorithm
by Benyuan Liu, Zhilin Zhang, Hongqi Fan, Zaiqi Lu, Qiang Fu

The preprint can be downloaded at:

Here is the abstract:
The performance of sparse signal recovery can be improved if both sparsity and correlation structure of signals can be exploited. One typical correlation structure is intra-block correlation in block sparse signals. To exploit this structure, a framework, called block sparse Bayesian learning (BSBL) framework, has been proposed recently. Algorithms derived from this framework showed promising performance but their speed is not very fast, which limits their applications. This work derives an efficient algorithm from this framework, using a  marginalized likelihood maximization method. Thus it can exploit block sparsity and intra-block correlation of signals. Compared to existing BSBL algorithms, it has close recovery performance to them, but has much faster speed. Therefore, it is more suitable for recovering large scale datasets.

Scientists See Promise in Deep-Learning Programs

The New York Times just has a report on  deep-learning. Some of the applications mentioned in the report  really refreshed my knowledge on it. Enjoy!