My blogs reporting quantitative financial analysis, artificial intelligence for stock investment & trading, and latest progress in signal processing and machine learning

Wednesday, June 17, 2015

5 Reasons Apache Spark is the Swiss Army Knife of Big Data Analytics


We are living in exponential times. Especially if we are talking about data. The world is moving fast and more data is generated every day. With all that data coming your way, you need the right tools to deal with the growing amounts of data.

If you want to get any insights from all that data, you need tools that can process massive amounts of data quickly and efficiently. Fortunately the Big Data open source landscape is also growing rapidly and more and more tools come to the market to help you with this. One of these open source tools that is making fame at this moment is Apache Spark.

This week I was invited to join the IBM |Spark analyst session and the Apache™ Spark Community Event in San Francisco, where the latest news was shared on Apache Spark.

IBM announced their continuous contribution to the Apache Spark community. At the core of this commitment, IBM wants to offer Spark as a Service on the IBM Cloud as well as integrate Spark into all of its analytics platforms. They have also donated the IBM SystemML machine learning to the Spark open source ecosystem, allowing the Spark community to benefit from the powerful SystemML technology. In addition, IBM announced the Spark Technology Centre and the objective to educate over a million data scientists and engineers on Spark.

Apache Spark is currently very hot and many organizations are using it to do more with their data. It can be seen as the Analytics Operating System and it is the potential to disrupt the Big Data ecosystem. It is an open source tool with over 400 developers contributing to Spark. Let me explain what Apache Spark and show you how it could benefit your organization:

Apache Spark Enables Deep Intelligence Everywhere

Spark is an open source tool that was developed in the AMPLab at UC Berkeley. Apache Spark is a general-purpose engine for large-scale data processing, up to 1000s of nodes. It is an in-memory distributed computing engine that is highly versatile to any environment. This enables users and developers to quickly build models, iterate faster and apply deep intelligence to your data across your organization.

Spark’s distinguishing feature is its Resilient Distributed Datasets (RDDs). This feature allows collections of objects to be stored in memory or disk across a cluster, which automatically rebuilds on failure. Its in-memory primitives offer up to 100 times faster performances, contrary to the two-stage, disk-based MapReduce paradigm. It therefore addresses several of the MapReduce challenges.

Spark lets data scientists and developers work together in a unified platform. It enables developers to essentially execute Python or Scala code across a cluster instead to one machine. Users can load data into a cluster’s memory and they can query it repeatedly. Basically Spark is an advanced analytics tool that is very useful for machine learning algorithms because of these clusters.

Spark is very well suited for the Big Data era, as it supports the rapid development of Big Data applications. Code can easily be reused across batch, streaming and interactive applications.

According to a CrowdChat hosted by IBM the week before the analyst session, some important features of Spark implementation were discussed, including:

Real time querying of your data;

High-speed stream processing of low latency data;

Clear separation of importing data and distributed computation;

Spark’s libraries, including: Spark SQL, Spark Streaming, the MLlib machine learning library, and GraphX;

A large community and support for Spark from major vendors including IBM, MapR, Cloudera, Intel, Hortonworks and many other Big Data platforms.

5 Use Cases of Apache Spark

According to Vaibhav Nivargi, early adopters of Spark by sector include consumer-packaged goods (CPG), insurance, media and entertainment, pharmaceuticals, retailers, automotive.

However, there are also multiple use cases for Spark where high-velocity, high-volume, constant streams of (un)structured data is generated, most of which will be machine data. Use cases include fraud detection, log processing as part of IT Operations Analytics, sensor data processing and of course data related to the Internet of Things.

Experts believe that Spark is likely to evolve into the preferred tool for high-performance Internet of Things applications that will eventually generate Petabytes of data across multiple channels. Apache Spark is already used by multiple companies and some examples of these implementations are the following:

1. ClearStory Data

ClearStory Data uses Apache Spark as the basis for their harmonization. Spark enabled them to create a visualization tool that allows users to slice and dice massive amounts of data while visualizations adjust instantly. This allows their users to collaborate on the data without any delay.

2. Conviva

Another use case of Apache Spark, in particular Spark Streaming, is that for Conviva. Conviva is one of the largest video companies in the world, processing more than 4 billion video streams per month. In order to achieve this, they dynamically select and optimize sources to deliver the highest playback quality.

Spark Streaming has enabled them to learn the different network conditions in real-time and feed this directly into the video player. This allows them to optimize the streams and ensure that all four billion videos receive the correct amount of buffering.

3. Yahoo!

Yahoo is using Apache Spark already for some time. They have several projects running with Spark. One of them uses Spark to offer the right content for the right visitor, i.e. personalization. Machine learning algorithms determine individual visitors’ interests to give them the right news when they visit Yahoo! These same algorithms also help to categorize news stories when they arise.

Personalization is in fact very difficult and it requires high-speed, (near) real-time processing power to understand a profile of a visitor upon entering your website. Apache Spark helps in this process.

4. Huawei

The telecommunications division of Huawei uses the Machine Learning libraries of Spark to analyse massive amounts of network data. This network data can show traffic patterns that reveal subpaths or routers within the network that are congested and that slow down the traffic. When they have identified the hotspots in the network, they can add hardware to solve the congestion.

5. RedRock

RedRock was developed during Spark Hacker Days. This hackathon is organized among IBM employees and within 10 days they created 100 innovations with Spark. One of these innovation is RedRock, which is currently in private Alpha.

RedRock is a Twitter analysis tool that allows user to search and analyse Twitter to get insights such as categories, topics, sentiment and geographics. It lets the user act on data-driven insights discovered from Twitter.

It was developed following the IBM design thinking framework. This is a cyclical process of understanding, exploring, prototyping and evaluating. RedRock uses a 19-node spark cluster with 3.3 Terabyte in-memory Twitter data. Thanks to Spark, RedRock can analyse and visualize this amount of data within seconds based on any query.

Five Ways Spark Improves Your Business

Apache Spark already has a lot of traction and with more companies partnering up and using Spark, it is likely to see that it is going to evolve in something really big. This is quite understandable if you look at the different advantages Spark can offer your business:

1. Spark is the right tool for analytic challenges that demand low-latency in-memory machine learning and graph analytics. This is especially relevant for companies that focus on the Internet of Things.

Since we will see the Internet of Things popping up in every imaginable industry in the coming years, Spark will enable organizations to analyse all that data coming from IoT sensors as it can easily deal with continuous streams of low-latency data. This will enable organizations to create real-time dashboards to explore their data and to monitor and optimize their business.

2. Spark will drastically improve Big Data Scientist productivity. It enables faster, iterative product development, using popular programming languages. The high-level libraries of Spark, including streaming data, machine learning, support for SQL queries, and graph processing, can be easily combined to create complex workflows.

This enables data scientists to create new Big Data applications faster. In fact, it requires 2-5x less code. It will result in reduced time-to-market for new products as well as faster access to insights within your organization using these applications.

Spark also enables data scientists to prototype solutions without the requirement to submit code to the cluster every time, leading to better feedback and iterative development.

3. According to James Kobielus, the Internet of Things “may spell the end of data centres as we’ve traditionally known them”. In the coming years we will see that most of the core functions of a data centre, storage and processing, can be done decentralized. i.e. it can be done directly on the IoT devices instead of a centralized data centre. This is called Fog Computing and was recently described by Ahmed Banafa on Datafloq.

Fog computing will give organizations unprecedented computing power at relatively low costs. Apache Spark is very well suited for the analysis of massive amounts of highly distributed data. Apache Spark could therefore, potentially, be the catalyst required to have fog computing take off and to prepare organizations for the Internet of Things. This in turn could enable organizations to create new products and applications for their customers, creating new business models and revenue streams.

4. Spark’s framework is developed on top of the Hadoop Distributed File System. So this is a major advantage for those who are already familiar with Hadoop and already have a Hadoop cluster within their organization. It therefore works well with Hadoop, using the same data formats and adhering to the data locality for efficient processing. It can be deployed on existing Hadoop clusters or work side-by-side. This allows organizations that already work with Hadoop to easily update their Big Data environment.

5. There is a  large community contributing around Spark, with over 400 developers from around the world contributing and a long list of vendors supporting Spark. Combined with the compatibility with popular programming languages it offers organizations a large pool of developers that can work with Spark. Instead of having to hire expensive programmers because of an unknown tool or programming language, Spark is easy to use because of the libraries and the compatibility with Java, Python and Scala.

The Swiss Army Knife of Big Data Analytics

Spark is a powerful open-source data analytics, cluster-computing framework. It has become very popular because of its speed, iterative computing and better data access because of its in memory caching. Its libraries enable developers to create complex applications faster and better, enabling organizations to do more with their data.

Because of its wide range of applications and it easy use to work with, Spark is also called the Swiss army knife of Big Data Analytics. And with the buzz already happening around Spark, the large community supporting Spark and the multiple use cases we already have seen, Spark could evolve into the next big thing within the Big Data ecosystem.

Saturday, May 9, 2015

Yann LeCun's Comments on Extreme Learning Machine (ELM)

Yann LeCun ( in his Facebook commented on ELM, which I quoted below:

What's so great about "Extreme Learning Machines"?

There is an interesting sociological phenomenon taking place in some corners of machine learning right now. A small research community, largely centered in China, has rallied around the concept of "Extreme Learning Machines".

Frankly, I don't understand what's so great about ELM. Would someone please care to explain?

An ELM is basically a 2-layer neural net in which the first layer is fixed and random, and the second layer is trained. There is a number of issues with this idea.

First, the name: an ELM is *exactly* what Minsky & Papert call a Gamba Perceptron (a Perceptron whose first layer is a bunch of linear threshold units). The original 1958 Rosenblatt perceptron was an ELM in that the first layer was randomly connected.

Second, the method: connecting the first layer randomly is just about the stupidest thing you could do. People have spent the almost 60 years since the Perceptron to come up with better schemes to non-linearly expand the dimension of an input vector so as to make the data more separable (many of which are documented in the 1974 edition of Duda & Hart). Let's just list a few: using families of basis functions such as polynomials, using "kernel methods" in which the basis functions (aka neurons) are centered on the training samples, using clustering or GMM to place the centers of the basis functions where the data is (something we used to call RBF networks), and using gradient descent to optimize the position of the basis functions (aka a 2-layer neural net trained with backprop).

Setting the layer-one weights randomly (if you do it in an appropriate way) can possibly be effective if the function you are trying to learn is very simple, and the amount of labelled data is small. The advantages are similar to that of an SVM (though to a lesser extent): the number of parameters that need to be trained supervised is small (since the first layer is fixed) and easily regularized (since they constitute a linear classifier). But then, why not use an SVM or an RBF net in the first place?

There may be a very narrow area of simple classification problems with small datasets where this kind of 2-layer net with random first layer may perform OK. But you will never see them beat records on complex tasks, such as ImageNet or speech recognition.

The EML's inventor, G.-B. Huang replied by pointing out that the answers can be found in his paper: "What are Extreme Learning Machines? Filling the Gap between Frank Rosenblatt's Dream and John von Neumann's Puzzle" (

Sunday, January 25, 2015

IEEE-CS Unveils Top 10 Technology Trends for 2015

LOS ALAMITOS, Calif., 1 December 2014 – In the coming year, while consumers will be treated to a dizzying array of augmented reality, wearables, and low-cost 3D printers, computer researchers will be tackling the underlying technology issues that make such cutting-edge consumer electronics products possible. IEEE Computer Society today announced the top 10 most important technology trends for 2015 and explores how these technologies will be integrated into daily life.

Cybersecurity in general will remain a critical concern, with increased focus on security for cloud computing and deeply embedded devices. And interoperability and standards will be top priorities to unleash the potential of Software-defined Anything (SDx) and the Internet of Anything (IoA).

"Researchers have been working to address these issues for a number of years, however 2015 should see real progress in these areas," said incoming IEEE Computer Society President Thomas Conte, an electrical and computer science professor at Georgia Tech. "We are reaching an inflection point for 3D printing, which will revolutionize manufacturing, and the exponential growth in devices connected to the Internet makes interoperability and standards critical."

Among the advances that IEEE Computer Society experts forecast:

1) The time is right for wearable devices: Imagine a wearable device that tells time, sends and receives email and messages, makes calls, and even tracks exercise routines. Smartwatches hitting the market do all that and more. Both established players and small startups in 2015 will be actively involved in developing new devices, applications, and protocols for the wearable electronics market.

2) Internet of Anything will become all-encompassing: The reality that up to 26 billion things will be connected on the Internet by 2020 is sinking in. The Internet of Things and Internet of Everything in 2015 will morph into the Internet of Anything. IoA envisions a common software "ecosystem" capable of accommodating any and all sensor inputs, system states, operating conditions, and data contexts — an overarching "Internet Operating System."

3) Building security into software design: As the volume of data explodes, along with the means to collect and analyze that information, building security into software design and balancing security and privacy are becoming top priorities.

4) Industry will tackle Software-defined Anything (SDx) interoperability and standards: Software-defined networking's programmability will turn various network appliances into a warehouse of apps. Several standards groups are working on interoperability issues, including the Open Networking Foundation (ONF), the Internet Engineering Task Force (IETF), the European Telecommunications Standards Institute (ETSI), and the International Telecommunication Union (ITU).

5) Cloud security and privacy concerns grow: The celebrity photo hacking scandal and iCloud breach in China in 2014 has brought cloud security to the forefront for 2015. Enterprises are moving workloads to the cloud and expecting enterprise-level security. To avoid system fragility and defend against vulnerabilities exploration from cyber attackers, various cybersecurity techniques and tools are being developed for cloud systems.

6) 3D Printing is poised for takeoff: Next year will see production of the first 3D-printed car. The 3D-printed car is just one of hundreds of uses that enterprises and consumers are finding for 3D printing, which will revolutionize manufacturing by lowering costs and time to market. Also in 2015, sales of 3D printers are expected to take off, driven by low-cost printing and uptake by a variety of industries.

7) Predictive Analytics will be increasingly used to identify outcomes: Business intelligence in 2015 will be less about examining the past and more about predicting the future. While predictive modeling techniques have been researched by the data mining community for several decades, they are now impacting every facet of our lives. In organizational settings, predictive analytics has gained widespread adoption over the past decade as firms look to compete on analytics.

8) Embedded Computing security will get added scrutiny: Deeply-embedded computing systems often perform extremely sensitive tasks, and in some cases, such as healthcare IT, these are lifesaving. Emerging deeply-embedded computing systems are prone to more serious or life-threatening malicious attacks. These call for revisiting traditional security mechanisms not only because of the new facets of threats and more adverse effects of breaches, but also due to the resource limitations of these often-battery-powered and extremely constrained computing systems.

9) Augmented Reality Applications will grow in popularity: Mobile apps using augmented reality help the colorblind see colors, travelers explore unfamiliar cities, shoppers imagine what they look like in different outfits, and even help drivers locate their parked cars. The availability of inexpensive graphics cards and sensors, and the popularity of applications in such areas as gaming and virtual worlds, is bringing augmented reality into the mainstream.

10) Smartphones will provide new opportunities for Continuous Digital Health: The way we deal with our health is undergoing a major transformation, not only because mobile Internet technology has made it possible to have continuous access to personal health information, but also because breaking the trend of ever-growing healthcare costs is increasingly necessary. Connectivity, interoperability, sensing, and instant feedback through smartphones all provide new opportunities for gaining insights into our health behavior.

To view the full list of IEEE Computer Society technology trends, visit

Sunday, January 11, 2015

Wearables, Sensables, and Opportuniteis at CES 2015

ApplySci released a summary on wearable healthcare devices on CES 2015. And I quoted below. ApplySci commented that "Samsung's Simband is best positioned to take wearables into medical monitoring". Note that in Simband, PPG are used to monitor heart rate for fitness tracking and health monitoring, and I've already obtained the superior PPG-based heart rate estimation algorithms for it. Some general frameworks have been published/under review by IEEE Transaction on Biomedical Engineering. You can check my website for more details:

It was the year of Digital Health and Wearable Tech at CES.  Endless watches tracked vital signs (and many athletes exercised tirelessly to prove the point).   New were several ear based fitness monitors (Brag), and some interesting TENS pain relief wearables (Quell).  Many companies provided  monitoring for senior citizens, and the most interesting only notified caregivers when there was a change in learned behavior (GreenPeak).  Senior companion robots were missing, although robots capable of household tasks were present (Oshbot).  3D printing was big (printed Pizza)–but where were 3d printed bones and organs?  Augmented reality was popular (APX,Augmenta)–but mostly for gaming or industrial use.  AR for health is next.

Two companies continue to stand out in Digital Health.  Samsung’s Simband  is best positioned to take wearables into  medical monitoring, with its multitude of sensors, open platform, and truly advanced health technologies. And  MC10‘s electronics that bend, stretch, and flex will disrupt home diagnosis, remote monitoring, and smart medical devices.

We see two immediate opportunities.  The brain, and the pulse.

1.  A few companies at CES claimed to monitor brain activity, and one savvy brand (Muse) provided earphones with soothing sounds while a headband monitored attention.  While these gadgets were fun to try, noone at CES presented extensive brain state interpretation to address cognitive and emotional issues.

2.  Every athlete at CES used a traditional finger based pulse sensor.  A slick wearable that can forgo the finger piece will make pulse oximetry during sports fun, instead of awkward.  As with every gadget, ensuring accuracy is key, as blind faith in wearables can be dangerous.

ApplySci looks forward to CES 2016, and the many breakthroughs to be discovered along the way, many of which will be featured at Wearable Tech + Digital Health NYC 2015.