Research not for publishing papers, but for fun, for satisfying curiosity, and for revealing the truth.

This blog reports latest progresses in
(1) Signal Processing and Machine Learning for Biomedicine, Neuroimaging, Wearable Healthcare, and Smart-Home
(2) Sparse Signal Recovery and Compressed Sensing of Signals by Exploiting Spatiotemporal Structures
(3) My Works


Saturday, May 9, 2015

Yann LeCun's Comments on Extreme Learning Machine (ELM)

Yann LeCun (https://www.facebook.com/yann.lecun/posts/10152872571572143) in his Facebook commented on ELM, which I quoted below:

What's so great about "Extreme Learning Machines"?

There is an interesting sociological phenomenon taking place in some corners of machine learning right now. A small research community, largely centered in China, has rallied around the concept of "Extreme Learning Machines".

Frankly, I don't understand what's so great about ELM. Would someone please care to explain?

An ELM is basically a 2-layer neural net in which the first layer is fixed and random, and the second layer is trained. There is a number of issues with this idea.

First, the name: an ELM is *exactly* what Minsky & Papert call a Gamba Perceptron (a Perceptron whose first layer is a bunch of linear threshold units). The original 1958 Rosenblatt perceptron was an ELM in that the first layer was randomly connected.

Second, the method: connecting the first layer randomly is just about the stupidest thing you could do. People have spent the almost 60 years since the Perceptron to come up with better schemes to non-linearly expand the dimension of an input vector so as to make the data more separable (many of which are documented in the 1974 edition of Duda & Hart). Let's just list a few: using families of basis functions such as polynomials, using "kernel methods" in which the basis functions (aka neurons) are centered on the training samples, using clustering or GMM to place the centers of the basis functions where the data is (something we used to call RBF networks), and using gradient descent to optimize the position of the basis functions (aka a 2-layer neural net trained with backprop).

Setting the layer-one weights randomly (if you do it in an appropriate way) can possibly be effective if the function you are trying to learn is very simple, and the amount of labelled data is small. The advantages are similar to that of an SVM (though to a lesser extent): the number of parameters that need to be trained supervised is small (since the first layer is fixed) and easily regularized (since they constitute a linear classifier). But then, why not use an SVM or an RBF net in the first place?

There may be a very narrow area of simple classification problems with small datasets where this kind of 2-layer net with random first layer may perform OK. But you will never see them beat records on complex tasks, such as ImageNet or speech recognition.
http://www.extreme-learning-machines.org/

The EML's inventor, G.-B. Huang replied by pointing out that the answers can be found in his paper: "What are Extreme Learning Machines? Filling the Gap between Frank Rosenblatt's Dream and John von Neumann's Puzzle" (http://www.ntu.edu.sg/home/egbhuang/pdf/ELM-Rosenblatt-Neumann.pdf)