Summary – The Grand Janitor Blog V3

A Read on “Deep Neural Networks for Acoustic Modeling in Speech Recognition” by Hinton et al.

A read on “Deep Neural Networks for Acoustic Modeling in Speech Recognition” by Hinton et al.

* This is the now-classic paper in deep learning, which is for the first time people confirmed that deep learning can improve ASR significantly. It is important in the fields of both deep learning and ASR. It’s also one of the first papers I read on deep learning back in 2012-3.

* Many people know the origin of deep learning from image recognition, e.g. many kids would tell you stories about Imagenet, Alexnet and history from now on. But then the first important application of deep learning is perhaps speech recognition.

* So what’s going on with ASR before deep learning then? For the most part, if you can come up with a technique that cut a state-of-the-art system’s WER by 10%, your PhD thesis is good. If your technique can consistently beat previous techniques in multiple systems, you usually get a fairly good job in a research institute in Big 4.

* The only technique which I recall to be better than 10% relative improvement are discriminative training. It got ~15% in many domains. That happens back in 2003-2004. In ASR, the term “discriminative training” has very complicated connotation. So I am not going to explain much. This just gives you the context of how powerful deep learning is.

* You might be curious what “relative improvement” is. e.g. suppose your original WER is 18%, but you improve from 17%, then your relatively improvement is 1%/18% = 5.56%. So 10% improvement really means you go down to 16.2%. (Yes, ASR is that tough.)

* So here comes replacing GMM with DNN. In these days, it sounds like a no-brainer. But back then, it was a huge deal. Many people in the past tried to stuff various ML technique to replace GMM. But no one can successfully beat HMM. So this is innovative.

* Now then it is how GMM is setup – the ancestor of this work has to trace back to Bourlard and Morgan’s “Connectionist Speech Recognition” in which the authors tried to come up with a Context-independent HMM system by replacing VQ scores with a shallow neural network. At that time, the unit are chosen to be CI-states.

* Hinton’s and perhaps Deng’s thinking are interesting: The unit was chose to be context-dependent states. Now that’s an new change, and reflect how modern HMM system is trained.

* Then there is how the network is really trained. Now you can see the early DLer’s stress on using pre-training because training is very expensive at that point. (I suspect it wasn’t using GPUs).

* Then there is the use of entropy to train a model. Later on, in other systems, many people just use a sentence-based entropy to do training. So in this sentence, the paper is olden.

* None of these are trivial work. But the result is stellar: we are talking about 18%-33% relative gain (p.14). To ASR people, that’s unreal.

* The paper also foresee some future use of DNN, such as bottleneck feature and articulatory feature. You probably know the former already. The latter is more exoteric in ASR, so I am not going to talk about much.

Anyway, that’s what I have. Enjoy the reading!

A Read on “Regularized Evolution for Image Classifier Architecture Search”

(First appeared in AIDL-LD and AIDL Weekly.)

This is a read on “Regularized Evolution for Image Classifier Architecture Search” which is the paper version of AmoebaNet, the latest result in AutoML (Or this page: https://research.googleblog.com/…/using-evolutionary-automl…)

* If you recall, Google already has several results on how to use RL and evolution strategy (ES) to discover model architecture in the past. e.g. Nasnet is one of the examples.

* So what’s new? The key idea is so-called regularized evolution strategy. What does it mean?

* Basically it is a tweak of the more standard tournament strategy, commonly used as the means of selecting individual out of a population. (https://en.wikipedia.org/wiki/Tournament_selection)

* Tournament is not too difficulty to describe:
– Choose random individuals from the population.
– Choose the best candidate according to certain optimizing criterion.

You can also use a probabilistic scheme to decide whether to use the second or third best candidate. You might also think of it as throwing away the worst-N-candidate.

* The AutoML calls this original method by Miller and Goldberg (1995) as non-regularized evolution method.

* What is “regularized” then? Instead of throwing away the worst-N-candidates. The author proposed to throw away the oldest-trained candidate.

* Now you won’t see a justification of why this method is better until the “Discussion” section. Okay, let’s go with the authors’ intended flow. As it turns the regularized method is better than non-regularized method. e.g. In CIFAR-10, the evolved model is ~10% relatively better either man-made model or NasNet. On Imagenet, it performs better than Squeeze-and-Excite Net as well as NasNet. (Squeenze-and-Excite Net is the ILSVRC 2017’s winner.)

* One technicality when you read the paper is the G-X dataset, they are actually the gray-scale version the normal X data. e.g. G-CIFAR-10 is the gray-scale version of CIFAR-10. The intention of why the authors do it are probably two folds: 1) to scale the problem down, 2) to avoid overfitting to only the standard testsets of the problems.

* Now, these are all great. But how come the “regularized” approach is better then? How would the authors explain it?

* I don’t want to come up with a hypothesis. So let me just quote the last paragraph here: “Under regularized evolution, all models have a short lifespan. Yet, populations improve over longer timescales (Figures 1d, 2c,d, 3a–c). This requires that its surviving lineages remain good through the generations. This, in turn, demands that the inherited architectures retrain well (since we always train from scratch, the weights are not heritable). On the other hand, non-regularized tournament selection allows models to live infinitely long, so a population can improve simply by accumulating high-accuracy models. Unfortunately, these models may have reached their high accuracy by luck during the noisy training process. In summary, only the regularized form requires that the architectures remain good after they are retrained.”

* And also: “Whether this mechanism is responsible for
the observed superiority of regularization is conjecture. We
leave its verification to future work.”