The definitive weekly newsletter on A.I. and Deep Learning, published by Waikit Lau and Arthur Chan. Our background spans MIT, CMU, Bessemer Venture Partners, Nuance, BBN, etc. Every week, we curate and analyze the most relevant and impactful developments in A.I.
We are back! The big news this week is perhaps Prof LeCun is stepping down as the chief of Facebook A.I. Research (FAIR). More in the news section.
We also have a bunch of interesting content in our blog section. e.g. Arthur's review of Course 4 of deeplearning.ai. Course 4 focuses on image classification as an application of deep learning. Arthur will walk through how it compares with an existing class such as cs231n.
Then, in our paper section, we present a read on the classic paper, "Deep Neural Networks for Acoustic Modeling in Speech Recognition".
As always, if you like our newsletter, feel free to forward it to your friends/colleagues!
This newsletter is a labor of love from us. All publishing costs and operating expenses are paid out of our pockets. If you like what we do, you can help defray our costs by sending a donation via link. For crypto enthusiasts, you can donate by sending Eth to this address: 0xEB44F762c58Da2200957b5cc2C04473F609eAA65.
We have a lot of demand from AIDL to start a new specialized group just on speech recognition. ASR is a really hot space lately. Our new group ASRDL will be a great place for you to learn more about the latest and greatest, and join the discussions.
Prof. LeCun is stepping down as the chief of FAIR. His replacement would be Jérôme Pesenti, former CEO of AI startup BenevolentTech. Presenti would be directly report to the CTO of Facebook, Mike Schroepfer. According the Quatz's article, LeCun would still decide research directions of FAIR, but day-to-day operations would report up to Presenti.
What do we make of the event? We this might be a necessary change as A.I goes from research to being applied across all Facebook products. Facebook has been relying on A.I. on various aspects of the company's operation, e.g. news ranking, face tagging and translation. Yet a large A.I. machinery also require a large software team to maintain. We speculate that Mr. Presenti might be a better choice for this next stage of evolution.
From the respected Ross Girshick: Facebook now open source detectron which includes two techniques - the first on MaskRCNN algorithm for segmentation, then on the the technique from "Focal Loss for Dense Object Detection" by T.Y. Lin et al on image detection.
This is the now-classic paper in deep learning, which was for the first time people confirmed that deep learning can improve ASR significantly. It is important in the fields of both deep learning and ASR. It's also one of the first papers I read on deep learning back in 2012-3.
Many people know the origin of deep learning from image recognition, e.g. many beginners would tell you stories about Imagenet, Alexnet and history and so on. But then the first important application of deep learning is perhaps speech recognition.
So what's going on with ASR before deep learning then? For the most part, if you can come up with a technique that cut a state-of-the-art system's WER by 10%, your PhD thesis is good. If your technique can consistently beat previous techniques in multiple systems, you usually get a fairly good job in a research institute in Big 4.
The only technique which I recall to be better than 10% relative improvement are discriminative training. It got ~15% in many domains. That happens back in 2003-2004. In ASR, the term "discriminative training" has very complicated connotation. So I am not going to explain much. This just gives you the context of how powerful deep learning is.
You might be curious what "relative improvement" is. e.g. suppose your original WER is 18%, but you improve from 17%, then your relatively improvement is 1%/18% = 5.56%. So 10% improvement really means you go down to 16.2%. (Yes, ASR is that tough.)
So here comes replacing GMM with DNN. In these days, it sounds like a no-brainer. But back then, it was a huge deal. Many people in the past tried to stuff various ML technique to replace GMM. But no one can successfully beat HMM. So this is innovative.
Now then it is how GMM is setup - the ancestor of this work has to trace back to Bourlard and Morgan's "Connectionist Speech Recognition" in which the authors tried to come up with a Context-independent HMM system by replacing VQ scores with a shallow neural network. At that time, the unit are chosen to be CI-states.
Hinton's and perhaps Deng's thinking are interesting: The unit was chose to be context-dependent states. Now that's an new change, and reflect how modern HMM system is trained.
Then there is how the network is really trained. Now you can see the early DLer's stress on using pre-training because training is very expensive at that point. (I suspect it wasn't using GPUs).
Then there is the use of entropy to train a model. Later on, in other systems, many people just use a sentence-based entropy to do training. So in this sentence, the paper is olden.
None of these are trivial work. But the result is stellar: we are talking about 18%-33% relative gain (p.14). To ASR people, that's unreal.
The paper also foresee some future use of DNN, such as bottleneck feature and articulatory feature. You probably know the former already. The latter is more exoteric in ASR, so I am not going to talk about much.