One of us (Arthur), has been working in the speech recognition industry, so we are quite aware that Facebook is developing their own recognizer. Yet, it is also painfully clear from the article that FB's direction of development is misguided, and it is doomed to fail. So we are writing this item to analyze the cause, and what Facebook could potentially do.
To start off, what's so special about automatic speech recognition (ASR) in the business of machine learning then? If you delve deep, speech as a pattern is subtle and hard to model. By itself, currently it is usually seen as speech as a sequence of n-dimensional vector. So modeling is usually done by non-trivial models such as hidden Markov model (HMM) or in deep learning long short term model (LSTM). Even now, training a good ASR model with 10000+ hours, takes a lot of resources to do well: first of all, data collection has to be based on the specific accent of the your market. e.g. an US English models would not work too well in other English-speaking countries.
Then there is data labeling, or in the case of speech recognition parlance, transcription. How do you find the linguistic experts to transcribe data? Unlike image recognition, crowd sourcing might give your poor quality transcripts, and results in poorly-performing models. So do you want to hire transcribers in-house? Would that fit to your budget? And who is going to manage them?
So say you have transcribed data, you are only half-way there. Unlike the workflow of image recognition training, many practical speech recognition training still consists of multiple steps. (Seq-to-seq type of training works but it takes a lot of data.) As a result, training is still a magical step which takes specialists to tend. e.g. What if the training fails? Do you need to fix the source code? And how do you interpret various failure modes in training? None of these questions are trivial. ASR source code are usually complex program which require programmers understand deep coding topics such as dynamic programming, numerical optimization or matrix computation. Many of them are written in low-level language such as C/C++ (i.e. not python), and takes an experienced coder to maintain. That's why most company job postings on speech recognition usually requires master or doctor degrees from the candidates.
Once you can create and maintain a speech recognizer. The final question is how do you grow its capability. For example. It's hard to estimate the time cost of training a recognizer in different languages. So how long should you setup your goal? If engineers fail to train a recognizer on time, what should you do? Those are difficult questions. For example, in our opinions, if it takes time for a team to fix bugs but they get it right in a production system, it worths for all the time it takes. But in our fast-paced world of development, not all the companies would/could give so much time to tend a recognizer. Not to say, the best speech recognizer, in our opinions, implement the most advanced mathematical model.
Let's go back to the case of Facebook: from the article, we learn that product managers would switch domains of a speech recognizer in half a year basis. And it can go from news transcription to voice dialogue. For us who work in the business for long time, such switching of tasks, is the most nightmarish scenario. Half a year might not even give you enough time to collect data. Not to say, you might need to tailor-made product features based on one type of domain. e.g. There are a lot of differences between speech recognition which can run offline and run on live. Have the Facebook product manager ever consider these issues? We don't think so.
What we would suggest to Facebook to do .... if they are serious about building a speech recognizer..... is to form a cross-departmental team with a powerful and an influential team leader. Members of the team, even for managers, should have strong understanding of how a recognizer should be built. We understand it is tough, but such team might give Facebook a real chance to at least sustain the effort in speech recognition, and perhaps when stars align, comes up with a product which is on-par with Google, Amazon or Apple.
Our two cents.
OpenAI release Spinning AI last week. It's mind-blowing because it's perhaps the first time we see a major institution released a course related to deep reinforcement learning.
Looking into details of Spinning AI, the goal is to bridge the gap between papers and actual RL implementation. So in some sense, it is more a tutorial you can follow while taking on-line class such as David Silver's RL class. The tutorial is in great detail, we also saw a fairly extensive bibliography which seems to more relevant than standard RL textbooks and resources existing.
As for support, Open AI promise to give high bandwidth support for Spinning AI. We hope they can sustain the effort because Open AI as an organization occasionally drop initiatives such as OpenAI Gym.