Category Archives: Speech Recognition

Future Plan for "The Grand Janitor Blog"

I have been crazily busy so blogging was rather slow for me.   Though I have a stronger and stronger feeling that my understanding is closer to the state of the art of speech recognition.   And for now, the state of the art of speech recognition, we got to talk about the whole deep neural network trend.

There is nothing conceptually new in the use of hybrid HMM-DBN-DNN.   It has been proposed under the name HMM-ANN in the past.   What is new is that there is new algorithm which allow fast training of multi-layered neural network.   It is mainly due to Hinton's breakthrough in 2006: it suggests training a DBN-DNN can be first initialized by pretrained RBM.

I am naturally very interested in this new trend.   IBM, Microsoft and Googles' results show that DBN-DNN is not a toy model we saw last two decades.

Well, that's all for my excitement on DBN, I still have tons of things to learn.    Back to the "Grand Janitor Blog",  as I had tried to improve the blog layout 4 months ago,  I got to say I feel very frustrated by Blogger and finally decide to move to WordPress.

I hope to move within the next month or so.  I will write a more proper announcement later on.


On Kurzweil : a perspective of an ASR practitioner

Many people who don't work on the fields of AI, NLP and ASR have heard of Kurzweil.   To my surprise, many seem to give little thought on what he said and just follow his theory wholeheartedly.

In this post, I just want to comment on one single little thing, which is whether real-time speech-to-speech translation can be achieved in 2010s.  This is a very specific prediction from Kurzweil's book "The Singularity is Near".

My discussion would mainly focus on ASR first.  So even though my examples below are not exactly pure ASR systems, I will skip the long winding wording of saying "ASR of System X".  And to be frank, MT and Response system probably goes through similar torturous development process anyway.   So, please, don't tell me that "System X is actually ASR + Y", that sort of besides the point.

Oh well, you probably ask why bother, don't we have a demo of real-time speech-to-speech translation from Microsoft already?

True, but if you observe the demo carefully, it is based on read speech.  I don't want to speculate much but I doubt it is a generic language model which wasn't tuned to the lecture.   In a nutshell, I disbelieve it is something you can use it in real-life.

Let's think of a more real-life example: Siri, are we really getting 100% correct response now?  (Not to boil down to ASR WER ...... yet)  I don't think so. Even with adaptation, I don't think Siri understand what I said every single time.    For most of the time, I follow the unofficial command list of Siri, let it improve with adaptation..... but still, it is not perfect.

Why? It is the hard cold reality: ASR is still not perfect, with all the advancement in HMM-based speech recognition.  All the key technologies we know in the last 20 years: CMLLR, MMIE, MPE, fMPE, DMT, consensus network, multipass decodings, SAT, SAT with MMIE or all the nicest front-ends, all the engineerings.   Nope, we are not yet having a feel-good accuracy.  Indeed, human speech recognition is not 0% WER neither but for some reasons, the current state-of-the-art ASR performance is not reaching there.

And Siri, we all know is the state-of-the-art.

Just digress a little bit: Now most of the critics when they write to this point, will then lament that "oh, there is just some invisible barrier out there and human just couldn't make a Tower Babel, blabla....".  I believe most of these "critics" have absolutely no ideas what they are talking about.   To identify these air-head critics, just try to see if they put "cognitive science" into the articles, then you don't know they never work on real-life ASR system.

I, on the hand, do believe one day we can get there.  Why?  Because when people work on one of these speech recognition evaluation tasks, many would tell you : given a certain test set and with enough time and gumption, you would be able to create a system without any errors.  So to me, it is more of an issue of whether some guys grinding on the problem, but not feasibility issue.

So where are we now in ASR?  Recently, In ICASSP 2012,  a Google paper, trained 87 thousand hour of data.  That is probably the largest scale of training I know.  Oh well, where are we now? 10%.  Go down from 12%.  So the last big experiment I know, it's probably the 3000 hours experiment back in 2006-7.  The Google authors are probably using a tougher test set.  So the initial recognition rate was yet again lower.

Okay, speculation time.  So let's assume, that human can always collect 10 times more labelled data for every 6-7 years AND we can do an AM training on them. When will we go to have say 2% WER on the current Google test set?   If we just think of very simple linear interpolation.  It will take 4 * 6 years = 24 years to collect 10000 times more data (or 8 billion hour of data).    So we are way-way past the 2010s deadline from Kurzweil.

And that's a wild speculation.   Computation resources probably will work out itself by that time.  What I doubt most is whether the progress would be linear.  

Of course, it might be non-linearly better too.  But here is another point: it's not just about the training set, it's about the test set.  If we truly want a recognizer to work for *everyone* in the planet, then the very right thing to do is test your recognizer on our whole population.  If we can't then you want to sample enough human speech to represent the Earth's population, the current test set might not be representative enough.   So it is possible that when we increase our test set, we found that the initial recognition rate has go down again.   And it seems to me our test set is still in the state of mimicking human population.

My discussion so far are mostly on acoustic model.  On the language model side,  the problem will mainly on domain specificity.   Also bear in mind, human language can evolve.  So, say we want to build a system which build a customized language model for each human being in the planet.  At a particular moment of time, you might not be able to get enough data to build such a language model.

For me, the point of the whole discussion is that ASR is an engineering system, not some idealistic discussion topic.  There will always be tradeoff.   You may say: "What if a certain technology Y emerge in the next 50 years?" I heard that a lot Y could be quantum computing or brain simulation or brain-human interface or machine implementation of brain.    Guys..... those, I got to admit are very smart idea in our time, and give it another 30-40 years, we might see something useful.   For now, ASR really has nothing to do with them.  I never heard of machine implementation of the audio cortex, or even an accurate construction of audio pathway.  Nor, there is an easy progress of dissecting mammal inner ear and bring understanding on what's going on in human ear.   From what I know, we seem to know some, but there are lots of other things we don't know.

That's why I think it's better to buckle down and just to try to work out our stuffs.  Meaning, try to come up with more interesting mathematical model, try to come up with more computational efficient method.   Those .... I think are meaningful discussion.   As for Kurzweil, no doubt he is a very smart guy, but at least on ASR, I don't think he knows what he talks about.

Of course, I am certainly not the only person who complains Kurzweil.  Look at how Douglas Hofstadter's criticism:

"It’s as if you took a lot of very good food and some dog excrement and blended it all up so that you can't possibly figure out what's good or bad. It's an intimate mixture of rubbish and good ideas, and it's very hard to disentangle the two, because these are smart people; they're not stupid."

Sounds like very reasonable to me.


Acoustic Score and Its Signness

Over the years, I got asked about why acoustic score could be a positive number all the time. That occasionally lead to a kind of big confusion from beginner users. So I write this article as a kind of road sign for people.

Acoustic score per frame is essentially the log value of continuous distribution function (cdf). In Sphinx's case, the cdf is a multi-dimensional Gaussian distribution. So Acoustic score per phone will be the log likelihood of the phone HMM. You can extend this definition to word HMM.

For the sign. If you think of a discrete probability distribution, then this acoustic score thingy should always be negative. (Because log of a decimal number is negative.) In the case of a Gaussian distribution though, when the standard deviation is small, it is possible that the value is larger than 1. (Also see this link). So those are the time you will see a positive value.

One thing you might feel disharmonious is the magnitude of the likelihood you see. Bear in mind, Sphinx2 or Sphinx3 are using a very small logbase. We are also talking about a multi-dimensional Gaussian distribution. It makes numerical values become bigger.


Also see:
My answer on the Sphinx Forum

Speech Recognition vs SETI

If you track news of CMUSphinx, you may notice that the Sourceforge guys start to distribute data through BitTorrent (link).

That's a great move.   One of the issues in ASR is the lack of machine power in training.  To make a blunt example, it's possible to squeeze extra performance by searching for the best training parameters.    Not to say a lot of modern training techniques take some time to run.

I do recommend all of your help the effort.  Again, me not involved at all, just feel that it is a great cause.

Of course, things in ASR are never easy so I want to give two subtle points about the whole distributed approach of training.

Improvement over the years?

First question you may ask,  now does that mean, ASR can be like project such as SETI, which would automatically improve over the years?  Not yet, ASR still has its unique challenge.

The major part I would see is how we can incrementally increase phonetically-balanced transcribed audio.   Note that it is not just audio, but transcribed audio.  Meaning: someone needs to go to listen to the audio, spending 5-10 times real time to write down what the audio really say word-by-word.   All these transcriptions need to clean up and in a certain format.  

This is what Voxforge tries to achieve and it's not a small undertaking.   Of course, comparing to the speed of the industry development, the progress is still too slow.  The last time I heard, Google was training their acoustic model with 38000 hours of data.   A WSJ corpus is a toy task compared to it.

Now, thinking in this way, let's say if we want to build the best recognizer through open source, what is the bottleneck?  I bet the answer doesn't lie on machine power,  whether we have enough transcribed data would be the key.   So that's something to ponder about.

(Added Dec 27, 2012, on the part of initial amount of data, Nickolay corrected me saying that amount of data from Sphinx is already in terms of 10000 hours.   That includes "librivox recordings, transcribed podcasts, subtitled videos, real-life call recordings, voicemail messages".

So it does sound like Sphinx has the amount of data which rivals commercial companies.  I am very interested to see how we can train an acoustic model with that amount of data.)

We build it, they will come?

ASR is always shrouded with misunderstanding.   Many believe it is a solved problem, many believe it is a unsolvable problem.   99.99% of world population are uninformed about the problem.   
I bet a lot of people would be fascinated by SETI, which .... Woa .... allows you to communicated to unknown intelligent sentients in the universe.   Rather than on ASR, which ..... Em ..... basically many regards as a source of satires/parodies these days.  
So here comes another problem,  the public don't understand ASR enough to see it as an important problem.   When you think about this more,  this is a dangerous situation.   Right now, couple of big companies control the resource of training cutting-edge speech recognizers.    So let's say in the futre everyone needs to talk with a machine in a daily basis.   These big companies would be so powerful that they can control our daily life.   To be honest to you, this thought haunts me from time to time.   
I believe we should continue to spread information on how to properly use an ASR system.  At the same time, continue to build application to show case ASR and let the public understand its inner-working.   Unlike subatomic particle physics,  HMM-based ASR is not that difficult to understand.   On this part, I appreciate all the effort which are done by developers of CMUSphinx, HTK, Julius and all other open source speech recognition projects.


I love the recent move of Sphinx spreading acoustic data using BitTorrent,  it is another step to work towards a self-improving speech recognition system.   There are still things we need to ponder in the open source speech community.   I mentioned a couple, feel free to bring up more in the comment section. 


What should be our focus in Speech Recognition?

If you worked in a business long enough, you start to understand better what type of work are important.   As many things in life, sometimes the answer is not trivial.   For example, in speech recognition, what are the important ingredients to work on?

Many people will instinctively say the decoder.  For many, the decoder, the speech recognizer, oorr the "computer thing" which does all the magic of recognizing speech, is the core of the works.

Indeed, working on a decoding is loads of fun.  If you a fresh new programmer, it is also one of those experiences, which will teach you a lot of things.   Unlike thousands of small, "cool" algorithms, writing a speech recognizer requires you to work out a lot of file format issues, system issues.   You will also touch a fairly advanced dynamic programming problem : writing a Viterbi search.   For many, it means several years of studying source code bases from the greats such as HTK, Sphinx and perhaps in house recognizers.

Writing a speech recognizer is also very important when you need to deal with speed issues.  You might want to fit a recognizer into your mobile phone or even just a chip.   For example, in Voci, an FPGA-based speech recognizer was built to cater ultra-high speed speech recognition (faster than 100xRT).   All these system-related issues required understanding of the decoder itself.

This makes speech recognition an exciting field similar to chess programming.  Indeed the two fields are very similar in terms of code development.   Both require deep understanding of search as a process. Both have eccentric figures popped up and popped out.   There are more stories untold than told in both field.  Both are fascinating fields.

There is one thing which speech recognition and chess programming are very different.   This is also a subtle point which even many savvy and resourceful programmers don't understand.   That is how each of these machines derived their knowledge sources.   In speech, you need to have a good model to do decent jobs for your task.   In chess though, most programmers can proceed to write a chess player with the standard piece values.   As a result, there is a process before anyone can use a speech recognizer.  That is to first train an acoustic model and a language model.  

The same decoder, having different acoustic models and language models, can give users perceptions ranging from a total trainwreck to the a modern wonder, borderline to magic.   Those are the true ingredients of our magic.   Unlike magicians though, we are never shy to talk about these secret ingredients.   They are just too subtle to discuss.   For example, you won't go to a party and tell your friends that "Using an ML estimate is not as good as using an MPFE estimate in speech recognition.  It usually results in absolutely 10% performance gap."  Those are not party talks.  Those are talks when you want to have no friends. 🙂

In both type of tasks, one require learning different from a programming training.   10 years ago, those skill are generally carried by "Mathematician, Statistician or People who specialized in Machine Learning".   Now there is new name : "Big Data Analyst".

Before I stopped, let me mention another type of work, which are important in real life.  What I want to say is transcription and dictionary work.   If you asked some high-minded researchers in the field, they will almost think those are not interesting work.   Yet, in real-life, you can almost always learn something new and improve your systems based on them.  May be I will talk about this more next time.

The Grand Janitor

I am back

Hi Guys,
     I stopped using this blog for 3 years and now I decide to claim it.  My life as the "Grand Janitor" of the Sphinx software is very memorable for me.   It was unfortunate for me to stop the blog and had only write on-line in other venues. 

     I will start to blog more about speech recognition and natural language processing.  This is probably time for me to read up again.  My another blog, Random Thought of Arthur Chan, will solely put my thought on other random things in the world.

     In any case, it's good to meet all of you again.  We'll have fun.

The Grand Janitor