Category Archives: ASR

The Search Programmer

In every team of any serious ASR or NLP company, that has to be one person who is the "search guy".  Not search as in search engine, but search as in searching in AI.  The equivalent of a chess engine programmer in a chess program,  or perhaps to engine specialist for race cars.   Usually this person has three important roles:

  1. Program the engine,
  2. Add new features to the engine ,
  3. Maintain the engine through its life time.

This job is usually taken by someone who has title such as "Speech Scientist" or "Speech Engineer".   They usually have blended skills of both programming and statistics.   It's a tough job, but it's also highly satisfactory job.  Because the success of a company usually depends on whether features can be integrated quickly.   That gives the "search guy" a mythical status even among data scientist - a search engineer needs to effectively work with two teams: one with mostly research background on statistics and machine learning, the other with mostly programming background, whose job is to churn out pseudocode, implementation and architecture diagrams daily.

I tend to think the power of "search guy" is both understated and overstated.

It's understated because there are many companies which only use other people's engine.  So they couldn't quite get the edge of customizing an engine. Those which use open source implementation is better, because they preserved the right to change the engine and give them leverage on intellectual property and trade secrets.  Those who bought commercial engine from large company would enjoy good performance for few years, but then got squeezed by huge price of upgrading and constrained by overly restrictive license.

(Shameless prompotion here:  Voci is an exception.  We are very nice to our clients. Check us out at here. 🙂 )

It's overstated because the skill of programming a search is nothing but a series of logical exercises.   The pity is programming a search algorithm, or generally a dynamic program (DP) in general, takes many kinds of expertise.  The knowledge can only be sporadically found in different subjects.  Some might learn the basic of DP in an algorithmic book such as CLRS, but mere knowledge of programming doesn't give you insights on how to debug an issue of the search.  You do need to have solid understanding in the domain knowledge (such as POS tagging and speech recognition) and theory (such as machine learning) to get the job done correctly.


Different HMMSets in HTK

HTK was my first speech toolkit. It's fun to use and you can learn a lot of ASR by following the manual carefully and deliberately.

If you are still using HMM/GMM technology (interesting but why?), here is a thread a year ago on why there are different HMM Types in HTK.

One thought I have: when I first start out in ASR, I seldom think of any human elements in a design. Of course, it has to deal with the difficulty of understanding all these terminologies and algorithms.

Yet ASR research has to do a lot with rival groups come up with different ideas, each try to bet against each other on the success of a certain technique.

So sometimes you would hope that competition would make technology finer. Yet a highly competitive environment only nurture followers, rather than competitive loner groups such as Prof. Young's , or MSR (whom AFAIK built the first working version of DNN-based ASR).



I'm a student who's looking into the HTK source code to get some idea
about practical implementation of HMMs. I have a question related to
the design choices of HTK.

AFAIK, the current working set of HMMs (HMMSet) has 4 types: plain,
shared, tied, discrete.
HMM sets with normal continuous emission densities are "plain" and
"shared", only difference being that some parameters are shared in the
latter. Sets with semi-continuous emission densities (shared Gaussian
pools for each stream) are called "tied" and discrete emission
densities are "discrete".

If someone uses HTK, isn't there a high chance of using only one of
these types? The usage of these types is probably mutually exclusive.
So my question is, why not have separate training and recognition
tools for continuous, semi-continuous and discrete HMM sets? Here are
some pros and cons of the current design I can think of, which of
course can be wrong:

- less code duplication
- simpler interface for the user

- more code complexity
- more contextual information required to read, more code jumps
- unused variables and memory, examples: vq and fv in struct
Observation, mixture alignment in discrete case

If I were to implement HMMs supporting all these emission densities,
what path should I follow? How feasible is it to use OOP principles to
create a better design? If so, why weren't they leveraged in HTK?

Warm regards,

(I trimmed out Mr. Neil Nelson's reply, which basically suggest people should use Kaldi instead.)

Max and Neil

I don’t usually respond to HTK questions, but this one was hard to resist.

I designed the first version of HTK in Cambridge in 1988 soon after moving from Manchester where I worked for a while on programming language and compiler design. I was a strong advocate of modular design, abstraction and OOP. However, at that time, C++ was a bit of a nightmare. There was little standardisation across operating systems and some implementations were very inefficient. As a result I decided that since HTK had to be very efficient and portable across platforms, it would be written in C, but the architecture would be modular and class like. Hence, header files look like class interfaces, and body files look like class method implementations.

When HTK was first designed, the “experts” in the US DARPA program had decided that continuous density HMMs would never scale and that discrete and semi-continous HMMs were the way to go. I thought they were wrong, but decided to hedge my bets and built in support for all three - whilst at the same time taking care that the implementation of continuous densities was not compromised by the parallel support for discrete and semi-continuous. By 1993 the Cambridge group (and the LIMSI group in France) were demonstrating that continuous density HMMs were significantly better than the other modelling approaches. So although we tried to maintain support for different emission density models, in practice we only used continuous densities for all of our research in Cambridge.

It is a source of considerable astonishment to me that HTK is still in active use 25 years later. Of course a lot has been added over the years, but the basic architecture is little changed from the initial implementation. So I guess I got something right - but as Neil says, things have moved on and today there are good alternatives to HTK. Which is best depends on what you want to do with it!

Steve Young"

ASR Software from Academic Research

Before Voci, I have worked in 3 types of work environment, academic institute, industrial research lab and startups (such as Speechworks and Scanscout).   There is one common thread, all environments require strong development background.   My role has always been a craftsman, under supervision of scientists, researchers, principal investigators or company owners to produce software and achieve a certain goal.   Of course, my specialty is on ASR which is always my dearest topic.

There are many things you can say about software engineering in each environment.  But in my view, producing quality software in academia is probably the toughest situation.  I am not alone.  See this post "Producing Good Software From Academia" from Prof. John Regehr.  My observations, is very similar to what Regehr suggests: career professors are very unlikely to have time to maintain a good software package.   Most Professors either take hire research programmers (guys like me) or assign these coding tasks to graduate students.

What I want to add here is both paths are difficult.   Research staffs, for example, have high mobility.  The stories go, who is who who maintain and develop a certain project source code decides to join Google/Amazon/IBM or startups.  That obviously makes sense.  Commercial companies paid way more than academic institutions.  Research staffs, just like other human being, were driven by economic laws and seek for better employment.  (Or for me, more fun.)

If you assigned the tasks to student, on the other hand, you face the problem of how to balance the load for all students.  My observation on my couple of bosses is that it is a very hard problem.  It usually results in either

  1. certain privilege student become the "golden boy/girl" in the group who doesn't need to do any grunge work.
  2. the group become more a company than a research group: for most of the time, research was on the sideline to make the whole group survive.

One version I heard was that if you work on academic research, only have around one-forth of your time would remain as "research time".   It is sad but it's brutally true.

Another deeper issue is the merit system: maintaining codebase for the community is not rewarded and sometimes just unappreciated.  On the other hand, writing papers earn you accolade.   This is a misfortune of our era: software maintenance is a very important discipline. People who are willing to spend this effort should be rewarded fairly and equally for researchers.


Learning ASR Through Coding

In a way, speech recognition is not that different from many skills.  You need to have a lot of practice to really grasp how certain things can be done.  e.g. if you never write a Viterbi algorithm,  it's probably hard for you to convince anybody you know the search aspect of ASR.   And if you never write an estimation algorithm, then your knowledge in training would be shaky.

Of course, this might be too hard for many people.  Who will have time to write a decoder or a trainer?  Fair enough.  I guess the next best choice is to study implementations of open source speech recognizers, try to modify them to fit your goal.   In the process, you will start to build up understanding.

Which recognizers?

Let me say one thing for learners these days: you guys are lucky.  When I tried to learn to do any ASR coding back in 2000, you have to join a certain speech lab, get a license of HTK before you can do any tracing and modification.    Now you have many choices,  HTK, Sphinx, Kaldi, Julius, RWTH recognizer, etc..... So what will be the recognizers you should learn?

I will name three of them, HTK, Sphinx and Kaldi. Why?

Why HTK?

You want to learn HTK because it has a well-designed and coherent interface.  It also has some of the best of training technology: its ML training is assumption free and take care of small issues such as silence/short-pauses, multiple pronunciations.   It has one of the sort large vocabulary MMIE training.  All of these work are very nice.

HTK also has a well-written tutorial.   If you own either the TIMIT or the RM corpora, you can usually train the whole thing following through the instruction.  While going through the tutorial, you gain valuable understanding on data structures commonly used in speech recognition.

Though I mainly worked on Sphinx,  there were around 2-3 years of life I used HTK in a day-to-day basis.   The menu itself is a good literature that can teach you a lot of things.   I believe many designers of speech recognizers actually learn from HTK source code as well.

Why Sphinx?

"Because you work on Sphinx!"  True, I am biased in this case.   But I do have a legitimate reason to like Sphinx and claim that knowledge of Sphinx is more useful.

If you compare the history of HTK and Sphinx systems development, you will notice that HTK's very nice interface stemmed from design effort in Entropic stage. Whereas Sphinx as whole are more work from PhD students, faculties and staffs.   In another words, Sphinx tools are more "hacky" than HTK.  So as a project, you will find that Sphinx seems to be more incoherent.   e.g. there are many recognizers written in C or Java.  The system itself seems to require much learning curves.

Very true, those are weaknesses.  But one thing I like about Sphinx is that it is fertile ground for any enthusiasts to play with.   The free BSD license gives people are chance to incorporate any part of the code into their projects.  As a result, historically, there are many companies which are using Sphinx in their company code.

Before we go, you may ask "Which Sphinx?"  If you ask 5 guys from the CMU Sphinx project, they will give you 5 different answers.  But let me just offer my point of view, which I think more related to learning.  Nick, the current maintainer-at-large, and I once chat, he believed that current Sphinx project should only support triple: Sphinx4/pocketsphinx/SphinxTrain.     I support that view.  As a project, we should only support and maintain focused number of components.

Though if you are enthusiasts, I will highly recommend you to study more.  Other than the triple, you will find Sphinx2 and Sphinx3 have their own interesting parts.  Not all of them is transferred to Sphinx4 or pocketsphinx.  But they are nonetheless fun code to read.   e.g. how triphones were implemented in different sphinx?  With all computation these days, I don't full triphone expansion works for real-time system.   I believe in that aspect, 2 and 3 are very interesting.

Why Kaldi?

I am very excited about Kaldi.  You can think of it as the "new HTK with the Sphinx license".   The technology is strong and new.  e.g. there is a branch which has all deep-neural network-based training.  The recognizer is based on WFST.    The best, all components are in very liberal licenses.   So you can surely do many amazing things with it.

The only reason why I don't recommend it more is that it is still relatively new.   Open source toolkits have strange lives : if they are being supported by funding, they can live forever.   If they are not, their fate is quite unpredictable.    Say MITLM toolkit, there were a year or so the maintainer left and there was no new maintainer.   I am sure during the time users will need to patch a thing or two.   It is certainly a very interesting toolkit.  (Because automatic optimization of mKN smoothing weight.)   But sometimes it's hard to predict what will happen.

In a way, development of Kaldi is rare, someone decides to share the best technology in our time to everybody.  (WFST, SGMM, DNN are all examples.)   I can only wish the project goes on.  If I could, may be I want to contribute a thing or two.



"The Grand Janitor Blog V2" Started

I moved "The Grand Janitor Blog" to WordPress.   Nothing much, Blogger is simply too constraining.  I don't like the theme.  I can't really customize a thing.  I can't put an ad there if I want to sell something.   So it was really annoying and it's time to change.

But then what's new with V2?   First of all, I might blog more about how machine learning influence speech recognition.  It's not new that machine learning is the source of how speech recognition. It has always been like that. Many experts who work in speech recognition have deep knowledge in pattern recognition.  When you look at their papers, you can sense that they have studied a certain machine learning method in great-depth.  So they can come up with creative ideas to improve the bottom-line, which is the only thing I care.  I don't really care the thousand APIs wrap around a certain recognizer.  I only care about the guts inside the decoder, the trainer.  Those components are what really matters but those are also components which are most misunderstood.

So why now?  It's obvious that the latest development of DBN-DNN (the "next big thing") is one factor.   I was told in school (10+ years ago) that GMM is the state of the art.  But things are rapidly changing, work of Prof. Hinton has given a theoretical basis for making DBN-DNN training practically feasible.   Enthusiasts, some rather sophisticated, are gather around the Kaldi forum.

For me,  as I I will describe myself as a recovering ASR programmer.   What does it mean?  It means I need to grok ASR from theory to implementation. That's tough.  I found myself studying again, dust off my "Advanced Calculus" and try to read and think creatively text such as "Connectionist Speech Recognition A Hybrid Approach" by Bourland and Nelson. (It's highly entertaining technical text!)  Perhaps more in the future.   But when you try to drill a certain skill in your life, there got to be a point you need to go back to the basic.   Re-think all the things you thought you know.  Re-prove all the proofs you thought you understood.    That takes time and patience but at the end it is also how you come up with new ideas.

As for the readers,  sorry for never getting back to your suggested blog messages.  You might be interested in a code trace of a certain part of Sphinx.  You might be interested in how certain parts of the program work.  I kept a list of them and probably write-up something when I have time.   No promise though;  I have been very busy.   And to be frank: everyone who works in ASR is busy.  That perhaps explain why not many actively maintained blogs in speech recognition.

Of course, I will keep on posting on other diverse topics such as programming and technology.   I am still a geek.  I don't think anyone can change that. 🙂

In any case, feel free to connect with me and have fun with speech recognition!


Arthur Chan, "The Grand Janitor"

On Kurzweil : a perspective of an ASR practitioner

Many people who don't work on the fields of AI, NLP and ASR have heard of Kurzweil.   To my surprise, many seem to give little thought on what he said and just follow his theory wholeheartedly.

In this post, I just want to comment on one single little thing, which is whether real-time speech-to-speech translation can be achieved in 2010s.  This is a very specific prediction from Kurzweil's book "The Singularity is Near".

My discussion would mainly focus on ASR first.  So even though my examples below are not exactly pure ASR systems, I will skip the long winding wording of saying "ASR of System X".  And to be frank, MT and Response system probably goes through similar torturous development process anyway.   So, please, don't tell me that "System X is actually ASR + Y", that sort of besides the point.

Oh well, you probably ask why bother, don't we have a demo of real-time speech-to-speech translation from Microsoft already?

True, but if you observe the demo carefully, it is based on read speech.  I don't want to speculate much but I doubt it is a generic language model which wasn't tuned to the lecture.   In a nutshell, I disbelieve it is something you can use it in real-life.

Let's think of a more real-life example: Siri, are we really getting 100% correct response now?  (Not to boil down to ASR WER ...... yet)  I don't think so. Even with adaptation, I don't think Siri understand what I said every single time.    For most of the time, I follow the unofficial command list of Siri, let it improve with adaptation..... but still, it is not perfect.

Why? It is the hard cold reality: ASR is still not perfect, with all the advancement in HMM-based speech recognition.  All the key technologies we know in the last 20 years: CMLLR, MMIE, MPE, fMPE, DMT, consensus network, multipass decodings, SAT, SAT with MMIE or all the nicest front-ends, all the engineerings.   Nope, we are not yet having a feel-good accuracy.  Indeed, human speech recognition is not 0% WER neither but for some reasons, the current state-of-the-art ASR performance is not reaching there.

And Siri, we all know is the state-of-the-art.

Just digress a little bit: Now most of the critics when they write to this point, will then lament that "oh, there is just some invisible barrier out there and human just couldn't make a Tower Babel, blabla....".  I believe most of these "critics" have absolutely no ideas what they are talking about.   To identify these air-head critics, just try to see if they put "cognitive science" into the articles, then you don't know they never work on real-life ASR system.

I, on the hand, do believe one day we can get there.  Why?  Because when people work on one of these speech recognition evaluation tasks, many would tell you : given a certain test set and with enough time and gumption, you would be able to create a system without any errors.  So to me, it is more of an issue of whether some guys grinding on the problem, but not feasibility issue.

So where are we now in ASR?  Recently, In ICASSP 2012,  a Google paper, trained 87 thousand hour of data.  That is probably the largest scale of training I know.  Oh well, where are we now? 10%.  Go down from 12%.  So the last big experiment I know, it's probably the 3000 hours experiment back in 2006-7.  The Google authors are probably using a tougher test set.  So the initial recognition rate was yet again lower.

Okay, speculation time.  So let's assume, that human can always collect 10 times more labelled data for every 6-7 years AND we can do an AM training on them. When will we go to have say 2% WER on the current Google test set?   If we just think of very simple linear interpolation.  It will take 4 * 6 years = 24 years to collect 10000 times more data (or 8 billion hour of data).    So we are way-way past the 2010s deadline from Kurzweil.

And that's a wild speculation.   Computation resources probably will work out itself by that time.  What I doubt most is whether the progress would be linear.  

Of course, it might be non-linearly better too.  But here is another point: it's not just about the training set, it's about the test set.  If we truly want a recognizer to work for *everyone* in the planet, then the very right thing to do is test your recognizer on our whole population.  If we can't then you want to sample enough human speech to represent the Earth's population, the current test set might not be representative enough.   So it is possible that when we increase our test set, we found that the initial recognition rate has go down again.   And it seems to me our test set is still in the state of mimicking human population.

My discussion so far are mostly on acoustic model.  On the language model side,  the problem will mainly on domain specificity.   Also bear in mind, human language can evolve.  So, say we want to build a system which build a customized language model for each human being in the planet.  At a particular moment of time, you might not be able to get enough data to build such a language model.

For me, the point of the whole discussion is that ASR is an engineering system, not some idealistic discussion topic.  There will always be tradeoff.   You may say: "What if a certain technology Y emerge in the next 50 years?" I heard that a lot Y could be quantum computing or brain simulation or brain-human interface or machine implementation of brain.    Guys..... those, I got to admit are very smart idea in our time, and give it another 30-40 years, we might see something useful.   For now, ASR really has nothing to do with them.  I never heard of machine implementation of the audio cortex, or even an accurate construction of audio pathway.  Nor, there is an easy progress of dissecting mammal inner ear and bring understanding on what's going on in human ear.   From what I know, we seem to know some, but there are lots of other things we don't know.

That's why I think it's better to buckle down and just to try to work out our stuffs.  Meaning, try to come up with more interesting mathematical model, try to come up with more computational efficient method.   Those .... I think are meaningful discussion.   As for Kurzweil, no doubt he is a very smart guy, but at least on ASR, I don't think he knows what he talks about.

Of course, I am certainly not the only person who complains Kurzweil.  Look at how Douglas Hofstadter's criticism:

"It’s as if you took a lot of very good food and some dog excrement and blended it all up so that you can't possibly figure out what's good or bad. It's an intimate mixture of rubbish and good ideas, and it's very hard to disentangle the two, because these are smart people; they're not stupid."

Sounds like very reasonable to me.