Category Archives: sphinxtrain

Grand Janitor's Blog February and March Summary

I wasn't very productive in blogging for the last two months.  Here are couple of worthy blog posts and news you might feel interested.

GJB also reached the milestone of 100 posts, thanks for your support !

Google Buys Neural Net Startup, Boosting Its Speech Recognition, Computer Vision Chops

Future Windows Phone speech recognition revealed in leaked video

Google Keep

Feel free to connect with me on Plus, LinkedIn and Twitter.


Good ASR Training System

The term "speech recognition" is a misnomer.

Why do I say that? I have explained this point in an old article "Do We Have True Open Source Dictation?, which I wrote back in 2005: To recap,  a speech recognition system consists of a Viterbi decoder, an acoustic model and a language model.  You could have a great recognizer but bad accuracy performance if the models are bad.

So how does that related to you, a developer/researcher of ASR?    The answer is ASR training tools and process usually become a core asset of your inventories.    In fact, I can tell you when I need to work on acoustic model training, I need to spend full time to work on it and it's one of the absorbing things I have done.  

Why is that?  When you look at development cycles of all tasks in making an ASR systems.   Training is the longest.  With the wrong tool, it is also the most error prone.    As an example, just take a look of Sphinx forum, you will find that majority of non-Sphinx4 questions are related to training.    Like, "I can't find the path of a certain file", "the whole thing just stuck at the middle".

Many first time users complain with frustration (and occasionally disgust) on why it is so difficult to train a model.   The frustration probably stems from the perception that "Shouldn't it be well-defined?"   The answer is again no. In fact how a model should be built (or even which model should be built) is always subjects to change.   It's also one of the two subfields in ASR, at least IMO, which is still creative and exciting in research.  (Another one: noisy speech recognition.)  What an open source software suite like Sphinx provide is a standard recipe for everyone.

Saying so, is there something we can do better for an ASR training system?   There is a lot I would say, here are some suggestions:

  1. A training experiment should be created, moved and copied with ease,
  2. A training experiment should be exactly repeatable given the input is exactly the same,
  3. The experimenter should be able to verify the correctness of an experiment before an experiment starts. 
Ease of Creation of an Experiment

You can think of a training experiment as a recipe ...... not exactly.   When we read a recipe and implement it again, we human would make mistakes.

But hey! We are working with computers.   Why do we need to fix small things in the recipe at all? So in a computer experiment, what we are shooting for is an experiment which can be easily created and moved around.

What does that mean?  It basically means there should be no executables which are hardwired to one particular environment.   There should also be no hardware/architecture assumption in the training implementations.   If there is, they should be hidden.

Repeatability of an Experiment

Similar to the previous point, should we allow difference when running a training experiment?  The answer should be no.   So one trick you heard from experienced experimenters is that you should keep the seed of random generators.   This will avoid minute difference happens in different runs of experiments.

Here someone would ask.   Shouldn't us allow a small difference between experiments?  We are essentially running a physical experiment.

I think that's a valid approach.  But to be conscientious, you might want to run a certain experiment many times to calculate an average.    In a way, I think this is my problem with this thinking.  It is slower to repeat an experiment.    e.g.  What if you see your experiment has 1% absolute drop?  Do you let it go? Or do you just chalk it up as noise?   Once you allow yourself to not repeat an experiment exactly, there will be tons of questions you should ask.

Verifiability of an Experiment

Running an experiment sometimes takes day, how do you make sure running it is correct? I would say you should first make sure trivial issues such as missing paths, missing models, or incorrect settings was first screened out and corrected.

One of my bosses used to make a strong point and asked me to verify input paths every single time.  This is a good habit and it pays dividend.   Can we do similar things in our training systems?

Apply it on Open Source

What I mentioned above is highly influenced by my experience in the field.   I personally found that sites, which have great infrastructure to transfer experiments between developers, are the strongest and faster growing.   
To put all these ideas into open source would mean very different development paradigm.   For example, do we want to have a centralized experiment database which everyone shares?   Do we want to put common resource such as existing paramatized inputs (such as MFCC) somewhere in common for everyone?  Should we integrate the retrieval of these inputs into part of our experiment recipe? 
Those are important questions.   In a way, I think it is the most type of questions we should ask in open source. Because regardless of much volunteer's effort.  Performance of open source models is still lagging behind the commercial models.  I believe it is an issue of methodology.  

sphinxbase 0.8 and SphinxTrain 1.08

I have done some analysis on sphinxbase0.8 and SphinxTrain 1.08 and try to understand if it is very different from sphinxbase0.7 and SphinxTrain1.0.7.  I don't see big difference but it is still a good idea to upgrade.

  • (sphinxbase) The bug in cmd_ln.c is a must fix.  Basically the freeing was wrong for all ARG_STRING_LIST argument.  So chances are you will get a crash when someone specify a wrong argument name and cmd_ln.c forces an exit.  This will eventually lead to a cmd_ln_val_free. 
  • (sphinxbase) There were also couple of changes in fsg tools.  Mostly I feel those are rewrites.  
  • (SphinxTrain) sphinxtrain, on the other hands, have new tools such as g2p framework.  Those are mostly openfst-based tool.  And it's worthwhile to put them into SphinxTrain. 
One final note here: there is a tendency of CMUSphinx, in general, starts to turn to C++.   C++ is something I love and hate. It could sometimes be nasty especially dealing with compilation.  At the same time, using C to emulate OOP features is quite painful.   So my hope is that we are using a subset of C++ which is robust across different compiler version. 

January 2013 Write-up

Miraculously, I still have some momentum for this blog and I have kept on the daily posting schedule.

Here is a write up for this month:  Feel free to look at this post on how I plan to write this blog:

Some Vision of the Grand Janitor's Blog

Sphinx' Tutorials and Commentaries

SphinxTrain1.07's bw:

Commentary on SphinxTrain1.07's bw (Part I)
Commentary on SphinxTrain1.07's bw (Part II)

Part I describes the high-level layout, Part II and describe half the state network was built.

Acoustic Score and Its Sign
Subword Units and their Occasionally Non-Trivial Meanings

Sphinx 4 from a C background : Material for Learning


Goldman Sachs not Liable
Aaron Swartz......

Other writings:

On Kurzweil : a perspective of an ASR practitioner



Commentary on SphinxTrain1.07's bw (Part I)

I was once asked by a fellow who didn't work in ASR on how the estimation algorithms in speech recognition work.   That's a tough question to answer.  From the high level, you can explain how properties of Q function would allow an increase of likelihood after each re-estimation.  You can also explain how the Baum-Welch algorithm is derived from the Q-function and how the estimation algorithm can eventually expressed by greeks, and naturally link it to the alpha and bet pass.   Finally, you can also just write down the reestimation formulae and let people perplex about it.

All are options, but this is not what I wanted nor the fellow wanted.   We hoped that somehow there is one single of entry in understanding the Baum-Welch algorithm.  Once we get there, we will grok.   Unfortunately, that's impossible for Baum-Welch.  It is really a rather deep algorithm, which takes several type of understanding.

In this post, I narrow down the discussion to just Baum-Welch in SphinxTrain1.07.  I will focus on the coding aspect of the program.   Two stresses here:

  1. How Baum-Welch of speech recognition in practice is different from the theory?
  2. How different parts of the theory is mapped to the actual code. 

In fact, in Part I, I will just describe the high level organization of the Baum-Welch algorithm in bw.   I assumed the readers know what the Baum-Welch algorithm is.   In Part II, I will focus on the low level functions such as next_utt_state, foward, backward_update, accum_global .

(At a certain point, I might write another post just to describe Baum-Welch, This will help my Math as well......)

Unlike the post of setting up Sphinx4.   This is not a post for faint of heart.  So skip the post if you feel dizzy.

Some Fun Reading Before You Move On

Before you move on, here are three references which I found highly useful to understand Baum-Welch in speech recognition. They are

  1. L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Chapter 6. "Theory and Implementation of Hidden Markov Model." p.343 and p.369.    Comments: In general, the whole Chapter 6 is essential to understand HMM-based speech recognition.  There are also a full derivation of the re-estimation formulae.  Unfortunately, it only gives the formula without proof for the most important case, in which observation probability was expressed as Gaussian Mixture Model (GMM).
  2. X. D. Huang, A. Acero and H. W. Hon, Spoken Language Processing.  Chapter 8. "Hidden Markov Models" Comments: written by one of the authors of Sphinx 2, Xuedong Huang, the book is a very good review of spoken language system.  Chapter 8 in particular has detailed proof of all reestimation algorithms.  If you want to choose one book to buy in speech recognition.  This is the one.  The only thing I would say it's the typeface of greeks are kind of ugly. 
  3. X. D. Huang, Y. Ariki, M. A. Jack, Hidden Markov Models for Speech Recognition. Chapter 5, 6, 7. Comments: again by Xuedong Huang, I think this is the most detail derivations I ever seen on continuous HMM in books.  (There might be good papers I don't know of).  Related to Sphinx, it has a chapter of semi-continuous HMM (SCHMM) as well. 

bw also features rather nice code commentaries. My understanding is that it is mostly written by Eric Thayer, who put great effort to pull multiple fragmented codebase together and form the embryo of today's SphinxTrain.

Baum-Welch algorithm in Theory

Now you read the references, in a very high-level what does a program of Baum-Welch estimation does? To summarize, we can think of it this way

* For each training utterance

  1. Build an HMM-network to represent it. 
  2. Run Forward Algorithm
  3. Run Backward Algorithm
  4. From the Forward/Backward, calculate the statistics (or counts or posterior scores depends on how you call it.)

* After we run through all utterances, estimate the parameters (means, variances, transition probability etc....) from the statistics.

Sounds simple?  I actually skipped a lot of details here but this is the big picture.

Baum-Welch algorithm in Practice

There are several practical concerns on doing Baum-Welch in practice.  These are particularly important when it is implemented for speech recognition. 
  1. Scaling of alpha/beta scores : this is explained in detail in Rabiner's book (p.365-p.368).  The gist is that when you calculate the alpha or beta scores.  They can easily exceed the range of precision of any machines.  It turns out there is a beautiful way to avoid this problem. 
  2. Multiple observation sequences:  or stream. this is a little bit archaic, but there are still some researches work on having multiple streams of features for speech recognition (e.g. combining the lip signal and speech signal). 
  3. Speed: most implementation you see are not based on a full run of forward or backward algorithm.  To improve speed, most implementations use a beam to constrained the search.
  4. Different types of states:  you can have HMM states which are emitting or non-emitting.  How you handle it complicates the implementation. 

You will see bw has taken care of a lot of these practical issues.   In my opinion, that is the reason why the whole program is a little bit bloated (5000 lines total).  

Tracing of bw: High Level

Now we get into the code level.  I will follow the version of bw from SphinxTrain1.07.  I don't see there are much changes in 1.08 yet.  So this tracing is very likely to be applicable for a while.

I will organize the tracing in this way.   First I will go through the high-level flow of the high-level.  Then I will describe some interesting places in the code by line numbers.

main() - src/programs/bw/main.c

This is the high level of main.c (Line 1903 to 1914)

 main ->   
if it is not mmie training

We will first go forward with main_initialize()

-> initialize the model inventory, essentially means 4 things, means (mean) variances (var), transition matrices (tmat), mixture weights (mixw).
-> a lexicon (or .... a dictionary)
-> model definition
-> feature vector type
-> lda (lda matrix)
-> cmn and agc
-> svspec
-> codebook definition (ts2cb)
-> mllr for SAT type of training.

Interesting codes:

  • Line 359: extract diagonal matrix if we specified a full one. 
  • Line 380: precompute Gaussian distribution.  That's usually mean the constant and almost always most the code faster. 
  • Line 390: specify what type of reestimation. 
  • Line 481: check point.  I never use this one but it seems like something that allow the training to restart if network fails. 
  • Line 546 to 577: do MLLR transformation for models: for SAT type of training. 

(Note to myself: got to understand why svspec was included in the code.)


Now let's go to main_reestimate.  In a nutshell, this is where the looping occurred.

      -> for every utterane.   
-> corpus_get_generic_featurevec (get feature vector (mfc))
-> feat_s2mfc2feat_live (get the feature vector)
-> corpus_get_sent (get the transcription)
-> corpus_get_phseg (get the phoneme segmentation.)
-> pdumpfn (open a dump file, this is more related Dave's constrained Baum-Welch research)
-> next_utt_states() /*create the state sequence network. One key function in bw. I will trace it more in detail. */
-> if it is not in Viterbi mode.
-> baum_welch_update() /*i.e. Baum-Welch update */
-> viterbi() /*i.e. Viterbi update)

Interesting code:

  • Line 702:  several parameter for the algorithm was initialized including abeam, bbeam, spthres, maxuttlen.
    • abeam and bbeam are essentially the beam sizes which control forward and backward algorithm. 
    • maxuttlen: this controls how large an utterance will be read in.  In these days, I seldom see this parameter set to something other than 0. (i.e. no limit).
    • spthres: "State posterior probability floor for reestimation.  States below this are not counted".  Another parameter I seldom use......


-> for each utterance
forward() (forward.c) (<This is where the forward algorithm is -Very complicated. 700 lines)
if -outphsegdir is specified , dump a phoneme segmentation.
backward_update() (backward.c Do backward algorithm and also update the accumulator)
(<- This is even more complicated 1400 lines)
-> accum_global() (Global accumulation.)
(<- Sort of long, but it's more trivial than forward and backwrd.)

Now this is the last function for today.  If you look back to the section of "Baum-Welch in theory".  you will notice how the procedure are mapped onto Sphinx. Several thoughts:

  1. One thing to notice is that forward, backward_update and accum_global need to work together.   But you got to realize all of these are long complicated functions.   So like next_utt_state, I will separate the discussion on another post.
  2. Another comment here: backward_update not only carry out the backward pass.  It also do an update of the statistics.

Conclusion of this post

In this post, I went through the high-level description of Baum-Welch algorithm as well as how the theory is mapped onto the C codebase.  My next post (will there be one?), I will focus on the low level functions such as next_utt_state, forward, backward_update and accum_global.
Feel free to comment. 

Me and CMU Sphinx

As I update this blog more frequently, I noticed more and more people are directed to here.   Naturally,  there are many questions about some work in my past.   For example, "Are you still answering questions in CMUSphinx forum?"  and generally requests to have certain tutorial.  So I guess it is time to clarify my current position and what I plan to do in future.

Yes, I am planning to work on Sphinx again but no, I probably don't hope to be a maintainer-at-large any more.   Nick proves himself to be the most awesome maintainer in our history.   Through his stewardship, Sphinx prospered in the last couple of years.  That's what I hope and that's what we all hope.    
So for that reason, you probably won't see me much in the forum, answering questions.  Rather I will spend most of my time to implement, to experiment and to get some work done. 
There are many things ought to be done in Sphinx.  Here are my top 5 list:
  1. Sphinx 4 maintenance and refactoring
  2. PocketSphinx's maintenance
  3. An HTKbook-like documentation : i.e. Hieroglyphs. 
  4. Regression tests on all tools in SphinxTrain.
  5. In general, modernization of Sphinx software, such as using WFST-based approach.
This is not a small undertaking so I am planning to spend a lot of time to relearn the software.  Yes, you hear it right.  Learning the software.  In general, I found myself very ignorant in a lot of software details of Sphinx at 2012.   There are many changes.  The parts I really catch up are probably sphinxbase, sphinx3 and SphinxTrain.   One PocketSphinx and Sphinx4, I need to learn a lot. 
That is why in this blog, you will see a lot of posts about my status of learning a certain speech recognition software.   Some could be minute details.   I share them because people can figure out a lot by going through my status.   From time to time, I will also pull these posts together and form a tutorial post. 
Before I leave, let me digress and talk about this blog a little bit: other than posts on speech recognition, I will also post a lot of things about programming, languages and other technology-related stuffs.  Part of it is that I am interested in many things.  The other part is I feel working on speech recognition actually requires one to understand a lot of programming and languages.   This might also attract a wider audience in future. 
In any case,  I hope I can keep on.  And hope you enjoy my articles!

The Grand Janitor's Blog

For the last year or so, I have been intermittently playing with several components of CMU Sphinx.  It is an intermittent effort because I am wearing several hats in Voci.

I find myself go back to Sphinx more and more often.   Being more experienced, I start to approach the project again carefully: tracing code, taking nodes and understanding what has been going on.  It was humbling experience - speech recognition has changed, Sphinx has more improvement than I can imagine. 
The life of maintaining sphinx3 (and occasionally dip into SphinxTrain) was one of the greatest experience I had in my life.   Unfortunately, not many of my friends know.  So Sphinx and I were pretty much disconnected for several years. 
So, what I plan to do is to reconnect.    One thing I have done throughout last 5 years was blogging so my first goal is to revamp this page. 
Let's start small: I just restarted RSS feeds.   You may also see some cross links to my other two blogs, Cumulomaniac, a site on my take of life, Hong Kong affairs as well as other semi-brainy topics,  and  333 weeks, a chronicle of my thoughts on technical management as well as startup business. 
Both sites are in Chinese and I have been actively working on them and tried to update weekly. 
So why do I keep this blog then?  Obviously the reason is for speech recognition.   Though, I start to realize that doing speech recognition has much more than just writing a speech recognizer.   So from now on, I will post other topics such as natural language processing, video processing as well as many low-level programming information.   
This will mean it is a very niche blog.   Could I keep up at all?  I don't know.   As my other blogs, I will try to write around 50 messages first and see if there is any momentum.