Categories
bw state sequence

Commentary on SphinxTrain1.07’s bw (Part II : next_state_utt.c’s First Half)

I will go on with the analysis of bw.  In my last post, we understand the high-level structure of the bw program.  So we now turns to the details of how the state lattice was built.  Or how next_state_utt.c works.

As always, here is the standard disclaimer: this is another long and super technical post I only expect a small group of programmer with exoteric interest of Baum-Welch algorithm can read.   It’s not for faint of heart but if you understand the whole thing, you would have some basic but genuine understanding of Baum-Welch algorithm in real life.

The name of the module, next_state_utt.c, is quite unassuming but it is an important key of understanding the Baum-Welch algorithm.  The way how the state lattice is structured affects how parameter estimation works.   The same statement says for not only Baum-Welch estimation but also other estimation algorithm in speech.

But what so difficult about the coding the lattice?  Here are two points I think it is worthwhile to point out:

  1. I guess an average programmer can probably work out a correct concatenation of all phone HMMs if all phones are context-independent in 3 days to 1 week.  But in many advanced systems, most of them are using context-dependent phones.  So you go to make sure at the right time, the right triphone state was used.
  2. In Sphinx, it got a bit more complicated because there is a distinction between positions of triphones.  This is quite specific to Sphinx and you won’t find it in systems such as HTK.  So it further complicates coding.  You will see in the following discussion, Sphinx has back off cases of position-dependent triphone estimation from time-to-time.   In my view, it might not be too different from the position-independent triphone system.  (It’s certainly fun to read. πŸ™‚ ) 
My goal here is to analyze the next_state_utt.c, how it works and to state some part one can improve.

High-Level Layout of next_utt_state.c:

 next_utt_state.c  
-> mk_wordlist (mk_wordlist.c)
-> mk_phone_list (mk_phonelist.c)
-> cvt2triphone (cvt2triphone.c)
-> state_seq_make (state_seq_make.c)

Nothing fancy here. We first make a wordlist (mk_wordlist) , then make a phone list (mk_phone_list), then convert the phone list to triphone (cvt2triphones), then create the state sequence (state_seq_make).

Now before we go on, just at this level, you may already discover one issue of Sphinx’s bw.  It is using a list to represent phone models.   So, let’s assume if you want to model a certain word with multiple pronunciations, you probably can’t do it without changing the code.

Another important thing to note: just like many non-WFST systems, it is not that easy to make a simple phoneme system with Sphinx.  (HTK is an exception but you can always turn on a flag to expand context. Just look up the manual.)  Say if you want to express your phoneme system to be one phoneme word, then you would want your dictionary look like:

AA AA
AE AE
B B
.
.
.
.
.

But then, if a word is a phone, should you actually want to build a network of cross-word triphones?  You probably want to if you want to shoot for performance – all of the most accurate phoneme-based system has some sort of context-dependency there.  (The Brno’s recognizer probably has some, but I don’t really grok why it is so good.)

But if you want to do your own interesting experiments, this fixed behavior may not suit your appetite.   Maybe you just want to use a context-independent phone system for some toy experiments.  But then, you are probably always building a triphone system.  So, it might or might not be what you like.

So if you really want to trigger the CI-model behavior, what can you do?  Take a look of my next post, in cvt2triphone.c, if the model definition file only specify CI states, then no triphone conversion will occur.   In a way, that is to say the system assume if you just train the CI model, you will get the CI model but there is no explicit way to turn it off.

mk_wordlist 

mk_wordlist is rather trivial:

 char **mk_wordlist(char *str,  
uint32 *n_word)
{
uint32 n_w;
uint32 i;
char **wl;
n_w = n_words(str);
wl = ckd_calloc(n_w, sizeof(char *));
wl[0] = strtok(str, " t");
for (i = 1; i < n_w; i++) {
wl[i] = strtok(NULL, " t");
}
assert(strtok(NULL, " t") == NULL);
*n_word = n_w;
return wl;
}

With one line of transcripts, mk_wordlist transform it to an array of C-string.  Memory of the string are allocated.

mk_phone_list

mk_phone_list is still trivial but there is a bit more detail


1:  acmod_id_t *mk_phone_list(char **btw_mark,  
2: uint32 *n_phone,
3: char **word,
4: uint32 n_word,
5: lexicon_t *lex)
6: {
7: uint32 n_p;
8: lex_entry_t *e;
9: char *btw;
10: unsigned int i, j, k;
11: acmod_id_t *p;
12: /*
13: * Determine the # of phones in the sequence.
14: */
15: for (i = 0, n_p = 0; i < n_word; i++) {
16: e = lexicon_lookup(lex, word[i]);
17: if (e == NULL) {
18: E_WARN("Unable to lookup word '%s' in the lexiconn", word[i]);
19: return NULL;
20: }
21: n_p += e->phone_cnt;
22: }
23: /*
24: * Allocate the phone sequence
25: */
26: p = ckd_calloc(n_p, sizeof(acmod_id_t));
27: /*
28: * Allocate the between word markers
29: */
30: btw = ckd_calloc(n_p, sizeof(char));
31: for (i = 0, k = 0; i < n_word; i++) { /* for each word */
32: e = lexicon_lookup(lex, word[i]);
33: if (e->phone_cnt == 0) /* Ignore words with no pronunciation */
34: continue;
35: for (j = 0; j < e->phone_cnt-1; j++, k++) { /* for all but the last phone in the word */
36: p[k] = e->ci_acmod_id[j];
37: }
38: p[k] = e->ci_acmod_id[j]; /* move over the last phone */
39: btw[k] = TRUE; /* mark word boundary following
40: kth phone */
41: ++k;
42: }
43: *btw_mark = btw;
44: *n_phone = n_p;
45: assert(k == n_p);
46: return p;
47: }

In line 15-22:, we first look up the pronunciations of the words. (Remember, right now we can only look up one.) It then allocate the an array of phones with ID (in the type of acmod_id_t).

Now here is special part of the code, other the array of phones, it also allocate an array call “between word markers”.  So what’s the mechanism?  Let me give an example.

Suppose you have a transcript with word sequence “MY NAME IS CLINTON”

       mk_word_list would create

     word[0] -> MY  
word[1] -> NAME
word[2] -> IS
word[3] -> CLINTON

       mk_print_list (with my best guess of pronunciations) would create

     ph[0] -> M      btw[0] -> 0
ph[1] -> AY btw[1] -> 1
ph[2] -> N btw[2] -> 0
ph[3] -> EY btw[3] -> 0
ph[4] -> M btw[4] -> 1
ph[5] -> IY btw[5] -> 0
ph[6] -> S btw[6] -> 1
ph[7] -> K btw[7] -> 0
ph[8] -> L btw[8] -> 0
ph[9] -> IY btw[9] -> 0
ph[10] -> N btw[10] -> 0
ph[11] -> T btw[11] -> 0
ph[12] -> AX btw[12] -> 0
pH[13] -> N btw[13] -> 1

So essentially it would indicate there is a word end at a certain phone.

I believe such as representation are for convenience purpose: it facilitate determination of whether a word is at the beginning, the middle or the end.

An alternative here is to do an optional silence.  This, according to HTK handbook, usually reduce the WER slightly.  It seems to be reasonable to figure out where the location of a phone is using silences as a marker.

A digression: acmod_id_t

1:  typedef unsigned char ci_acmod_id_t;  
2: #define NO_CI_ACMOD (0xff)
3: #define MAX_CI_ACMOD (0xfe)
4: typedef uint32 acmod_id_t;
5: #define NO_ACMOD (0xffffffff) /* The ID representing no acoustic
6: * model. */
7: #define MAX_ACMOD (0xfffffffe) /* The max ID possible */

Now, we used acmod_id_t in mk_phone_list, but what is it really?  So let’s take a detour of acmod_id_ds.t (“ds” stands for data structure.)

acmod_id_t is essentially a uint32, which is just a the size of unsigned integer or 2^32 -1.  Why -1? Notice that MAX_CI_ACMOD was defined as 0xfe?

The more interesting part here: we saw ci_acmod_id_t is only a character type.  This is obviously another problem here, in some languages, one may be interested to express it with more than 255 phones. (Why 255?)

We’ll meet acmod_set a little bit more.  But let us move on first – sometimes code tracing will be more motivated when you see the code before the data structure.   Many suggest otherwise: Indeed, once you know the data, code will make more sense.   But in practice, you will most likely read the code first and needs to connect things together.   Thus IMO: both approach has their merit in code tracing.

So far ….. and next post

To avoid clutter a single post, I will stop and put the rest of next_utt_states.c (cvt2phones and state_seq_make) on another post.   But I want to summarize several things I have observed so far:
  1. Current bw doesn’t create a word network so it has issues to handle multiple pronunciations. 
  2. Current bw automatically expand triphone contexts. There is no explicit way to turn it off.
  3. bw is not doing optional silence in the network. 
  4. bw doesn’t work for more than 255 CI phones. 
Btw, SphinxTrain1.08 has several changed which replace mk_word_list with data structure from sphinxbase.  Those are encouraging changes.  If I have time, I will cover them. 
Arthur

Categories
backward algorithm Baum-Welch algorithm bw Forward algorithm L Rabiner open source speech recognition scaling source code speed sphinxtrain SphinxTrain1.07 streams X. D. Huang

Commentary on SphinxTrain1.07’s bw (Part I)

I was once asked by a fellow who didn’t work in ASR on how the estimation algorithms in speech recognition work.   That’s a tough question to answer.  From the high level, you can explain how properties of Q function would allow an increase of likelihood after each re-estimation.  You can also explain how the Baum-Welch algorithm is derived from the Q-function and how the estimation algorithm can eventually expressed by greeks, and naturally link it to the alpha and bet pass.   Finally, you can also just write down the reestimation formulae and let people perplex about it.

All are options, but this is not what I wanted nor the fellow wanted.   We hoped that somehow there is one single of entry in understanding the Baum-Welch algorithm.  Once we get there, we will grok.   Unfortunately, that’s impossible for Baum-Welch.  It is really a rather deep algorithm, which takes several type of understanding.

In this post, I narrow down the discussion to just Baum-Welch in SphinxTrain1.07.  I will focus on the coding aspect of the program.   Two stresses here:

  1. How Baum-Welch of speech recognition in practice is different from the theory?
  2. How different parts of the theory is mapped to the actual code. 

In fact, in Part I, I will just describe the high level organization of the Baum-Welch algorithm in bw.   I assumed the readers know what the Baum-Welch algorithm is.   In Part II, I will focus on the low level functions such as next_utt_state, foward, backward_update, accum_global .

(At a certain point, I might write another post just to describe Baum-Welch, This will help my Math as well……)

Unlike the post of setting up Sphinx4.   This is not a post for faint of heart.  So skip the post if you feel dizzy.

Some Fun Reading Before You Move On

Before you move on, here are three references which I found highly useful to understand Baum-Welch in speech recognition. They are

  1. L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Chapter 6. “Theory and Implementation of Hidden Markov Model.” p.343 and p.369.    Comments: In general, the whole Chapter 6 is essential to understand HMM-based speech recognition.  There are also a full derivation of the re-estimation formulae.  Unfortunately, it only gives the formula without proof for the most important case, in which observation probability was expressed as Gaussian Mixture Model (GMM).
  2. X. D. Huang, A. Acero and H. W. Hon, Spoken Language Processing.  Chapter 8. “Hidden Markov Models” Comments: written by one of the authors of Sphinx 2, Xuedong Huang, the book is a very good review of spoken language system.  Chapter 8 in particular has detailed proof of all reestimation algorithms.  If you want to choose one book to buy in speech recognition.  This is the one.  The only thing I would say it’s the typeface of greeks are kind of ugly. 
  3. X. D. Huang, Y. Ariki, M. A. Jack, Hidden Markov Models for Speech Recognition. Chapter 5, 6, 7. Comments: again by Xuedong Huang, I think this is the most detail derivations I ever seen on continuous HMM in books.  (There might be good papers I don’t know of).  Related to Sphinx, it has a chapter of semi-continuous HMM (SCHMM) as well. 

bw also features rather nice code commentaries. My understanding is that it is mostly written by Eric Thayer, who put great effort to pull multiple fragmented codebase together and form the embryo of today’s SphinxTrain.

Baum-Welch algorithm in Theory

Now you read the references, in a very high-level what does a program of Baum-Welch estimation does? To summarize, we can think of it this way

* For each training utterance

  1. Build an HMM-network to represent it. 
  2. Run Forward Algorithm
  3. Run Backward Algorithm
  4. From the Forward/Backward, calculate the statistics (or counts or posterior scores depends on how you call it.)

* After we run through all utterances, estimate the parameters (means, variances, transition probability etc….) from the statistics.

Sounds simple?  I actually skipped a lot of details here but this is the big picture.

Baum-Welch algorithm in Practice

There are several practical concerns on doing Baum-Welch in practice.  These are particularly important when it is implemented for speech recognition. 
  1. Scaling of alpha/beta scores : this is explained in detail in Rabiner’s book (p.365-p.368).  The gist is that when you calculate the alpha or beta scores.  They can easily exceed the range of precision of any machines.  It turns out there is a beautiful way to avoid this problem. 
  2. Multiple observation sequences:  or stream. this is a little bit archaic, but there are still some researches work on having multiple streams of features for speech recognition (e.g. combining the lip signal and speech signal). 
  3. Speed: most implementation you see are not based on a full run of forward or backward algorithm.  To improve speed, most implementations use a beam to constrained the search.
  4. Different types of states:  you can have HMM states which are emitting or non-emitting.  How you handle it complicates the implementation. 

You will see bw has taken care of a lot of these practical issues.   In my opinion, that is the reason why the whole program is a little bit bloated (5000 lines total).  


Tracing of bw: High Level

Now we get into the code level.  I will follow the version of bw from SphinxTrain1.07.  I don’t see there are much changes in 1.08 yet.  So this tracing is very likely to be applicable for a while.

I will organize the tracing in this way.   First I will go through the high-level flow of the high-level.  Then I will describe some interesting places in the code by line numbers.

main() – src/programs/bw/main.c

This is the high level of main.c (Line 1903 to 1914)

 main ->   
main_initialize()
if it is not mmie training
main_reestimate()
else
main_reestimate_mmi()

main_initialize()
We will first go forward with main_initialize()

 main_initailize  
-> initialize the model inventory, essentially means 4 things, means (mean) variances (var), transition matrices (tmat), mixture weights (mixw).
-> a lexicon (or .... a dictionary)
-> model definition
-> feature vector type
-> lda (lda matrix)
-> cmn and agc
-> svspec
-> codebook definition (ts2cb)
-> mllr for SAT type of training.

Interesting codes:

  • Line 359: extract diagonal matrix if we specified a full one. 
  • Line 380: precompute Gaussian distribution.  That’s usually mean the constant and almost always most the code faster. 
  • Line 390: specify what type of reestimation. 
  • Line 481: check point.  I never use this one but it seems like something that allow the training to restart if network fails. 
  • Line 546 to 577: do MLLR transformation for models: for SAT type of training. 

(Note to myself: got to understand why svspec was included in the code.)

main_reestimate()

Now let’s go to main_reestimate.  In a nutshell, this is where the looping occurred.

      -> for every utterane.   
-> corpus_get_generic_featurevec (get feature vector (mfc))
-> feat_s2mfc2feat_live (get the feature vector)
-> corpus_get_sent (get the transcription)
-> corpus_get_phseg (get the phoneme segmentation.)
-> pdumpfn (open a dump file, this is more related Dave's constrained Baum-Welch research)
-> next_utt_states() /*create the state sequence network. One key function in bw. I will trace it more in detail. */
-> if it is not in Viterbi mode.
-> baum_welch_update() /*i.e. Baum-Welch update */
else
-> viterbi() /*i.e. Viterbi update)

Interesting code:

  • Line 702:  several parameter for the algorithm was initialized including abeam, bbeam, spthres, maxuttlen.
    • abeam and bbeam are essentially the beam sizes which control forward and backward algorithm. 
    • maxuttlen: this controls how large an utterance will be read in.  In these days, I seldom see this parameter set to something other than 0. (i.e. no limit).
    • spthres: “State posterior probability floor for reestimation.  States below this are not counted”.  Another parameter I seldom use……

baum_welch_update()

 baum_welch_update()  
-> for each utterance
forward() (forward.c) (<This is where the forward algorithm is -Very complicated. 700 lines)
if -outphsegdir is specified , dump a phoneme segmentation.
backward_update() (backward.c Do backward algorithm and also update the accumulator)
(<- This is even more complicated 1400 lines)
-> accum_global() (Global accumulation.)
(<- Sort of long, but it's more trivial than forward and backwrd.)

Now this is the last function for today.  If you look back to the section of “Baum-Welch in theory”.  you will notice how the procedure are mapped onto Sphinx. Several thoughts:

  1. One thing to notice is that forward, backward_update and accum_global need to work together.   But you got to realize all of these are long complicated functions.   So like next_utt_state, I will separate the discussion on another post.
  2. Another comment here: backward_update not only carry out the backward pass.  It also do an update of the statistics.

Conclusion of this post

In this post, I went through the high-level description of Baum-Welch algorithm as well as how the theory is mapped onto the C codebase.  My next post (will there be one?), I will focus on the low level functions such as next_utt_state, forward, backward_update and accum_global.
Feel free to comment. 
Arthur