The Grand Janitor Blog V3 – Page 26 – Speech Recognition, Artificial Intelligence, and Random Musing of Arthur Chan

Aaron Swartz……

Have been busy lately so I haven’t post much.

The only story I think is worthwhile lately is the death of Aaron Swartz. I am late on the whole event but the Wikipedia page is an amazing source of what’s really going on.

You should also watch Lessig’s comments at here.

Btw, last time I checked, “Maximum Likelihood from Incomplete Data via the EM algorithm” costs $29. I am glad many wrote great tutorials on EM but not so happy when I couldn’t read the original paper.

Arthur

BNF KLEE Matt Might

Readings from Jan 15, 2013

Post author By grandjanitor
Post date January 15, 2013
No Comments on Readings from Jan 15, 2013

The language of languages by Matt Might: this clears me up on several syntactical aspect of the few types of grammars.

KLEE : Hmm. This is very interesting. Given a function, this tool can automatically generate all the test cases with proper coverage. I think this could improve the current testsuit of C-based Sphinx.

Arthur

FAMA-C John Regehr KLEE

John Regehr on Hiding Bugs

Post author By grandjanitor
Post date January 15, 2013
No Comments on John Regehr on Hiding Bugs

Hiding Bugs from Branch Coverage : Great article on how one can hide bugs by function. All code in C. It also mentioned usage of advanced tools such as KLEE and FAMA-C.

Arthur

chess conditional random field Harvard log MIT Python Readings

Readings for Jan 14, 2013

Post author By grandjanitor
Post date January 14, 2013
No Comments on Readings for Jan 14, 2013

Approximation relating lg, ln, and log10 by John Cook
Digits in Power of 2 by John Cook
Rise and Fall of Third Normal Form by John Cook
The Crown Game Affair : Fascinating account on how cheating in chess was detected and caught.
Introduction to Conditional Random Field : Eerily similar to HMM, that’s perhaps why there was many “cute” ideas published on it in the past.
Stuff Harvard People Like by Ed Checn: Entertaining post. I do know some people from MIT and the study fits to the stereotype I recognize. For Harvard, not as much, some are really mellow fellows. Also, not every one in a school is savvy in computers. So the samples Ed Chen collected may or may not be representative. (To his credit, he named all his assumptions in his post.)

Arthur

3.5 3.7 C++ java sphinx 3.X Sphinx 4 Sphinx 4 from C background tutorial

Sphinx 4 from a C background : Material for Learning Sphinx 4

Post author By grandjanitor
Post date January 10, 2013
3 Comments on Sphinx 4 from a C background : Material for Learning Sphinx 4

I have been quite focused on SphinxTrain lately. So I haven’t touched Sphinx 4 for a while. As I have one afternoon which I can use with leisure (not really), so I decide to take a look of some basic material again.

Sphinx-4, as a recognizer, is interesting piece software to me, a recovering recognizer programmer. It seems remote but oddly familiar. It is sort of a dream-land for experimenting different decoding strategies. During Sphinx 3.5 to 3.7, I tried to make Sphinx 3.X to be more generalized in terms of search. Those effort was tough mainly because the programs were in C. As you might guess, those modification requires much reinvention of a lot of good software engineering mechanisms (such as class).

Sphinx-4 is now widely studied. There are many projects using Sphinx-4 and its architecture is analyzed in many sites. That’s why I have abundant amount of material to learn the recognizer. (Yay! 🙂 )

Here are the top 5 pages in my radar now and I am going to study them in detail:

Introduction : What Sphinx-4 is? And how to use it.
Sphinx 4 Application Programmer Guide : What excites me is model switching capability. I also love the way the current recognizer can be linked to multiple languages.
Configuration Manager : That’s an interesting part as well. That is a recognizer which is configurable for every components. Is it a good thing? There are pros and cons about a hierarchical configuration system. But for most of the time, I think that’s a better way than flat command-line structure.
Instrumentation : How to test the decoder with examples on TIDIGITS and many more database.
FAQ: Here is a list of questions which make me curious.
The White Paper : Extremely illuminating, I also appreciate the scholarship when they compare different versions of Sphinxes.
The 2003 paper: I haven’t gone through this one yet but it’s certainly something I want to check out.

Arthur

infosec stylometric analysis

Stylometric Analysis

Just read a story about how usage of machine learning can identify anonymous hackers or crackers. I am actually not too surprised by that capability. According to the article, currently the accuracy is around 66% to 80%, which I take it as detection rate. There is a 5000 words minimum limit there, which attempt to make the word distribution estimation be robust. (Backoff strategy comes to mind immediately……)

The final limitation about this method is that the text has to be in English. That ….. I don’t think it’s such a big deal. No one can communicate effectively if they mix up multiple languages in text. If they do, probably it means the mixture of the languages is a kind of language itself.

Thinking deeper, this will probably prompt intelligent hackers to speak less in public forums. They will either use secure channel to establish a connection for obtaining information.

Will hacker be deterred by this new method? I guess no, many in the hacker community are well-aware that latest machine learning method can detect their existence. For example, usage of IRC is already one particular signature that one can detect in the network.

In any case, the topic of how NLP can be applied to infosec always fascinates me, hope I can work on it someday.

Arthur

Uncategorized

No readings for today …… but…..

Post author By grandjanitor
Post date January 10, 2013
No Comments on No readings for today …… but…..

It’s a digression but check this out: This is a heart-breaking story of comedian Anthony Griffin.

Arthur

backward algorithm Baum-Welch algorithm bw Forward algorithm L Rabiner open source speech recognition scaling source code speed sphinxtrain SphinxTrain1.07 streams X. D. Huang

Commentary on SphinxTrain1.07’s bw (Part I)

Post author By grandjanitor
Post date January 10, 2013
3 Comments on Commentary on SphinxTrain1.07’s bw (Part I)

I was once asked by a fellow who didn’t work in ASR on how the estimation algorithms in speech recognition work. That’s a tough question to answer. From the high level, you can explain how properties of Q function would allow an increase of likelihood after each re-estimation. You can also explain how the Baum-Welch algorithm is derived from the Q-function and how the estimation algorithm can eventually expressed by greeks, and naturally link it to the alpha and bet pass. Finally, you can also just write down the reestimation formulae and let people perplex about it.

All are options, but this is not what I wanted nor the fellow wanted. We hoped that somehow there is one single of entry in understanding the Baum-Welch algorithm. Once we get there, we will grok. Unfortunately, that’s impossible for Baum-Welch. It is really a rather deep algorithm, which takes several type of understanding.

In this post, I narrow down the discussion to just Baum-Welch in SphinxTrain1.07. I will focus on the coding aspect of the program. Two stresses here:

How Baum-Welch of speech recognition in practice is different from the theory?
How different parts of the theory is mapped to the actual code.

In fact, in Part I, I will just describe the high level organization of the Baum-Welch algorithm in bw. I assumed the readers know what the Baum-Welch algorithm is. In Part II, I will focus on the low level functions such as next_utt_state, foward, backward_update, accum_global .

(At a certain point, I might write another post just to describe Baum-Welch, This will help my Math as well……)

Unlike the post of setting up Sphinx4. This is not a post for faint of heart. So skip the post if you feel dizzy.

Some Fun Reading Before You Move On

Before you move on, here are three references which I found highly useful to understand Baum-Welch in speech recognition. They are

L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Chapter 6. “Theory and Implementation of Hidden Markov Model.” p.343 and p.369. Comments: In general, the whole Chapter 6 is essential to understand HMM-based speech recognition. There are also a full derivation of the re-estimation formulae. Unfortunately, it only gives the formula without proof for the most important case, in which observation probability was expressed as Gaussian Mixture Model (GMM).
X. D. Huang, A. Acero and H. W. Hon, Spoken Language Processing. Chapter 8. “Hidden Markov Models” Comments: written by one of the authors of Sphinx 2, Xuedong Huang, the book is a very good review of spoken language system. Chapter 8 in particular has detailed proof of all reestimation algorithms. If you want to choose one book to buy in speech recognition. This is the one. The only thing I would say it’s the typeface of greeks are kind of ugly.
X. D. Huang, Y. Ariki, M. A. Jack, Hidden Markov Models for Speech Recognition. Chapter 5, 6, 7. Comments: again by Xuedong Huang, I think this is the most detail derivations I ever seen on continuous HMM in books. (There might be good papers I don’t know of). Related to Sphinx, it has a chapter of semi-continuous HMM (SCHMM) as well.

bw also features rather nice code commentaries. My understanding is that it is mostly written by Eric Thayer, who put great effort to pull multiple fragmented codebase together and form the embryo of today’s SphinxTrain.

Baum-Welch algorithm in Theory

Now you read the references, in a very high-level what does a program of Baum-Welch estimation does? To summarize, we can think of it this way

* For each training utterance

Build an HMM-network to represent it.
Run Forward Algorithm
Run Backward Algorithm
From the Forward/Backward, calculate the statistics (or counts or posterior scores depends on how you call it.)

* After we run through all utterances, estimate the parameters (means, variances, transition probability etc….) from the statistics.

Sounds simple? I actually skipped a lot of details here but this is the big picture.

Baum-Welch algorithm in Practice

There are several practical concerns on doing Baum-Welch in practice. These are particularly important when it is implemented for speech recognition.

Scaling of alpha/beta scores : this is explained in detail in Rabiner’s book (p.365-p.368). The gist is that when you calculate the alpha or beta scores. They can easily exceed the range of precision of any machines. It turns out there is a beautiful way to avoid this problem.
Multiple observation sequences: or stream. this is a little bit archaic, but there are still some researches work on having multiple streams of features for speech recognition (e.g. combining the lip signal and speech signal).
Speed: most implementation you see are not based on a full run of forward or backward algorithm. To improve speed, most implementations use a beam to constrained the search.
Different types of states: you can have HMM states which are emitting or non-emitting. How you handle it complicates the implementation.

You will see bw has taken care of a lot of these practical issues. In my opinion, that is the reason why the whole program is a little bit bloated (5000 lines total).

Tracing of bw: High Level

Now we get into the code level. I will follow the version of bw from SphinxTrain1.07. I don’t see there are much changes in 1.08 yet. So this tracing is very likely to be applicable for a while.

I will organize the tracing in this way. First I will go through the high-level flow of the high-level. Then I will describe some interesting places in the code by line numbers.

main() – src/programs/bw/main.c

This is the high level of main.c (Line 1903 to 1914)

 main ->   
    main_initialize()  
    if it is not mmie training  
      main_reestimate()  
    else  
      main_reestimate_mmi()

main_initialize()
We will first go forward with main_initialize()

 main_initailize  
 -> initialize the model inventory, essentially means 4 things, means (mean) variances (var), transition matrices (tmat), mixture weights (mixw).  
 -> a lexicon (or .... a dictionary)  
 -> model definition  
 -> feature vector type  
 -> lda (lda matrix)  
 -> cmn and agc  
 -> svspec  
 -> codebook definition (ts2cb)  
 -> mllr for SAT type of training.

Interesting codes:

Line 359: extract diagonal matrix if we specified a full one.
Line 380: precompute Gaussian distribution. That’s usually mean the constant and almost always most the code faster.
Line 390: specify what type of reestimation.
Line 481: check point. I never use this one but it seems like something that allow the training to restart if network fails.
Line 546 to 577: do MLLR transformation for models: for SAT type of training.

(Note to myself: got to understand why svspec was included in the code.)

main_reestimate()

Now let’s go to main_reestimate. In a nutshell, this is where the looping occurred.

      -> for every utterane.   
        -> corpus_get_generic_featurevec (get feature vector (mfc))  
        -> feat_s2mfc2feat_live (get the feature vector)  
        -> corpus_get_sent (get the transcription)  
        -> corpus_get_phseg (get the phoneme segmentation.)  
        -> pdumpfn (open a dump file, this is more related Dave's constrained Baum-Welch research)  
        -> next_utt_states() /*create the state sequence network. One key function in bw. I will trace it more in detail.  */ 
        -> if it is not in Viterbi mode.  
         -> baum_welch_update()  /*i.e. Baum-Welch update */
         else   
         -> viterbi()  /*i.e. Viterbi update)

Interesting code:

Line 702: several parameter for the algorithm was initialized including abeam, bbeam, spthres, maxuttlen.

abeam and bbeam are essentially the beam sizes which control forward and backward algorithm.
maxuttlen: this controls how large an utterance will be read in. In these days, I seldom see this parameter set to something other than 0. (i.e. no limit).
spthres: “State posterior probability floor for reestimation. States below this are not counted”. Another parameter I seldom use……

baum_welch_update()

 baum_welch_update()  
 -> for each utterance
       forward() (forward.c) (<This is where the forward algorithm is -Very complicated. 700 lines)  
       if -outphsegdir is specified , dump a phoneme segmentation.  
       backward_update() (backward.c Do backward algorithm and also update the accumulator)  
           (<- This is even more complicated 1400 lines)  
 -> accum_global() (Global accumulation.)   
         (<- Sort of long, but it's more trivial than forward and backwrd.)

Now this is the last function for today. If you look back to the section of “Baum-Welch in theory”. you will notice how the procedure are mapped onto Sphinx. Several thoughts:

One thing to notice is that forward, backward_update and accum_global need to work together. But you got to realize all of these are long complicated functions. So like next_utt_state, I will separate the discussion on another post.
Another comment here: backward_update not only carry out the backward pass. It also do an update of the statistics.

Conclusion of this post

In this post, I went through the high-level description of Baum-Welch algorithm as well as how the theory is mapped onto the C codebase. My next post (will there be one?), I will focus on the low level functions such as next_utt_state, forward, backward_update and accum_global.

Feel free to comment.

Arthur

Berkeley license cmn cmu sphinx global GNU license. HCopy HTK HVite local sphinxbase sphinx_fe Thought wave2feat

Two Views of Time-Signal : Global vs Local

Post author By grandjanitor
Post date January 8, 2013
1 Comment on Two Views of Time-Signal : Global vs Local

As I have been working on Sphinx at work and start to chat with Nicholay more, one thing I realize is that several frequently used components of Sphinx need to rethink. Here is one example related to my work recently.

Speech signal or …… in general time signal can be processed in two ways: you either process as a whole, or you process in blocks. The former, you can call it a global view, the latter, you can call it a local view. Of course, there are many other names: block/utterance, block/whole but essentially the terminology means the same thing.

For most of the time, global and local processing are the same. So you can simply say: the two types of the processing are equivalent.

Of course, not when you start to an operation which use information available. For a very simple example, look at cepstral mean normalization (CMN). Implementing CMN in block mode is certainly an interesting problem. For example, how do you estimate the mean if you have a running window? When you think about it a little bit, you will realize it is not a trivial problem. That’s probably why there are still papers on cepstral mean normalization.

Translate to sphinx, if you look at sphinxbase’s sphinx_fe, you will realize that the implementation is based on the local mode, i.e. every once in a while, samples are consumed, processed and write onto the disc. There is no easy way to implement CMN on sphinx_fe because it is assumed that the consumer (such as decode, bw) will do these stuffs their own.

It’s all good though there are interesting consequence: what the SF’s guys said about “feature” is really all the processing that can be done in the local sense. Rather than the “feature” you see in either the decoders or bw.

This special point of view is ingrained within sphinxbase/sphinxX/sphinxtrain (Sphinx4? not sure yet.) . This is quite different from what you will find in HTK which see feature vector as the vector used in Viterbi decoding.

That bring me to another point. If you look deeper, HTK such as HVite/HCopy are highly abstract. So each tool was designed to take care of its own problem well. HCopy really means to provide just the feature, whereas HVite is just doing Viterbi algorithm on a bunch of features. It’s nothing complicated. On the other hand, Sphinx are more speech-oriented. In that world, life is more intertwined. That’s perhaps why you seldom hear people use Sphinx to do research other than speech recognition. You can, on the other hand, do other machine learning tasks in HTK.

Which view is better? If you ask me, I hope that both HTK and Sphinx are released in Berkeley license. Tons of real-life work can be saved because each cover some useful functionalities.

Given that only one of them are released in a liberal license (Sphinx), then may be what we need is to absorb some design paradigm from HTK. For example, HTK has a sense of organizing data as pipes. That something SphinxTrain can use. This will enhance work of Unix users, who are usually contribute the most in the community.

I also hope that eventually there are good clones of HTK tools but made available in Berkeley/GNU license. Not that I don’t like the status quo: I am happy to read the code of HTK (unlike the time before 2.2……). But as you work in the industry for a while, many are actually using both Sphinx and HTK to solve their speech research-related problems. Of course, many of these guys (, if they are honest,) need to come up with extra development time to port some HTK functions into their own production systems. Not tough, but you will wonder whether time can be better spent ……

Arthur

hours Readings Thought todo list

Readings at Jan 8, 2012

Testing Redux by Vivek Halder

Comment: I second Vivek, in a large scale project, having no testing is a project-killing act. One factor to consider: in real-life, the mythical 100X productive programmer are rarely seen. Even then, these programmers can make a mistake or two. Therefore, not having any automatic testing for a group is a very bad thing.

On the other, should an individual programmer always follow automatic testing? Yes and no. Yes in the sense you should always write a test for your program. No in the sense that you shouldn’t believe testing will make your program automatically correct.

Bring back the 40-hour work week by Sara Robinson

Comment: very well said. I hope this more hours = more work done crap can end soon.

Todon’t by Jeff Atwood

Comment: I like todo list alot but sometimes it takes me a while to organize them. They also grow very quickly. A good suggestions here is that not only you should add to your list, you should always change priorities. One of the purposes of todo is to record your life but it has nothing to do with how you move forward with your life.

Arthur