Category Archives: open source speech recognition

Landscape of Open Source Speech Recognition Software (II : Simon)

Around December last year, I wrote an article on open source speech recognizers.  I covered HTK, Kaldi and Julius.   One thing you should know, just like CMUSphinx,  all of these packages contain their own versions of Viterbi algorithms' implementation.   So when you asked someone who is in the field of speech recognition, they will usually say open source speech recognizers are Sphinx, HTK, Kaldi and Julius.

That's how I usually view speech recognition too.    After years working in the industry though, I start to realize this definition of seeing speech recognizer = Viterbi algorithm could be constraining.   In fact,  from the user's point of view,  a good speech application system should be a combination of

a recognizer + good models + good GUI.

I like to call the former type of "speech recognizer" as "speech recognition engines" but the latter type as "speech recognition applications".   Both types of "speech recognizers" are worthwhile applications.   From the users' point of view, it might just be a technicality to differentiate them.

When I am recovering as a speech recognition programmer (another name throwing πŸ™‚ ),  one thing I notice is that there is much effort on writing "speech recognition applications".   It is a good trend because most people from academia really didn't spend too much time to write good speech applications.   And in open source, we badly need good applications such as dictation machine, IVR and C&C.

One effort which really impressed me is Simon.   It is weird because most of the time I only care about engine-level type of software.   But in the case of Simon, you can see couple of its features are really solving problems in real life and integrated to the bigger them of open source speech recognition.

  • In 0.4.0, Simon starts to integrate with Sphinx.   So if someone wants to develop it commercially, they can.
  • The Simon's team also intentionally make context switching in the application, that's good work as well.   In general, if you always use a huge dictionary, you are just over-recognizing words in a certain context. 
  • Last and not least, I like the fact it integrates itself to Voxforge.  Voxforge is the open source answer to a large speech database of commercial speech company.  So integration with Voxforge will ensure an increasing amount of data for your application.
So kudo to the Simon team!  I believe this is the right kind of thinking to start a good speech application. 
Arthur

Commentary on SphinxTrain1.07's bw (Part I)

I was once asked by a fellow who didn't work in ASR on how the estimation algorithms in speech recognition work.   That's a tough question to answer.  From the high level, you can explain how properties of Q function would allow an increase of likelihood after each re-estimation.  You can also explain how the Baum-Welch algorithm is derived from the Q-function and how the estimation algorithm can eventually expressed by greeks, and naturally link it to the alpha and bet pass.   Finally, you can also just write down the reestimation formulae and let people perplex about it.

All are options, but this is not what I wanted nor the fellow wanted.   We hoped that somehow there is one single of entry in understanding the Baum-Welch algorithm.  Once we get there, we will grok.   Unfortunately, that's impossible for Baum-Welch.  It is really a rather deep algorithm, which takes several type of understanding.

In this post, I narrow down the discussion to just Baum-Welch in SphinxTrain1.07.  I will focus on the coding aspect of the program.   Two stresses here:

  1. How Baum-Welch of speech recognition in practice is different from the theory?
  2. How different parts of the theory is mapped to the actual code. 

In fact, in Part I, I will just describe the high level organization of the Baum-Welch algorithm in bw.   I assumed the readers know what the Baum-Welch algorithm is.   In Part II, I will focus on the low level functions such as next_utt_state, foward, backward_update, accum_global .

(At a certain point, I might write another post just to describe Baum-Welch, This will help my Math as well......)

Unlike the post of setting up Sphinx4.   This is not a post for faint of heart.  So skip the post if you feel dizzy.

Some Fun Reading Before You Move On

Before you move on, here are three references which I found highly useful to understand Baum-Welch in speech recognition. They are

  1. L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Chapter 6. "Theory and Implementation of Hidden Markov Model." p.343 and p.369.    Comments: In general, the whole Chapter 6 is essential to understand HMM-based speech recognition.  There are also a full derivation of the re-estimation formulae.  Unfortunately, it only gives the formula without proof for the most important case, in which observation probability was expressed as Gaussian Mixture Model (GMM).
  2. X. D. Huang, A. Acero and H. W. Hon, Spoken Language Processing.  Chapter 8. "Hidden Markov Models" Comments: written by one of the authors of Sphinx 2, Xuedong Huang, the book is a very good review of spoken language system.  Chapter 8 in particular has detailed proof of all reestimation algorithms.  If you want to choose one book to buy in speech recognition.  This is the one.  The only thing I would say it's the typeface of greeks are kind of ugly. 
  3. X. D. Huang, Y. Ariki, M. A. Jack, Hidden Markov Models for Speech Recognition. Chapter 5, 6, 7. Comments: again by Xuedong Huang, I think this is the most detail derivations I ever seen on continuous HMM in books.  (There might be good papers I don't know of).  Related to Sphinx, it has a chapter of semi-continuous HMM (SCHMM) as well. 

bw also features rather nice code commentaries. My understanding is that it is mostly written by Eric Thayer, who put great effort to pull multiple fragmented codebase together and form the embryo of today's SphinxTrain.

Baum-Welch algorithm in Theory

Now you read the references, in a very high-level what does a program of Baum-Welch estimation does? To summarize, we can think of it this way

* For each training utterance

  1. Build an HMM-network to represent it. 
  2. Run Forward Algorithm
  3. Run Backward Algorithm
  4. From the Forward/Backward, calculate the statistics (or counts or posterior scores depends on how you call it.)

* After we run through all utterances, estimate the parameters (means, variances, transition probability etc....) from the statistics.

Sounds simple?  I actually skipped a lot of details here but this is the big picture.

Baum-Welch algorithm in Practice

There are several practical concerns on doing Baum-Welch in practice.  These are particularly important when it is implemented for speech recognition. 
  1. Scaling of alpha/beta scores : this is explained in detail in Rabiner's book (p.365-p.368).  The gist is that when you calculate the alpha or beta scores.  They can easily exceed the range of precision of any machines.  It turns out there is a beautiful way to avoid this problem. 
  2. Multiple observation sequences:  or stream. this is a little bit archaic, but there are still some researches work on having multiple streams of features for speech recognition (e.g. combining the lip signal and speech signal). 
  3. Speed: most implementation you see are not based on a full run of forward or backward algorithm.  To improve speed, most implementations use a beam to constrained the search.
  4. Different types of states:  you can have HMM states which are emitting or non-emitting.  How you handle it complicates the implementation. 

You will see bw has taken care of a lot of these practical issues.   In my opinion, that is the reason why the whole program is a little bit bloated (5000 lines total).  


Tracing of bw: High Level

Now we get into the code level.  I will follow the version of bw from SphinxTrain1.07.  I don't see there are much changes in 1.08 yet.  So this tracing is very likely to be applicable for a while.

I will organize the tracing in this way.   First I will go through the high-level flow of the high-level.  Then I will describe some interesting places in the code by line numbers.

main() - src/programs/bw/main.c

This is the high level of main.c (Line 1903 to 1914)

 main ->   
main_initialize()
if it is not mmie training
main_reestimate()
else
main_reestimate_mmi()

main_initialize()
We will first go forward with main_initialize()

 main_initailize  
-> initialize the model inventory, essentially means 4 things, means (mean) variances (var), transition matrices (tmat), mixture weights (mixw).
-> a lexicon (or .... a dictionary)
-> model definition
-> feature vector type
-> lda (lda matrix)
-> cmn and agc
-> svspec
-> codebook definition (ts2cb)
-> mllr for SAT type of training.

Interesting codes:

  • Line 359: extract diagonal matrix if we specified a full one. 
  • Line 380: precompute Gaussian distribution.  That's usually mean the constant and almost always most the code faster. 
  • Line 390: specify what type of reestimation. 
  • Line 481: check point.  I never use this one but it seems like something that allow the training to restart if network fails. 
  • Line 546 to 577: do MLLR transformation for models: for SAT type of training. 

(Note to myself: got to understand why svspec was included in the code.)

main_reestimate()

Now let's go to main_reestimate.  In a nutshell, this is where the looping occurred.

      -> for every utterane.   
-> corpus_get_generic_featurevec (get feature vector (mfc))
-> feat_s2mfc2feat_live (get the feature vector)
-> corpus_get_sent (get the transcription)
-> corpus_get_phseg (get the phoneme segmentation.)
-> pdumpfn (open a dump file, this is more related Dave's constrained Baum-Welch research)
-> next_utt_states() /*create the state sequence network. One key function in bw. I will trace it more in detail. */
-> if it is not in Viterbi mode.
-> baum_welch_update() /*i.e. Baum-Welch update */
else
-> viterbi() /*i.e. Viterbi update)

Interesting code:

  • Line 702:  several parameter for the algorithm was initialized including abeam, bbeam, spthres, maxuttlen.
    • abeam and bbeam are essentially the beam sizes which control forward and backward algorithm. 
    • maxuttlen: this controls how large an utterance will be read in.  In these days, I seldom see this parameter set to something other than 0. (i.e. no limit).
    • spthres: "State posterior probability floor for reestimation.  States below this are not counted".  Another parameter I seldom use......

baum_welch_update()

 baum_welch_update()  
-> for each utterance
forward() (forward.c) (<This is where the forward algorithm is -Very complicated. 700 lines)
if -outphsegdir is specified , dump a phoneme segmentation.
backward_update() (backward.c Do backward algorithm and also update the accumulator)
(<- This is even more complicated 1400 lines)
-> accum_global() (Global accumulation.)
(<- Sort of long, but it's more trivial than forward and backwrd.)

Now this is the last function for today.  If you look back to the section of "Baum-Welch in theory".  you will notice how the procedure are mapped onto Sphinx. Several thoughts:

  1. One thing to notice is that forward, backward_update and accum_global need to work together.   But you got to realize all of these are long complicated functions.   So like next_utt_state, I will separate the discussion on another post.
  2. Another comment here: backward_update not only carry out the backward pass.  It also do an update of the statistics.

Conclusion of this post

In this post, I went through the high-level description of Baum-Welch algorithm as well as how the theory is mapped onto the C codebase.  My next post (will there be one?), I will focus on the low level functions such as next_utt_state, forward, backward_update and accum_global.
Feel free to comment. 
Arthur

Where to start when tracing source code of a speech recognition toolkit?

Modern speech recognition software are complicated piece of software.  To understand it, you need to have some basic understanding of the principle of speech recognition, as well as some ideas on the programming language being used.

By now, you may hear a lot of people say they know about a speech recognizer.   And by now, you probably realize that most of these people have absolutely no ideas what's going on inside a recognizer.   So if you are reading this blog message, you are probably telling yourself, "I might want to trace the codebase of some recognizers' code." Be it Sphinx, HTK, Julius, Kaldi or whatever codebase you are looking at.

For the above toolkits, I will say I only know in detail about Sphinx,  probably a little bit about HTK's HVite.  But I won't say the same for others.  In fact, even in Sphinx, I only know intimately about Sphinx 3/SphinxTrain/sphinxbase triplet.   So just like you, I hope to learn more.

So here it begs the question: how would you trace a speech recognition toolkit codebase? If you think it is easy, probably because you worked in speech recognition for a while and you probably shouldn't read this post.

Let's just use sphinx as an example, there are hundreds of files in each component of Sphinx.   So where should you start?    A blunt approach would be reading each of the file one by one.   That's not a smart the way.   So here is a suggestion for you : focus on the following four things,

  1. Viterbi algorithm
  2. Workflow of training
  3. Baum-Welch algorithm. 
  4. Estimation algorithms of language models. 
When you know where the Viterbi algorithm is, you will soon figure out how the feature vector is generated.  On the same vein: if you know where the Baum-Welch algorithm, you will probably know how the statistics are generated.   If you know the workflow of the training, then you will understand the how the model is "evolved".   If you know how the language model is estimated, then you would have understanding of one of the most important heuristic of the search. 
Some of you may protest, how about the front-end? Isn't that important too?  True, but not when you try to understand a codebase.  For all practical purpose, a feature vector is just an N-dimensional vector.  The waveform is just an NxT matrix.   You can certainly do a lot of fancy things on this NxT matrix.   But when you think of Viterbi and Baum-Welch, they probably just read the frames and then calculate Gaussian distribution.  That's pretty much it's how much you want to know a front-end. 
How about adaptation algorithms?  That I think it's important.   But it should probably go after understanding of the major things in the code.   Because no matter whether you are doing adaptation online or doing this in speaker adaptive training.  It is something on top of the Baum-Welch algorithm.   Some implementation stick adaptation within the Baum-Welch executable.  There is certainly nothing wrong about it.   But it is still a kind of add-on. 
How about decoding API?  Those are useful things to know but it is more important when you just need to write an application.  For example, in Sphinx4, you just need to know how to call the Recognizer class.  In sphinx3, live_decode is what you need to know.   But only understanding those won't give you too much insights of how the decoder really works. 
How about the data structure?  Those are sort of important and should be understood when you try to understand a certain algorithm.   In the case of languages such as Java and C++, you should probably take notes on a custom-made data structure.  Or whether the designer call a specific data structure libraries.  Like Boost in C++. 
I guess this pretty much sums it all.  Now let me get back to one non-trivial item on the list, which is the workflow of training.   Many of you might think that recognition systems differ from each other because they have different decoders.  Dead wrong!  As I stressed from time to time, they differ because they have different acoustic models and language models.  So that's why in many research labs, much effort was put on preserving the parameters and procedures of how models is trained.  Much effort was also put to fine tuned this procedure.  
On this part,  I got to say open source speech recognition still has long long way to go.  For starter, there is no much sharing of recipes among speech hobbyists.   What many try to do is to search for a good model.   If you don't know how to train a model, you probably don't even know how to improve it for you own project.   
Arthur

How to Ask Questions in the Sphinx Forum?

Many go to different open source toolkits to look for a ready-to-use speech recognizer, and seldom get what they want.   Many feel disappointed and curse that developers of open source speech recognizer just couldn't catch up with commercial product.   Few know why and few decide to write about the reason.

People in the field blame Hollywood for lion share of the problem.  Indeed, many people believe ASR should work similarly to scenes of Space Odyssey 2001 or Star Trek.   We are far far away from there.   You may say SIRI is getting close.  True.   But when you look closer, SIRI doesn't always get what you say right, her strength lies on the very intelligent response system.

Unlike compilers such as GCC, speech recognition toolkit such as the CMU Sphinx project HTK are toolkits.   The mathematical models these toolkits provided were trained and fit to certain group of samples. Whereas, applications such as Google Voice or SIRI gather 100 or even 1000 times more data when they train a model.   This is the fundamental reason why you don't get the premium recognition rate you think you entitled to.

Many people (me included) saw that as a problem.  Unfortunately, to collect clean transcribed data has always been a problem.   Voxforge is the only attempt I am aware of to resolve the issue.    They are still growing up but it will be a while they can collect enough data to rival with commercial applications.

* * *
Now what does that tell you when you ask questions in CMU Sphinx or other speech recognition forum?   For users who expect out-of-the-box super performance, I would say "Sorry, we are not there yet."  In fact, speech recognition, in general, is probably not in performance shown in the original Star Trek yet (that will require accent adaptation and very good noise cancellation since the characters seem to be able to use the recognizer any time they like).

How about many users who have a little bit (or much) programming background? I would say one thing important.  As a programmer, you probably get used to look at the code, understand what it's done, do something cute and feel awesome from time-to-time.  You can't do that if you seriously want to develop a speech recognition system.

Rather, you should think like a data analyst.  For example, when you feel the recognition rate is bad, what is your evidence?  What is your data set?  What is the size of your data set? If you have a set, can you share the set?   If you don't have numerical measure, have you at least use pencil or paper to mark down at least some results and some mistakes? Report them when you ask questions, then you will get useful answers back.

If you go to look at programming forum, many ask questions with the source such that people can repeat the problem easily.    Some even go further to pinpoint location of the problem.    This is probably what you want to do if you get stuck.

* * *

Before I end this post, let's also bring up the issue of how usually ASR problem is solved?  Like...... if you see performance is bad, what should you do?

Some speech recognition problems can be solved readily.  For example, if you try to recognize digit strings but only get one digit at a time, chances are your grammar was written incorrectly.  If you see completely crappy speech recognition performance, then I will first check if the front-end of decoder match exactly as the front-end used to train the models.

For the rest,  the strength of the model is really the issue.   So most of your time should spend on learning and understanding techniques of model improvement.    For example, do you want to collect data and boost up your acoustic model?  Or if you know more about the domain, can you crawl some text on the web and help your language model?   Those are the first ideas you should think about.

There are also an exoteric group of people in the world who ask a different question, "Can we use a different estimation algorithm to make the better?"  That is the basis of MMIE, MPE and MFE.   If you found yourself mathematically proficient (perhaps need to be very proficient......), then learning those techniques and implement some of them would help boosting up the performance as well.   What I mentioned such as MMIE are just the basics,  each site has their own specialized technique and you might want to know.

Of course, you normally don't have to think so deep.   Adding more data is usually the first step of ASR improvement.    If you start to think something advance and if you can,  please try to put your implementation somewhere public such that everyone in the world can try it out.   These are something small to do, but I believe if we keep on doing something small right, there will be a day we can make open source speech recognizers as the commercial ones.

Arthur

Landscape of Open Source Speech Recognition software at the end of 2012 (I)

As I am back, I start to visit all my old friends - all open source speech recognition toolkits.  The usual suspects are still around.  There are also many new kids in town so this is a good place to take a look.

It was a good exercise for me, 5 years of not thinking about open source speech recognition is a bit long.   It feels like I am getting in touch with my body again.

I will skip CMU Sphinx in this blog post as you probably know something about it if you are reading this blog.   Sphinx is also quite a complicated projects so it is rather hard to describe  entirely in one post.   This post serves only as an overview.  Most of the toolkit listed here have rich documentation.   You will find much useful information there.

HTK

I checked out the Cambridge HTK web page.  Disappointingly, the latest version is still 3.4.1, so we are still talking about MPE and MMIE, which is still great but not as exciting as other new kids in town such as KALDI.   
HTK has always been one of my top 3 speech recognition systems since most of my graduate work are done using HTK.   There are also many tricks you can do with the tools.   
As a toolkit, I also find its software engineering practice admirable.   For example, the software command was based on common libraries written beneath.  (Earlier versions such as 1.5 or 2.1 would restrict access to the memory allocation library HMem.)   When reading the source code, you feel much regularities and there doesn't seem to be much duplicated code. 
The license disallows commercial use but that's okay.  With ATK, which is released in a freer license, you can also include the decoder code into a commercial application.

Kaldi

The new kid in town.   It is headed by Dr. Dan Povey, who researched many advanced acoustic modeling techniques.   His recognizers attract much interest as it has implemented features such as subspace GMM and FST-based speech recognizer.   Of all, this features feel like more "modern". 
I only have little exposure on the toolkit (but determined to learn more).   Unlike Sphinx and HTK, it is written in C++ instead of C.   As of this writing, Kaldi's compilation takes a long time and the binaries are *huge*.   In my setup, it took me around 5G of disc space to compile.   It probably means I haven't setup correctly ...... or more likely, the executable is not stripped.   That means working on Kaldi's source code actively would take some discretion in terms of HD.  
Another interesting part of Kaldi is that it is using weighted finite state transducer (WFST) as the unifying knowledge source representation.   To contrast this, you may say most of the current open source speech recognizers are using ad-hoc knowledge source.   

Are there any differences in terms of performance you ask?  In my opinion, probably not much if you are doing an apple to apple comparison.   The strength of using WFST is that when you need to introduce new knowledge,  in theory you don't have to hack the recognizer.  You just need to write your knowledge in an FST and compose it with your knowledge network, then you are all set. 
In reality, the WFST-based technology seems to still have practice problem.  As the vocabulary size goes large and knowledge source got more complicated, the composed decoding WFST would naturally outgrow the system memory.   As a result, many sites propose different technique to make decoding algorithm works.  
Those are downsides but the appeal of the technique should not be overlooked.   That's why Kaldi becomes one of my favorite toolkits recently. 

Julius

Julius is still around!  And I am absolutely jubilant about it.  Julius is a high-speed speech recognizer which can decode a 60k vocabulary. One speed-up techniques of Sphinx 3.X was context-independent phone Gaussian mixture model selection (CIGMMS) and I borrowed this idea from Julius when I first wrote.  
Julius is only the decoder and the beauty of it is that it never claims to be more than that.  Accompanied with the software, there is a new Juliusbook, which is the guide on how to use the software.  I think the documentation are in greater-depth than other similar documentations. 
Julius comes with a set of Japanese models, not English.   This might be one of the reasons why it is not as popular (more like talk about) as HTK/Sphinx/Kaldi. 
(Note at 20130320: I later learned that Julius also comes with an English model now.  In fact, some anecdotes suggest the system is more accurate than Sphinx 4 with broadcast news.  I am not surprised.  HTK was as acoustic model trainer.)

So far......

I went through three of my favorite recognition toolkits.  In the next post, I will cover several other toolkits available. 
Arthur

Getting back to the project.....

After several years not touching Sphinx (or for that regard, any serious coding), I start to have a conversation with myself, namely, the me who maintained Sphinx 3.X 6 years ago.

When I was working with the project, I was tasked to work on Sphinx 3.  I have been an advocate of Sphinx 3 ever since.  To say the truth, I might have overdone it - there are many great recognizers in the world.  Just look within the family: Sphinx 4, PocketSphinx and recently MultiSphinx by Dave are all great recognizers.  (Dave has also fixed a lot of my bugs.  So if you look into the source code, you will see places where he screamed, or I paraphrase "Arthur, what are you talking about?")

Experience with many outside companies changed me.   I literally turned from a naive twenty something guy to a thirty something guy.   Still naive, but my world view has certainly changed.   In fact, for many purposes,  I found that learning all components of Sphinx is very beneficial.

Let's think in this way:  each of the project from CMU Sphinx was meant to solve a practical problem in real life.  For example, in Sphinx 4, not only you have great out-of-the-box performance.  You also got the native code which can be incorporated into Java-based servers.  This is a huge plus when you are thinking of writing a web application.    And web applications will be around for a long time.

Same as PocketSphinx, it is meant to be a version of Sphinx which can be integrated different embedded systems.   I am yet to learn about MultiSphinx but I always have faith on Dave and his ideas.

This makes me want to learn again.  It's weird, once you open your mind, you will see doors everywhere.   For me, my next targets would be learning Sphinx 4 and PocketSphinx.   Both of them have great importance.   Will I still work on Sphinx 3?  Probably.  X can always bigger than 8.  It's the programming reality which makes me change.   As I would think now, it's a good change, a very good change.

The Grand Janitor

The Grand Janitor After CMU Sphinx

I have left the development of CMU Sphinx for around 6 years.  Geez.  Talking about changes.  During the time, I went to work for one startup and one defense contractor.   Start numerous non-speech related blogs.

I certainly have fun but feel drifted at the same time - both companies I worked with are extraordinary but their causes are not mine.    As you know, life without a cause is a tough life.

And now when I am inspecting Sphinx and open source speech recognition again.   Wow, there are tons of changes.   The awareness of the need of open source speech recognition has never been so acute and high.   The performance of open source speech recognition still requires a lot of work but it is no longer unthinkable to deploy an open source speech recognizer in a real application.

There are more resources for learning how to use a speech recognizer.   Thanks to dedicated Sphinx developers such as David Huggins-Daines and Nickolay Shmyrev.  Many more people learn about how to properly use Sphinx and there are more documentation around.

There are also more resources for building a speech recognizer.  One notable effort is Voxforge led by Ken McClean which dedicated to accumulate clean and transcribed data over the time.   Though I don't know how large is its size, I admire the dedication of Ken.    Someone should start such a project long long time ago.   Once it is started, there is a chance that open source data would be an important source of speech data in future.

In my last 6 years, I can only act as a bystander of Sphinx development.   I change job again recently and will work with a company which is close to Sphinx.   I don't know how much I will do *real* work.   But I am glad that Sphinx and I cross paths again.   At the very least, I hope to contribute ideas to the community and help this great project grows.

The Grand Janitor