OpenEars and RapidEars

There are couple of messages on open source speech recognition on iOS.  That certainly caught my eyes.  For now, I believe one of most well-known names is OpenEars.  It is built up on PocketSphinx and it provides speech synthesis functionalities as well.

It looks pretty promising.  So I will probably start to play with it as well.

I just added a link of OpenEars here in the Grand Janitor's Blog.   Btw, I should think of better syndicate different ASR-related site.

Arthur

After the 50th Post, some vision on the Grand Janitor's Blog

I tried to gather some topic ideas on this blog.   From LinkedIn, I hear quite a bit of feedbacks on discussion about decoding algorithms.    I have thought about this a bit.   Here are some sample topics which I will discuss,

  • How to setup Recognizer X on a Platform Y? 
  • What are the interesting properties of Program A in speech recognition package B? 
  • What are the best practice when you try to use a certain model training package? 
  • Tuning results in a domain. 
  • News: certain package has a new version coming out! What are something special? 
  • Thoughts in speech recognition problem.
  • Thoughts in programming problem.
  • Algorithms/Data structures for decoding/parameter estimation : like you suggested in this post. 
  • What is an interesting development direction for a speech recognition package?
  • Others: softwares/algorithm in some related fields such as NLP, MT. 

Of course, I will keep on chatting about some on-going components of Sphinx which I feel interested,

  1. Sphinx 4 maintenance and refactoring
  2. PocketSphinx's maintenance
  3. An HTKbook-like documentation : i.e. Hieroglyphs. 
  4. Regression tests on all tools in SphinxTrain.
  5. In general, modernization of Sphinx software, such as using WFST-based approach.
Unlike academic journals, I hope this is more an informal hub of ASR information.   I don't know if I can keep on. but at least that's what I am going to try. 
In any case, this blog will occupy much of my spare time next year.   Feel free to comments and give me suggestions.  And of course ....... Happy New Year!

Arthur

You got to have thick skin......

Linux Chews Up Kernel Maintainer for Introducing UserSpace Bug.

http://developers.slashdot.org/story/12/12/29/018234/linus-chews-up-kernel-maintainer-for-introducing-userspace-bug

That's just the way it is.  Remember, there are always better, stronger, more famous, more powerful programmers working with you.  So criticisms will come one day.

My take, as long as if you don't work with them in person, just treat them as an icon on the screen.   πŸ˜€

Arthur

Where to start when tracing source code of a speech recognition toolkit?

Modern speech recognition software are complicated piece of software.  To understand it, you need to have some basic understanding of the principle of speech recognition, as well as some ideas on the programming language being used.

By now, you may hear a lot of people say they know about a speech recognizer.   And by now, you probably realize that most of these people have absolutely no ideas what's going on inside a recognizer.   So if you are reading this blog message, you are probably telling yourself, "I might want to trace the codebase of some recognizers' code." Be it Sphinx, HTK, Julius, Kaldi or whatever codebase you are looking at.

For the above toolkits, I will say I only know in detail about Sphinx,  probably a little bit about HTK's HVite.  But I won't say the same for others.  In fact, even in Sphinx, I only know intimately about Sphinx 3/SphinxTrain/sphinxbase triplet.   So just like you, I hope to learn more.

So here it begs the question: how would you trace a speech recognition toolkit codebase? If you think it is easy, probably because you worked in speech recognition for a while and you probably shouldn't read this post.

Let's just use sphinx as an example, there are hundreds of files in each component of Sphinx.   So where should you start?    A blunt approach would be reading each of the file one by one.   That's not a smart the way.   So here is a suggestion for you : focus on the following four things,

  1. Viterbi algorithm
  2. Workflow of training
  3. Baum-Welch algorithm. 
  4. Estimation algorithms of language models. 
When you know where the Viterbi algorithm is, you will soon figure out how the feature vector is generated.  On the same vein: if you know where the Baum-Welch algorithm, you will probably know how the statistics are generated.   If you know the workflow of the training, then you will understand the how the model is "evolved".   If you know how the language model is estimated, then you would have understanding of one of the most important heuristic of the search. 
Some of you may protest, how about the front-end? Isn't that important too?  True, but not when you try to understand a codebase.  For all practical purpose, a feature vector is just an N-dimensional vector.  The waveform is just an NxT matrix.   You can certainly do a lot of fancy things on this NxT matrix.   But when you think of Viterbi and Baum-Welch, they probably just read the frames and then calculate Gaussian distribution.  That's pretty much it's how much you want to know a front-end. 
How about adaptation algorithms?  That I think it's important.   But it should probably go after understanding of the major things in the code.   Because no matter whether you are doing adaptation online or doing this in speaker adaptive training.  It is something on top of the Baum-Welch algorithm.   Some implementation stick adaptation within the Baum-Welch executable.  There is certainly nothing wrong about it.   But it is still a kind of add-on. 
How about decoding API?  Those are useful things to know but it is more important when you just need to write an application.  For example, in Sphinx4, you just need to know how to call the Recognizer class.  In sphinx3, live_decode is what you need to know.   But only understanding those won't give you too much insights of how the decoder really works. 
How about the data structure?  Those are sort of important and should be understood when you try to understand a certain algorithm.   In the case of languages such as Java and C++, you should probably take notes on a custom-made data structure.  Or whether the designer call a specific data structure libraries.  Like Boost in C++. 
I guess this pretty much sums it all.  Now let me get back to one non-trivial item on the list, which is the workflow of training.   Many of you might think that recognition systems differ from each other because they have different decoders.  Dead wrong!  As I stressed from time to time, they differ because they have different acoustic models and language models.  So that's why in many research labs, much effort was put on preserving the parameters and procedures of how models is trained.  Much effort was also put to fine tuned this procedure.  
On this part,  I got to say open source speech recognition still has long long way to go.  For starter, there is no much sharing of recipes among speech hobbyists.   What many try to do is to search for a good model.   If you don't know how to train a model, you probably don't even know how to improve it for you own project.   
Arthur