All posts by grandjanitor


For a period of time, getting up is a daunting thing to me.   You see...... computers used to be a tool to let me realize myself.  I like to work, play with one.  It was not a job.

Since when it is changed for me?  It was the time when I think of a computer to be solely a tool of making money.   That's how many people in the field think.  Programming is no longer a pursuit of skill.   It is a way to get higher salary, win programming competition and have bragging right on lunch table. Knowledge in speech recognition?  It is not to solve one of the biggest problem in human history.  It is for winning contracts from defense,  beating other sites and again bragging to your esteemed colleagues.   These sicken me.
In my view, it is fine to think of money issue.  In fact, everyone should take care of their own personal finance and have basic understanding of economics...... BUT......  It doesn't mean everything has to be driven solely money.   
Rather, everyone should have passion, which allows them to wake up everyday, not being daunted by the workload of the day, but think "Woa,  there are 10 cool things I want to do.  What should I work on today?" and feel excited about life. 

Readings at Dec 18, 2012

From time to time, I will put interesting technology reading in my blog.   Enjoy.

  1. The value of typing code : By John Cook, after all these years, I got to concur that code I didn't type are not code that I grok. 
  2. The Founder's dilemma : Recommended by Joel Spolsky.  It sounds like an interesting book to check out as I am sick of overly qualitative statement in the startup world. 
  3. Tutorial on Python NLTK:  by Sujit Pal.  Python NLTK is something I want to check out for long time.  
  4. Pure Virtual Destructor in C++ : by Eli Bendersky.
  5. Dumping A C++ Object Memory Layout With Clang : by Eli Bendersky

How to Ask Questions in the Sphinx Forum?

Many go to different open source toolkits to look for a ready-to-use speech recognizer, and seldom get what they want.   Many feel disappointed and curse that developers of open source speech recognizer just couldn't catch up with commercial product.   Few know why and few decide to write about the reason.

People in the field blame Hollywood for lion share of the problem.  Indeed, many people believe ASR should work similarly to scenes of Space Odyssey 2001 or Star Trek.   We are far far away from there.   You may say SIRI is getting close.  True.   But when you look closer, SIRI doesn't always get what you say right, her strength lies on the very intelligent response system.

Unlike compilers such as GCC, speech recognition toolkit such as the CMU Sphinx project HTK are toolkits.   The mathematical models these toolkits provided were trained and fit to certain group of samples. Whereas, applications such as Google Voice or SIRI gather 100 or even 1000 times more data when they train a model.   This is the fundamental reason why you don't get the premium recognition rate you think you entitled to.

Many people (me included) saw that as a problem.  Unfortunately, to collect clean transcribed data has always been a problem.   Voxforge is the only attempt I am aware of to resolve the issue.    They are still growing up but it will be a while they can collect enough data to rival with commercial applications.

* * *
Now what does that tell you when you ask questions in CMU Sphinx or other speech recognition forum?   For users who expect out-of-the-box super performance, I would say "Sorry, we are not there yet."  In fact, speech recognition, in general, is probably not in performance shown in the original Star Trek yet (that will require accent adaptation and very good noise cancellation since the characters seem to be able to use the recognizer any time they like).

How about many users who have a little bit (or much) programming background? I would say one thing important.  As a programmer, you probably get used to look at the code, understand what it's done, do something cute and feel awesome from time-to-time.  You can't do that if you seriously want to develop a speech recognition system.

Rather, you should think like a data analyst.  For example, when you feel the recognition rate is bad, what is your evidence?  What is your data set?  What is the size of your data set? If you have a set, can you share the set?   If you don't have numerical measure, have you at least use pencil or paper to mark down at least some results and some mistakes? Report them when you ask questions, then you will get useful answers back.

If you go to look at programming forum, many ask questions with the source such that people can repeat the problem easily.    Some even go further to pinpoint location of the problem.    This is probably what you want to do if you get stuck.

* * *

Before I end this post, let's also bring up the issue of how usually ASR problem is solved?  Like...... if you see performance is bad, what should you do?

Some speech recognition problems can be solved readily.  For example, if you try to recognize digit strings but only get one digit at a time, chances are your grammar was written incorrectly.  If you see completely crappy speech recognition performance, then I will first check if the front-end of decoder match exactly as the front-end used to train the models.

For the rest,  the strength of the model is really the issue.   So most of your time should spend on learning and understanding techniques of model improvement.    For example, do you want to collect data and boost up your acoustic model?  Or if you know more about the domain, can you crawl some text on the web and help your language model?   Those are the first ideas you should think about.

There are also an exoteric group of people in the world who ask a different question, "Can we use a different estimation algorithm to make the better?"  That is the basis of MMIE, MPE and MFE.   If you found yourself mathematically proficient (perhaps need to be very proficient......), then learning those techniques and implement some of them would help boosting up the performance as well.   What I mentioned such as MMIE are just the basics,  each site has their own specialized technique and you might want to know.

Of course, you normally don't have to think so deep.   Adding more data is usually the first step of ASR improvement.    If you start to think something advance and if you can,  please try to put your implementation somewhere public such that everyone in the world can try it out.   These are something small to do, but I believe if we keep on doing something small right, there will be a day we can make open source speech recognizers as the commercial ones.


Landscape of Open Source Speech Recognition software at the end of 2012 (I)

As I am back, I start to visit all my old friends - all open source speech recognition toolkits.  The usual suspects are still around.  There are also many new kids in town so this is a good place to take a look.

It was a good exercise for me, 5 years of not thinking about open source speech recognition is a bit long.   It feels like I am getting in touch with my body again.

I will skip CMU Sphinx in this blog post as you probably know something about it if you are reading this blog.   Sphinx is also quite a complicated projects so it is rather hard to describe  entirely in one post.   This post serves only as an overview.  Most of the toolkit listed here have rich documentation.   You will find much useful information there.


I checked out the Cambridge HTK web page.  Disappointingly, the latest version is still 3.4.1, so we are still talking about MPE and MMIE, which is still great but not as exciting as other new kids in town such as KALDI.   
HTK has always been one of my top 3 speech recognition systems since most of my graduate work are done using HTK.   There are also many tricks you can do with the tools.   
As a toolkit, I also find its software engineering practice admirable.   For example, the software command was based on common libraries written beneath.  (Earlier versions such as 1.5 or 2.1 would restrict access to the memory allocation library HMem.)   When reading the source code, you feel much regularities and there doesn't seem to be much duplicated code. 
The license disallows commercial use but that's okay.  With ATK, which is released in a freer license, you can also include the decoder code into a commercial application.


The new kid in town.   It is headed by Dr. Dan Povey, who researched many advanced acoustic modeling techniques.   His recognizers attract much interest as it has implemented features such as subspace GMM and FST-based speech recognizer.   Of all, this features feel like more "modern". 
I only have little exposure on the toolkit (but determined to learn more).   Unlike Sphinx and HTK, it is written in C++ instead of C.   As of this writing, Kaldi's compilation takes a long time and the binaries are *huge*.   In my setup, it took me around 5G of disc space to compile.   It probably means I haven't setup correctly ...... or more likely, the executable is not stripped.   That means working on Kaldi's source code actively would take some discretion in terms of HD.  
Another interesting part of Kaldi is that it is using weighted finite state transducer (WFST) as the unifying knowledge source representation.   To contrast this, you may say most of the current open source speech recognizers are using ad-hoc knowledge source.   

Are there any differences in terms of performance you ask?  In my opinion, probably not much if you are doing an apple to apple comparison.   The strength of using WFST is that when you need to introduce new knowledge,  in theory you don't have to hack the recognizer.  You just need to write your knowledge in an FST and compose it with your knowledge network, then you are all set. 
In reality, the WFST-based technology seems to still have practice problem.  As the vocabulary size goes large and knowledge source got more complicated, the composed decoding WFST would naturally outgrow the system memory.   As a result, many sites propose different technique to make decoding algorithm works.  
Those are downsides but the appeal of the technique should not be overlooked.   That's why Kaldi becomes one of my favorite toolkits recently. 


Julius is still around!  And I am absolutely jubilant about it.  Julius is a high-speed speech recognizer which can decode a 60k vocabulary. One speed-up techniques of Sphinx 3.X was context-independent phone Gaussian mixture model selection (CIGMMS) and I borrowed this idea from Julius when I first wrote.  
Julius is only the decoder and the beauty of it is that it never claims to be more than that.  Accompanied with the software, there is a new Juliusbook, which is the guide on how to use the software.  I think the documentation are in greater-depth than other similar documentations. 
Julius comes with a set of Japanese models, not English.   This might be one of the reasons why it is not as popular (more like talk about) as HTK/Sphinx/Kaldi. 
(Note at 20130320: I later learned that Julius also comes with an English model now.  In fact, some anecdotes suggest the system is more accurate than Sphinx 4 with broadcast news.  I am not surprised.  HTK was as acoustic model trainer.)

So far......

I went through three of my favorite recognition toolkits.  In the next post, I will cover several other toolkits available. 

The Grand Janitor's Blog

For the last year or so, I have been intermittently playing with several components of CMU Sphinx.  It is an intermittent effort because I am wearing several hats in Voci.

I find myself go back to Sphinx more and more often.   Being more experienced, I start to approach the project again carefully: tracing code, taking nodes and understanding what has been going on.  It was humbling experience - speech recognition has changed, Sphinx has more improvement than I can imagine. 
The life of maintaining sphinx3 (and occasionally dip into SphinxTrain) was one of the greatest experience I had in my life.   Unfortunately, not many of my friends know.  So Sphinx and I were pretty much disconnected for several years. 
So, what I plan to do is to reconnect.    One thing I have done throughout last 5 years was blogging so my first goal is to revamp this page. 
Let's start small: I just restarted RSS feeds.   You may also see some cross links to my other two blogs, Cumulomaniac, a site on my take of life, Hong Kong affairs as well as other semi-brainy topics,  and  333 weeks, a chronicle of my thoughts on technical management as well as startup business. 
Both sites are in Chinese and I have been actively working on them and tried to update weekly. 
So why do I keep this blog then?  Obviously the reason is for speech recognition.   Though, I start to realize that doing speech recognition has much more than just writing a speech recognizer.   So from now on, I will post other topics such as natural language processing, video processing as well as many low-level programming information.   
This will mean it is a very niche blog.   Could I keep up at all?  I don't know.   As my other blogs, I will try to write around 50 messages first and see if there is any momentum. 

Self Criticism : Hieroglyph

When I was working on CMU Sphinx, I was more an aggressive young guy and love to start many projects (still am).   So I started many projects and not many of them completed.   I wasn't completely insane: what was lacking at that point of development is that we lack of passion and momentum.  So working on many things give a sense of we are moving forward.

One of the projects, which I feel I should be responsible, is the Hieroglyph.   It was meant to be a complete set of documentation for several Sphinx components work together.   But when I finished the 3rd draft, my startup work kicked in.    That's why what you can see is only an incomplete form of the document.

Fast-forward 6 years later, it was unfortunate that the document is still the comprehensive source of sphinx if you want to understand the underlying structure/method of CMU Sphinx C-based executables.     The current CMU Sphinx encompasses way more than I decided to cover.   For example, the Java-based Sphinx4 has gained much followings.   And pocketsphinx is pretty much the de-facto speech recognizer for embedded speech recognition.

If you were following me (unlikely but possible), I have personally changed substantially.   For example, my job experience taught me that Java is a very important language and having a recognizer in Java would significantly boost the project.    I also feel embedded speech recognition is probably the real future of our life.

Back to Hieroglyph, suffice to say it is not yet a sufficient document.   I hope that I can go back to it and ask what I can do to make it better.


New Triplet is Released

Just learned from the CMUSphinx's main site.  It sounds like there is a new triplet of sphinxbase and SphinxTrain released.

I took a look of the changes.   Most of the changes work towards better reuse between SphinxTrain and sphinxbase.   I think this is very encouraging.

There are around 600-700 SVN update since the last major release of triplet.   I think Nick and the SF guys are doing great jobs on the toolkit.

As for training,  one encouraging part is that there are efforts to improve the training procedure.   I have always been maintaining that model training is the heart of speech recognition.   A good model is the key of getting good speech recognition and performance.   And great performance is the key of getting great user experience.

When will CMU Sphinx walk on the right path?   I am still waiting but I am increasingly optimistic.


(PS. I have nothing to do with this release.  Though, I guess it's time to go back to actual open-source coding.)

CMU Sphinx Documentation

I was browsing the documentation section of and was very impressed.   Compared to my ad-hoc version of documents back in, or the old robust group document, it is a huge improvement.

What is the challenging to develop documentation for speech recognition? I believe the toughest part is that some people still see speech recognition as a programming task.  In real-life though, speech recognition application should be viewed as a data analysis task.

Here is why:  suppose you work on a normal programming task, once you figure out the algorithm, you job is pretty much done.

On a speech app though, that is just a tiny step towards a system which is good.  For example, you might notice that your dictionary is not refined enough such that some of the words are not recognized correctly.   Or you found that your language model has something wrong such that a certain trigrams never appears.

Those tasks, in terms of skill sets, require a person to stay in front of the Linux console, then come up with a Eureka moment : "Oh, that's what's wrong!".    So the job "Speech Scientist" usually requires knowledge of statistics, machine learning and more generally good analytic skills.

Your basic Linux skill is also extremely important: e.g. a senior researcher once shows me how he did many things solely on perl one-liner.   As it turns out, when you can wield perl one-liner correctly, you can solve many text processing problem with one command!  This would save you a lot of time in writing a throw-away script and allow you to focus on analysis why things are going wrong.

Back to good speech application documentation:  one of the challenging part is to convey this real-life work-flow of Speech Scientist to the open source community.   Many of us learn (and thrive to learn more...)  this kind of skill in a hard way: writing reports, papers, presentations and be ready to get feedback from other people.  You will also find yourself  amazed by some brilliant insights and analyses too. (There are stupid analysis too but that's parts of life......)

The Sphinx project collectively has gone a long way on this front of development.  If you have time, check out,  I found much of material very useful.  Check it out!

The Grand Janitor.

What should be our focus in Speech Recognition?

If you worked in a business long enough, you start to understand better what type of work are important.   As many things in life, sometimes the answer is not trivial.   For example, in speech recognition, what are the important ingredients to work on?

Many people will instinctively say the decoder.  For many, the decoder, the speech recognizer, oorr the "computer thing" which does all the magic of recognizing speech, is the core of the works.

Indeed, working on a decoding is loads of fun.  If you a fresh new programmer, it is also one of those experiences, which will teach you a lot of things.   Unlike thousands of small, "cool" algorithms, writing a speech recognizer requires you to work out a lot of file format issues, system issues.   You will also touch a fairly advanced dynamic programming problem : writing a Viterbi search.   For many, it means several years of studying source code bases from the greats such as HTK, Sphinx and perhaps in house recognizers.

Writing a speech recognizer is also very important when you need to deal with speed issues.  You might want to fit a recognizer into your mobile phone or even just a chip.   For example, in Voci, an FPGA-based speech recognizer was built to cater ultra-high speed speech recognition (faster than 100xRT).   All these system-related issues required understanding of the decoder itself.

This makes speech recognition an exciting field similar to chess programming.  Indeed the two fields are very similar in terms of code development.   Both require deep understanding of search as a process. Both have eccentric figures popped up and popped out.   There are more stories untold than told in both field.  Both are fascinating fields.

There is one thing which speech recognition and chess programming are very different.   This is also a subtle point which even many savvy and resourceful programmers don't understand.   That is how each of these machines derived their knowledge sources.   In speech, you need to have a good model to do decent jobs for your task.   In chess though, most programmers can proceed to write a chess player with the standard piece values.   As a result, there is a process before anyone can use a speech recognizer.  That is to first train an acoustic model and a language model.  

The same decoder, having different acoustic models and language models, can give users perceptions ranging from a total trainwreck to the a modern wonder, borderline to magic.   Those are the true ingredients of our magic.   Unlike magicians though, we are never shy to talk about these secret ingredients.   They are just too subtle to discuss.   For example, you won't go to a party and tell your friends that "Using an ML estimate is not as good as using an MPFE estimate in speech recognition.  It usually results in absolutely 10% performance gap."  Those are not party talks.  Those are talks when you want to have no friends. πŸ™‚

In both type of tasks, one require learning different from a programming training.   10 years ago, those skill are generally carried by "Mathematician, Statistician or People who specialized in Machine Learning".   Now there is new name : "Big Data Analyst".

Before I stopped, let me mention another type of work, which are important in real life.  What I want to say is transcription and dictionary work.   If you asked some high-minded researchers in the field, they will almost think those are not interesting work.   Yet, in real-life, you can almost always learn something new and improve your systems based on them.  May be I will talk about this more next time.

The Grand Janitor


Again, I feel rejuvenated.   Last few months of experience start to make me more unified both as a person and as a technical person.   When you start to work on something which draw up all you know in your life, you know that you are walking on the right path.

Things are starting to look more and more interesting.

The Grand Janitor