Development of Sphinx 3.X (X = 6 to 8) and its Ramification.

One of the things I have done back in Sphinx is to so called "Great Refactoring" of Sphinx 3, SphinxTrain and sphinxbase.   It was started by me but mostly took up by Dave (in a disgruntled manner 🙂 ).    I write this article to reflect the whole process and ask if I have done the right thing.

The background is like this: as you know, the CMU sphinx project has many recognizers.   Sphinx2, 3, 4, PocketSphinx and MultiSphinx.   It's easy to understand why that happened in the first place.  CMU is an university and understandably would have many different types of projects.  In essence,  when someone think of a good new idea, they will simply implement a recognizer.  The by-product of it would be a PhD thesis or some kind of project reports.

There is nothing wrong with that.  Think of the pain of understanding and changing a recognizer which has 10-30 thousand lines of code, you will know that it is not for the faint of heart.  Many of the original programmers of the recognizers also have practical reason to ignore code re-usability - many of them have deadlines to meet.  So I always feel empathy towards them.

Of course, on the other side of the coin,  having many recognizers gives users a mild amount of pain.   Just to look at 3.0 and 3.3, command-line interface had changed (e.g. -meanfn becomes -mean).   So when people need to interface with the code,  it would take some understanding.   The bigger problem is that do you expect a certain feature appears in one of the decoders to appear in another?   This kind of inconsistency is very hard to explain to normal users.

So here comes the first change at 3.5, or around 6-7 years ago, I decided to merge 3.0's series of tools and recognizer with 3.3, the fast decoder.  I got to say, the decision is mainly driven by young naivete and year-long insomnia.  ( 🙂 ).   There were also frustration from users which drove me to make those changes.  In 3.5, the main thing I did was just to "port" the tool from the old 3.0 such as allphone, astar, align to 3.x.   There are some command-line interface changes.   So far, all are cool.

Then it comes to 3.6, at this point, I started to realize a lot of underlying functions and libraries are duplicated.   For example, we have multiple GMM computation routines but you can't use them in all tools which call GMM computation.   Like allphone in 3.5 used GMM computation, but you can't expect any fast GMM computation in 3.4 can be used in allphone.  Simply because the library wasn't shared.

So what did young and naive me thought?  Let's try to write a single architecture to incorporate all these different things! (!!!!)  Now... this is what I think where things go wrong.

Let me explain a little bit more.  There is a legitimate reason why the original programmer (Ravi) decides to split the tools into multiple parts and let code duplicates.   Simply because, the issue in align is not necessarily the issue of decode.   If the programmer of align needs to consider issues of decode, then it will take a long time to really get any programming done.

This happens to be the case of Sphinx 3.X.  Now for the development of Sphinx 3.X, there was another undesirable factor.  That is I decided to leave - I simply couldn't overcome the economic force at the time - a startup company is willing to hire me.

To complicate the matter,  we *also* decide to factor out common parts between SphinxTrain and sphinx3 to avoid code duplication between the two.   Again, it is driven by legitimate concern,  the fact that there were two feature extraction routines in two packages constantly make users ask themselves whether the front-end are matched.

All of these except I am leaving are good things but they just entail coding time.  Now the end effect is that it makes the effort too big, too time-consuming.  3.6 took me around 1 year to write and release. I release an official release at around mid of 2006 but there are still too many issues in the program.  The latter 3.8, Dave has taken up and really fixed many bugs.  So I always think it's Dave to make sphinx 3.X in the current stable form.

To the credit of the guys in the team, they really bash me : Evandro, being circumspect and consistent, always asked if it is a good idea in the first place.   Ravi, always the wise man, had brought up the issues of merging the code.  And of course, there is Dave, he deserves most of the credits for fixing a lot of nasty bugs.

So, in fact, it is really I should be blamed in the process.  I guess I am finally mature enough to apologize to everyone.

So you may wonder why I said all of these?  Oh well, first of all, that's because I am going to put work on the recognizers again.   Not just on Sphinx 3, but all other recognizers.  So my first hope is that I don't repeat my past problems.

Now given the code is being iterated in last 6 years, the benefit of merging the code in Sphinx 3 starts to really show up.  People can do a lot of more things than the past.   Is it good enough?  I don't think so.  Sphinx 3 has a lot of potentials but it's very misunderstood.  In a nutshell, I need to put more work on it in the future.

The Grand Janitor

Being a programmer in your 20s and 30s

It's funny how a person changes.  I always thought 20s was my best time.  It sort of was.   Generally, that was the moment you are energetic, can burn as much as you can, naively think that life, relationship can last forever. Also, you unconditionally trust other people.  

Things will turn when you are 30s, you start to realize your skill, your prowess to growth has a limit.  In exchange, you grow wiser.  In my case, I found my reads on people are much better, I start to go behind other people's word and try to understand people's intention.  I start to treasure genuine friendship, protest contrived politeness and faked honesty.

I also start to know when is the best time to be quiet and when is the best time to give a come back.  The former is important because if you are the only one who shine, your team will perpetually has the capability of one person, who is you.  

The latter is also important because if you are always quiet, there are just people who will step on your toes harder and harder.   They will think you are weak and can be bullied.   In real life, as in the time when you are in high-school, the bullies love to bully the weak.   Making sure they have a hard time to do so, is a very important life skill.

I will never go back to the time I can hack a program for 20 hours, sleep, and then hack it again for another 20 hours.  Will I feel regret about it? Probably not, in exchange, I learn that sometimes you can solve a 20-hour problem with 2 hour, you can still sleep and make a living.   For all that matter, it seems to be a better deal. 🙂

The Grand Janitor

Start to look at the repository tree

Programming as a profession is a a strange one.   If you are a doctor, you can usually carry your knowledge and skills from one place to another provided that you have exactly the same tool.    If you are a programmer, you speed and skill are partially determined by the tools you build in house for a particular place.   So for example, I am not supposed to use any tool I built when I worked in the small video-advertising start-up.   Even if I can do something in 1 second at that period of time, if I change my job, I will need to restart and rebuild the tool again.   We are probably talking about days to rebuild the tool and weeks to refine it again.

There is one exception: if you worked in open source, much of your code would be stored in a public place.   Even when you have left your job for long time, it is legit for you to use it again.  You don't have to solve the same problem again and again.   This is the beauty of open source and I am greatly benefited by it personally. 
As I start to regain my muscles in Sphinx, I start to notice that there are much changes in last 6 years.  Just look at the top level of Subversion:
File  Rev. Age Author Last log entry
 Parent Directory
 CLP/  10079  23 months  dhdfu  Finally add an -F argument to use the full path in the control file as the label…
PocketSphinxAndroidDemo/  11117  9 months  nshmyrev  Wrapper for nbest
 SimpleLM/  22  12 years  rickyhoughton  Initial revision
 Speech-Recognizer-SPX/  8933  3 years  nshmyrev  Update module to recent pocketsphinx API
 SphinxTrain/  11350  9 days  nshmyrev  Extract warped features during 000 stage if VTLN is enabled. See for detailsht
 archive_s3/  7289  4 years  egouvea  Fixed error message in decoder script reporting failure in bw, and made result d…
 cmuclmtk/  11035  10 months  nshmyrev  Fixes bug in wngram2idngram and adds a test for it
 cmudict/  11348  3 weeks  air  cleaned up documentation and code (a bit) recompiled the dict
 gst-sphinx/  7848  4 years  dhdfu  Support changing language models at runtime (maybe)
 htk2s3conv/  11336  6 weeks  nshmyrev  Adds warning about different number of mixtures
 jsgfparser/  7230  4 years  dhdfu  Fix the main program to output the only public rule if no rule is specified, and…
 logios/  11339  4 weeks  tkharris  remove duplicated code
 misc_scripts/  10147  22 months  dhdfu  handle zero references
 multisphinx/  10945  12 months  dhdfu  clean up better and introduce vocabulary maps
 pocketsphinx/  11351  8 days  nshmyrev  Updated lat2dot script. I need to move it to the other location though
 pocketsphinx-extra/  9972  2 years  dhdfu  add sc models with mixture_weights and mdef.txt files
 scons/  5868  5 years  egouvea  updated the scons support to reflect that plugin.jar is now part of the package
 share/  5532  6 years  egouvea  Setting dsp and dsw files to have have windows EOL regardless where it's downloa…
 sphinx2/  8767  3 years  egouvea  Updated the sphinx-2 MS files to MS .NET, consistent with the other packages, an…
 sphinx3/  11329  2 months  nshmyrev  Patch to solve memory issues in python module. See for detailshttps://bugzilla
 sphinx4/  11344  3 weeks  nshmyrev  Properly sets logger for AudioFileDataSource. Thanks to Bandele Ola.
 sphinx_fsttools/  10791  14 months  nshmyrev  Some bit in AM to FST conversion
 sphinxbase/  11346  3 weeks  nshmyrev  Properly select buffer size when using audioresample. Thanks to balkce See fo…
 tools/  9009  3 years  nshmyrev  Updated to the latest release of sphinx4
 web/  10249  21 months  nshmyrev  There is no sphinx3 development anymore
How exciting is that?  You got only 6 to 7 top level directories 7 years ago!
From now on, I will start to put more notes on different tools in the repository. 
The Grand Janitor

Getting back to the project.....

After several years not touching Sphinx (or for that regard, any serious coding), I start to have a conversation with myself, namely, the me who maintained Sphinx 3.X 6 years ago.

When I was working with the project, I was tasked to work on Sphinx 3.  I have been an advocate of Sphinx 3 ever since.  To say the truth, I might have overdone it - there are many great recognizers in the world.  Just look within the family: Sphinx 4, PocketSphinx and recently MultiSphinx by Dave are all great recognizers.  (Dave has also fixed a lot of my bugs.  So if you look into the source code, you will see places where he screamed, or I paraphrase "Arthur, what are you talking about?")

Experience with many outside companies changed me.   I literally turned from a naive twenty something guy to a thirty something guy.   Still naive, but my world view has certainly changed.   In fact, for many purposes,  I found that learning all components of Sphinx is very beneficial.

Let's think in this way:  each of the project from CMU Sphinx was meant to solve a practical problem in real life.  For example, in Sphinx 4, not only you have great out-of-the-box performance.  You also got the native code which can be incorporated into Java-based servers.  This is a huge plus when you are thinking of writing a web application.    And web applications will be around for a long time.

Same as PocketSphinx, it is meant to be a version of Sphinx which can be integrated different embedded systems.   I am yet to learn about MultiSphinx but I always have faith on Dave and his ideas.

This makes me want to learn again.  It's weird, once you open your mind, you will see doors everywhere.   For me, my next targets would be learning Sphinx 4 and PocketSphinx.   Both of them have great importance.   Will I still work on Sphinx 3?  Probably.  X can always bigger than 8.  It's the programming reality which makes me change.   As I would think now, it's a good change, a very good change.

The Grand Janitor

The Grand Janitor After CMU Sphinx

I have left the development of CMU Sphinx for around 6 years.  Geez.  Talking about changes.  During the time, I went to work for one startup and one defense contractor.   Start numerous non-speech related blogs.

I certainly have fun but feel drifted at the same time - both companies I worked with are extraordinary but their causes are not mine.    As you know, life without a cause is a tough life.

And now when I am inspecting Sphinx and open source speech recognition again.   Wow, there are tons of changes.   The awareness of the need of open source speech recognition has never been so acute and high.   The performance of open source speech recognition still requires a lot of work but it is no longer unthinkable to deploy an open source speech recognizer in a real application.

There are more resources for learning how to use a speech recognizer.   Thanks to dedicated Sphinx developers such as David Huggins-Daines and Nickolay Shmyrev.  Many more people learn about how to properly use Sphinx and there are more documentation around.

There are also more resources for building a speech recognizer.  One notable effort is Voxforge led by Ken McClean which dedicated to accumulate clean and transcribed data over the time.   Though I don't know how large is its size, I admire the dedication of Ken.    Someone should start such a project long long time ago.   Once it is started, there is a chance that open source data would be an important source of speech data in future.

In my last 6 years, I can only act as a bystander of Sphinx development.   I change job again recently and will work with a company which is close to Sphinx.   I don't know how much I will do *real* work.   But I am glad that Sphinx and I cross paths again.   At the very least, I hope to contribute ideas to the community and help this great project grows.

The Grand Janitor

I am back

Hi Guys,
     I stopped using this blog for 3 years and now I decide to claim it.  My life as the "Grand Janitor" of the Sphinx software is very memorable for me.   It was unfortunate for me to stop the blog and had only write on-line in other venues. 

     I will start to blog more about speech recognition and natural language processing.  This is probably time for me to read up again.  My another blog, Random Thought of Arthur Chan, will solely put my thought on other random things in the world.

     In any case, it's good to meet all of you again.  We'll have fun.

The Grand Janitor

Random Thought: Cloud


When I was in College in Hong Kong, I love to stare at the blue sky and just watching pieces of cloud floating from my left to right. There was much open space in the University. My favorite thing to do is to skip classes and watch some clouds.

To many of my friends, that is a ridiculous habit. Though most of them see them as part of my little eccentricities in my little unsung college career.

In another words, I have done worse. 🙂 So they are not truly surprised and I am not that disappointed by their misunderstanding of clouds.

My true disappointment comes when I tried to share this interesting hobby with a mathematically-oriented friend. This guy is genuinely smart. In terms of Math, I think he is about 5 years ahead of me. So I thought he would understand.

So I told him my true intention of watching cloud - I would like to predict weather based on observing the cloud. That, to me, is a totally reasonable application of Mathematics. This is his response,

"You read "Wind and Cloud" too much.".

("Wind and Cloud" is a popular martial art comic book in Hong Kong. It's about a two martial experts, "Wind" and "Cloud" and their adventure in China.)

Many people asked me why I chose to live in US instead of Hong Kong, or even Bigger China. This story is probably an example of why.

In Hong Kong (or probably the bigger China), it is a difficult thing for students to imagine that advanced mathematics could have anything to do with complex subjects such as metereology at all. Also, there is a big gap between the expert knowledge of a certain field and the general public. So even if you have a technical background and you are smart enough to learn, you could still be ignorant on branches of other fields.

Of course, an even deeper problem is that imagination and creativity is not an emphasis in technical subjects such as Science and Mathematics. In the secondary school curriculum, they were usually not taught to inspire students to discover Mathematics themeslves. This explains the behavior of my smart friend.

There are social consequences of this, students grow up like this will probably unable to appreciate interesting thought from the youngs. That is to say scientific and technical workers are not truly appreciated. This compounds with the general money-loving attitude in Hong Kong. You will not surprised that Science and Technology is tough to develop there.

We cannot say the States' education is perfect, there are tons of holes and problems in it as well. But perhaps because Americans are always more adventurous in nature. They always see possibilities. That's why if you asked a smart student in U.S. the same question, you would probably got an account of General Circulation Model, how the basic equations is written. How Stoke-Navier equation can be used in this problem. (If we digress, then we would chat about how Stoke-Navier equation could be one of the 7 Millenium problems.)

I don't resent my friend's comment. What I see was that a smart person like him was wasted in the system. How many more of these situations happened in the past? I have no idea. What I know is that this is the true impedance of generating good scientific and technical workers.

Statistically Insignificant Me

Slightly related my last post. It relates to an interesting issue of whether we should share the bookshelf in the first place.

Why is it an issue? Well, privacy. Suppose someone is malicious and try to figure you out. The best way is to try to gather all information about you and work against you.

Another concern of mine is rather interesting and absolutely speculative, what if information I read will affect my thought and what if people could reconstruct it just from the information I read? That will open up a lot of interesting application. e.g. We might be able to predict what a person will do better.

Just like in other time series problem such as speech recognition and quantitative analysis. Human life could simply be defined by a series of time events. Some (forget the quote) believes that one human life could be stored in hard-disk and some starts to collect human life and see whether it could be model.

Information of what you read could tell a lot of who you are. Do you read Arthur C. Clarke? Do you read Jane Austen? Do you read Stephen King? Do you read Lora Roberts? From that information, one could build a machine learner to reverse map to who you are and how you make decision. We might just call this a kind of personality modeling.

It seems to me these are entirely possible from the standpoint of what we know. Yet, I still decide to share my bookshelf? Why?

Well, this was crystal-clear moment for me (and perhaps for you as well) which helps me to make a decision: Very simple, *I* am statistically in-significant.

If you happen to come to this web page, the only reason you come is because you are connected to me. How likely will that happened?

I know about 150 persons in my life. The world has about 6 billion. So that simply means the chance of me being discovered is around 1.5 x 10^-8. It is already pretty low.

Now, when other people know me and recommend me to someone else. Then this probability will be boosted up because 1) my PageRank will increase, 2) people follow my link deep enough will eventually discovered my bookshelves.

Yet, if I try to stay low-profile, (say not try to do SEO, not recommend any friends to go to my page) then it is reasonable to expect the factor mentioned is smaller than 1.

Further, 1.5 x 10^-8 is an upper bound as an estimate because
1, Not all my friends are interested in me (discounting factor : 0.6, a conservative one, the actual number is probably higher but I just don't want to face it. 😉 )
2, My friends who are interested in me might not follow my links (discounting factor: 0.01)

So we are talking about an event with probability as low as 10^-9 or 10^-10 here. That seems to me close to cheap cryptographic algorithm.

But notice here, my security is not come from hiding or cryptography. My security merely come from my statistical insignificance. In English, I am very open but no one cares. And I am still a happy treebear. 😉

That's why you see my bookshelf. Long story for a simple decision. If you happen to read this, I hope you enjoy it.

-a

Visual Bookshelves

I love to read and like to write reviews for every books I read. None of them will change the world but it still loves to do it. That's why by definition - I'm a bookworm. Not even feel shy about it. 😉

I go quite far: try to record every books I read on a blog and start to put them in a blog called "ContentGeek". Luckily, I haven't gone very far. Because once I discovered Visual Bookshelves, there is no need for me to do it all.

Visual Bookshelves allow users to look up a book from Amazon, add comments and stored it in a database. It also shows the book cover of the books. What else could I want more?

So anyway, this is the link of my visual bookshelves:

http://www.cs.cmu.edu/~archan/personal/bookshelf.html

Enjoy.

-a