Category Archives: Sphinx

Apology, Updates and Misc.

There are some questions on LinkedIn about the whereabouts of this blog.   As you may notice, I haven't done any updates for a while.   I was crazy busy by work in Voci (Good!) and many life challenges, just like everyone.    Having a lot of fun with programming, as I am working with two of my most favorite languages - C and Python.  Life is not bad at all.

My apology to all readers though, it could be tough to blog sometimes.  Hopefully, this situation will change later this year.....

Couple of worthwhile news in ASR,  Goldman-Sach won the trial in the Dragon law suit.  There is also the VB's piece of MS doubling up speed in their recognizer.

I don't know how to make out of the lawsuit but only feel a bit sad.  Dragon has been the homes of many elite speech programmers/developers/researchers.  Many old-timers of speech were there.   Most of them sigh about the whole L&H fiasco.   If I were them, I would feel the same too.   In fact, once you know a bit of ASR history, you would notice that the fall of L&H gave rise to one you-know-its-name player nowadays.  So in a way, the fate of two generations of ASR guys are altered.

As for the MS piece, we are following another trend these days, which is the emergence of DBN.  Is it surprising?  Probably not, it's rather easy to speed up neural network calculation.  (Training is harder, but that's what DBN is strong compared to previous NN approach.)

On Sphinx, I will point out one recent bug contributed by Ricky Chan, which exposed a problem in bw's MMIE training.   I am yet to try it but I believe Nick has already incorporated into the open-source code base.

Another items which Nick has been stressing lately is to use python, instead of perl, as the scripting language of SphinxTrain.   I think that's a good trend.  I like perl and use one-liner, map/grep type of program a lot.  Generally though, it's hard to find a concrete coding standard for perl.   Whereas python seems to be cleaner and naturally lead to OOP.  This is an important issue - perl programmers and perl programming style seems to be spawned from many different type of languages.   The original (bad) C programmer would fondly use globals and write functions with 10 arguments.  The original C++ programmer might expect language support on OOP but find that "it is just a hash".   These style difference could make perl training script hard to maintain.

That's why I like python more.  Even very bad script seems to convert itself to more maintainable script.   There is also a good pathway for python/C connect.  (Cython is probably the best.)

In any case, that's what I have this time.  I owe all of you many articles.  Let's see if I can write some in the near future.

Arthur

Me and CMU Sphinx

As I update this blog more frequently, I noticed more and more people are directed to here.   Naturally,  there are many questions about some work in my past.   For example, "Are you still answering questions in CMUSphinx forum?"  and generally requests to have certain tutorial.  So I guess it is time to clarify my current position and what I plan to do in future.

Yes, I am planning to work on Sphinx again but no, I probably don't hope to be a maintainer-at-large any more.   Nick proves himself to be the most awesome maintainer in our history.   Through his stewardship, Sphinx prospered in the last couple of years.  That's what I hope and that's what we all hope.    
So for that reason, you probably won't see me much in the forum, answering questions.  Rather I will spend most of my time to implement, to experiment and to get some work done. 
There are many things ought to be done in Sphinx.  Here are my top 5 list:
  1. Sphinx 4 maintenance and refactoring
  2. PocketSphinx's maintenance
  3. An HTKbook-like documentation : i.e. Hieroglyphs. 
  4. Regression tests on all tools in SphinxTrain.
  5. In general, modernization of Sphinx software, such as using WFST-based approach.
This is not a small undertaking so I am planning to spend a lot of time to relearn the software.  Yes, you hear it right.  Learning the software.  In general, I found myself very ignorant in a lot of software details of Sphinx at 2012.   There are many changes.  The parts I really catch up are probably sphinxbase, sphinx3 and SphinxTrain.   One PocketSphinx and Sphinx4, I need to learn a lot. 
That is why in this blog, you will see a lot of posts about my status of learning a certain speech recognition software.   Some could be minute details.   I share them because people can figure out a lot by going through my status.   From time to time, I will also pull these posts together and form a tutorial post. 
Before I leave, let me digress and talk about this blog a little bit: other than posts on speech recognition, I will also post a lot of things about programming, languages and other technology-related stuffs.  Part of it is that I am interested in many things.  The other part is I feel working on speech recognition actually requires one to understand a lot of programming and languages.   This might also attract a wider audience in future. 
In any case,  I hope I can keep on.  And hope you enjoy my articles!
Arthur

How to Ask Questions in the Sphinx Forum?

Many go to different open source toolkits to look for a ready-to-use speech recognizer, and seldom get what they want.   Many feel disappointed and curse that developers of open source speech recognizer just couldn't catch up with commercial product.   Few know why and few decide to write about the reason.

People in the field blame Hollywood for lion share of the problem.  Indeed, many people believe ASR should work similarly to scenes of Space Odyssey 2001 or Star Trek.   We are far far away from there.   You may say SIRI is getting close.  True.   But when you look closer, SIRI doesn't always get what you say right, her strength lies on the very intelligent response system.

Unlike compilers such as GCC, speech recognition toolkit such as the CMU Sphinx project HTK are toolkits.   The mathematical models these toolkits provided were trained and fit to certain group of samples. Whereas, applications such as Google Voice or SIRI gather 100 or even 1000 times more data when they train a model.   This is the fundamental reason why you don't get the premium recognition rate you think you entitled to.

Many people (me included) saw that as a problem.  Unfortunately, to collect clean transcribed data has always been a problem.   Voxforge is the only attempt I am aware of to resolve the issue.    They are still growing up but it will be a while they can collect enough data to rival with commercial applications.

* * *
Now what does that tell you when you ask questions in CMU Sphinx or other speech recognition forum?   For users who expect out-of-the-box super performance, I would say "Sorry, we are not there yet."  In fact, speech recognition, in general, is probably not in performance shown in the original Star Trek yet (that will require accent adaptation and very good noise cancellation since the characters seem to be able to use the recognizer any time they like).

How about many users who have a little bit (or much) programming background? I would say one thing important.  As a programmer, you probably get used to look at the code, understand what it's done, do something cute and feel awesome from time-to-time.  You can't do that if you seriously want to develop a speech recognition system.

Rather, you should think like a data analyst.  For example, when you feel the recognition rate is bad, what is your evidence?  What is your data set?  What is the size of your data set? If you have a set, can you share the set?   If you don't have numerical measure, have you at least use pencil or paper to mark down at least some results and some mistakes? Report them when you ask questions, then you will get useful answers back.

If you go to look at programming forum, many ask questions with the source such that people can repeat the problem easily.    Some even go further to pinpoint location of the problem.    This is probably what you want to do if you get stuck.

* * *

Before I end this post, let's also bring up the issue of how usually ASR problem is solved?  Like...... if you see performance is bad, what should you do?

Some speech recognition problems can be solved readily.  For example, if you try to recognize digit strings but only get one digit at a time, chances are your grammar was written incorrectly.  If you see completely crappy speech recognition performance, then I will first check if the front-end of decoder match exactly as the front-end used to train the models.

For the rest,  the strength of the model is really the issue.   So most of your time should spend on learning and understanding techniques of model improvement.    For example, do you want to collect data and boost up your acoustic model?  Or if you know more about the domain, can you crawl some text on the web and help your language model?   Those are the first ideas you should think about.

There are also an exoteric group of people in the world who ask a different question, "Can we use a different estimation algorithm to make the better?"  That is the basis of MMIE, MPE and MFE.   If you found yourself mathematically proficient (perhaps need to be very proficient......), then learning those techniques and implement some of them would help boosting up the performance as well.   What I mentioned such as MMIE are just the basics,  each site has their own specialized technique and you might want to know.

Of course, you normally don't have to think so deep.   Adding more data is usually the first step of ASR improvement.    If you start to think something advance and if you can,  please try to put your implementation somewhere public such that everyone in the world can try it out.   These are something small to do, but I believe if we keep on doing something small right, there will be a day we can make open source speech recognizers as the commercial ones.

Arthur

The Grand Janitor's Blog

For the last year or so, I have been intermittently playing with several components of CMU Sphinx.  It is an intermittent effort because I am wearing several hats in Voci.

I find myself go back to Sphinx more and more often.   Being more experienced, I start to approach the project again carefully: tracing code, taking nodes and understanding what has been going on.  It was humbling experience - speech recognition has changed, Sphinx has more improvement than I can imagine. 
The life of maintaining sphinx3 (and occasionally dip into SphinxTrain) was one of the greatest experience I had in my life.   Unfortunately, not many of my friends know.  So Sphinx and I were pretty much disconnected for several years. 
So, what I plan to do is to reconnect.    One thing I have done throughout last 5 years was blogging so my first goal is to revamp this page. 
Let's start small: I just restarted RSS feeds.   You may also see some cross links to my other two blogs, Cumulomaniac, a site on my take of life, Hong Kong affairs as well as other semi-brainy topics,  and  333 weeks, a chronicle of my thoughts on technical management as well as startup business. 
Both sites are in Chinese and I have been actively working on them and tried to update weekly. 
So why do I keep this blog then?  Obviously the reason is for speech recognition.   Though, I start to realize that doing speech recognition has much more than just writing a speech recognizer.   So from now on, I will post other topics such as natural language processing, video processing as well as many low-level programming information.   
This will mean it is a very niche blog.   Could I keep up at all?  I don't know.   As my other blogs, I will try to write around 50 messages first and see if there is any momentum. 
Arthur

Restart

Again, I feel rejuvenated.   Last few months of experience start to make me more unified both as a person and as a technical person.   When you start to work on something which draw up all you know in your life, you know that you are walking on the right path.

Things are starting to look more and more interesting.

The Grand Janitor

Start to look at the repository tree

Programming as a profession is a a strange one.   If you are a doctor, you can usually carry your knowledge and skills from one place to another provided that you have exactly the same tool.    If you are a programmer, you speed and skill are partially determined by the tools you build in house for a particular place.   So for example, I am not supposed to use any tool I built when I worked in the small video-advertising start-up.   Even if I can do something in 1 second at that period of time, if I change my job, I will need to restart and rebuild the tool again.   We are probably talking about days to rebuild the tool and weeks to refine it again.

There is one exception: if you worked in open source, much of your code would be stored in a public place.   Even when you have left your job for long time, it is legit for you to use it again.  You don't have to solve the same problem again and again.   This is the beauty of open source and I am greatly benefited by it personally. 
As I start to regain my muscles in Sphinx, I start to notice that there are much changes in last 6 years.  Just look at the top level of Subversion:
File  Rev. Age Author Last log entry
 Parent Directory
 CLP/  10079  23 months  dhdfu  Finally add an -F argument to use the full path in the control file as the label…
PocketSphinxAndroidDemo/  11117  9 months  nshmyrev  Wrapper for nbest
 SimpleLM/  22  12 years  rickyhoughton  Initial revision
 Speech-Recognizer-SPX/  8933  3 years  nshmyrev  Update module to recent pocketsphinx API
 SphinxTrain/  11350  9 days  nshmyrev  Extract warped features during 000 stage if VTLN is enabled. See for detailsht
 archive_s3/  7289  4 years  egouvea  Fixed error message in decoder script reporting failure in bw, and made result d…
 cmuclmtk/  11035  10 months  nshmyrev  Fixes bug in wngram2idngram and adds a test for it
 cmudict/  11348  3 weeks  air  cleaned up documentation and code (a bit) recompiled the dict
 gst-sphinx/  7848  4 years  dhdfu  Support changing language models at runtime (maybe)
 htk2s3conv/  11336  6 weeks  nshmyrev  Adds warning about different number of mixtures
 jsgfparser/  7230  4 years  dhdfu  Fix the main program to output the only public rule if no rule is specified, and…
 logios/  11339  4 weeks  tkharris  remove duplicated code
 misc_scripts/  10147  22 months  dhdfu  handle zero references
 multisphinx/  10945  12 months  dhdfu  clean up better and introduce vocabulary maps
 pocketsphinx/  11351  8 days  nshmyrev  Updated lat2dot script. I need to move it to the other location though
 pocketsphinx-extra/  9972  2 years  dhdfu  add sc models with mixture_weights and mdef.txt files
 scons/  5868  5 years  egouvea  updated the scons support to reflect that plugin.jar is now part of the package
 share/  5532  6 years  egouvea  Setting dsp and dsw files to have have windows EOL regardless where it's downloa…
 sphinx2/  8767  3 years  egouvea  Updated the sphinx-2 MS files to MS .NET, consistent with the other packages, an…
 sphinx3/  11329  2 months  nshmyrev  Patch to solve memory issues in python module. See for detailshttps://bugzilla
 sphinx4/  11344  3 weeks  nshmyrev  Properly sets logger for AudioFileDataSource. Thanks to Bandele Ola.
 sphinx_fsttools/  10791  14 months  nshmyrev  Some bit in AM to FST conversion
 sphinxbase/  11346  3 weeks  nshmyrev  Properly select buffer size when using audioresample. Thanks to balkce See fo…
 tools/  9009  3 years  nshmyrev  Updated to the latest release of sphinx4
 web/  10249  21 months  nshmyrev  There is no sphinx3 development anymore
How exciting is that?  You got only 6 to 7 top level directories 7 years ago!
From now on, I will start to put more notes on different tools in the repository. 
The Grand Janitor

I am back

Hi Guys,
     I stopped using this blog for 3 years and now I decide to claim it.  My life as the "Grand Janitor" of the Sphinx software is very memorable for me.   It was unfortunate for me to stop the blog and had only write on-line in other venues. 

     I will start to blog more about speech recognition and natural language processing.  This is probably time for me to read up again.  My another blog, Random Thought of Arthur Chan, will solely put my thought on other random things in the world.

     In any case, it's good to meet all of you again.  We'll have fun.

The Grand Janitor