Category Archives: sphinxbase

Grand Janitor's Blog February and March Summary

I wasn't very productive in blogging for the last two months.  Here are couple of worthy blog posts and news you might feel interested.

GJB also reached the milestone of 100 posts, thanks for your support !
Newsworthy:


Google Buys Neural Net Startup, Boosting Its Speech Recognition, Computer Vision Chops

Future Windows Phone speech recognition revealed in leaked video


Google Keep


Feel free to connect with me on Plus, LinkedIn and Twitter.

Arthur

sphinxbase 0.8 and SphinxTrain 1.08

I have done some analysis on sphinxbase0.8 and SphinxTrain 1.08 and try to understand if it is very different from sphinxbase0.7 and SphinxTrain1.0.7.  I don't see big difference but it is still a good idea to upgrade.

  • (sphinxbase) The bug in cmd_ln.c is a must fix.  Basically the freeing was wrong for all ARG_STRING_LIST argument.  So chances are you will get a crash when someone specify a wrong argument name and cmd_ln.c forces an exit.  This will eventually lead to a cmd_ln_val_free. 
  • (sphinxbase) There were also couple of changes in fsg tools.  Mostly I feel those are rewrites.  
  • (SphinxTrain) sphinxtrain, on the other hands, have new tools such as g2p framework.  Those are mostly openfst-based tool.  And it's worthwhile to put them into SphinxTrain. 
One final note here: there is a tendency of CMUSphinx, in general, starts to turn to C++.   C++ is something I love and hate. It could sometimes be nasty especially dealing with compilation.  At the same time, using C to emulate OOP features is quite painful.   So my hope is that we are using a subset of C++ which is robust across different compiler version. 
Arthur 

Two Views of Time-Signal : Global vs Local

As I have been working on Sphinx at work and start to chat with Nicholay more, one thing I realize is that several frequently used components of Sphinx need to rethink.  Here is one example  related to my work recently.

Speech signal or ...... in general time signal can be processed in two ways: you either process as a whole, or you process in blocks.  The former, you can call it a global view, the latter, you can call it a local view.  Of course, there are many other names: block/utterance, block/whole but essentially the terminology means the same thing.

For most of the time, global and local processing are the same.   So you can simply say: the two types of the processing are equivalent.

Of course, not when you start to an operation which use information available.   For a very simple example, look at cepstral mean normalization (CMN).  Implementing CMN in block mode is certainly an interesting problem.  For example, how do you estimate the mean if you have a running window?   When you think about it a little bit, you will realize it is not a trivial problem. That's probably why there are still papers on cepstral mean normalization.

Translate to sphinx, if you look at sphinxbase's sphinx_fe, you will realize that the implementation is based on the local mode, i.e. every once in a while, samples are consumed, processed and write onto the disc.    There is no easy way to implement CMN on sphinx_fe because it is assumed that the consumer (such as decode, bw) will do these stuffs their own.

It's all good though there are interesting consequence: what the SF's guys said about "feature" is really all the processing that can be done in the local sense.   Rather than the "feature" you see in either the decoders or bw.

This special point of view is ingrained within sphinxbase/sphinxX/sphinxtrain (Sphinx4? not sure yet.) .  This is quite different from what you will find in HTK which see feature vector as the vector used in Viterbi decoding.

That bring me to another point.  If you look deeper, HTK such as HVite/HCopy are highly abstract. So each tool was designed to take care of its own problem well. HCopy really means to provide just the feature, whereas HVite is just doing Viterbi algorithm on a bunch of features.   It's nothing complicated.  On the other hand, Sphinx are more speech-oriented.  In that world, life is more intertwined.   That's perhaps why you seldom hear people use Sphinx to do research other than speech recognition.  You can, on the other hand, do other machine learning tasks in HTK.

Which view is better?  If you ask me, I hope that both HTK and Sphinx are released in Berkeley license.  Tons of real-life work can be saved because each cover some useful functionalities.

Given that only one of them are released in a liberal license (Sphinx),  then may be what we need is to absorb some design paradigm from HTK.  For example, HTK has a sense of organizing data as pipes.   That something SphinxTrain can use.   This will enhance work of Unix users, who are usually contribute the most in the community.

I also hope that eventually there are good clones of HTK tools but made available in Berkeley/GNU license.  Not that I don't like the status quo: I am happy to read the code of HTK (unlike the time before 2.2......).   But as you work in the industry for a while, many are actually using both Sphinx and HTK to solve their speech research-related problems.   Of course, many of these guys  (, if they are honest,) need to come up with extra development time to port some HTK functions into their own production systems.  Not tough, but you will wonder whether time can be better spent ......

Arthur

Me and CMU Sphinx

As I update this blog more frequently, I noticed more and more people are directed to here.   Naturally,  there are many questions about some work in my past.   For example, "Are you still answering questions in CMUSphinx forum?"  and generally requests to have certain tutorial.  So I guess it is time to clarify my current position and what I plan to do in future.

Yes, I am planning to work on Sphinx again but no, I probably don't hope to be a maintainer-at-large any more.   Nick proves himself to be the most awesome maintainer in our history.   Through his stewardship, Sphinx prospered in the last couple of years.  That's what I hope and that's what we all hope.    
So for that reason, you probably won't see me much in the forum, answering questions.  Rather I will spend most of my time to implement, to experiment and to get some work done. 
There are many things ought to be done in Sphinx.  Here are my top 5 list:
  1. Sphinx 4 maintenance and refactoring
  2. PocketSphinx's maintenance
  3. An HTKbook-like documentation : i.e. Hieroglyphs. 
  4. Regression tests on all tools in SphinxTrain.
  5. In general, modernization of Sphinx software, such as using WFST-based approach.
This is not a small undertaking so I am planning to spend a lot of time to relearn the software.  Yes, you hear it right.  Learning the software.  In general, I found myself very ignorant in a lot of software details of Sphinx at 2012.   There are many changes.  The parts I really catch up are probably sphinxbase, sphinx3 and SphinxTrain.   One PocketSphinx and Sphinx4, I need to learn a lot. 
That is why in this blog, you will see a lot of posts about my status of learning a certain speech recognition software.   Some could be minute details.   I share them because people can figure out a lot by going through my status.   From time to time, I will also pull these posts together and form a tutorial post. 
Before I leave, let me digress and talk about this blog a little bit: other than posts on speech recognition, I will also post a lot of things about programming, languages and other technology-related stuffs.  Part of it is that I am interested in many things.  The other part is I feel working on speech recognition actually requires one to understand a lot of programming and languages.   This might also attract a wider audience in future. 
In any case,  I hope I can keep on.  And hope you enjoy my articles!
Arthur

The Grand Janitor's Blog

For the last year or so, I have been intermittently playing with several components of CMU Sphinx.  It is an intermittent effort because I am wearing several hats in Voci.

I find myself go back to Sphinx more and more often.   Being more experienced, I start to approach the project again carefully: tracing code, taking nodes and understanding what has been going on.  It was humbling experience - speech recognition has changed, Sphinx has more improvement than I can imagine. 
The life of maintaining sphinx3 (and occasionally dip into SphinxTrain) was one of the greatest experience I had in my life.   Unfortunately, not many of my friends know.  So Sphinx and I were pretty much disconnected for several years. 
So, what I plan to do is to reconnect.    One thing I have done throughout last 5 years was blogging so my first goal is to revamp this page. 
Let's start small: I just restarted RSS feeds.   You may also see some cross links to my other two blogs, Cumulomaniac, a site on my take of life, Hong Kong affairs as well as other semi-brainy topics,  and  333 weeks, a chronicle of my thoughts on technical management as well as startup business. 
Both sites are in Chinese and I have been actively working on them and tried to update weekly. 
So why do I keep this blog then?  Obviously the reason is for speech recognition.   Though, I start to realize that doing speech recognition has much more than just writing a speech recognizer.   So from now on, I will post other topics such as natural language processing, video processing as well as many low-level programming information.   
This will mean it is a very niche blog.   Could I keep up at all?  I don't know.   As my other blogs, I will try to write around 50 messages first and see if there is any momentum. 
Arthur