Category Archives: cmu sphinx

Python multiprocessing

As my readers may noticed, I haven't updated this blog as I have pretty heavy workload. It doesn't help that I was sick in the middle of March as well. Excuses aside though, I am happy to come back. If I couldn't write much about Sphinx and programming, I think it's still worth it to keep posting links.

I also come up with requests on writing more details on individual parts of Sphinx.   I love these requests so feel free to send me more.   Of course, it usually takes me some time to fully grok a certain part of Sphinx and I could describe it in an approachable way.   So before that, I could only ask for your patience.

Recently I come up with parallel processing a lot and was intrigued on how it works in the practice. In python, a natural choice is to use the library multiprocessing. So here is a simple example on how you can run multiple processes in python. It would be very useful in the modern days CPUs which has multi-cores.

Here is an example program on how that could be done:

1:  import multiprocessing  
2: import subprocess
3: jobs = []
4: for i in range (N):
5: p = multiprocessing.Process(target=process,
6: name = 'TASK' + str(i),
7: args=(i, ......
8: )
9: )
10: jobs.append(p)
11: p.start()
12: for j in jobs:
13: if j.is_alive():
14: print 'Waiting for job %s' %(j.name)
15: j.join()

The program is fairly trivial. Interesting enough, it is also quite similar to the multithreading version in python. Line 5 to 11 is where you run your task and I just wait for the tasks finished from Line 12 to 15.

It feels little bit less elegant than using Pool because it provides a waiting mechanism for the entire pool of task.  Right now, I am essentially waiting for job which is still running by the time job 1 is finished.

Is it worthwhile to go another path which is thread-based programming.  One thing I learned in this exercise is that older version of python, multi-threaded program can be paradoxically slower than the single-threaded one. (See this link from Eli Bendersky.) It could be an easier being resolved in recent python though.

Arthur

Acoustic Score and Its Signness

Over the years, I got asked about why acoustic score could be a positive number all the time. That occasionally lead to a kind of big confusion from beginner users. So I write this article as a kind of road sign for people.

Acoustic score per frame is essentially the log value of continuous distribution function (cdf). In Sphinx's case, the cdf is a multi-dimensional Gaussian distribution. So Acoustic score per phone will be the log likelihood of the phone HMM. You can extend this definition to word HMM.

For the sign. If you think of a discrete probability distribution, then this acoustic score thingy should always be negative. (Because log of a decimal number is negative.) In the case of a Gaussian distribution though, when the standard deviation is small, it is possible that the value is larger than 1. (Also see this link). So those are the time you will see a positive value.

One thing you might feel disharmonious is the magnitude of the likelihood you see. Bear in mind, Sphinx2 or Sphinx3 are using a very small logbase. We are also talking about a multi-dimensional Gaussian distribution. It makes numerical values become bigger.

Arthur

Also see:
My answer on the Sphinx Forum

Two Views of Time-Signal : Global vs Local

As I have been working on Sphinx at work and start to chat with Nicholay more, one thing I realize is that several frequently used components of Sphinx need to rethink.  Here is one example  related to my work recently.

Speech signal or ...... in general time signal can be processed in two ways: you either process as a whole, or you process in blocks.  The former, you can call it a global view, the latter, you can call it a local view.  Of course, there are many other names: block/utterance, block/whole but essentially the terminology means the same thing.

For most of the time, global and local processing are the same.   So you can simply say: the two types of the processing are equivalent.

Of course, not when you start to an operation which use information available.   For a very simple example, look at cepstral mean normalization (CMN).  Implementing CMN in block mode is certainly an interesting problem.  For example, how do you estimate the mean if you have a running window?   When you think about it a little bit, you will realize it is not a trivial problem. That's probably why there are still papers on cepstral mean normalization.

Translate to sphinx, if you look at sphinxbase's sphinx_fe, you will realize that the implementation is based on the local mode, i.e. every once in a while, samples are consumed, processed and write onto the disc.    There is no easy way to implement CMN on sphinx_fe because it is assumed that the consumer (such as decode, bw) will do these stuffs their own.

It's all good though there are interesting consequence: what the SF's guys said about "feature" is really all the processing that can be done in the local sense.   Rather than the "feature" you see in either the decoders or bw.

This special point of view is ingrained within sphinxbase/sphinxX/sphinxtrain (Sphinx4? not sure yet.) .  This is quite different from what you will find in HTK which see feature vector as the vector used in Viterbi decoding.

That bring me to another point.  If you look deeper, HTK such as HVite/HCopy are highly abstract. So each tool was designed to take care of its own problem well. HCopy really means to provide just the feature, whereas HVite is just doing Viterbi algorithm on a bunch of features.   It's nothing complicated.  On the other hand, Sphinx are more speech-oriented.  In that world, life is more intertwined.   That's perhaps why you seldom hear people use Sphinx to do research other than speech recognition.  You can, on the other hand, do other machine learning tasks in HTK.

Which view is better?  If you ask me, I hope that both HTK and Sphinx are released in Berkeley license.  Tons of real-life work can be saved because each cover some useful functionalities.

Given that only one of them are released in a liberal license (Sphinx),  then may be what we need is to absorb some design paradigm from HTK.  For example, HTK has a sense of organizing data as pipes.   That something SphinxTrain can use.   This will enhance work of Unix users, who are usually contribute the most in the community.

I also hope that eventually there are good clones of HTK tools but made available in Berkeley/GNU license.  Not that I don't like the status quo: I am happy to read the code of HTK (unlike the time before 2.2......).   But as you work in the industry for a while, many are actually using both Sphinx and HTK to solve their speech research-related problems.   Of course, many of these guys  (, if they are honest,) need to come up with extra development time to port some HTK functions into their own production systems.  Not tough, but you will wonder whether time can be better spent ......

Arthur

Me and CMU Sphinx

As I update this blog more frequently, I noticed more and more people are directed to here.   Naturally,  there are many questions about some work in my past.   For example, "Are you still answering questions in CMUSphinx forum?"  and generally requests to have certain tutorial.  So I guess it is time to clarify my current position and what I plan to do in future.

Yes, I am planning to work on Sphinx again but no, I probably don't hope to be a maintainer-at-large any more.   Nick proves himself to be the most awesome maintainer in our history.   Through his stewardship, Sphinx prospered in the last couple of years.  That's what I hope and that's what we all hope.    
So for that reason, you probably won't see me much in the forum, answering questions.  Rather I will spend most of my time to implement, to experiment and to get some work done. 
There are many things ought to be done in Sphinx.  Here are my top 5 list:
  1. Sphinx 4 maintenance and refactoring
  2. PocketSphinx's maintenance
  3. An HTKbook-like documentation : i.e. Hieroglyphs. 
  4. Regression tests on all tools in SphinxTrain.
  5. In general, modernization of Sphinx software, such as using WFST-based approach.
This is not a small undertaking so I am planning to spend a lot of time to relearn the software.  Yes, you hear it right.  Learning the software.  In general, I found myself very ignorant in a lot of software details of Sphinx at 2012.   There are many changes.  The parts I really catch up are probably sphinxbase, sphinx3 and SphinxTrain.   One PocketSphinx and Sphinx4, I need to learn a lot. 
That is why in this blog, you will see a lot of posts about my status of learning a certain speech recognition software.   Some could be minute details.   I share them because people can figure out a lot by going through my status.   From time to time, I will also pull these posts together and form a tutorial post. 
Before I leave, let me digress and talk about this blog a little bit: other than posts on speech recognition, I will also post a lot of things about programming, languages and other technology-related stuffs.  Part of it is that I am interested in many things.  The other part is I feel working on speech recognition actually requires one to understand a lot of programming and languages.   This might also attract a wider audience in future. 
In any case,  I hope I can keep on.  And hope you enjoy my articles!
Arthur

Self Criticism : Hieroglyph

When I was working on CMU Sphinx, I was more an aggressive young guy and love to start many projects (still am).   So I started many projects and not many of them completed.   I wasn't completely insane: what was lacking at that point of development is that we lack of passion and momentum.  So working on many things give a sense of we are moving forward.

One of the projects, which I feel I should be responsible, is the Hieroglyph.   It was meant to be a complete set of documentation for several Sphinx components work together.   But when I finished the 3rd draft, my startup work kicked in.    That's why what you can see is only an incomplete form of the document.

Fast-forward 6 years later, it was unfortunate that the document is still the comprehensive source of sphinx if you want to understand the underlying structure/method of CMU Sphinx C-based executables.     The current CMU Sphinx encompasses way more than I decided to cover.   For example, the Java-based Sphinx4 has gained much followings.   And pocketsphinx is pretty much the de-facto speech recognizer for embedded speech recognition.

If you were following me (unlikely but possible), I have personally changed substantially.   For example, my job experience taught me that Java is a very important language and having a recognizer in Java would significantly boost the project.    I also feel embedded speech recognition is probably the real future of our life.

Back to Hieroglyph, suffice to say it is not yet a sufficient document.   I hope that I can go back to it and ask what I can do to make it better.

Arthur

New Triplet is Released

Just learned from the CMUSphinx's main site.  It sounds like there is a new triplet of sphinxbase and SphinxTrain released.

http://cmusphinx.sourceforge.net/2012/12/new-release-sphinxbase-0-8-pocketsphinx-0-8-and-sphinxtrain-0-8/

I took a look of the changes.   Most of the changes work towards better reuse between SphinxTrain and sphinxbase.   I think this is very encouraging.

There are around 600-700 SVN update since the last major release of triplet.   I think Nick and the SF guys are doing great jobs on the toolkit.

As for training,  one encouraging part is that there are efforts to improve the training procedure.   I have always been maintaining that model training is the heart of speech recognition.   A good model is the key of getting good speech recognition and performance.   And great performance is the key of getting great user experience.

When will CMU Sphinx walk on the right path?   I am still waiting but I am increasingly optimistic.

Arthur

(PS. I have nothing to do with this release.  Though, I guess it's time to go back to actual open-source coding.)

CMU Sphinx Documentation

I was browsing the documentation section of cmusphinx.org and was very impressed.   Compared to my ad-hoc version of documents back in www.cs.cmu.edu/~archan, or the old robust group document, it is a huge improvement.

What is the challenging to develop documentation for speech recognition? I believe the toughest part is that some people still see speech recognition as a programming task.  In real-life though, speech recognition application should be viewed as a data analysis task.

Here is why:  suppose you work on a normal programming task, once you figure out the algorithm, you job is pretty much done.

On a speech app though, that is just a tiny step towards a system which is good.  For example, you might notice that your dictionary is not refined enough such that some of the words are not recognized correctly.   Or you found that your language model has something wrong such that a certain trigrams never appears.

Those tasks, in terms of skill sets, require a person to stay in front of the Linux console, then come up with a Eureka moment : "Oh, that's what's wrong!".    So the job "Speech Scientist" usually requires knowledge of statistics, machine learning and more generally good analytic skills.

Your basic Linux skill is also extremely important: e.g. a senior researcher once shows me how he did many things solely on perl one-liner.   As it turns out, when you can wield perl one-liner correctly, you can solve many text processing problem with one command!  This would save you a lot of time in writing a throw-away script and allow you to focus on analysis why things are going wrong.

Back to good speech application documentation:  one of the challenging part is to convey this real-life work-flow of Speech Scientist to the open source community.   Many of us learn (and thrive to learn more...)  this kind of skill in a hard way: writing reports, papers, presentations and be ready to get feedback from other people.  You will also find yourself  amazed by some brilliant insights and analyses too. (There are stupid analysis too but that's parts of life......)

The Sphinx project collectively has gone a long way on this front of development.  If you have time, check out
http://cmusphinx.org/wiki,  I found much of material very useful.  Check it out!

The Grand Janitor.

What should be our focus in Speech Recognition?

If you worked in a business long enough, you start to understand better what type of work are important.   As many things in life, sometimes the answer is not trivial.   For example, in speech recognition, what are the important ingredients to work on?

Many people will instinctively say the decoder.  For many, the decoder, the speech recognizer, oorr the "computer thing" which does all the magic of recognizing speech, is the core of the works.

Indeed, working on a decoding is loads of fun.  If you a fresh new programmer, it is also one of those experiences, which will teach you a lot of things.   Unlike thousands of small, "cool" algorithms, writing a speech recognizer requires you to work out a lot of file format issues, system issues.   You will also touch a fairly advanced dynamic programming problem : writing a Viterbi search.   For many, it means several years of studying source code bases from the greats such as HTK, Sphinx and perhaps in house recognizers.

Writing a speech recognizer is also very important when you need to deal with speed issues.  You might want to fit a recognizer into your mobile phone or even just a chip.   For example, in Voci, an FPGA-based speech recognizer was built to cater ultra-high speed speech recognition (faster than 100xRT).   All these system-related issues required understanding of the decoder itself.

This makes speech recognition an exciting field similar to chess programming.  Indeed the two fields are very similar in terms of code development.   Both require deep understanding of search as a process. Both have eccentric figures popped up and popped out.   There are more stories untold than told in both field.  Both are fascinating fields.

There is one thing which speech recognition and chess programming are very different.   This is also a subtle point which even many savvy and resourceful programmers don't understand.   That is how each of these machines derived their knowledge sources.   In speech, you need to have a good model to do decent jobs for your task.   In chess though, most programmers can proceed to write a chess player with the standard piece values.   As a result, there is a process before anyone can use a speech recognizer.  That is to first train an acoustic model and a language model.  

The same decoder, having different acoustic models and language models, can give users perceptions ranging from a total trainwreck to the a modern wonder, borderline to magic.   Those are the true ingredients of our magic.   Unlike magicians though, we are never shy to talk about these secret ingredients.   They are just too subtle to discuss.   For example, you won't go to a party and tell your friends that "Using an ML estimate is not as good as using an MPFE estimate in speech recognition.  It usually results in absolutely 10% performance gap."  Those are not party talks.  Those are talks when you want to have no friends. πŸ™‚

In both type of tasks, one require learning different from a programming training.   10 years ago, those skill are generally carried by "Mathematician, Statistician or People who specialized in Machine Learning".   Now there is new name : "Big Data Analyst".

Before I stopped, let me mention another type of work, which are important in real life.  What I want to say is transcription and dictionary work.   If you asked some high-minded researchers in the field, they will almost think those are not interesting work.   Yet, in real-life, you can almost always learn something new and improve your systems based on them.  May be I will talk about this more next time.

The Grand Janitor

Development of Sphinx 3.X (X = 6 to 8) and its Ramification.

One of the things I have done back in Sphinx is to so called "Great Refactoring" of Sphinx 3, SphinxTrain and sphinxbase.   It was started by me but mostly took up by Dave (in a disgruntled manner πŸ™‚ ).    I write this article to reflect the whole process and ask if I have done the right thing.

The background is like this: as you know, the CMU sphinx project has many recognizers.   Sphinx2, 3, 4, PocketSphinx and MultiSphinx.   It's easy to understand why that happened in the first place.  CMU is an university and understandably would have many different types of projects.  In essence,  when someone think of a good new idea, they will simply implement a recognizer.  The by-product of it would be a PhD thesis or some kind of project reports.

There is nothing wrong with that.  Think of the pain of understanding and changing a recognizer which has 10-30 thousand lines of code, you will know that it is not for the faint of heart.  Many of the original programmers of the recognizers also have practical reason to ignore code re-usability - many of them have deadlines to meet.  So I always feel empathy towards them.

Of course, on the other side of the coin,  having many recognizers gives users a mild amount of pain.   Just to look at 3.0 and 3.3, command-line interface had changed (e.g. -meanfn becomes -mean).   So when people need to interface with the code,  it would take some understanding.   The bigger problem is that do you expect a certain feature appears in one of the decoders to appear in another?   This kind of inconsistency is very hard to explain to normal users.

So here comes the first change at 3.5, or around 6-7 years ago, I decided to merge 3.0's series of tools and recognizer with 3.3, the fast decoder.  I got to say, the decision is mainly driven by young naivete and year-long insomnia.  ( πŸ™‚ ).   There were also frustration from users which drove me to make those changes.  In 3.5, the main thing I did was just to "port" the tool from the old 3.0 such as allphone, astar, align to 3.x.   There are some command-line interface changes.   So far, all are cool.

Then it comes to 3.6, at this point, I started to realize a lot of underlying functions and libraries are duplicated.   For example, we have multiple GMM computation routines but you can't use them in all tools which call GMM computation.   Like allphone in 3.5 used GMM computation, but you can't expect any fast GMM computation in 3.4 can be used in allphone.  Simply because the library wasn't shared.

So what did young and naive me thought?  Let's try to write a single architecture to incorporate all these different things! (!!!!)  Now... this is what I think where things go wrong.

Let me explain a little bit more.  There is a legitimate reason why the original programmer (Ravi) decides to split the tools into multiple parts and let code duplicates.   Simply because, the issue in align is not necessarily the issue of decode.   If the programmer of align needs to consider issues of decode, then it will take a long time to really get any programming done.

This happens to be the case of Sphinx 3.X.  Now for the development of Sphinx 3.X, there was another undesirable factor.  That is I decided to leave - I simply couldn't overcome the economic force at the time - a startup company is willing to hire me.

To complicate the matter,  we *also* decide to factor out common parts between SphinxTrain and sphinx3 to avoid code duplication between the two.   Again, it is driven by legitimate concern,  the fact that there were two feature extraction routines in two packages constantly make users ask themselves whether the front-end are matched.

All of these except I am leaving are good things but they just entail coding time.  Now the end effect is that it makes the effort too big, too time-consuming.  3.6 took me around 1 year to write and release. I release an official release at around mid of 2006 but there are still too many issues in the program.  The latter 3.8, Dave has taken up and really fixed many bugs.  So I always think it's Dave to make sphinx 3.X in the current stable form.

To the credit of the guys in the team, they really bash me : Evandro, being circumspect and consistent, always asked if it is a good idea in the first place.   Ravi, always the wise man, had brought up the issues of merging the code.  And of course, there is Dave, he deserves most of the credits for fixing a lot of nasty bugs.

So, in fact, it is really I should be blamed in the process.  I guess I am finally mature enough to apologize to everyone.

So you may wonder why I said all of these?  Oh well, first of all, that's because I am going to put work on the recognizers again.   Not just on Sphinx 3, but all other recognizers.  So my first hope is that I don't repeat my past problems.

Now given the code is being iterated in last 6 years, the benefit of merging the code in Sphinx 3 starts to really show up.  People can do a lot of more things than the past.   Is it good enough?  I don't think so.  Sphinx 3 has a lot of potentials but it's very misunderstood.  In a nutshell, I need to put more work on it in the future.

The Grand Janitor

Getting back to the project.....

After several years not touching Sphinx (or for that regard, any serious coding), I start to have a conversation with myself, namely, the me who maintained Sphinx 3.X 6 years ago.

When I was working with the project, I was tasked to work on Sphinx 3.  I have been an advocate of Sphinx 3 ever since.  To say the truth, I might have overdone it - there are many great recognizers in the world.  Just look within the family: Sphinx 4, PocketSphinx and recently MultiSphinx by Dave are all great recognizers.  (Dave has also fixed a lot of my bugs.  So if you look into the source code, you will see places where he screamed, or I paraphrase "Arthur, what are you talking about?")

Experience with many outside companies changed me.   I literally turned from a naive twenty something guy to a thirty something guy.   Still naive, but my world view has certainly changed.   In fact, for many purposes,  I found that learning all components of Sphinx is very beneficial.

Let's think in this way:  each of the project from CMU Sphinx was meant to solve a practical problem in real life.  For example, in Sphinx 4, not only you have great out-of-the-box performance.  You also got the native code which can be incorporated into Java-based servers.  This is a huge plus when you are thinking of writing a web application.    And web applications will be around for a long time.

Same as PocketSphinx, it is meant to be a version of Sphinx which can be integrated different embedded systems.   I am yet to learn about MultiSphinx but I always have faith on Dave and his ideas.

This makes me want to learn again.  It's weird, once you open your mind, you will see doors everywhere.   For me, my next targets would be learning Sphinx 4 and PocketSphinx.   Both of them have great importance.   Will I still work on Sphinx 3?  Probably.  X can always bigger than 8.  It's the programming reality which makes me change.   As I would think now, it's a good change, a very good change.

The Grand Janitor