Around December last year, I wrote an article on open source speech recognizers. I covered HTK, Kaldi and Julius. One thing you should know, just like CMUSphinx, all of these packages contain their own versions of Viterbi algorithms' implementation. So when you asked someone who is in the field of speech recognition, they will usually say open source speech recognizers are Sphinx, HTK, Kaldi and Julius.
That's how I usually view speech recognition too. After years working in the industry though, I start to realize this definition of seeing speech recognizer = Viterbi algorithm could be constraining. In fact, from the user's point of view, a good speech application system should be a combination of
a recognizer + good models + good GUI.
I like to call the former type of "speech recognizer" as "speech recognition engines" but the latter type as "speech recognition applications". Both types of "speech recognizers" are worthwhile applications. From the users' point of view, it might just be a technicality to differentiate them.
When I am recovering as a speech recognition programmer (another name throwing 🙂 ), one thing I notice is that there is much effort on writing "speech recognition applications". It is a good trend because most people from academia really didn't spend too much time to write good speech applications. And in open source, we badly need good applications such as dictation machine, IVR and C&C.
One effort which really impressed me is Simon. It is weird because most of the time I only care about engine-level type of software. But in the case of Simon, you can see couple of its features are really solving problems in real life and integrated to the bigger them of open source speech recognition.
- In 0.4.0, Simon starts to integrate with Sphinx. So if someone wants to develop it commercially, they can.
- The Simon's team also intentionally make context switching in the application, that's good work as well. In general, if you always use a huge dictionary, you are just over-recognizing words in a certain context.
- Last and not least, I like the fact it integrates itself to Voxforge. Voxforge is the open source answer to a large speech database of commercial speech company. So integration with Voxforge will ensure an increasing amount of data for your application.
So kudo to the Simon team! I believe this is the right kind of thinking to start a good speech application.
I have done some analysis on sphinxbase0.8 and SphinxTrain 1.08 and try to understand if it is very different from sphinxbase0.7 and SphinxTrain1.0.7. I don't see big difference but it is still a good idea to upgrade.
- (sphinxbase) The bug in cmd_ln.c is a must fix. Basically the freeing was wrong for all ARG_STRING_LIST argument. So chances are you will get a crash when someone specify a wrong argument name and cmd_ln.c forces an exit. This will eventually lead to a cmd_ln_val_free.
- (sphinxbase) There were also couple of changes in fsg tools. Mostly I feel those are rewrites.
- (SphinxTrain) sphinxtrain, on the other hands, have new tools such as g2p framework. Those are mostly openfst-based tool. And it's worthwhile to put them into SphinxTrain.
One final note here: there is a tendency of CMUSphinx, in general, starts to turn to C++. C++ is something I love and hate. It could sometimes be nasty especially dealing with compilation. At the same time, using C to emulate OOP features is quite painful. So my hope is that we are using a subset of C++ which is robust across different compiler version.
As my readers may noticed, I haven't updated this blog as I have pretty heavy workload. It doesn't help that I was sick in the middle of March as well. Excuses aside though, I am happy to come back. If I couldn't write much about Sphinx and programming, I think it's still worth it to keep posting links.
I also come up with requests on writing more details on individual parts of Sphinx. I love these requests so feel free to send me more. Of course, it usually takes me some time to fully grok a certain part of Sphinx and I could describe it in an approachable way. So before that, I could only ask for your patience.
Recently I come up with parallel processing a lot and was intrigued on how it works in the practice. In python, a natural choice is to use the library multiprocessing. So here is a simple example on how you can run multiple processes in python. It would be very useful in the modern days CPUs which has multi-cores.
Here is an example program on how that could be done:
1: import multiprocessing
2: import subprocess
3: jobs = 
4: for i in range (N):
5: p = multiprocessing.Process(target=process,
6: name = 'TASK' + str(i),
7: args=(i, ......
12: for j in jobs:
13: if j.is_alive():
14: print 'Waiting for job %s' %(j.name)
The program is fairly trivial. Interesting enough, it is also quite similar to the multithreading version in python. Line 5 to 11 is where you run your task and I just wait for the tasks finished from Line 12 to 15.
It feels little bit less elegant than using Pool because it provides a waiting mechanism for the entire pool of task. Right now, I am essentially waiting for job which is still running by the time job 1 is finished.
Is it worthwhile to go another path which is thread-based programming. One thing I learned in this exercise is that older version of python, multi-threaded program can be paradoxically slower than the single-threaded one. (See this link from Eli Bendersky.) It could be an easier being resolved in recent python though.
Taeuber's Paradox and the Life Expectancy Brick Wall by Kas Thomas
Simplicity is Wonderful, But Not a Requirement by James Hague
Yeah. I knew a professor who always want to rewrite speech recognition systems such that is easier for research. Ahh...... modern speech recognition systems are complex any way. Not making mistakes is already very hard. Not to say building a good research system which easy to use for everyone. (Remember, everyone has their different research goal.)
I read "Why Hating Your Shitty Job Only Makes It Worse", there is something positive about the article but I can't completely agree with the authors.
Part of the dilemma at work in a traditional office space is that inevitably some kind of a*holes and bad system will appear in your life. The question is whether you want to ignore it or not. You should be keenly aware of your work condition and make rational decision of staying an leaving.
I worked on Sphinx 3 a lot. In these days, it was generally regarded as an "old-style" recognizer as compared to Sphinx 4 and PocketSphinx. It is also not support officially by the SF's guys.
Coders of speech recognition think a little bit different. They usually stick to a certain codebase which they feel comfortable with. For me, it is not just a personal preference, it also reflects how much I know about a certain recognizer. For example, I know quite a bit of how Sphinx 3 performs. In these days, I tried to learn how Sphinx 4 fare as well. So far, if you ask me to choose an accurate recognizer, I will still probably choose Sphinx 3, not because the search technology is better (Sphinx 4 is way superior), but because it can easily made to support several advanced modeling types. This seems to be how the 2010 developer meeting concluded as well.
But that was just me. In fact, I am bullish on all Sphinx recognizers. One thing I want to note is the power of Sphinx 4 in development. There are many projects are based on Sphinx 4. In these days, if you want to get a job on speech recognizer, knowing Sphinx 4 is probably a good ticket. That's why I am quite keen on learning it more so hopefully I can write on both recognizers more.
In any case, this is a Sphinx 3's article. I will probably write more on each components. Feel free to comments.
How Sphinx3 is initialized:
Here is a listing of function used on how Sphinx 3 is initialized I got from Sphinx 3.0.8. Essentially, there are 3 layers of initialization, kb_init, kbcore_init and s3_am_init. Separating kb_init and kbcore_init probably starts very early in Sphinx 3. Whereas separating s3_am_init from kbcore_init was probably from me. (So all blames on me.) That is to support -hmmdir.
-> kbcore_init (*)
-> set operation mode
-> Look for feat.params very early on.
-> s3_am_init (*)
-> misc. models init
mgau_init such as
-> dict2pid_build <- Should put into search
-> read in mdef.
-> depends on -senmgau type
.semi or .s3cont.
- -hmmdir override all other sub-parameters.
Guess I was too confident again to build up this site. Once again I feel work took me much time and couldn't work on this blog soon.
It also depends on whether my work are something camera-ready. Currently I have couple of articles which are ready to hash out but need some refinement.
Let's see if I can come back for month a month later. For now, you might see a lot of filler materials on this blog.
Can you stop loving this video?
The original link
Read it. It doesn't just apply in research, it applies in any job.