cmu sphinx documentation – The Grand Janitor Blog V3

New Triplet is Released

Just learned from the CMUSphinx’s main site. It sounds like there is a new triplet of sphinxbase and SphinxTrain released.

http://cmusphinx.sourceforge.net/2012/12/new-release-sphinxbase-0-8-pocketsphinx-0-8-and-sphinxtrain-0-8/

I took a look of the changes. Most of the changes work towards better reuse between SphinxTrain and sphinxbase. I think this is very encouraging.

There are around 600-700 SVN update since the last major release of triplet. I think Nick and the SF guys are doing great jobs on the toolkit.

As for training, one encouraging part is that there are efforts to improve the training procedure. I have always been maintaining that model training is the heart of speech recognition. A good model is the key of getting good speech recognition and performance. And great performance is the key of getting great user experience.

When will CMU Sphinx walk on the right path? I am still waiting but I am increasingly optimistic.

Arthur

(PS. I have nothing to do with this release. Though, I guess it’s time to go back to actual open-source coding.)

I was browsing the documentation section of cmusphinx.org and was very impressed. Compared to my ad-hoc version of documents back in www.cs.cmu.edu/~archan, or the old robust group document, it is a huge improvement.

What is the challenging to develop documentation for speech recognition? I believe the toughest part is that some people still see speech recognition as a programming task. In real-life though, speech recognition application should be viewed as a data analysis task.

Here is why: suppose you work on a normal programming task, once you figure out the algorithm, you job is pretty much done.

On a speech app though, that is just a tiny step towards a system which is good. For example, you might notice that your dictionary is not refined enough such that some of the words are not recognized correctly. Or you found that your language model has something wrong such that a certain trigrams never appears.

Those tasks, in terms of skill sets, require a person to stay in front of the Linux console, then come up with a Eureka moment : “Oh, that’s what’s wrong!”. So the job “Speech Scientist” usually requires knowledge of statistics, machine learning and more generally good analytic skills.

Your basic Linux skill is also extremely important: e.g. a senior researcher once shows me how he did many things solely on perl one-liner. As it turns out, when you can wield perl one-liner correctly, you can solve many text processing problem with one command! This would save you a lot of time in writing a throw-away script and allow you to focus on analysis why things are going wrong.

Back to good speech application documentation: one of the challenging part is to convey this real-life work-flow of Speech Scientist to the open source community. Many of us learn (and thrive to learn more…) this kind of skill in a hard way: writing reports, papers, presentations and be ready to get feedback from other people. You will also find yourself amazed by some brilliant insights and analyses too. (There are stupid analysis too but that’s parts of life……)

The Sphinx project collectively has gone a long way on this front of development. If you have time, check out
http://cmusphinx.org/wiki, I found much of material very useful. Check it out!

The Grand Janitor.