Julius

Speech Recognition vs SETI

If you track news of CMUSphinx, you may notice that the Sourceforge guys start to distribute data through BitTorrent (link).

That’s a great move. One of the issues in ASR is the lack of machine power in training. To make a blunt example, it’s possible to squeeze extra performance by searching for the best training parameters. Not to say a lot of modern training techniques take some time to run.

I do recommend all of your help the effort. Again, me not involved at all, just feel that it is a great cause.

Of course, things in ASR are never easy so I want to give two subtle points about the whole distributed approach of training.

Improvement over the years?

First question you may ask, now does that mean, ASR can be like project such as SETI, which would automatically improve over the years? Not yet, ASR still has its unique challenge.

The major part I would see is how we can incrementally increase phonetically-balanced transcribed audio. Note that it is not just audio, but transcribed audio. Meaning: someone needs to go to listen to the audio, spending 5-10 times real time to write down what the audio really say word-by-word. All these transcriptions need to clean up and in a certain format.

This is what Voxforge tries to achieve and it’s not a small undertaking. Of course, comparing to the speed of the industry development, the progress is still too slow. The last time I heard, Google was training their acoustic model with 38000 hours of data. A WSJ corpus is a toy task compared to it.

Now, thinking in this way, let’s say if we want to build the best recognizer through open source, what is the bottleneck? I bet the answer doesn’t lie on machine power, whether we have enough transcribed data would be the key. So that’s something to ponder about.

(Added Dec 27, 2012, on the part of initial amount of data, Nickolay corrected me saying that amount of data from Sphinx is already in terms of 10000 hours. That includes “librivox recordings, transcribed podcasts, subtitled videos, real-life call recordings, voicemail messages”.

So it does sound like Sphinx has the amount of data which rivals commercial companies. I am very interested to see how we can train an acoustic model with that amount of data.)

We build it, they will come?

ASR is always shrouded with misunderstanding. Many believe it is a solved problem, many believe it is a unsolvable problem. 99.99% of world population are uninformed about the problem.

I bet a lot of people would be fascinated by SETI, which …. Woa …. allows you to communicated to unknown intelligent sentients in the universe. Rather than on ASR, which ….. Em ….. basically many regards as a source of satires/parodies these days.

So here comes another problem, the public don’t understand ASR enough to see it as an important problem. When you think about this more, this is a dangerous situation. Right now, couple of big companies control the resource of training cutting-edge speech recognizers. So let’s say in the futre everyone needs to talk with a machine in a daily basis. These big companies would be so powerful that they can control our daily life. To be honest to you, this thought haunts me from time to time.

I believe we should continue to spread information on how to properly use an ASR system. At the same time, continue to build application to show case ASR and let the public understand its inner-working. Unlike subatomic particle physics, HMM-based ASR is not that difficult to understand. On this part, I appreciate all the effort which are done by developers of CMUSphinx, HTK, Julius and all other open source speech recognition projects.

Conclusion

I love the recent move of Sphinx spreading acoustic data using BitTorrent, it is another step to work towards a self-improving speech recognition system. There are still things we need to ponder in the open source speech community. I mentioned a couple, feel free to bring up more in the comment section.

Arthur

Landscape of Open Source Speech Recognition software at the end of 2012 (I)

As I am back, I start to visit all my old friends – all open source speech recognition toolkits. The usual suspects are still around. There are also many new kids in town so this is a good place to take a look.

It was a good exercise for me, 5 years of not thinking about open source speech recognition is a bit long. It feels like I am getting in touch with my body again.

I will skip CMU Sphinx in this blog post as you probably know something about it if you are reading this blog. Sphinx is also quite a complicated projects so it is rather hard to describe entirely in one post. This post serves only as an overview. Most of the toolkit listed here have rich documentation. You will find much useful information there.

HTK

I checked out the Cambridge HTK web page. Disappointingly, the latest version is still 3.4.1, so we are still talking about MPE and MMIE, which is still great but not as exciting as other new kids in town such as KALDI.

HTK has always been one of my top 3 speech recognition systems since most of my graduate work are done using HTK. There are also many tricks you can do with the tools.

As a toolkit, I also find its software engineering practice admirable. For example, the software command was based on common libraries written beneath. (Earlier versions such as 1.5 or 2.1 would restrict access to the memory allocation library HMem.) When reading the source code, you feel much regularities and there doesn’t seem to be much duplicated code.

The license disallows commercial use but that’s okay. With ATK, which is released in a freer license, you can also include the decoder code into a commercial application.

Kaldi

The new kid in town. It is headed by Dr. Dan Povey, who researched many advanced acoustic modeling techniques. His recognizers attract much interest as it has implemented features such as subspace GMM and FST-based speech recognizer. Of all, this features feel like more “modern”.

I only have little exposure on the toolkit (but determined to learn more). Unlike Sphinx and HTK, it is written in C++ instead of C. As of this writing, Kaldi’s compilation takes a long time and the binaries are *huge*. In my setup, it took me around 5G of disc space to compile. It probably means I haven’t setup correctly …… or more likely, the executable is not stripped. That means working on Kaldi’s source code actively would take some discretion in terms of HD.

Another interesting part of Kaldi is that it is using weighted finite state transducer (WFST) as the unifying knowledge source representation. To contrast this, you may say most of the current open source speech recognizers are using ad-hoc knowledge source.

Are there any differences in terms of performance you ask? In my opinion, probably not much if you are doing an apple to apple comparison. The strength of using WFST is that when you need to introduce new knowledge, in theory you don’t have to hack the recognizer. You just need to write your knowledge in an FST and compose it with your knowledge network, then you are all set.

In reality, the WFST-based technology seems to still have practice problem. As the vocabulary size goes large and knowledge source got more complicated, the composed decoding WFST would naturally outgrow the system memory. As a result, many sites propose different technique to make decoding algorithm works.

Those are downsides but the appeal of the technique should not be overlooked. That’s why Kaldi becomes one of my favorite toolkits recently.

Julius is still around! And I am absolutely jubilant about it. Julius is a high-speed speech recognizer which can decode a 60k vocabulary. One speed-up techniques of Sphinx 3.X was context-independent phone Gaussian mixture model selection (CIGMMS) and I borrowed this idea from Julius when I first wrote.

Julius is only the decoder and the beauty of it is that it never claims to be more than that. Accompanied with the software, there is a new Juliusbook, which is the guide on how to use the software. I think the documentation are in greater-depth than other similar documentations.

Julius comes with a set of Japanese models, not English. This might be one of the reasons why it is not as popular (more like talk about) as HTK/Sphinx/Kaldi.

(Note at 20130320: I later learned that Julius also comes with an English model now. In fact, some anecdotes suggest the system is more accurate than Sphinx 4 with broadcast news. I am not surprised. HTK was as acoustic model trainer.)

So far……

I went through three of my favorite recognition toolkits. In the next post, I will cover several other toolkits available.

Arthur