training scripts – The Grand Janitor Blog V3

The term “speech recognition” is a misnomer.

Why do I say that? I have explained this point in an old article “Do We Have True Open Source Dictation?, which I wrote back in 2005: To recap, a speech recognition system consists of a Viterbi decoder, an acoustic model and a language model. You could have a great recognizer but bad accuracy performance if the models are bad.

So how does that related to you, a developer/researcher of ASR? The answer is ASR training tools and process usually become a core asset of your inventories. In fact, I can tell you when I need to work on acoustic model training, I need to spend full time to work on it and it’s one of the absorbing things I have done.

Why is that? When you look at development cycles of all tasks in making an ASR systems. Training is the longest. With the wrong tool, it is also the most error prone. As an example, just take a look of Sphinx forum, you will find that majority of non-Sphinx4 questions are related to training. Like, “I can’t find the path of a certain file”, “the whole thing just stuck at the middle”.

Many first time users complain with frustration (and occasionally disgust) on why it is so difficult to train a model. The frustration probably stems from the perception that “Shouldn’t it be well-defined?” The answer is again no. In fact how a model should be built (or even which model should be built) is always subjects to change. It’s also one of the two subfields in ASR, at least IMO, which is still creative and exciting in research. (Another one: noisy speech recognition.) What an open source software suite like Sphinx provide is a standard recipe for everyone.

Saying so, is there something we can do better for an ASR training system? There is a lot I would say, here are some suggestions:

A training experiment should be created, moved and copied with ease,
A training experiment should be exactly repeatable given the input is exactly the same,
The experimenter should be able to verify the correctness of an experiment before an experiment starts.

Ease of Creation of an Experiment

You can think of a training experiment as a recipe …… not exactly. When we read a recipe and implement it again, we human would make mistakes.

But hey! We are working with computers. Why do we need to fix small things in the recipe at all? So in a computer experiment, what we are shooting for is an experiment which can be easily created and moved around.

What does that mean? It basically means there should be no executables which are hardwired to one particular environment. There should also be no hardware/architecture assumption in the training implementations. If there is, they should be hidden.

Repeatability of an Experiment

Similar to the previous point, should we allow difference when running a training experiment? The answer should be no. So one trick you heard from experienced experimenters is that you should keep the seed of random generators. This will avoid minute difference happens in different runs of experiments.

Here someone would ask. Shouldn’t us allow a small difference between experiments? We are essentially running a physical experiment.

I think that’s a valid approach. But to be conscientious, you might want to run a certain experiment many times to calculate an average. In a way, I think this is my problem with this thinking. It is slower to repeat an experiment. e.g. What if you see your experiment has 1% absolute drop? Do you let it go? Or do you just chalk it up as noise? Once you allow yourself to not repeat an experiment exactly, there will be tons of questions you should ask.

Verifiability of an Experiment

Running an experiment sometimes takes day, how do you make sure running it is correct? I would say you should first make sure trivial issues such as missing paths, missing models, or incorrect settings was first screened out and corrected.

One of my bosses used to make a strong point and asked me to verify input paths every single time. This is a good habit and it pays dividend. Can we do similar things in our training systems?

Apply it on Open Source

What I mentioned above is highly influenced by my experience in the field. I personally found that sites, which have great infrastructure to transfer experiments between developers, are the strongest and faster growing.

To put all these ideas into open source would mean very different development paradigm. For example, do we want to have a centralized experiment database which everyone shares? Do we want to put common resource such as existing paramatized inputs (such as MFCC) somewhere in common for everyone? Should we integrate the retrieval of these inputs into part of our experiment recipe?

Those are important questions. In a way, I think it is the most type of questions we should ask in open source. Because regardless of much volunteer’s effort. Performance of open source models is still lagging behind the commercial models. I believe it is an issue of methodology.

Arthur

Where to start when tracing source code of a speech recognition toolkit?

Modern speech recognition software are complicated piece of software. To understand it, you need to have some basic understanding of the principle of speech recognition, as well as some ideas on the programming language being used.

By now, you may hear a lot of people say they know about a speech recognizer. And by now, you probably realize that most of these people have absolutely no ideas what’s going on inside a recognizer. So if you are reading this blog message, you are probably telling yourself, “I might want to trace the codebase of some recognizers’ code.” Be it Sphinx, HTK, Julius, Kaldi or whatever codebase you are looking at.

For the above toolkits, I will say I only know in detail about Sphinx, probably a little bit about HTK’s HVite. But I won’t say the same for others. In fact, even in Sphinx, I only know intimately about Sphinx 3/SphinxTrain/sphinxbase triplet. So just like you, I hope to learn more.

So here it begs the question: how would you trace a speech recognition toolkit codebase? If you think it is easy, probably because you worked in speech recognition for a while and you probably shouldn’t read this post.

Let’s just use sphinx as an example, there are hundreds of files in each component of Sphinx. So where should you start? A blunt approach would be reading each of the file one by one. That’s not a smart the way. So here is a suggestion for you : focus on the following four things,

Viterbi algorithm
Workflow of training
Baum-Welch algorithm.
Estimation algorithms of language models.

When you know where the Viterbi algorithm is, you will soon figure out how the feature vector is generated. On the same vein: if you know where the Baum-Welch algorithm, you will probably know how the statistics are generated. If you know the workflow of the training, then you will understand the how the model is “evolved”. If you know how the language model is estimated, then you would have understanding of one of the most important heuristic of the search.

Some of you may protest, how about the front-end? Isn’t that important too? True, but not when you try to understand a codebase. For all practical purpose, a feature vector is just an N-dimensional vector. The waveform is just an NxT matrix. You can certainly do a lot of fancy things on this NxT matrix. But when you think of Viterbi and Baum-Welch, they probably just read the frames and then calculate Gaussian distribution. That’s pretty much it’s how much you want to know a front-end.

How about adaptation algorithms? That I think it’s important. But it should probably go after understanding of the major things in the code. Because no matter whether you are doing adaptation online or doing this in speaker adaptive training. It is something on top of the Baum-Welch algorithm. Some implementation stick adaptation within the Baum-Welch executable. There is certainly nothing wrong about it. But it is still a kind of add-on.

How about decoding API? Those are useful things to know but it is more important when you just need to write an application. For example, in Sphinx4, you just need to know how to call the Recognizer class. In sphinx3, live_decode is what you need to know. But only understanding those won’t give you too much insights of how the decoder really works.

How about the data structure? Those are sort of important and should be understood when you try to understand a certain algorithm. In the case of languages such as Java and C++, you should probably take notes on a custom-made data structure. Or whether the designer call a specific data structure libraries. Like Boost in C++.

I guess this pretty much sums it all. Now let me get back to one non-trivial item on the list, which is the workflow of training. Many of you might think that recognition systems differ from each other because they have different decoders. Dead wrong! As I stressed from time to time, they differ because they have different acoustic models and language models. So that’s why in many research labs, much effort was put on preserving the parameters and procedures of how models is trained. Much effort was also put to fine tuned this procedure.

On this part, I got to say open source speech recognition still has long long way to go. For starter, there is no much sharing of recipes among speech hobbyists. What many try to do is to search for a good model. If you don’t know how to train a model, you probably don’t even know how to improve it for you own project.

Arthur