Tag Archives: ASR

The Search Programmer

In every team of any serious ASR or NLP company, that has to be one person who is the "search guy".  Not search as in search engine, but search as in searching in AI.  The equivalent of a chess engine programmer in a chess program,  or perhaps to engine specialist for race cars.   Usually this person has three important roles:

  1. Program the engine,
  2. Add new features to the engine ,
  3. Maintain the engine through its life time.

This job is usually taken by someone who has title such as "Speech Scientist" or "Speech Engineer".   They usually have blended skills of both programming and statistics.   It's a tough job, but it's also highly satisfactory job.  Because the success of a company usually depends on whether features can be integrated quickly.   That gives the "search guy" a mythical status even among data scientist - a search engineer needs to effectively work with two teams: one with mostly research background on statistics and machine learning, the other with mostly programming background, whose job is to churn out pseudocode, implementation and architecture diagrams daily.

I tend to think the power of "search guy" is both understated and overstated.

It's understated because there are many companies which only use other people's engine.  So they couldn't quite get the edge of customizing an engine. Those which use open source implementation is better, because they preserved the right to change the engine and give them leverage on intellectual property and trade secrets.  Those who bought commercial engine from large company would enjoy good performance for few years, but then got squeezed by huge price of upgrading and constrained by overly restrictive license.

(Shameless prompotion here:  Voci is an exception.  We are very nice to our clients. Check us out at here. 🙂 )

It's overstated because the skill of programming a search is nothing but a series of logical exercises.   The pity is programming a search algorithm, or generally a dynamic program (DP) in general, takes many kinds of expertise.  The knowledge can only be sporadically found in different subjects.  Some might learn the basic of DP in an algorithmic book such as CLRS, but mere knowledge of programming doesn't give you insights on how to debug an issue of the search.  You do need to have solid understanding in the domain knowledge (such as POS tagging and speech recognition) and theory (such as machine learning) to get the job done correctly.


Different HMMSets in HTK

HTK was my first speech toolkit. It's fun to use and you can learn a lot of ASR by following the manual carefully and deliberately.

If you are still using HMM/GMM technology (interesting but why?), here is a thread a year ago on why there are different HMM Types in HTK.

One thought I have: when I first start out in ASR, I seldom think of any human elements in a design. Of course, it has to deal with the difficulty of understanding all these terminologies and algorithms.

Yet ASR research has to do a lot with rival groups come up with different ideas, each try to bet against each other on the success of a certain technique.

So sometimes you would hope that competition would make technology finer. Yet a highly competitive environment only nurture followers, rather than competitive loner groups such as Prof. Young's , or MSR (whom AFAIK built the first working version of DNN-based ASR).



I'm a student who's looking into the HTK source code to get some idea
about practical implementation of HMMs. I have a question related to
the design choices of HTK.

AFAIK, the current working set of HMMs (HMMSet) has 4 types: plain,
shared, tied, discrete.
HMM sets with normal continuous emission densities are "plain" and
"shared", only difference being that some parameters are shared in the
latter. Sets with semi-continuous emission densities (shared Gaussian
pools for each stream) are called "tied" and discrete emission
densities are "discrete".

If someone uses HTK, isn't there a high chance of using only one of
these types? The usage of these types is probably mutually exclusive.
So my question is, why not have separate training and recognition
tools for continuous, semi-continuous and discrete HMM sets? Here are
some pros and cons of the current design I can think of, which of
course can be wrong:

- less code duplication
- simpler interface for the user

- more code complexity
- more contextual information required to read, more code jumps
- unused variables and memory, examples: vq and fv in struct
Observation, mixture alignment in discrete case

If I were to implement HMMs supporting all these emission densities,
what path should I follow? How feasible is it to use OOP principles to
create a better design? If so, why weren't they leveraged in HTK?

Warm regards,

(I trimmed out Mr. Neil Nelson's reply, which basically suggest people should use Kaldi instead.)

Max and Neil

I don’t usually respond to HTK questions, but this one was hard to resist.

I designed the first version of HTK in Cambridge in 1988 soon after moving from Manchester where I worked for a while on programming language and compiler design. I was a strong advocate of modular design, abstraction and OOP. However, at that time, C++ was a bit of a nightmare. There was little standardisation across operating systems and some implementations were very inefficient. As a result I decided that since HTK had to be very efficient and portable across platforms, it would be written in C, but the architecture would be modular and class like. Hence, header files look like class interfaces, and body files look like class method implementations.

When HTK was first designed, the “experts” in the US DARPA program had decided that continuous density HMMs would never scale and that discrete and semi-continous HMMs were the way to go. I thought they were wrong, but decided to hedge my bets and built in support for all three - whilst at the same time taking care that the implementation of continuous densities was not compromised by the parallel support for discrete and semi-continuous. By 1993 the Cambridge group (and the LIMSI group in France) were demonstrating that continuous density HMMs were significantly better than the other modelling approaches. So although we tried to maintain support for different emission density models, in practice we only used continuous densities for all of our research in Cambridge.

It is a source of considerable astonishment to me that HTK is still in active use 25 years later. Of course a lot has been added over the years, but the basic architecture is little changed from the initial implementation. So I guess I got something right - but as Neil says, things have moved on and today there are good alternatives to HTK. Which is best depends on what you want to do with it!

Steve Young"

Patterns in ASR Coding

Many toolkits in ASR appears in the form of unix executables.   But the nature of ASR tool is quite a bit different from general unix tools.   I will name 3 here:

  1. Complexity: A flexible toolkit also demands developers to have an external scripting framework.  In SphinxTrain, it used to be glued by perl, now by python.   Kaldi, on the other hand, is mainly glued by shell script.  I heard Cambridge has its own tools to do experiment correctly.
  2. Running Time: Coding ASR is that it takes long time to verify if something is correct.   So there are things you can't do: a very agile type of development by code-and-test doesn't work well.   I have seen people implemented, but it leaves so many bugs in the codebase.
  3. Numerical Issues: Another issue is that much coding in numerical algorithm could cause subtle changes of the results, it is tricky to code these changes well.  When these changes penetrated to production, it is usually very hard to debug.  When such changes affect performance, the result could be disastrous to you and your clients.

In a nutshell, we are dealing with a piece of software which is complex and mission-critical.  The issue is how do you continue develop and maintain such software.

In this article, I will talk about how this kind of coding can be done right.   You should notice that I don't favor a monolithic design of experimental tools.   e.g. "why don't we just write one single tool that does everything (to train/to decode)?"  There is a place of those mindsets in software engineering. e.g. Mercuria is designed in that way and I heard it is very competitive to GIT.   But I prefer a Unix-tool type of design which is closed to HTK, Sphinx, Kaldi.  i.e.  you write many tools and each of them has different purposes. You then simply glue them together for your own purpose.  I will call all the code changes in these little unix tools as code-level changes.  While changes in the scripting level simply as script-level changes.

Many of these thought are taught to me by experienced people in the field.   Some can be applicable in other fields: such as Think Before Code, Conclude from your Test.  Other can be applied to machine-learning specific problem: Match Results Numerically, Always Record Results.

Think Before Code

In our time, the agile development paradigm is very popular.  May be too popular, in my view.  Agile development is being deployed in too many places which I think inappropriate.  ASR is one of them.

As a coder in ASR, what you usually do are two things: making code-level changes (in C/C++/Java) or script-level changes (in Perl/Python).  In a nutshell, you are doing programming in a complex piece of software.   Since testing could take a long time.  Code-and-test type paradigm won't work for you too well.

On the other hand, deliberate-and-slow thinking is your first line of defense for any potential issues.  You should ask yourself couple of questions before any changes:

  1. Do you understand the purpose each of the tools in your script?
  2. Do you understand the underlying principle of the tool?
  3. Do you understand the I/O?
  4. Would you expect that any changes would change the I/O at all?
  5. For each tool, do you understand the code?
  6. What is your change?
  7. Where are your changes?  How many things you need to change? (10 files, 100 files? List them out.)
  8. In your head, after you make the change, do you expect your change will work? Why?  Convince yourself.

These are some of the questions you should ask yourself.  Granted, you don't have to all answers, but the more you know, you would reduce any potential future issues .

Conclude from your Tests, not from your Head

After all the thinking, are we done? No, you should still test your code, in fact you should test your code like a professional tester.  Bombard your well-thought out program with test.   Fix all warnings from compilers, valgrind it to fix leaks.   If you don't fix a certain thing, make sure you have a very very good reason. Because any changes in your decoder and trainer could have many ramifications to upper-layer of software, to you and to your colleagues.

The worst way to think about ASR coding is to say "it should work!".  No.  Sometimes, it doesn't. You are too naive for not testing the code.

Who makes such mistakes? It is hard to nail it down. My observation is that those who always try to think through any problems in their head and have strong conviction that they are right.    They are usually fresh grads (all kinds, Bachelors? Masters? PhDs? They are everywhere.)  Or people who only work on research and hadn't done real-life coding that much.  In a nutshell, it is a "philosophy"-thing.  Some people tend to think their thought apriori will work as it is.   This is a 8-th century thinking.  Always verify your changes with experiments.

Also. No one say, testing always eliminate all problems.  But if you think and test.  The chances of making mistakes will be tremendously reduced.

Scale It Down

The issue about large amount of testing in ASR it that it takes a long time.   So what should you do?

Scale it down.

e.g. Suppose you have 1000 utterance test, you want to reduce the testing time.  Make it a 100 utterance test, or even 10.  That allows you to verify your change quickly.

e.g. If you have an issue appears in 1 min utterance, try to see if you can repeat the same issue on a 6 second one.

e.g. If you are trying a procedure for 1000 hour of data, try to test it with 100 hour first.

These are just some examples.  This is a very important paradigm because it allows you to move on with your work faster.

Match Results Numerically

If you make an innocuous change, but the results are slightly different.  You should be very worried.

The first question you should ask is "How can this happen at all?" For example, let's say if you add a command-line option, your decoding results shouldn't change.

Are there any implicit or explicit random number generators in the code?  Or have you accidentally take in users' input?  Or else, how come your innocuous change would cause changes in results?

Be wearied about any one who say "It is just a small change.  Who cares? The results won't change." No, always question the size of the changes.   Ask for how many significant digits are match if there are any difference.   If you could try to learn more about intrinsic error introduced by floating point calculation.  (e.g. "What Every Computer Scientist Should Know About Floating Point Calculation" is a good start.)

There is another opposing thought: i.e. It should be okay to have some numerical changes.  I don't really buy it because once you allow yourself to drift 0.1% 10 times, you will have a 1% drift which can't be explained.  The only times you should let yourself go is you encountered randomness you can't control.  Even in those cases, you should still explain why your performance would change.

Predict before Change

Do you expect your changes would give better results?  Or worse results?  Can you explain to yourself why your change could be good/bad?

In terms of results, we are talking about mainly 3 things :  word-error-rate, speed and usage of memory.

Setup an Experimental Framework

If you are anyone serious about ML or ASR, you should have tested your code many times.  If you have tested your code many times, you will realize you can't use your brain to manipulate all your experiments.  You need a system.

I have written an article in V1 about this subject.  In a nutshell, make sure you can repeat/copy/record all your experimental detail including versions of binary, parameters.

Record your Work

With complexity of your work, you should make sure you keep enough documentation.  Here are some ideas:

  1. Version Control System : for your code
  2. Bug tracking : for your bugs and feature requests
  3. Planning document: for what you need to do in a certain task
  4. Progress Note: record in a daily basis on what you have done/learned experimentally.

Yes, you should have many records by now.  If you don't have any, I feel worried about you.  Chances are some important experimental details were forgotten.  Or if you don't see what you are doing is an experiment...... Woa.  I wonder how you explain what you do to other people.


That's what I have today.  This article summarizes many important concepts on how to maximize your success of doing any coding changes.    Some of these are habits which take time to setup and get used to.   Though from my experience, these habits are invaluable.  I found myself writing features which have less problems.  Or at least when there are problems, they are problems I hadn't and couldn't anticipate.







"The Grand Janitor Blog V2" Started

I moved "The Grand Janitor Blog" to WordPress.   Nothing much, Blogger is simply too constraining.  I don't like the theme.  I can't really customize a thing.  I can't put an ad there if I want to sell something.   So it was really annoying and it's time to change.

But then what's new with V2?   First of all, I might blog more about how machine learning influence speech recognition.  It's not new that machine learning is the source of how speech recognition. It has always been like that. Many experts who work in speech recognition have deep knowledge in pattern recognition.  When you look at their papers, you can sense that they have studied a certain machine learning method in great-depth.  So they can come up with creative ideas to improve the bottom-line, which is the only thing I care.  I don't really care the thousand APIs wrap around a certain recognizer.  I only care about the guts inside the decoder, the trainer.  Those components are what really matters but those are also components which are most misunderstood.

So why now?  It's obvious that the latest development of DBN-DNN (the "next big thing") is one factor.   I was told in school (10+ years ago) that GMM is the state of the art.  But things are rapidly changing, work of Prof. Hinton has given a theoretical basis for making DBN-DNN training practically feasible.   Enthusiasts, some rather sophisticated, are gather around the Kaldi forum.

For me,  as I I will describe myself as a recovering ASR programmer.   What does it mean?  It means I need to grok ASR from theory to implementation. That's tough.  I found myself studying again, dust off my "Advanced Calculus" and try to read and think creatively text such as "Connectionist Speech Recognition A Hybrid Approach" by Bourland and Nelson. (It's highly entertaining technical text!)  Perhaps more in the future.   But when you try to drill a certain skill in your life, there got to be a point you need to go back to the basic.   Re-think all the things you thought you know.  Re-prove all the proofs you thought you understood.    That takes time and patience but at the end it is also how you come up with new ideas.

As for the readers,  sorry for never getting back to your suggested blog messages.  You might be interested in a code trace of a certain part of Sphinx.  You might be interested in how certain parts of the program work.  I kept a list of them and probably write-up something when I have time.   No promise though;  I have been very busy.   And to be frank: everyone who works in ASR is busy.  That perhaps explain why not many actively maintained blogs in speech recognition.

Of course, I will keep on posting on other diverse topics such as programming and technology.   I am still a geek.  I don't think anyone can change that. 🙂

In any case, feel free to connect with me and have fun with speech recognition!


Arthur Chan, "The Grand Janitor"