Category Archives: HTK

Me and My Machines

IMG_4054I wrote a page long time ago about the machines I used.   When you work with computing for a while, every one of your computers mean something to you.    That's why I  tried not to throw them away easily. Occasionally I also bought scrape computers, fixed them up and felt like I did a good thing for the planet.

Anyway, here is a list of machine I used.  Some with more stories than the others:

  1. A 286 (1991-1992?) : The first computer I ever touch back in junior high school.  There was a geeky senior dude tried to teach us the basic of database and none of us really understand him. He wasn't nice to us, who were like 12-13 years old.  I disliked his attitude and called him out.   He was so unhappy and stormed out the computer room.   We eventually learn stuffs like LOGO, and basic DOS commands on these very slow 286. (Well, you can optimize the hell of them though.)
  2. A 486-66DX  (1994-1996?):  My first computer and I had it since high school.  I wasn't very into computer at that time. I used it to play Tie-Fighter, and wrote documents using Words.  I also did several assignments on microprocessor programming (i.e. basic Assembly stuffs).   It was incredibly slow and it takes a long time to compile a Visual C++ backbone windows program.   Later, I gave it to a girl and she just threw the whole thing away.   (Shame on me. I threw away a relic of computer history.)
  3. A P166 "Mars" (1996-2000): I bought this when I am second year in College.   Since I spent most of my money on this machine, I was doing part-time during my degree.    And I was finally able to do some interesting stuffs on computer such as GUI programming.   The GUI programming stuffs makes me get a good contract from librarian who tries to develop cataloging software.   I also wrote my first isolated word speech recognizer on it.    Later I ran a speech recognizer written by a guy named Ricky Chan.    The recognizer was then used in my final year project.   Unfortunately, both the cataloging software and  my final year project were disasters:  I didn't know how to fix memory leaks in C/C++ at that point.   All my programs died horribly.   Good Ricky Chan has nothing to do with it.  It's all my fault. But, the horror of Windows 95's blue screen still haunt me even these days.  Of course, both the librarian and my then-boss saw me at very dim light.  (They probably still do.)  I was cleaning my basement this year and Mars was getting too dirty.  So I painfully threw it away with tears in my eyes.
  4. A P500 "Jupiter" (2000-):  I bought this in my first year of graduate school, half a year after I started to receive stipends.    This is the moment I was very into HTK (Hidden Markov Toolkit).  I still kept Mars, but if you want to train HMM for connected digit recognition using TIDIGITS, my P166 with 16Mb will take me close to a week.   My P500 though allows me to run TIMIT and I was even able to train triphones (Woo!) .    I also gleefully run every steps from the HTK manual V2.2 even though I had no idea what I was doing.   Jupiter was also the machine I wrote the modified Viterbi algorithm in my thesis (formally Frame-Skipping Viterbi Algorithm (FSVA)).  I still keep the mid-frame body of the "Jupiter" but I think it wasn't working well since around 6 years ago.
  5. A Book Mini-PC (2000): In between Mars and Jupiter, I bought a mini-form PC.  I tried to install Red Hat Linux on it, but I was very bad at any Linux installation then.   Eventually the mother board was burned and I gave it to my friend who claim to know how to fix motherboard.    (He never got back to me.)
  6. "eea045" (2000-2003):  It is a lab machine I used back in HKUST,  it was first a Pentium 500MHz, but soon my boss upgraded it to 1.7GHz.   I was jubilant to use it to run acoustic model training, I also ran most of my theses' experiments on it.
  7. A Toshiba laptop (2002) My mom gave it to me because she said it's not running too well.  It dies on me right at the day I was going to present my Master Thesis.   Luckily, someone helps me to borrow a machine from the EEE department so now I am a happy Master.
  8. "Snoopy" (2001-2003): I was then a Junior Speech Scientist at Speechworks. And this Pentium 500 was assigned to me.   It is also the first of the four machines I used with funny names.
  9. "Grandpa" (2001-2003): The laptop assigned to me in Speechworks.   It solved a lot of funny crises for me.   I really missed "Grandpa" when I was laid off from Speechworks.
  10. iBuddie 4 A928 (2002-2003):  A thing called desknote at the time,  it's like a laptop but you always have to give it juice.   Again, its motherboard burnt.  And again, I don't quite know how to fix it.
  11. "Lumpy" (2003-2006): This is the machine assigned to me from CMU SCS,  and I asked the admin many times if the name is some kind of very profound joke.  "No" is their answer.  But I always know it's a setup. 😐  Always know.
  12. "Grumpy"/"Big Baby" (2003-): This is a Dell Inspiron 9100 I bought in a hefty price of $3000.  Even at 2004, it was a heavy laptop.   I used it for most of my CMU work, including hacking Sphinx, writing papers.    Prof.  Alex Rudnicky, my then-boss in CMU, always jokingly asked me if Big Baby is a dock station.   (Seriously, No.)   I also used it as my temporary laptop in Scanscout.   The laptop is so large and heavy, I used it as my dumbbells in Scanscout.
  13. "The Cartoon Network"(2003-2006): This is the name of cluster in CMU Sphinx Group which is used by many students from the Robust Group, by me and David Huggins Daines, Alex's student, as well as Evandro, who was then working for Prof. Jack Mostow.  The names of the machines were all based on cartoon characters from Cartoon networks:  for example, Blossoms,  Bubbles and Buttercups are three 2G Hz machines which were not too reliable.   I have been asking Alex to name one of the machines to be Mojo Jojo.  But he keeps on refusing me.  (Why? Why Alex?)
  14. A G4 (2004-2006) This is the first Mac I ever used in my life but it's one of the most important.   I used it to develop for a project called CALO (Cognitive Agent that Learn and Organize), now venerable because several SRI participants started an engine which nowadays called Siri.   But what I learned is simpler:  Apple would grow big, since then I invested on Apple regularly, with reasonable profit.
  15. A Lenovo laptop (2007-2008):  In my short stay at Scanscout,  I used this machine exclusively to compile and develop what then called the SSFramework ("ScanScout Framework"), a java-Tomcat stack which Scanscout used to serve video ad.   I ghosted it to have two partitions: Windows and Linux.   I mostly worked on Windows.  At that point, I always have small issues here and there to switch back to Linux.  Usually, the very versatile tech guru Dr. Tadashi Yonezaki would help me. Dr. Yonezaki later became the Chief Scientist of Scanscout.
  16. "Scanscouts' Machines" (2007-2008): I can't quite remember how the setting is, but all machines from early Scanscouts were shared by core technology scientists, like Tadashi or me, and several developers, QAs.   I wasn't too into "The Scout" (how couple of early Alumi called it).   So I left the company after only 1.5 years.   A good ending though: Scanscout was later acquired by Tremor Video and got listed.
  17. Inspiron 530 "Inspirie" (2008 - ): There was around half a year of time when I resigned from Scanscout, I was unemployed.   I stayed home most of the time, read a lot and played tons of poker and backgammon on-line.  That was also the time I bought Inspirie.   For long time, it wasn't doing much other than being a home media center.    Last few years though, Inspirie played an important role as I tried to learn deep learning.   I ran all Theano's tutorial on it (despite it being very very slow).
  18. Machines I used in a S&P 500 company (2009-2011): Between "The Scout" and Voci, I was hired by a mid-size research institute as a Staff Scientist, and took care much of the experimental work within the group.   It's a tough job, has long hours and so my mind usually get very numb.   I can only vaguely remember there are around 3 incidences of my terminal were broken.    That was also the time I was routinely using around 200 to 300 cores, which my guess is around 10-15% of all cores available within the department.   I was always told to tone down usage.  Since there are couple of guys in the department were exactly like me, recklessly sending jobs to the queue,  the admin decides to have a scheme which limit the amount of cores we could use.
  19. A 2011 Macbook Pro 17 inches "Macky" (2011 - After several years of saving, I finally bought my first Macbook.   I LOVE IT SO MUCH! It was also the first time since many years I feel computing is fun.  I wrote several blogs, several little games with Macky but mostly it was the machine I carried around.   Unfortunately, a horrible person poured tea on top of it.   So its display was permanently broken, I have to connect it with an LCD all the time.   But it is still the machine I love most.  Because it makes me love computing again.
  20. "IBM P2.8 4 cores" (2011-) A machine assigned to me by Voci. Most of my recent work on the Voci's speech recognition framework was done on it.
  21. "Machines from Voci" (2011-) They are fun machines.  Part of it is due to the rise of GPUs.  Unfortunately I can't talk about theirs settings too much. Let's say Voci has been doing great work with them.
  22. "A 13 inches MacBook" (2014-) This is my current laptop.   I took most of my Cousera classes with it.    I feel great about its size and easy-goingness.
  23. "An HP Stream" (2015-) My current Windows machine.  I hate Windows but you got to use it sometimes. A $200 price tag seems about right.
  24. "Dell M70" and "HP Pavilion dv2000" (2015-) i.e. The machine you saw in the image up top of this post.   I bought each of them for less than $10 from Goodwill.   Both of them have no problem in operation, but small physical issues such as dent and broken hinges.   A screwdriver and some electric tape would fix them easily.

There you have it.  The 24 sets of machines I have touched.  Mostly a history of story of some unknown silicons, but also my personal perspective on computing.


(Edit at Dec 24: Fixed some typos.)

Two Views of Time-Signal : Global vs Local

As I have been working on Sphinx at work and start to chat with Nicholay more, one thing I realize is that several frequently used components of Sphinx need to rethink.  Here is one example  related to my work recently.

Speech signal or ...... in general time signal can be processed in two ways: you either process as a whole, or you process in blocks.  The former, you can call it a global view, the latter, you can call it a local view.  Of course, there are many other names: block/utterance, block/whole but essentially the terminology means the same thing.

For most of the time, global and local processing are the same.   So you can simply say: the two types of the processing are equivalent.

Of course, not when you start to an operation which use information available.   For a very simple example, look at cepstral mean normalization (CMN).  Implementing CMN in block mode is certainly an interesting problem.  For example, how do you estimate the mean if you have a running window?   When you think about it a little bit, you will realize it is not a trivial problem. That's probably why there are still papers on cepstral mean normalization.

Translate to sphinx, if you look at sphinxbase's sphinx_fe, you will realize that the implementation is based on the local mode, i.e. every once in a while, samples are consumed, processed and write onto the disc.    There is no easy way to implement CMN on sphinx_fe because it is assumed that the consumer (such as decode, bw) will do these stuffs their own.

It's all good though there are interesting consequence: what the SF's guys said about "feature" is really all the processing that can be done in the local sense.   Rather than the "feature" you see in either the decoders or bw.

This special point of view is ingrained within sphinxbase/sphinxX/sphinxtrain (Sphinx4? not sure yet.) .  This is quite different from what you will find in HTK which see feature vector as the vector used in Viterbi decoding.

That bring me to another point.  If you look deeper, HTK such as HVite/HCopy are highly abstract. So each tool was designed to take care of its own problem well. HCopy really means to provide just the feature, whereas HVite is just doing Viterbi algorithm on a bunch of features.   It's nothing complicated.  On the other hand, Sphinx are more speech-oriented.  In that world, life is more intertwined.   That's perhaps why you seldom hear people use Sphinx to do research other than speech recognition.  You can, on the other hand, do other machine learning tasks in HTK.

Which view is better?  If you ask me, I hope that both HTK and Sphinx are released in Berkeley license.  Tons of real-life work can be saved because each cover some useful functionalities.

Given that only one of them are released in a liberal license (Sphinx),  then may be what we need is to absorb some design paradigm from HTK.  For example, HTK has a sense of organizing data as pipes.   That something SphinxTrain can use.   This will enhance work of Unix users, who are usually contribute the most in the community.

I also hope that eventually there are good clones of HTK tools but made available in Berkeley/GNU license.  Not that I don't like the status quo: I am happy to read the code of HTK (unlike the time before 2.2......).   But as you work in the industry for a while, many are actually using both Sphinx and HTK to solve their speech research-related problems.   Of course, many of these guys  (, if they are honest,) need to come up with extra development time to port some HTK functions into their own production systems.  Not tough, but you will wonder whether time can be better spent ......


Speech Recognition vs SETI

If you track news of CMUSphinx, you may notice that the Sourceforge guys start to distribute data through BitTorrent (link).

That's a great move.   One of the issues in ASR is the lack of machine power in training.  To make a blunt example, it's possible to squeeze extra performance by searching for the best training parameters.    Not to say a lot of modern training techniques take some time to run.

I do recommend all of your help the effort.  Again, me not involved at all, just feel that it is a great cause.

Of course, things in ASR are never easy so I want to give two subtle points about the whole distributed approach of training.

Improvement over the years?

First question you may ask,  now does that mean, ASR can be like project such as SETI, which would automatically improve over the years?  Not yet, ASR still has its unique challenge.

The major part I would see is how we can incrementally increase phonetically-balanced transcribed audio.   Note that it is not just audio, but transcribed audio.  Meaning: someone needs to go to listen to the audio, spending 5-10 times real time to write down what the audio really say word-by-word.   All these transcriptions need to clean up and in a certain format.  

This is what Voxforge tries to achieve and it's not a small undertaking.   Of course, comparing to the speed of the industry development, the progress is still too slow.  The last time I heard, Google was training their acoustic model with 38000 hours of data.   A WSJ corpus is a toy task compared to it.

Now, thinking in this way, let's say if we want to build the best recognizer through open source, what is the bottleneck?  I bet the answer doesn't lie on machine power,  whether we have enough transcribed data would be the key.   So that's something to ponder about.

(Added Dec 27, 2012, on the part of initial amount of data, Nickolay corrected me saying that amount of data from Sphinx is already in terms of 10000 hours.   That includes "librivox recordings, transcribed podcasts, subtitled videos, real-life call recordings, voicemail messages".

So it does sound like Sphinx has the amount of data which rivals commercial companies.  I am very interested to see how we can train an acoustic model with that amount of data.)

We build it, they will come?

ASR is always shrouded with misunderstanding.   Many believe it is a solved problem, many believe it is a unsolvable problem.   99.99% of world population are uninformed about the problem.   
I bet a lot of people would be fascinated by SETI, which .... Woa .... allows you to communicated to unknown intelligent sentients in the universe.   Rather than on ASR, which ..... Em ..... basically many regards as a source of satires/parodies these days.  
So here comes another problem,  the public don't understand ASR enough to see it as an important problem.   When you think about this more,  this is a dangerous situation.   Right now, couple of big companies control the resource of training cutting-edge speech recognizers.    So let's say in the futre everyone needs to talk with a machine in a daily basis.   These big companies would be so powerful that they can control our daily life.   To be honest to you, this thought haunts me from time to time.   
I believe we should continue to spread information on how to properly use an ASR system.  At the same time, continue to build application to show case ASR and let the public understand its inner-working.   Unlike subatomic particle physics,  HMM-based ASR is not that difficult to understand.   On this part, I appreciate all the effort which are done by developers of CMUSphinx, HTK, Julius and all other open source speech recognition projects.


I love the recent move of Sphinx spreading acoustic data using BitTorrent,  it is another step to work towards a self-improving speech recognition system.   There are still things we need to ponder in the open source speech community.   I mentioned a couple, feel free to bring up more in the comment section. 


Me and CMU Sphinx

As I update this blog more frequently, I noticed more and more people are directed to here.   Naturally,  there are many questions about some work in my past.   For example, "Are you still answering questions in CMUSphinx forum?"  and generally requests to have certain tutorial.  So I guess it is time to clarify my current position and what I plan to do in future.

Yes, I am planning to work on Sphinx again but no, I probably don't hope to be a maintainer-at-large any more.   Nick proves himself to be the most awesome maintainer in our history.   Through his stewardship, Sphinx prospered in the last couple of years.  That's what I hope and that's what we all hope.    
So for that reason, you probably won't see me much in the forum, answering questions.  Rather I will spend most of my time to implement, to experiment and to get some work done. 
There are many things ought to be done in Sphinx.  Here are my top 5 list:
  1. Sphinx 4 maintenance and refactoring
  2. PocketSphinx's maintenance
  3. An HTKbook-like documentation : i.e. Hieroglyphs. 
  4. Regression tests on all tools in SphinxTrain.
  5. In general, modernization of Sphinx software, such as using WFST-based approach.
This is not a small undertaking so I am planning to spend a lot of time to relearn the software.  Yes, you hear it right.  Learning the software.  In general, I found myself very ignorant in a lot of software details of Sphinx at 2012.   There are many changes.  The parts I really catch up are probably sphinxbase, sphinx3 and SphinxTrain.   One PocketSphinx and Sphinx4, I need to learn a lot. 
That is why in this blog, you will see a lot of posts about my status of learning a certain speech recognition software.   Some could be minute details.   I share them because people can figure out a lot by going through my status.   From time to time, I will also pull these posts together and form a tutorial post. 
Before I leave, let me digress and talk about this blog a little bit: other than posts on speech recognition, I will also post a lot of things about programming, languages and other technology-related stuffs.  Part of it is that I am interested in many things.  The other part is I feel working on speech recognition actually requires one to understand a lot of programming and languages.   This might also attract a wider audience in future. 
In any case,  I hope I can keep on.  And hope you enjoy my articles!

Landscape of Open Source Speech Recognition software at the end of 2012 (I)

As I am back, I start to visit all my old friends - all open source speech recognition toolkits.  The usual suspects are still around.  There are also many new kids in town so this is a good place to take a look.

It was a good exercise for me, 5 years of not thinking about open source speech recognition is a bit long.   It feels like I am getting in touch with my body again.

I will skip CMU Sphinx in this blog post as you probably know something about it if you are reading this blog.   Sphinx is also quite a complicated projects so it is rather hard to describe  entirely in one post.   This post serves only as an overview.  Most of the toolkit listed here have rich documentation.   You will find much useful information there.


I checked out the Cambridge HTK web page.  Disappointingly, the latest version is still 3.4.1, so we are still talking about MPE and MMIE, which is still great but not as exciting as other new kids in town such as KALDI.   
HTK has always been one of my top 3 speech recognition systems since most of my graduate work are done using HTK.   There are also many tricks you can do with the tools.   
As a toolkit, I also find its software engineering practice admirable.   For example, the software command was based on common libraries written beneath.  (Earlier versions such as 1.5 or 2.1 would restrict access to the memory allocation library HMem.)   When reading the source code, you feel much regularities and there doesn't seem to be much duplicated code. 
The license disallows commercial use but that's okay.  With ATK, which is released in a freer license, you can also include the decoder code into a commercial application.


The new kid in town.   It is headed by Dr. Dan Povey, who researched many advanced acoustic modeling techniques.   His recognizers attract much interest as it has implemented features such as subspace GMM and FST-based speech recognizer.   Of all, this features feel like more "modern". 
I only have little exposure on the toolkit (but determined to learn more).   Unlike Sphinx and HTK, it is written in C++ instead of C.   As of this writing, Kaldi's compilation takes a long time and the binaries are *huge*.   In my setup, it took me around 5G of disc space to compile.   It probably means I haven't setup correctly ...... or more likely, the executable is not stripped.   That means working on Kaldi's source code actively would take some discretion in terms of HD.  
Another interesting part of Kaldi is that it is using weighted finite state transducer (WFST) as the unifying knowledge source representation.   To contrast this, you may say most of the current open source speech recognizers are using ad-hoc knowledge source.   

Are there any differences in terms of performance you ask?  In my opinion, probably not much if you are doing an apple to apple comparison.   The strength of using WFST is that when you need to introduce new knowledge,  in theory you don't have to hack the recognizer.  You just need to write your knowledge in an FST and compose it with your knowledge network, then you are all set. 
In reality, the WFST-based technology seems to still have practice problem.  As the vocabulary size goes large and knowledge source got more complicated, the composed decoding WFST would naturally outgrow the system memory.   As a result, many sites propose different technique to make decoding algorithm works.  
Those are downsides but the appeal of the technique should not be overlooked.   That's why Kaldi becomes one of my favorite toolkits recently. 


Julius is still around!  And I am absolutely jubilant about it.  Julius is a high-speed speech recognizer which can decode a 60k vocabulary. One speed-up techniques of Sphinx 3.X was context-independent phone Gaussian mixture model selection (CIGMMS) and I borrowed this idea from Julius when I first wrote.  
Julius is only the decoder and the beauty of it is that it never claims to be more than that.  Accompanied with the software, there is a new Juliusbook, which is the guide on how to use the software.  I think the documentation are in greater-depth than other similar documentations. 
Julius comes with a set of Japanese models, not English.   This might be one of the reasons why it is not as popular (more like talk about) as HTK/Sphinx/Kaldi. 
(Note at 20130320: I later learned that Julius also comes with an English model now.  In fact, some anecdotes suggest the system is more accurate than Sphinx 4 with broadcast news.  I am not surprised.  HTK was as acoustic model trainer.)

So far......

I went through three of my favorite recognition toolkits.  In the next post, I will cover several other toolkits available. 

What should be our focus in Speech Recognition?

If you worked in a business long enough, you start to understand better what type of work are important.   As many things in life, sometimes the answer is not trivial.   For example, in speech recognition, what are the important ingredients to work on?

Many people will instinctively say the decoder.  For many, the decoder, the speech recognizer, oorr the "computer thing" which does all the magic of recognizing speech, is the core of the works.

Indeed, working on a decoding is loads of fun.  If you a fresh new programmer, it is also one of those experiences, which will teach you a lot of things.   Unlike thousands of small, "cool" algorithms, writing a speech recognizer requires you to work out a lot of file format issues, system issues.   You will also touch a fairly advanced dynamic programming problem : writing a Viterbi search.   For many, it means several years of studying source code bases from the greats such as HTK, Sphinx and perhaps in house recognizers.

Writing a speech recognizer is also very important when you need to deal with speed issues.  You might want to fit a recognizer into your mobile phone or even just a chip.   For example, in Voci, an FPGA-based speech recognizer was built to cater ultra-high speed speech recognition (faster than 100xRT).   All these system-related issues required understanding of the decoder itself.

This makes speech recognition an exciting field similar to chess programming.  Indeed the two fields are very similar in terms of code development.   Both require deep understanding of search as a process. Both have eccentric figures popped up and popped out.   There are more stories untold than told in both field.  Both are fascinating fields.

There is one thing which speech recognition and chess programming are very different.   This is also a subtle point which even many savvy and resourceful programmers don't understand.   That is how each of these machines derived their knowledge sources.   In speech, you need to have a good model to do decent jobs for your task.   In chess though, most programmers can proceed to write a chess player with the standard piece values.   As a result, there is a process before anyone can use a speech recognizer.  That is to first train an acoustic model and a language model.  

The same decoder, having different acoustic models and language models, can give users perceptions ranging from a total trainwreck to the a modern wonder, borderline to magic.   Those are the true ingredients of our magic.   Unlike magicians though, we are never shy to talk about these secret ingredients.   They are just too subtle to discuss.   For example, you won't go to a party and tell your friends that "Using an ML estimate is not as good as using an MPFE estimate in speech recognition.  It usually results in absolutely 10% performance gap."  Those are not party talks.  Those are talks when you want to have no friends. 🙂

In both type of tasks, one require learning different from a programming training.   10 years ago, those skill are generally carried by "Mathematician, Statistician or People who specialized in Machine Learning".   Now there is new name : "Big Data Analyst".

Before I stopped, let me mention another type of work, which are important in real life.  What I want to say is transcription and dictionary work.   If you asked some high-minded researchers in the field, they will almost think those are not interesting work.   Yet, in real-life, you can almost always learn something new and improve your systems based on them.  May be I will talk about this more next time.

The Grand Janitor