Categories
ASR Programming

The Search Programmer

In every team of any serious ASR or NLP company, that has to be one person who is the “search guy”.  Not search as in search engine, but search as in searching in AI.  The equivalent of a chess engine programmer in a chess program,  or perhaps to engine specialist for race cars.   Usually this person has three important roles:

  1. Program the engine,
  2. Add new features to the engine ,
  3. Maintain the engine through its life time.

This job is usually taken by someone who has title such as “Speech Scientist” or “Speech Engineer”.   They usually have blended skills of both programming and statistics.   It’s a tough job, but it’s also highly satisfactory job.  Because the success of a company usually depends on whether features can be integrated quickly.   That gives the “search guy” a mythical status even among data scientist – a search engineer needs to effectively work with two teams: one with mostly research background on statistics and machine learning, the other with mostly programming background, whose job is to churn out pseudocode, implementation and architecture diagrams daily.


I tend to think the power of “search guy” is both understated and overstated.

It’s understated because there are many companies which only use other people’s engine.  So they couldn’t quite get the edge of customizing an engine. Those which use open source implementation is better, because they preserved the right to change the engine and give them leverage on intellectual property and trade secrets.  Those who bought commercial engine from large company would enjoy good performance for few years, but then got squeezed by huge price of upgrading and constrained by overly restrictive license.

(Shameless prompotion here:  Voci is an exception.  We are very nice to our clients. Check us out at here. 🙂 )

It’s overstated because the skill of programming a search is nothing but a series of logical exercises.   The pity is programming a search algorithm, or generally a dynamic program (DP) in general, takes many kinds of expertise.  The knowledge can only be sporadically found in different subjects.  Some might learn the basic of DP in an algorithmic book such as CLRS, but mere knowledge of programming doesn’t give you insights on how to debug an issue of the search.  You do need to have solid understanding in the domain knowledge (such as POS tagging and speech recognition) and theory (such as machine learning) to get the job done correctly.

Arthur

Categories
Machine Learning

For the Not-So-Uninitiated: Review of Ng’s Coursera Machine Learning Class

I heard about Prof. Andrew Ng’s Machine Learning Class for a long time.  As MOOC goes, this is a famous one.  You can say the class actually popularized MOOC.   Many people seem to be benefited from the class and it has ~70% positive rating.   I have no doubt that Prof. Ng has done a good job in teaching non-data scientist on a lot of difficult concepts in machine learning.

On the other hand, if you are more a experienced practitioner of ML, i.e. like me, who has worked on a sub field of the industry (eh, speech recognition……) for a while, would the class be useful for you?

I think the answer is yes for several reasons:

  1. You want to connect the dots : most of us work in a particular machine learning problem for a while, it’s easy to fall into certain tunnel vision inherent to a certain type of machine learning.   e.g.  For a while, people think that using 13 dimension of MFCC is the norm in ASR.  So if you learn machine learning through ASR, it’s natural to think that feature engineering is not important. That cannot be more wrong! If you look at reviews of Kaggle winners, most will tell you they spent majority of time to engineer feature.  So learning machine learning from ground up would give you a new perspective.
  2. You want to learn the language of machine learning properly One thing I found which is useful Ng’s class is that it doesn’t assume you know everything (unlike many postgraduate level classes).   e.g. I found that Ng’s explanation of the term of bias vs variance makes a lot of sense – because the terms have to be interpreted differently to make sense.  Before his class, I always have to conjure in my head on the equation of bias and variance.   True, it’s more elegant that way, but for the most part an intuitive feeling is more crucial at work.
  3. You want to practice:  Suppose you are like me, who has been focusing on one area in ASR, e.g. in my case, I spent quite a portion of my time just work on the codebase of the in-house engine.  Chances are you will lack of opportunities to train yourself on other techniques.  e.g.  I never implemented linear regression (a one-liner), logistic regression before.  So this class will give you an opportunity to play with these stuffs hand-ons.
  4. Your knowledge is outdated : You might have learned pattern recognition or machine learning once back in school.  But technology has changed so you want to keep up.  I think Ng’s class is a good starter class.  There are more difficult ones such as Hinton’s Neural Network for Machine Learning, the Caltech class by Prof. Yaser Abu-Mostafa, or the CMU’s class by Prof. Toni Mitchell.  If you are already proficient, yes, may be you should jump to those first.

So this is how I see Ng’s class.  It is deliberately simple and leaned on the practical side.  Math is minimal and calculus is nada.  There is no deep learning and you don’t have to implement algorithm to train SVM.   There is o latest stuffs such as random forest and gradient boosting.   But it’s a good starter class.   It also get you good warm up if you hadn’t learn for a while.

Of course, this also speaks quite a bit of the downsides of the class, there are just too many practical techniques which are not covered.  For example, if you work on a few machine learning class, you will notice that SVM with RBF kernel is not the most scalable option.  Random forest and gradient boosting is usually a better choice.   And even when using SVM, using a linear kernel with right packages (such as pegasus-ml) would give you much faster run.  In practice, it could mean if you can deliver or not.   So this is what Ng’s class is lacking,  it doesn’t cover many important modern techniques.

In a way, you should see it as your first machine learning class.   The realistic expectation should be you need to keep on learning.  (Isn’t that speak for everything?)

Issues aside, I feel very grateful to learn something new in machine learning again.  That was since I took my last ML class back in 2002, the landscape of the field was so different back then.    For that, let’s thank to Prof. Ng! And Happy Learning.

Arthur

Postscript at 2017 April

Since taking this first class of coursera, I took several other classes such as Dragomir Radev’s NLP and perhaps more interesting to you, Hinton’s Neural Network Machine Learning.    You can find my reviews on the following hyperlinks:

Radev’s Coursera Introduction to Natural Language Processing – A Review

A Review on Hinton’s Coursera “Neural Networks and Machine Learning”

I also have a mind to write a review for perfect beginner of machine learning, so stay tuned! 🙂

(20151112) Edit: tunnel effects -> tunnel vision.   Fixed some writing issues.
(20170416) In the process of organizing my articles.  So I do some superficial edits.

Reference:

Andrew Ng’s Coursera Machine Learning Class : https://www.coursera.org/learn/machine-learning/home/welcome

Geoff Hinton’s Neural Networks for Machine Learning:  https://www.coursera.org/course/neuralnets

The Caltech class: https://work.caltech.edu/telecourse.html

The CMU class: http://www.cs.cmu.edu/~tom/10701_sp11/lectures.shtml

 

 

Categories
Classification Machine Learning Regression

Gradient Descent For Logistic Regression

I was binge watching (no kidding) all videos from Andrew Ng’s Coursera ML class.   May be I want to write a review at a certain point.  In short, it is highly recommendable for anyone who works in data science and machine learning to go through the class and spend some time to finish the homework step-by-step.

What I want to talk about though is an interesting mathematical equation you can find in the lecture, namely the gradient descent update or logistic regression.   You might notice that gradient descents for both linear regression and logistic regression have the same form in terms of the hypothesis function.  i.e.

$latex \theta_j := \theta_{j} – \alpha \sum_{i=1}^M (H_{\theta} (\pmb{x}^{(i)}) – y^{(i)}) x_j^{(i)}……(1)$

Notation can be found in Prof. Ng’s lecture at Coursera.  Also you can find the lecture notes at here.

So why is it the case then? In a nutshell, it has to do with how the cost function $\latex J(\theta)$ was constructed.  But let us back up and do some simple Calculus exercises on how the update equation can be derived.

In general, updating the parameter $latex \theta_j$ with gradient descent follows

$latex \theta_j := \theta_{j} – \alpha \frac{\partial J(\theta)} {\partial \theta}……(2)$

So we first consider linear regression with hypothesis function,

$latex H_{\theta}(\pmb{x}) = \theta^T \pmb{x}……(3)$

and cost function,

$latex J(\theta) = \frac{1}{2}\sum_{i=1}^M (H_{\theta}(\pmb{x}^{(i)})- y^{(i)})^2……(4)$.

So….

$latex \frac{\partial J(\theta)}{\partial \theta_j} = \frac{\partial J(\theta)}{\partial H_{\theta}(\pmb{x}^{(i)})} \frac{\partial H_{\theta}(\pmb{x}^{(i)})}{\partial \theta_j} $

$latex = \sum_{i=1}^M (H_{\theta} (\pmb{x}^{(i)}) – y^{(i)}) x_j^{(i)} for k = 1 \ldots N$

So we arrive update equation (1).

Before we go on notice that in our derivation for linear regression, we use chain rule to simplify. Many of these super long expressions can be simplified much more easily if you happen to know the trick.

So how about logistic regression? The hypothesis function of logistic regression is
$latex H_{\theta}(\pmb{x}) = g(\theta^T \pmb{x})……(5)$

where $latex g(z)$ is the sigmoid function$

$latex g(z) = \frac{1}{1+e^{-z}}…… (6)$.

as we can plot in Diagram 1.

A sigmoid function
Diagram 1: A sigmoid function

Sigmoid function is widely used in engineering and science. For our discussion, here’s one very useful property:

$latex \frac{dg(z)}{dz} = g(z) (1 – g(z)) …… (7)$

Proof:
$latex \frac{dg(z)}{dz} = \frac{d}{dz}\frac{1}{1+e^{-z}}$
$latex = -(\frac{-e^{-z}}{(1+e^{-z})^2})$
$latex = \frac{e^{-z}}{(1+e^{-z})^2}$
$latex = g(z)(1-g(z))$

as $latex 1-g(z) = \frac{e^{-z}}{1+e^{-z}}$.

Now we have all the tools, let’s go forward to calculate the gradient term for the logistic regression cost function, which is defined as,

$latex J(\theta) = \sum_{i=1}^M \lbrack -y^{(i)}log H_{\theta}(x^{(i)})-(1-y^{(i)})log (1- H_\theta(x^{i}))\rbrack$

The gradient is

$latex \frac{\partial J(\theta)}{\partial\theta_k} = \sum_{i=1}^M \lbrack -y^{(i)} \frac{H’_{\theta}(x^{(i)})}{H_{\theta}(x^{(i)})} + (1- y^{(i)}) \frac{H’_{\theta}(x^{(i)})}{1-H_{\theta}(x^{(i)})}\rbrack ……(8)$

So making use of Equation (7) and chain rule, the gradient w.r.t $latex \theta_k$:

$latex H’_{\theta}(x^{(i)}) = H_{\theta}(x^{(i)})(1-H_{\theta}(x^{(i)}))x_k^{(i)} …..(9)$

Substitute (9) into (8),

$latex \frac{\partial J(\theta)}{\partial\theta_k} = \sum_{i=1}^M -y^{(i)}\frac{H_{\theta}(x^{(i)})(1-H_{\theta}(x^{(i)}))x_k^{(i)} }{H_{\theta}(x^{(i)})} + \sum_{i=1}^M (1- y^{(i)}) \frac{H_{\theta}(x^{(i)})(1-H_{\theta}(x^{(i)}))x_k^{(i)} }{1-H_{\theta}(x^{(i)})}$
$latex = \sum_{i=1}^M\lbrack -y^{(i)} (1-H_{\theta}(x^{(i)}))x_k^{(i)} \rbrack + \sum_{i=1}^M\lbrack (1- y^{(i)}) H_{\theta}(x^{(i)})x_k^{(i)} \rbrack$
$latex = x_k^{(i)} \lbrack \sum_{i=1}^M (-y^{(i)} + y^{(i)} H_{\theta}(x^{(i)}) + \sum_{i=1}^M (H_{\theta}(x^{(i)}) – y^{(i)} H_{\theta}(x^{(i)})) \rbrack$

As you may observed, the second and the fourth term cancel out. So we end up having:

$latex \frac{\partial J(\theta)}{\partial\theta_k} = \sum_{i=1}^M (H_{\theta}(x^{(i)}) -y^{(i)})x_k^{(i)}$,

which brings us back update rule (2).

This little calculus exercise shows that both linear regression and logistic regression (actually a kind of classification) arrive the same update rule. What we should appreciate is that the design of the cost function is part of the reasons why such “coincidence” happens. But that’s why I appreciate Ng’s simple lecture. It is using a set of derivation which brings beginners into machine learning more easily.

Arthur

Reference:
Profession Ng’s lectures : https://www.coursera.org/learn/machine-learning

Property of sigmoid function can be found at Wikipedia page: https://en.wikipedia.org/wiki/Sigmoid_function

Many linear separator-based machine learning algorithms can be trained using simple gradient descent. Take a look of Chapter 5 of Duda, Hart and Stork’s Pattern Classification.

Categories
Uncategorized

Doing It

I was usually in a group, try to tackle a task. We might miss a certain implementation, a certain proof, or even some of us lack of some simple knowledge of certain programming language. And you can guess the consequence: all planning efforts stop here. One of us would say “Let’s not pursue that path. There’s too much risk and unknown.”

I ponder about these situations for a while. There is a surprisingly simple solution for all of these problems : spend a little time to try out a simple solution.

For example, when you don’t know a certain language, can you write an “if”-clause or even a “Hello World” with that language? Further can you write a for-loop. For me, google this kind of questions for around 10 minutes usually gives me basic knowledge of the language and implement a simple program. And surprisingly, that’s what ~70% needs – repeat an action (for-loop) under certain condition (an if-statement).

Another example, suppose you don’t have an implementation of say linear regression.   Can you simplify the problem first? e.g. Can you first implement linear regression with 1 variable? (i.e. only the linear term.)  If you can do it, can you then extend your implementation to multiple variables?   Implementing an algorithm, unlike understanding an algorithm, doesn’t take as much time.   So it’s silly to get caught up before you start.

As a final example, how about learning a certain machine learning algorithm?  What should you first do?  My approach would be first running some existing examples of the algorithm.  For example, if you want to learn the basic of SVM, try to run examples from the tutorial of libsvm.    If you want to learn DNN, go through the tutorial of deep learning tutorial.    For sure, you will not get every nuance in machine learning this way.  e.g.  how to make a decision when you get stuck?  How to decide if you machine learning algorithm is suitable for the task?   Yes. Those take experience and more in-depth methodology.  (Check out Andrew Ng’s Machine Learning Class.) But knowing these algorithms from top-down or solely based on mathematics is unlikely to get you too far.

One common retort from people opposing of doing it is that you usually don’t quite get state-of-the-art performance in the first few passes.   I do agree certain parts of their viewpoints.  Sometimes, more planning gives you a more refined first solution.  But for the most time,  you don’t conjure knowledge out of nothing, you build up knowledge.   It’s useless to be too caught up in thinking some time.

I also don’t oppose to understanding.   For example, doing derivations and proofs for algorithm always help your implementation.   But in practice, getting a feel about an algorithm would also help you to understand an algorithm.

If you ask me, what is really shameful is that we were sometimes blocked by our subtle fear of learning something new.   That doesn’t do any good for you, or the team.  That’s what I am driving at in this article.

Arthur

Reference:
I was inspired by this post : “Concrete, Then Abstract“.  I also highly recommend “Zen and the Art of Motorcycle Maintenance: An Inquiry Into Values” to my readers.

Categories
Uncategorized

The Simplification Trick in IBM Model 1

As a daily reading routine, I try to browse through “Statistical Machine Translation” by Koehn, a famous researcher in the field.  Of course, the basic of SMT is IBM Models and the basic of IBM Models is IBM Model 1.   So one key derivation is how come when there exponential alignment ($latex {l_f}^{l_e}$) to a computational complexity linear to ($latex{l_f}) and ($latex{l_e}).   So it boils down the derivation at equation 4.10:

(Here I assume the readers know the basic formulation of t-tables. Please refer to Ch 4 of [1])

$latex p(e|f)$

$latex =\sum_{a} p(e,a|f) ….. (1)$

$latex = \sum_{a(1)=0}^{l_f}\ldots\sum_{a(l_e)=0}^{l_f} p(e,a|f) …… (2)$

$latex = \sum_{a(1)=0}^{l_f} \ldots\sum_{a(l_e)=0}^{l_f}\frac{\epsilon}{(l_f + 1)^{l_e}} \prod_{j=1}^{l_e} t(e_j|f_{a(j)}) …… (3)$

$latex = \frac{\epsilon}{(l_f + 1)^{l_e}}\sum_{a(1)=0}^{l_f} \ldots\sum_{a(l_e)=0}^{l_f}\prod_{j=1}^{l_e} t(e_j|f_{a(j)}) …… (4)$

$latex = \frac{\epsilon}{(l_f + 1)^{l_e}}\prod_{j=1}^{l_e}\sum_{i=0}^{l_f}t(e_j|f_i) …… (5)$

It must be trivial to some very smart people, but how come $latex (4)$ became $latex (5)$ all the sudden?

It turns out this is an important trick, it was used in deriving the count function $c$ later in equation 4.13. (An exercise at the back of Chapter 4.) The same trick was also used in Model 2 when alignment probability is modeled. So it might worth a bit of our time to understand why.

Usually this kind of expression reduction can be formally solved by mathematical induction. But as noted in [2]. Mathematical induction doesn’t usually give you useful insight. So let us use another approach.

Let us first consider the expression with $latex l_e$ summation signs in equation $latex (4)$:

$latex \sum_{a(1)=0}^{l_f} \prod_{j=1}^{l_e}t(e_j|f_i) …… (6)$

So further we consider the first term of the expression at $latex (6)$. What is it? It’s simply:

$latex t(e_1 | f_0) \prod_{j=2}^{l_e}t(e_j|f_i)$.

The second term?

$latex t(e_1 | f_1) \prod_{j=2}^{l_e}t(e_j|f_i)$.

So you probably start to get the drift. The $latex m$-th term of expression $latex (6)$ is

$latex t(e_1 | f_m) \prod_{j=2}^{l_e}t(e_j|f_i)$.

Since all $latex l_f$ terms have the common factor $latex \prod_{j=2}^{l_e}t(e_j|f_i)$, you can factorize $latex (6)$ as

$latex (t(e_1|f_0) + t(e_1|f_1) + \ldots + t(e_1 | f_{l_f})\prod_{j=2}^{l_e}t(e_j|f_i) $

or

$latex \sum_{i=0}^{l_f}t(e_1|f_i)\prod_{j=2}^{l_e}t(e_j|f_i) $

As a result, you reduce the number of products by one term (in case you don’t notice. $latex j$ is now 2).

So back to our question. You just need to follow above procedure for $l_e$-times, you will end up re-express equation $latex (4)$ as

$latex (\sum_{i=0}^{l_f}t(e_1|f_i)) \times (\sum_{i=0}^{l_f}t(e_2|f_i)) \ldots \times (\sum_{i=0}^{l_f}t(e_{l_e}|f_i))$

So here come equation $latex (5)$

$latex = \frac{\epsilon}{(l_f + 1)^{l_e}}\prod_{j=1}^{l_e}\sum_{i=0}^{l_f}t(e_j|f_i)$

Straightly speaking this is not a post for SMT, but more on how one can simplify summands and multiplicands. In many papers, this kind of math is usually skipped and assume the readers know what to do. (I can imagine some friends of mine would say “if you don’t know these stuffs, you don’t deserve to do ….!”) Of course, for mere mortals like us, it could take some time and effort to figure it out.

Direct manipulation is certainly one way to go. There are also interested rules on how summation indices which sometimes greatly simplify the derivation. If you are interested to read more about the topic, I found that Chapter 2 of Concrete Mathematics ([2]) is a good start.

Reference:
[1] Statistical Machine Translation, Philip Koehn 2010.
[2] Concrete Mathematics, Graham et al 2002.

Categories
SMT

Building a simple SMT

I had couple of vacation days last week.   For fun, I decided to train a statistical machine translator (SMT).   Since I want to use tools from open source.  The natural choice is Moses w GIZA++.  So this note is how you can start smoothly.  I don’t plan to write a detail tutorial because Moses’ tutorial is nice enough already.  What I note here is more on how you should deal with different stumbling blocks.

Which Tutorial to Follow?

If you never run an SMT training before,  perhaps the more solid way to start is to follow the “Baseline System” link (a better name could be “How to train a baseline system”).    At here, there is a rather detail tutorial on how to train a sets of models from WMT13 mini news commentary.

Compilation

I found that the most difficult part of the process is to compile moses.   I don’t blame anybody, C++ program can generally be difficult to compile.

Boost

Use source of boost, make sure libbz2 was first installed.  Then life would be much easier.

cmph

While it is not mandatory, I would highly recommend you to install cmph first before compiling moses because compiling cmph would trigger compilation of file compressing tools such as processPhraseTableMin and processLexicalTableMin.  Without them, it will take a long long time to do decoding.

Actual bjaming

Do ./bjam –with-boost=<boost_dir> –with-cmph=<cmph_dir> -j 4

works fairly well for me until I tried to compile the ./misc directory.   That I found I need to manually add a path of boost to the compilation.

Training

Training is fairly trivial once you have moses compiled correctly and put everything in your root directory.

On the hand, if you compiled your code somewhere other than ~/, do expect some debugging is necessary.  e.g. mert-moses.pl would require full path at the –merdir argument.

Results:

BLEU = 23.34, 60.1/29.7/16.7/9.9 (BP=1.000, ratio=1.018, hyp_len=76112, ref_len=7475)

Conclusion

Here you have it.  Some notes on the simplest recipe for non-expert (like me).   If I have a chance, I would analyze how the source code works.  Again just for fun.

Arthur

Categories
ASR

Different HMMSets in HTK

HTK was my first speech toolkit. It’s fun to use and you can learn a lot of ASR by following the manual carefully and deliberately.

If you are still using HMM/GMM technology (interesting but why?), here is a thread a year ago on why there are different HMM Types in HTK.

One thought I have: when I first start out in ASR, I seldom think of any human elements in a design. Of course, it has to deal with the difficulty of understanding all these terminologies and algorithms.

Yet ASR research has to do a lot with rival groups come up with different ideas, each try to bet against each other on the success of a certain technique.

So sometimes you would hope that competition would make technology finer. Yet a highly competitive environment only nurture followers, rather than competitive loner groups such as Prof. Young’s , or MSR (whom AFAIK built the first working version of DNN-based ASR).

Arthur


Hi,

I’m a student who’s looking into the HTK source code to get some idea
about practical implementation of HMMs. I have a question related to
the design choices of HTK.

AFAIK, the current working set of HMMs (HMMSet) has 4 types: plain,
shared, tied, discrete.
HMM sets with normal continuous emission densities are “plain” and
“shared”, only difference being that some parameters are shared in the
latter. Sets with semi-continuous emission densities (shared Gaussian
pools for each stream) are called “tied” and discrete emission
densities are “discrete”.

If someone uses HTK, isn’t there a high chance of using only one of
these types? The usage of these types is probably mutually exclusive.
So my question is, why not have separate training and recognition
tools for continuous, semi-continuous and discrete HMM sets? Here are
some pros and cons of the current design I can think of, which of
course can be wrong:

pros:
– less code duplication
– simpler interface for the user

cons:
– more code complexity
– more contextual information required to read, more code jumps
– unused variables and memory, examples: vq and fv in struct
Observation, mixture alignment in discrete case

If I were to implement HMMs supporting all these emission densities,
what path should I follow? How feasible is it to use OOP principles to
create a better design? If so, why weren’t they leveraged in HTK?

Warm regards,
Max

(I trimmed out Mr. Neil Nelson’s reply, which basically suggest people should use Kaldi instead.)


Max and Neil

I don’t usually respond to HTK questions, but this one was hard to resist.

I designed the first version of HTK in Cambridge in 1988 soon after moving from Manchester where I worked for a while on programming language and compiler design. I was a strong advocate of modular design, abstraction and OOP. However, at that time, C++ was a bit of a nightmare. There was little standardisation across operating systems and some implementations were very inefficient. As a result I decided that since HTK had to be very efficient and portable across platforms, it would be written in C, but the architecture would be modular and class like. Hence, header files look like class interfaces, and body files look like class method implementations.

When HTK was first designed, the “experts” in the US DARPA program had decided that continuous density HMMs would never scale and that discrete and semi-continous HMMs were the way to go. I thought they were wrong, but decided to hedge my bets and built in support for all three – whilst at the same time taking care that the implementation of continuous densities was not compromised by the parallel support for discrete and semi-continuous. By 1993 the Cambridge group (and the LIMSI group in France) were demonstrating that continuous density HMMs were significantly better than the other modelling approaches. So although we tried to maintain support for different emission density models, in practice we only used continuous densities for all of our research in Cambridge.

It is a source of considerable astonishment to me that HTK is still in active use 25 years later. Of course a lot has been added over the years, but the basic architecture is little changed from the initial implementation. So I guess I got something right – but as Neil says, things have moved on and today there are good alternatives to HTK. Which is best depends on what you want to do with it!

Steve Young

Categories
Uncategorized

How to use Latex in WordPress?

This is one of these small things it could take you a while to figure out.   In your “Text” tab, what you want to do is type:

$latex your-latex-code-here$

For example

$latex \gamma$ would give you

$latex \gamma$.

So what could go wrong?

If you mistakenly put space in after the first dollar sign. e.g.
$[SPACE]latex \gamma$ 

Then you will get the latex code on the screen. (Argh.)

Her are some more complicated examples:

$ latex \int_0^1 f(x) \, \mathrm{d} x $ is

$latex \int_0^1 f(x) \, \mathrm{d} x$

And

$ latex \displaystyle \sum_{n=1}^\infty \frac{1}{n^2} = \frac{\pi^2}{6}. $ is

$latex \displaystyle \sum_{n=1}^\infty \frac{1}{n^2} = \frac{\pi^2}{6}.$

I think this is a useful thing to know because if you copy latex formula from the web it would take you a while to get it right and nice.

Arthur

Categories
The Grand Janitor Blog

Updates of The Grand Janitor Blog

It has been a while since I last blogged.  I was mostly busy with work so blogging was on the sideline.   So this time I spent a bit of time to update publication list and get my hands wet.

What I learned in last 1.5 years?  “ASR is solved” That’s the one comment I heard when DNN+GMM came out for around 3 years,  i.e.  when DNN+GMM was widely adopted by around 40-50 sites around the world:

Of course, this was said before – it happened when people started to use adaptive techniques and see significant gain.   Perhaps it was also said when people first discovered using GMM as the state distributions, or first using HMM instead of DTW.

So while I understand people are getting more elated.   Probably we still have problem to solve.  For example, look at the latest IBM research at Switchboard,  there are probably quite a lot of room to improve the current state-of-the-art NN-based system.   Not to say, on-line videos transcription seems to be hard problems.  That’s perhaps why many people are still working on ASR.

Hopefully I can come back more – ideally one post per week.  We will see how goes.

Arthur

Categories
Programming

Patterns in ASR Coding

Many toolkits in ASR appears in the form of unix executables.   But the nature of ASR tool is quite a bit different from general unix tools.   I will name 3 here:

  1. Complexity: A flexible toolkit also demands developers to have an external scripting framework.  In SphinxTrain, it used to be glued by perl, now by python.   Kaldi, on the other hand, is mainly glued by shell script.  I heard Cambridge has its own tools to do experiment correctly.
  2. Running Time: Coding ASR is that it takes long time to verify if something is correct.   So there are things you can’t do: a very agile type of development by code-and-test doesn’t work well.   I have seen people implemented, but it leaves so many bugs in the codebase.
  3. Numerical Issues: Another issue is that much coding in numerical algorithm could cause subtle changes of the results, it is tricky to code these changes well.  When these changes penetrated to production, it is usually very hard to debug.  When such changes affect performance, the result could be disastrous to you and your clients.

In a nutshell, we are dealing with a piece of software which is complex and mission-critical.  The issue is how do you continue develop and maintain such software.

In this article, I will talk about how this kind of coding can be done right.   You should notice that I don’t favor a monolithic design of experimental tools.   e.g. “why don’t we just write one single tool that does everything (to train/to decode)?”  There is a place of those mindsets in software engineering. e.g. Mercuria is designed in that way and I heard it is very competitive to GIT.   But I prefer a Unix-tool type of design which is closed to HTK, Sphinx, Kaldi.  i.e.  you write many tools and each of them has different purposes. You then simply glue them together for your own purpose.  I will call all the code changes in these little unix tools as code-level changes.  While changes in the scripting level simply as script-level changes.

Many of these thought are taught to me by experienced people in the field.   Some can be applicable in other fields: such as Think Before Code, Conclude from your Test.  Other can be applied to machine-learning specific problem: Match Results Numerically, Always Record Results.

Think Before Code

In our time, the agile development paradigm is very popular.  May be too popular, in my view.  Agile development is being deployed in too many places which I think inappropriate.  ASR is one of them.

As a coder in ASR, what you usually do are two things: making code-level changes (in C/C++/Java) or script-level changes (in Perl/Python).  In a nutshell, you are doing programming in a complex piece of software.   Since testing could take a long time.  Code-and-test type paradigm won’t work for you too well.

On the other hand, deliberate-and-slow thinking is your first line of defense for any potential issues.  You should ask yourself couple of questions before any changes:

  1. Do you understand the purpose each of the tools in your script?
  2. Do you understand the underlying principle of the tool?
  3. Do you understand the I/O?
  4. Would you expect that any changes would change the I/O at all?
  5. For each tool, do you understand the code?
  6. What is your change?
  7. Where are your changes?  How many things you need to change? (10 files, 100 files? List them out.)
  8. In your head, after you make the change, do you expect your change will work? Why?  Convince yourself.

These are some of the questions you should ask yourself.  Granted, you don’t have to all answers, but the more you know, you would reduce any potential future issues .

Conclude from your Tests, not from your Head

After all the thinking, are we done? No, you should still test your code, in fact you should test your code like a professional tester.  Bombard your well-thought out program with test.   Fix all warnings from compilers, valgrind it to fix leaks.   If you don’t fix a certain thing, make sure you have a very very good reason. Because any changes in your decoder and trainer could have many ramifications to upper-layer of software, to you and to your colleagues.

The worst way to think about ASR coding is to say “it should work!”.  No.  Sometimes, it doesn’t. You are too naive for not testing the code.

Who makes such mistakes? It is hard to nail it down. My observation is that those who always try to think through any problems in their head and have strong conviction that they are right.    They are usually fresh grads (all kinds, Bachelors? Masters? PhDs? They are everywhere.)  Or people who only work on research and hadn’t done real-life coding that much.  In a nutshell, it is a “philosophy”-thing.  Some people tend to think their thought apriori will work as it is.   This is a 8-th century thinking.  Always verify your changes with experiments.

Also. No one say, testing always eliminate all problems.  But if you think and test.  The chances of making mistakes will be tremendously reduced.

Scale It Down

The issue about large amount of testing in ASR it that it takes a long time.   So what should you do?

Scale it down.

e.g. Suppose you have 1000 utterance test, you want to reduce the testing time.  Make it a 100 utterance test, or even 10.  That allows you to verify your change quickly.

e.g. If you have an issue appears in 1 min utterance, try to see if you can repeat the same issue on a 6 second one.

e.g. If you are trying a procedure for 1000 hour of data, try to test it with 100 hour first.

These are just some examples.  This is a very important paradigm because it allows you to move on with your work faster.

Match Results Numerically

If you make an innocuous change, but the results are slightly different.  You should be very worried.

The first question you should ask is “How can this happen at all?” For example, let’s say if you add a command-line option, your decoding results shouldn’t change.

Are there any implicit or explicit random number generators in the code?  Or have you accidentally take in users’ input?  Or else, how come your innocuous change would cause changes in results?

Be wearied about any one who say “It is just a small change.  Who cares? The results won’t change.” No, always question the size of the changes.   Ask for how many significant digits are match if there are any difference.   If you could try to learn more about intrinsic error introduced by floating point calculation.  (e.g. “What Every Computer Scientist Should Know About Floating Point Calculation” is a good start.)

There is another opposing thought: i.e. It should be okay to have some numerical changes.  I don’t really buy it because once you allow yourself to drift 0.1% 10 times, you will have a 1% drift which can’t be explained.  The only times you should let yourself go is you encountered randomness you can’t control.  Even in those cases, you should still explain why your performance would change.

Predict before Change

Do you expect your changes would give better results?  Or worse results?  Can you explain to yourself why your change could be good/bad?

In terms of results, we are talking about mainly 3 things :  word-error-rate, speed and usage of memory.

Setup an Experimental Framework

If you are anyone serious about ML or ASR, you should have tested your code many times.  If you have tested your code many times, you will realize you can’t use your brain to manipulate all your experiments.  You need a system.

I have written an article in V1 about this subject.  In a nutshell, make sure you can repeat/copy/record all your experimental detail including versions of binary, parameters.

Record your Work

With complexity of your work, you should make sure you keep enough documentation.  Here are some ideas:

  1. Version Control System : for your code
  2. Bug tracking : for your bugs and feature requests
  3. Planning document: for what you need to do in a certain task
  4. Progress Note: record in a daily basis on what you have done/learned experimentally.

Yes, you should have many records by now.  If you don’t have any, I feel worried about you.  Chances are some important experimental details were forgotten.  Or if you don’t see what you are doing is an experiment…… Woa.  I wonder how you explain what you do to other people.

Conclusion

That’s what I have today.  This article summarizes many important concepts on how to maximize your success of doing any coding changes.    Some of these are habits which take time to setup and get used to.   Though from my experience, these habits are invaluable.  I found myself writing features which have less problems.  Or at least when there are problems, they are problems I hadn’t and couldn’t anticipate.

Arthur