The Simplification Trick in IBM Model 1

As a daily reading routine, I try to browse through "Statistical Machine Translation" by Koehn, a famous researcher in the field.  Of course, the basic of SMT is IBM Models and the basic of IBM Models is IBM Model 1.   So one key derivation is how come when there exponential alignment (${l_f}^{l_e}$) to a computational complexity linear to ($latex{l_f}) and ($latex{l_e}).   So it boils down the derivation at equation 4.10:

(Here I assume the readers know the basic formulation of t-tables. Please refer to Ch 4 of [1])

$p(e|f)$

$=\sum_{a} p(e,a|f) ..... (1)$

$= \sum_{a(1)=0}^{l_f}\ldots\sum_{a(l_e)=0}^{l_f} p(e,a|f) ...... (2)$

$= \sum_{a(1)=0}^{l_f} \ldots\sum_{a(l_e)=0}^{l_f}\frac{\epsilon}{(l_f + 1)^{l_e}} \prod_{j=1}^{l_e} t(e_j|f_{a(j)}) ...... (3)$

$= \frac{\epsilon}{(l_f + 1)^{l_e}}\sum_{a(1)=0}^{l_f} \ldots\sum_{a(l_e)=0}^{l_f}\prod_{j=1}^{l_e} t(e_j|f_{a(j)}) ...... (4)$

$= \frac{\epsilon}{(l_f + 1)^{l_e}}\prod_{j=1}^{l_e}\sum_{i=0}^{l_f}t(e_j|f_i) ...... (5)$

It must be trivial to some very smart people, but how come $(4)$ became $(5)$ all the sudden?

It turns out this is an important trick, it was used in deriving the count function $c$ later in equation 4.13. (An exercise at the back of Chapter 4.) The same trick was also used in Model 2 when alignment probability is modeled. So it might worth a bit of our time to understand why.

Usually this kind of expression reduction can be formally solved by mathematical induction. But as noted in [2]. Mathematical induction doesn't usually give you useful insight. So let us use another approach.

Let us first consider the expression with $l_e$ summation signs in equation $(4)$:

$\sum_{a(1)=0}^{l_f} \prod_{j=1}^{l_e}t(e_j|f_i) ...... (6)$

So further we consider the first term of the expression at $(6)$. What is it? It's simply:

$t(e_1 | f_0) \prod_{j=2}^{l_e}t(e_j|f_i)$.

The second term?

$t(e_1 | f_1) \prod_{j=2}^{l_e}t(e_j|f_i)$.

So you probably start to get the drift. The $m$-th term of expression $(6)$ is

$t(e_1 | f_m) \prod_{j=2}^{l_e}t(e_j|f_i)$.

Since all $l_f$ terms have the common factor $\prod_{j=2}^{l_e}t(e_j|f_i)$, you can factorize $(6)$ as

$(t(e_1|f_0) + t(e_1|f_1) + \ldots + t(e_1 | f_{l_f})\prod_{j=2}^{l_e}t(e_j|f_i)$

or

$\sum_{i=0}^{l_f}t(e_1|f_i)\prod_{j=2}^{l_e}t(e_j|f_i)$

As a result, you reduce the number of products by one term (in case you don't notice. $j$ is now 2).

So back to our question. You just need to follow above procedure for $l_e$-times, you will end up re-express equation $(4)$ as

$(\sum_{i=0}^{l_f}t(e_1|f_i)) \times (\sum_{i=0}^{l_f}t(e_2|f_i)) \ldots \times (\sum_{i=0}^{l_f}t(e_{l_e}|f_i))$

So here come equation $(5)$

$= \frac{\epsilon}{(l_f + 1)^{l_e}}\prod_{j=1}^{l_e}\sum_{i=0}^{l_f}t(e_j|f_i)$

Straightly speaking this is not a post for SMT, but more on how one can simplify summands and multiplicands. In many papers, this kind of math is usually skipped and assume the readers know what to do. (I can imagine some friends of mine would say "if you don't know these stuffs, you don't deserve to do ....!") Of course, for mere mortals like us, it could take some time and effort to figure it out.

Direct manipulation is certainly one way to go. There are also interested rules on how summation indices which sometimes greatly simplify the derivation. If you are interested to read more about the topic, I found that Chapter 2 of Concrete Mathematics ([2]) is a good start.

Reference:
[1] Statistical Machine Translation, Philip Koehn 2010.
[2] Concrete Mathematics, Graham et al 2002.

How to use Latex in WordPress?

This is one of these small things it could take you a while to figure out.   In your "Text" tab, what you want to do is type:

$latex your-latex-code-here$

For example

$latex \gamma$ would give you

$\gamma$.

So what could go wrong?

If you mistakenly put space in after the first dollar sign. e.g.
$[SPACE]latex \gamma$

Then you will get the latex code on the screen. (Argh.)

Her are some more complicated examples:

$latex \int_0^1 f(x) \, \mathrm{d} x$ is

$\int_0^1 f(x) \, \mathrm{d} x$

And

$latex \displaystyle \sum_{n=1}^\infty \frac{1}{n^2} = \frac{\pi^2}{6}.$ is

$\displaystyle \sum_{n=1}^\infty \frac{1}{n^2} = \frac{\pi^2}{6}.$

I think this is a useful thing to know because if you copy latex formula from the web it would take you a while to get it right and nice.

Arthur

Geeky:

Is Depression Really Biochemical (AssertTrue)

Meetings are Mutexes (Vivek Haldar)

So True.  It doesn't count all the time you use to prepare a meeting.

Exhaustive Testing is Not a Proof of Correctness

True, but hey.  Writing regression tests is never a bad thing. If you rely only on your brain on testing, it bounds to fail one way or the other.

Apple :

Yahoo:

Yahoo The Marissa Mayer Turnaround

Out of all commentaries on Marissa Mayer's realm.  I think Jean-Louis Gassée goes straight to the point and I agree most.   You cannot use a one size fit all policy.  So WFH is not always appropriate as well.

Management:

The Management-free Organization

Taeuber's Paradox and the Life Expectancy Brick Wall by Kas Thomas

Simplicity is Wonderful, But Not a Requirement by James Hague

Yeah.  I knew a professor who always want to rewrite speech recognition systems such that is easier for research.   Ahh...... modern speech recognition systems are complex any way.   Not making mistakes is already very hard.   Not to say building a good research system which easy to use for everyone. (Remember, everyone has their different research goal.)

Arthur

Toning down on Blogging......

Guess I was too confident again to build up this site.  Once again I feel work took me much time and couldn't work on this blog soon.

It also depends on whether my work are something camera-ready.   Currently I have couple of articles which are ready to hash out but need some refinement.

Let's see if I can come back for month a month later.  For now, you might see a lot of filler materials on this blog.

Arthur

Read it.  It doesn't just apply in research, it applies in any job.

Arthur

Collapse Video

It caught my eyes and made me think......

Arthur

No readings for today ...... but.....

It's a digression but check this out:  This is a heart-breaking story of comedian Anthony Griffin.

Arthur

After the 50th Post, some vision on the Grand Janitor's Blog

I tried to gather some topic ideas on this blog.   From LinkedIn, I hear quite a bit of feedbacks on discussion about decoding algorithms.    I have thought about this a bit.   Here are some sample topics which I will discuss,

• How to setup Recognizer X on a Platform Y?
• What are the interesting properties of Program A in speech recognition package B?
• What are the best practice when you try to use a certain model training package?
• Tuning results in a domain.
• News: certain package has a new version coming out! What are something special?
• Thoughts in speech recognition problem.
• Thoughts in programming problem.
• Algorithms/Data structures for decoding/parameter estimation : like you suggested in this post.
• What is an interesting development direction for a speech recognition package?
• Others: softwares/algorithm in some related fields such as NLP, MT.

Of course, I will keep on chatting about some on-going components of Sphinx which I feel interested,

1. Sphinx 4 maintenance and refactoring
2. PocketSphinx's maintenance
3. An HTKbook-like documentation : i.e. Hieroglyphs.
4. Regression tests on all tools in SphinxTrain.
5. In general, modernization of Sphinx software, such as using WFST-based approach.
Unlike academic journals, I hope this is more an informal hub of ASR information.   I don't know if I can keep on. but at least that's what I am going to try.
In any case, this blog will occupy much of my spare time next year.   Feel free to comments and give me suggestions.  And of course ....... Happy New Year!

Arthur