Statistically Insignificant Me

Slightly related my last post. It relates to an interesting issue of whether we should share the bookshelf in the first place.

Why is it an issue? Well, privacy. Suppose someone is malicious and try to figure you out. The best way is to try to gather all information about you and work against you.

Another concern of mine is rather interesting and absolutely speculative, what if information I read will affect my thought and what if people could reconstruct it just from the information I read? That will open up a lot of interesting application. e.g. We might be able to predict what a person will do better.

Just like in other time series problem such as speech recognition and quantitative analysis. Human life could simply be defined by a series of time events. Some (forget the quote) believes that one human life could be stored in hard-disk and some starts to collect human life and see whether it could be model.

Information of what you read could tell a lot of who you are. Do you read Arthur C. Clarke? Do you read Jane Austen? Do you read Stephen King? Do you read Lora Roberts? From that information, one could build a machine learner to reverse map to who you are and how you make decision. We might just call this a kind of personality modeling.

It seems to me these are entirely possible from the standpoint of what we know. Yet, I still decide to share my bookshelf? Why?

Well, this was crystal-clear moment for me (and perhaps for you as well) which helps me to make a decision: Very simple, *I* am statistically in-significant.

If you happen to come to this web page, the only reason you come is because you are connected to me. How likely will that happened?

I know about 150 persons in my life. The world has about 6 billion. So that simply means the chance of me being discovered is around 1.5 x 10^-8. It is already pretty low.

Now, when other people know me and recommend me to someone else. Then this probability will be boosted up because 1) my PageRank will increase, 2) people follow my link deep enough will eventually discovered my bookshelves.

Yet, if I try to stay low-profile, (say not try to do SEO, not recommend any friends to go to my page) then it is reasonable to expect the factor mentioned is smaller than 1.

Further, 1.5 x 10^-8 is an upper bound as an estimate because
1, Not all my friends are interested in me (discounting factor : 0.6, a conservative one, the actual number is probably higher but I just don't want to face it. ūüėČ )
2, My friends who are interested in me might not follow my links (discounting factor: 0.01)

So we are talking about an event with probability as low as 10^-9 or 10^-10 here. That seems to me close to cheap cryptographic algorithm.

But notice here, my security is not come from hiding or cryptography. My security merely come from my statistical insignificance. In English, I am very open but no one cares. And I am still a happy treebear. ūüėČ

That's why you see my bookshelf. Long story for a simple decision. If you happen to read this, I hope you enjoy it.


Visual Bookshelves

I love to read and like to write reviews for every books I read. None of them will change the world but it still loves to do it. That's why by definition - I'm a bookworm. Not even feel shy about it. ūüėČ

I go quite far: try to record every books I read on a blog and start to put them in a blog called "ContentGeek". Luckily, I haven't gone very far. Because once I discovered Visual Bookshelves, there is no need for me to do it all.

Visual Bookshelves allow users to look up a book from Amazon, add comments and stored it in a database. It also shows the book cover of the books. What else could I want more?

So anyway, this is the link of my visual bookshelves:



David's plan on Sphinx 3.7

A great read, it touches the heart of implementation issues of all sphinxen. And its criticism on my implementation a right straight to the point.

I felt very relieved when the current maintainer attack what I did in the past. (Some features I did were rather stupid.) This shows that Sphinx is still alive and will still be alive.


Life in Scanscout

Hi Guys,
Scanscout ( is a rather interesting company. . If you look at this blog, you probably know that I have been there for a while.

My direct supervisor doesn't like to give away too much. I think he has a point (as he is a *v* smart guy"). This contradicts to my philosophy of information sharing. So alright, as a compromise, here are couple of things I could share. (Of course, my estimate of the probably of anyone looking at this blog is about 1/10^9, so I guess it doesn't matter that much......)

1, We have a massage chair and it is awesome.
2, We have a foozball table and have a tournament every Friday. Beware, there are several good players. (I always get the lowest score.)
3, It is on the fore-front of video advertising. I am glad that I've joined. ūüôā

Arthur Chan



Ah. This is not exactly news. It has been around since 2006 John Hopkins workshop.

mosedecoder is probably the first open source statistical machine translation implementation in the world. For quite a while, only the IBM models training portion of the code could be found in GIZA++. So for people who is interested in SMT, they will probably turn to Pharaoh, a close source implementation available in the web.

I could have some fun. ūüėČ


Third Draft of Hieroglyphs

Hi all,

It has been a while I worked on the Hieroglyphs (the fancy name I made for sphinx documentation). This is perhaps the only things I haven't wrapped up in CMU. Therefore I decided to release a draft. You can find it


It still looks pretty messy but it starts to look like a book now.

Several chapters and sections were trimmed in this draft. You will still see a lot of ?. Those are signals of not enough proof-reading. Forgive me, when I have more time, I will try to fix some of them in near future.

Grand Janitor

Left CMU

Hi Guys,
It was a sad decision. After a long soul-searching, I decided to leave CMU and join a startup company called Scanscout. I must be out of mind!!

Anyway, my new job require knowledge in speech recognition, information retrieval and video processing. These are all good fit for me. I could tell you I have a lot of fun!

Sphinx, in particular the trio, Sphinx 3.X, SphinxTrain and CMULMTKV3 are now maintained by David Huggins-Daines and Evandro Gouvea. I still keep a nominal maintainership but these two are the true heros in the story now.

However, feel free to chat with me on anything related language processing. I am more than happy to be there.

Arthur Chan

Sphinx 3.6 is officially released

Sphinx 3.6 Official Release 
Sphinx 3.6 official release included all changes one found in Sphinx 3.6 RC I. 
From 3.6 RC I to 3.6 official:  
New Features:  
-Added support for sphinx 2-style semi-continuous HMM in Sphinx 3.  
-Added sphinx3_continuous which performs on-line decoding in both windows and linux platforms. 
-Synchronized the frontend with Sphinx2, adding implementation of VTLN. (i.e. -warp_type = inverse_linear, piewise_linear, affine)  
-Prefix "sphinx3_" has been added to programs align, allphone, astar, dag, decode, decode_anytopo, ep to avoid confusion in some unix systems. 
For Developers:  
-All public headers (*.h) are now put under $root/include instead of the same directories as their source .c file. 
-The directory name libutil is now changed to libs3util  
-Sphinx3, as well as all other modules in the CMU Sphinx 
project, is now versioned by Subversion. 
Bug Fixes:  
-[1459402] Serious memory relocation is fixed.  
-In RCI, -dither was not properly implemented, this has been fixed.  
Known Problem:  
-When the model contains nan, there will be abnormal output of the result. At this point, this issue is resolved in SphinxTrain 
Sphinx 3.6 Release Candidate I 
The corresponding SphinxTrain's tag is SPHINX3_6_CMU_INTERNAL_RELEASE 
One can check out the matching SphinxTrain of the sphinx3.6 release by command, 
svn co 
A Summary of Sphinx 3.6 RC I 
Sphinx 3.6 is a gently refactored version of Sphinx 3.5. Our programming is defensive and we only aim at further consolidation and unification our code-bases in Sphinx 3. 
Despite our programming is defensive, there are still several interesting and new features could be found in this 
release. Their details could be found in the "New Feature" 
section below. Here is a brief summary: 
1, Further speed-up of CIGMMS in the 4 level GMM Computation Schemes (4LGC) 
2, Multiple regression classes an MAP adaptation in SphinxTrain 
3, Better support in using LM in Sphinx 3.X.  
4, FSG search is now supported. This is adapted from Sphinx 2.  
5, Support of full triphone search in flat lexicon search.  
6, Some support of different character sets other than of 
Sphinx 3.X. Models in multiple languages are now tested in 
Sphinx 3.X. 
We hope you enjoy this release candidate. In future, we will 
continue to improve the quality of CMU Sphinx and CMU Sphinx's related software. 
New Features  
-Speaker Adaptation: 
a, Multiple regression class (phoneme-based) is now supported.  
-GMM Computation  
a, Improvements of CIGMMs is now incorporated.  
i, One could specify the upper limit of the number of CD 
senones to be computed in each frame by specifying -maxcdsenpf. 
ii, The best Gaussian index (BGI) are not stored and could 
be used as a mechanism to speed up GMM computation 
iii, tightening-factor (-tighten_factor) is introduced to 
smooth between fix naive down-sampling technique and CI-GMMS. 
b, Support of SCHMM and FCHMM 
i, decode will fully support computation of SCHMM.  
-Language Model  
a, reading an LM in ARPA text format is now supported. Users now have an option to by-pass the use of lm3g2dmp. 
b, live decoding API now supports switching of language models.  
c, full support of class-based LM. See also the Bug fixes section 
d, lm_convert is introduced. lm_convert supersede the functionalities of lm3g2dmp. Not only could lm_convert convert an LM from TXT format to DMP format. It could also do the reverse.  
This part will detail the change we make in different search 
In 3.6, collection of algorithms could all be used under a 
single executable decode. decode_anytopo is still reserved 
for backward compatibility purpose.  
Decode now support three modes of search.  
Mode 2 (FSG): (Adapted from Sphinx 2) FSG search.  
Mode 3 (FLAT): Flat-lexicon search. (The original search in decode_anytopo in 3.X (X < 6)) 
Mode 4 (TREE): Tree-lexicion search. (The original search in decode in 3.X (x<6)  
Some of these functionalities will only be applicable in one 
particular search. We will mark them with FSG, FLAT and TREE.  
a, One could now turn off -bt_wsil to control whether silence should be used as the ending word. (FLAT, TREE) 
b, In FLAT, full triphones could be used instead of 
multiplexed triphones.  
c, FSG is a new added routine in 3.6 which is adapted from 
Sphinx 2.5 
a, -dither is now supported in live_pretend and 
live_decode, the initial seed could always be set the command 
-seed. (Jerry Wolf will be very happy about this feature.) 
a, One could turn on built-in letter-to-sound rule in dict.c by using -lts_mismatch.  
b, current Sphinx 3.6 is tested to work on setup of English, 
Chinese Mandarin, French and English.  
c, changes in allphone: allphone can now generate a match and a matchseg just like decode* recognizers.  
Bug fixes 
-Miscellaneous memory leak fixed in the tree search (mode 4) 
-Initialization class-based LM routine use to switch the order of 
word insertion penalty and language model weight. This is now fixed.  
-Assertion generated vithist.c is now turn into an error 
message. Instead of causing the whole program stopped. The decoding will just fail for that sentnece. We suspect that this is the problem which caused possible wipe out of memory in Sphinx 
3.4 & 3.5  
-Number of CI phones could now be at most 32767 (instead of 127) 
-[1236322]: libutilstr2words special character bugs.  
Behavior Changes 
-Endpointer (ep) now used computation of s3 log.  
-Multi-stream GMM computation will not truncate the pdf to 8 bit anymore. This will avoid confusion of programmer. However 
-Except in allphone and align, When .cont. is used in 
-senmgau, the code will automatically turn to use fast GMM computation routine. To make sure the multiple-stream GMM computation will be in effect, one need to specify .s3cont. 
-executable dag hadn't accounted for the language weight. Now this issue is fixed. 
-(See Bug fixes also) decode will now return error message 
when vithist was fed in with history -1. Instead of asserting the problem. The recognizer will dump Warning message. Usually that means beam widths need to increase. 
Functions still under test  
-Encoding conversion in lm_convert.  
-LIUM contribution: LM could now represented as AT&T fsm format.  
Known bugs 
-Confidence estimation, the computations of forward and 
backward posterior probability have mismatch 
-In allphone, sometimes the scores generated in the matchseg file will have very low scores.  
-Regression test on second-stage search still have bugs.  
Corresponding changes in SphinxTrain 
Please note that SphinxTrain is distributed as a separate package and you can get it by: 
svn co 
-Support for generation of MAP, multiple-class MLLR.  
-Support for BBI tree generation