Categories
Uncategorized

Thoughts from Your Humble Administrators – Jan 8, 2017

What we have been thinking last week:

I (Arthur) am traveling, thus the late issue.

  • Our group was once called simply “Deep Learning”.  Of course, all of you have read many articles about “Deep Learning is/is not a hype” type of articles.  So you may wonder how to call B.S. in those articles.
  • There is a trick – Does the author mention that deep learning’s view point is to “automatically learn a representation”, instead of relying on human experts or “feature engineering”.   If an article fails to mention this point,  then that article doesn’t worth its salt.
  • “But hey Arthur, isn’t that just your definition of what deep learning is?”  I know you will say that. 🙂  Not really.  This point happens to be mentioned by Hinton in Lecture 2 of his Coursera class, In Chapter 1 of “Deep Learning” written by Goodfellow.   Goodfellow went on to explain one of the original “deep” means that there is a “deep hierarchical representation”.   That’s why DNN is good, CNN is  good, RNN? If you expand it, it is deep too.  So perhaps that’s why it is also part of “deep learning”.
  • My point is: I guess it doesn’t really matter if someone is for/against deep learning, what you want to look at is their arguments.  Do they know what they are talking about?  Hint: most of them don’t, even authors we allow them to post in this forum.  But we only ask for relevancy, so if the OP has opinions, we let it go.  You are warned though on the validity of some of the blog posts in the forum.
  • CES 2017 is happening.  To AIDLers, perhaps the most relevant is all intelligent agent and virtual assistants.   At least the me, the one which leads the trend recently is Amazon Echo/Dot which is enabled by Alexa.    Not only Amazon’s teams ASR capability should be on-par with Apple, MS and Google.  Alexa leads in terms of long-distance speech recognition (based on beam-forming) as well as keyword wakeup (i.e. user can trigger recognition by saying “Alexa” instead of pressing buttons in old iOS.)  Those are impressive features, and tough to work well technically.
  • One thing to point out, while ASR is very impressive in many of these virtual assistants,  the dialogue system is still lacking the “soul”.   This says for all chatbots – working for specific domain is okay, but you will quickly notice that it is not real. Jose Diaz asked us if one day a universal real virtual assistant is possible, I can only say it is in a universe far far away.

Must read: all posts about Mario.  Why? We love him since we were young! 🙂

Arthur

If you like this message, subscribe the Grand Janitor Blog’s RSS feed. You can also find me (Arthur) at twitter, LinkedInPlus, Clarity.fm.  Together with Waikit Lau, I maintain the Deep Learning Facebook forum.  Also check out my awesome employer: Voci.

Categories
Uncategorized

Some Thoughts on Learning Machine Learning/Data Science

I have been refreshing myself on various aspects of machine learning and data science.  For the most part it has been a very nice experience.   What I like most is that I finally able to grok many machine learning jargons people talk about.    It gave me a lot of trouble even as merely a practitioner of machine learning.  Because most people just assume you have some understanding of what they mean.

Here is a little secret: all these jargons can be very shallow to very deep.  For instance, “lasso” just mean setting the regularization terms with exponent 1.   I always think it’s just people don’t want to say the mouthful: “Set the regularization term to 1”, so they come up with lasso.

Then there is bias-variance trade off.   Now here is a concept which is very hard to explain well.    What opens my mind is what Andrew Ng said in his Coursera lecture, “just forget the term bias and variance”.  Then he moves on to talk about over and under-fitting.  That’s a much easier to understand concept.   And then he lead you to think.  In the case, when a model underfits, we have an estimator that has “huge bias”,  and when the model overfit, the estimator would allow too much “variance”.   Now that’s a much easier way to understand.   Over and under-fitting can be visualized.   Anyone who understands the polynomial regression would understand what overfitting is.  That easily leads you to have a eureka moment: “Oh, complex models can easily overfit!”   That’s actually the key of understanding the whole phenomenon.

Not only people are getting better to explain different concepts. Several important ideas are enunciated better.  e.g. reproducibility is huge, and it should be huge in machine learning as well.   Yet even now you see junior scientists in entry level ignore all important measures to make sure their work reproducible.   That’s a pity.  In speech recognition, e.g. I remember there was a dark time where training a broadcast news model was so difficult, despite the fact that we know people have done it before.    How much time people waste to repeat other peoples’ work?

Nowadays, perhaps I would just younger scientists to take the John Hopkins’ “Reproducible Research”.  No kidding.  Pay $49 to finish that class.

Anyway, that’s my rambling for today.   Before I go, I have been actively engaged in the Facebook’s Deep Learning group.  It turns out many of the forum uses love to hear more about how to learn deep learning.   Perhaps I will write up more in the future.

Arthur

Categories
Uncategorized

Some Notes on scikit-learn

There are many machine learning frameworks, but the one I like most is scikt-learn.  If you use Anaconda python, it is really easy to setup.   So here are some quick notes:

How to setup a very basic training?

Here is a very simple example:

from sklearn import svm #Import SVM

from sklearn import datasets #Import Dataset, we will use the iris dataset

clf = svm.SVC() #setup a classifier

iris = datasets.load_iris() #load in a database

X, y = iris.data, iris.target #Setting up the design matrix, i.e. the standard X input matrix and y output vector

clf.fit(X, y) #Do training

from sklearn.externals import joblib
joblib.dump(clf, 'models/svm.pkl') #Dump the model as a pickle file.

Now  a common question is what if you have different type of input? So here is an example with csv file input. The original example come from machinelearningmastery.com:

# the Pima Indians diabetes dataset from CSV URLPython
# Load the Pima Indians diabetes dataset from CSV URL
import numpy as np
import urllib
# URL for the Pima Indians Diabetes dataset (UCI Machine Learning Repository)
url = "http://goo.gl/j0Rvxq"
# download the file
raw_data = urllib.urlopen(url)
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")
print(dataset.shape)
# separate the data from the target attributes
X = dataset[:,0:7]
y = dataset[:,8]

from sklearn import svm
clf = svm.SVC()
clf.fit(X, y)

from sklearn.externals import joblib
joblib.dump(clf, 'models/PID_svm.pkl')

 

That’s pretty much it. If you are interested, also check out some cool text classification examples at here.

Arthur

Categories
Uncategorized

Why My 10 Years of Blogging Is More Like 5 or 6 ……

Recently I imported all my posts from the old “Grand Janitor Blog“.   Or “V1” (as this blog is officially called “The Grand Janitor Blog V2”). Lo and Behold!  It turns out I have been blogging for almost 10 years.  It started slowly in the first few years when I was working at Scanscout, it got even slower when I was at BBN.  Mostly because lives in both Scanscout and BBN were very stressful.   Just finishing all my work was tough, not to say blogging.

I guess another reason was the perceived numbness of being a technical person.  The 10 years between 25 and 35 are the years you would question the meaning of your career.    Both Scanscout and BBN promised me something great: Scanscout? Financial reward.  BBN? Life-time financial stability.   But when I worked on those two companies, there was always a sense of lost.    I had questioned (admittedly….. sometimes incorrectly) numerous times on the purpose of some of the tasks.   I became a person who is known to be quarrelsome, and without too many good reasons.

Or in general, there were moments I just didn’t care.  I didn’t care about work,  not to say I wouldn’t even care to further myself, or try to write up what I learned.    That’s perhaps why I didn’t blog as much as I should.   I could make an excuse and say this is due to contractual obligations imposed by the companies.   But the truth is I hadn’t pull myself together as a good technical person.   That’s what I meant by numbness as a technical person.

It changed around 4-5 years ago when I joined Voci.   Not trying to brag, but Voci has great culture.  We were encouraged to come up with innovative solution.  If we failed, we were encouraged to learn from the failure.

I was also given an interesting task: to maintain and architect the new Voci’s speech recognition engine.   Being software architect for machine learning software has always been my strength.   Unfortunately, I was only able to use the skill at CMU.

That’s perhaps why I started to care more – I care about debugging, which was once a boring and tedious process to me.   I cared about the build system, which I always relied on other “experts” to fix any issues.  I cared about machine learning, which I always thought is just a bunch of meaningless Math.   Then I also cared about Math again, just like when I was young.

So the above is the brief history of my last 10 years as technical person and why I couldn’t blog as much. I want to say one important thing here:  I want to take personal responsibility on my lack of productivity during some of the years.  May be I can make an excuse and say such-and-such company was managed poorly, but I don’t want to.  For the most part, all companies I worked for were managed by very smart and competent persons.   Sure, they have issue.   But not able to learn was really a me-thing.

I believe the antidote of the numbness is to learn.  Learning as much as you can, learn as widely as you can.   And don’t give up, one day the universe will give you a break.

As I am almost 40, my wish is to blog for another 40 years.

Arthur

Categories
Uncategorized

Using ARPA LM with Python

During Christmas, I tried to do some small fun hack with language modeling.  That obviously requires reading and evaluating an LM.   There are many ways to do so.   But here is new method which I really like: use the python interface of KenLM.

So here is a note for myself (largely adapted from Victor Chahuneau):

Install boost-1.6.0

To install KenLM, you need to first install boost.  If you want to install boost, then you need to install libbz2.

  1. First, install libbz2:
    sudo apt-get install libbz2-dev
  2. Then install 1.6.0:  download here, then type
    ./bootstrap.sh

    , and finally

    ./bj2 -j 4
  3. Install boost:
    ./bj2 install

Install KenLM

Now we install kenlm, I am using the copy of Victor’s here.

git clone https://github.com/vchahun/kenlm.git
pushd kenlm
./bjam
python setup.py install
popd

Training an LM

Download some books from Gutenberg, I am using Chambers’s Journal of Popular Literature, Science, and Art, No. 723.  And I got this file,

50780-0.txt 
So all you need to do to train a model is,

cat 50780-0.txt| /home/archan/src/github/kenlm/bin/lmplz -o 3 > yourLM.arpa

Then you can binarize the LM, which is the part I like about KenLM, it feels snappy and fast than other toolkits I used.

/home/archan/src/github/kenlm/bin/build_binary yourLM.arpa yourLM.klm

Evaluate a Sentence with an LM

Write a python script like this:

import kenlm
model = kenlm.LanguageModel('yourLM.klm')
score = model.score('i like science fiction')
print score

Arthur

Categories
Uncategorized

Doing It

I was usually in a group, try to tackle a task. We might miss a certain implementation, a certain proof, or even some of us lack of some simple knowledge of certain programming language. And you can guess the consequence: all planning efforts stop here. One of us would say “Let’s not pursue that path. There’s too much risk and unknown.”

I ponder about these situations for a while. There is a surprisingly simple solution for all of these problems : spend a little time to try out a simple solution.

For example, when you don’t know a certain language, can you write an “if”-clause or even a “Hello World” with that language? Further can you write a for-loop. For me, google this kind of questions for around 10 minutes usually gives me basic knowledge of the language and implement a simple program. And surprisingly, that’s what ~70% needs – repeat an action (for-loop) under certain condition (an if-statement).

Another example, suppose you don’t have an implementation of say linear regression.   Can you simplify the problem first? e.g. Can you first implement linear regression with 1 variable? (i.e. only the linear term.)  If you can do it, can you then extend your implementation to multiple variables?   Implementing an algorithm, unlike understanding an algorithm, doesn’t take as much time.   So it’s silly to get caught up before you start.

As a final example, how about learning a certain machine learning algorithm?  What should you first do?  My approach would be first running some existing examples of the algorithm.  For example, if you want to learn the basic of SVM, try to run examples from the tutorial of libsvm.    If you want to learn DNN, go through the tutorial of deep learning tutorial.    For sure, you will not get every nuance in machine learning this way.  e.g.  how to make a decision when you get stuck?  How to decide if you machine learning algorithm is suitable for the task?   Yes. Those take experience and more in-depth methodology.  (Check out Andrew Ng’s Machine Learning Class.) But knowing these algorithms from top-down or solely based on mathematics is unlikely to get you too far.

One common retort from people opposing of doing it is that you usually don’t quite get state-of-the-art performance in the first few passes.   I do agree certain parts of their viewpoints.  Sometimes, more planning gives you a more refined first solution.  But for the most time,  you don’t conjure knowledge out of nothing, you build up knowledge.   It’s useless to be too caught up in thinking some time.

I also don’t oppose to understanding.   For example, doing derivations and proofs for algorithm always help your implementation.   But in practice, getting a feel about an algorithm would also help you to understand an algorithm.

If you ask me, what is really shameful is that we were sometimes blocked by our subtle fear of learning something new.   That doesn’t do any good for you, or the team.  That’s what I am driving at in this article.

Arthur

Reference:
I was inspired by this post : “Concrete, Then Abstract“.  I also highly recommend “Zen and the Art of Motorcycle Maintenance: An Inquiry Into Values” to my readers.

Categories
Uncategorized

The Simplification Trick in IBM Model 1

As a daily reading routine, I try to browse through “Statistical Machine Translation” by Koehn, a famous researcher in the field.  Of course, the basic of SMT is IBM Models and the basic of IBM Models is IBM Model 1.   So one key derivation is how come when there exponential alignment ($latex {l_f}^{l_e}$) to a computational complexity linear to ($latex{l_f}) and ($latex{l_e}).   So it boils down the derivation at equation 4.10:

(Here I assume the readers know the basic formulation of t-tables. Please refer to Ch 4 of [1])

$latex p(e|f)$

$latex =\sum_{a} p(e,a|f) ….. (1)$

$latex = \sum_{a(1)=0}^{l_f}\ldots\sum_{a(l_e)=0}^{l_f} p(e,a|f) …… (2)$

$latex = \sum_{a(1)=0}^{l_f} \ldots\sum_{a(l_e)=0}^{l_f}\frac{\epsilon}{(l_f + 1)^{l_e}} \prod_{j=1}^{l_e} t(e_j|f_{a(j)}) …… (3)$

$latex = \frac{\epsilon}{(l_f + 1)^{l_e}}\sum_{a(1)=0}^{l_f} \ldots\sum_{a(l_e)=0}^{l_f}\prod_{j=1}^{l_e} t(e_j|f_{a(j)}) …… (4)$

$latex = \frac{\epsilon}{(l_f + 1)^{l_e}}\prod_{j=1}^{l_e}\sum_{i=0}^{l_f}t(e_j|f_i) …… (5)$

It must be trivial to some very smart people, but how come $latex (4)$ became $latex (5)$ all the sudden?

It turns out this is an important trick, it was used in deriving the count function $c$ later in equation 4.13. (An exercise at the back of Chapter 4.) The same trick was also used in Model 2 when alignment probability is modeled. So it might worth a bit of our time to understand why.

Usually this kind of expression reduction can be formally solved by mathematical induction. But as noted in [2]. Mathematical induction doesn’t usually give you useful insight. So let us use another approach.

Let us first consider the expression with $latex l_e$ summation signs in equation $latex (4)$:

$latex \sum_{a(1)=0}^{l_f} \prod_{j=1}^{l_e}t(e_j|f_i) …… (6)$

So further we consider the first term of the expression at $latex (6)$. What is it? It’s simply:

$latex t(e_1 | f_0) \prod_{j=2}^{l_e}t(e_j|f_i)$.

The second term?

$latex t(e_1 | f_1) \prod_{j=2}^{l_e}t(e_j|f_i)$.

So you probably start to get the drift. The $latex m$-th term of expression $latex (6)$ is

$latex t(e_1 | f_m) \prod_{j=2}^{l_e}t(e_j|f_i)$.

Since all $latex l_f$ terms have the common factor $latex \prod_{j=2}^{l_e}t(e_j|f_i)$, you can factorize $latex (6)$ as

$latex (t(e_1|f_0) + t(e_1|f_1) + \ldots + t(e_1 | f_{l_f})\prod_{j=2}^{l_e}t(e_j|f_i) $

or

$latex \sum_{i=0}^{l_f}t(e_1|f_i)\prod_{j=2}^{l_e}t(e_j|f_i) $

As a result, you reduce the number of products by one term (in case you don’t notice. $latex j$ is now 2).

So back to our question. You just need to follow above procedure for $l_e$-times, you will end up re-express equation $latex (4)$ as

$latex (\sum_{i=0}^{l_f}t(e_1|f_i)) \times (\sum_{i=0}^{l_f}t(e_2|f_i)) \ldots \times (\sum_{i=0}^{l_f}t(e_{l_e}|f_i))$

So here come equation $latex (5)$

$latex = \frac{\epsilon}{(l_f + 1)^{l_e}}\prod_{j=1}^{l_e}\sum_{i=0}^{l_f}t(e_j|f_i)$

Straightly speaking this is not a post for SMT, but more on how one can simplify summands and multiplicands. In many papers, this kind of math is usually skipped and assume the readers know what to do. (I can imagine some friends of mine would say “if you don’t know these stuffs, you don’t deserve to do ….!”) Of course, for mere mortals like us, it could take some time and effort to figure it out.

Direct manipulation is certainly one way to go. There are also interested rules on how summation indices which sometimes greatly simplify the derivation. If you are interested to read more about the topic, I found that Chapter 2 of Concrete Mathematics ([2]) is a good start.

Reference:
[1] Statistical Machine Translation, Philip Koehn 2010.
[2] Concrete Mathematics, Graham et al 2002.

Categories
Uncategorized

How to use Latex in WordPress?

This is one of these small things it could take you a while to figure out.   In your “Text” tab, what you want to do is type:

$latex your-latex-code-here$

For example

$latex \gamma$ would give you

$latex \gamma$.

So what could go wrong?

If you mistakenly put space in after the first dollar sign. e.g.
$[SPACE]latex \gamma$ 

Then you will get the latex code on the screen. (Argh.)

Her are some more complicated examples:

$ latex \int_0^1 f(x) \, \mathrm{d} x $ is

$latex \int_0^1 f(x) \, \mathrm{d} x$

And

$ latex \displaystyle \sum_{n=1}^\infty \frac{1}{n^2} = \frac{\pi^2}{6}. $ is

$latex \displaystyle \sum_{n=1}^\infty \frac{1}{n^2} = \frac{\pi^2}{6}.$

I think this is a useful thing to know because if you copy latex formula from the web it would take you a while to get it right and nice.

Arthur

Categories
Uncategorized

Tuesday’s Links (Meetings and more)

Geeky:

Is Depression Really Biochemical (AssertTrue)

Meetings are Mutexes (Vivek Haldar)

So True.  It doesn’t count all the time you use to prepare a meeting.

Exhaustive Testing is Not a Proof of Correctness

True, but hey.  Writing regression tests is never a bad thing. If you rely only on your brain on testing, it bounds to fail one way or the other.

Apple :

Apple’s iPhone 5 debuts on T-Mobile April 12 with $99 upfront payment plan
iWatchHumor (DogHouseDiaries)

Yahoo:

Yahoo The Marissa Mayer Turnaround

Out of all commentaries on Marissa Mayer’s realm.  I think Jean-Louis Gassée goes straight to the point and I agree most.   You cannot use a one size fit all policy.  So WFH is not always appropriate as well.

Management:

The Management-free Organization

Categories
Uncategorized

Readings at Feb 28, 2013

Taeuber’s Paradox and the Life Expectancy Brick Wall by Kas Thomas

Simplicity is Wonderful, But Not a Requirement by James Hague

Yeah.  I knew a professor who always want to rewrite speech recognition systems such that is easier for research.   Ahh…… modern speech recognition systems are complex any way.   Not making mistakes is already very hard.   Not to say building a good research system which easy to use for everyone. (Remember, everyone has their different research goal.)

Arthur