Categories
Uncategorized

Some Notes on scikit-learn

There are many machine learning frameworks, but the one I like most is scikt-learn.  If you use Anaconda python, it is really easy to setup.   So here are some quick notes:

How to setup a very basic training?

Here is a very simple example:

from sklearn import svm #Import SVM

from sklearn import datasets #Import Dataset, we will use the iris dataset

clf = svm.SVC() #setup a classifier

iris = datasets.load_iris() #load in a database

X, y = iris.data, iris.target #Setting up the design matrix, i.e. the standard X input matrix and y output vector

clf.fit(X, y) #Do training

from sklearn.externals import joblib
joblib.dump(clf, 'models/svm.pkl') #Dump the model as a pickle file.

Now  a common question is what if you have different type of input? So here is an example with csv file input. The original example come from machinelearningmastery.com:

# the Pima Indians diabetes dataset from CSV URLPython
# Load the Pima Indians diabetes dataset from CSV URL
import numpy as np
import urllib
# URL for the Pima Indians Diabetes dataset (UCI Machine Learning Repository)
url = "http://goo.gl/j0Rvxq"
# download the file
raw_data = urllib.urlopen(url)
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")
print(dataset.shape)
# separate the data from the target attributes
X = dataset[:,0:7]
y = dataset[:,8]

from sklearn import svm
clf = svm.SVC()
clf.fit(X, y)

from sklearn.externals import joblib
joblib.dump(clf, 'models/PID_svm.pkl')

 

That’s pretty much it. If you are interested, also check out some cool text classification examples at here.

Arthur

Categories
Uncategorized

Why My 10 Years of Blogging Is More Like 5 or 6 ……

Recently I imported all my posts from the old “Grand Janitor Blog“.   Or “V1” (as this blog is officially called “The Grand Janitor Blog V2”). Lo and Behold!  It turns out I have been blogging for almost 10 years.  It started slowly in the first few years when I was working at Scanscout, it got even slower when I was at BBN.  Mostly because lives in both Scanscout and BBN were very stressful.   Just finishing all my work was tough, not to say blogging.

I guess another reason was the perceived numbness of being a technical person.  The 10 years between 25 and 35 are the years you would question the meaning of your career.    Both Scanscout and BBN promised me something great: Scanscout? Financial reward.  BBN? Life-time financial stability.   But when I worked on those two companies, there was always a sense of lost.    I had questioned (admittedly….. sometimes incorrectly) numerous times on the purpose of some of the tasks.   I became a person who is known to be quarrelsome, and without too many good reasons.

Or in general, there were moments I just didn’t care.  I didn’t care about work,  not to say I wouldn’t even care to further myself, or try to write up what I learned.    That’s perhaps why I didn’t blog as much as I should.   I could make an excuse and say this is due to contractual obligations imposed by the companies.   But the truth is I hadn’t pull myself together as a good technical person.   That’s what I meant by numbness as a technical person.

It changed around 4-5 years ago when I joined Voci.   Not trying to brag, but Voci has great culture.  We were encouraged to come up with innovative solution.  If we failed, we were encouraged to learn from the failure.

I was also given an interesting task: to maintain and architect the new Voci’s speech recognition engine.   Being software architect for machine learning software has always been my strength.   Unfortunately, I was only able to use the skill at CMU.

That’s perhaps why I started to care more – I care about debugging, which was once a boring and tedious process to me.   I cared about the build system, which I always relied on other “experts” to fix any issues.  I cared about machine learning, which I always thought is just a bunch of meaningless Math.   Then I also cared about Math again, just like when I was young.

So the above is the brief history of my last 10 years as technical person and why I couldn’t blog as much. I want to say one important thing here:  I want to take personal responsibility on my lack of productivity during some of the years.  May be I can make an excuse and say such-and-such company was managed poorly, but I don’t want to.  For the most part, all companies I worked for were managed by very smart and competent persons.   Sure, they have issue.   But not able to learn was really a me-thing.

I believe the antidote of the numbness is to learn.  Learning as much as you can, learn as widely as you can.   And don’t give up, one day the universe will give you a break.

As I am almost 40, my wish is to blog for another 40 years.

Arthur

Categories
ANN deep learning Language Modeling SMT

Some Speculations On Why Microsoft Tay Collapsed

Microsoft’s Tay, following Google AlphaGo, was meant to be yet another highly intelligent A.I. program which fulfill human’s long standing dream: a machine which can truly converse.   But as you know, Tay fails spectacularly.  To me, this is a highly unusual event, part of it is that Microsoft’s another conversation agent, Xiaoice, was extremely successful in China.   The other part is MSR, is one of the leading sites on using deep learning in various machine learning problems.   You would think that a major P.R. problem such as Tay confirming “Donald Trump is the hope”,  and purportedly support genocide should be weeded out before launch.

As I read many posts in the past week attempted to describe why Tay fails, sadly they offer me no insights.  Some even written from respected magazines, e.g. in New Yorkers‘ “I’ve Seen the Greatest A.I. Minds of My Generation Destroyed by Twitter” at the end the author concluded,

“If there is a lesson to be learned, it is that consciousness wants conscience. Most consumer-tech companies have, at one time or another, launched a product before it was ready, or thought that it was equipped to do something that it ended up failing at dismally. “

While I always love the prose from New Yorkers, there is really no machine which can mimic/model  human consciousness (yet).   In fact, no one really knows how “consciousness” works, it’s also tough to define what “consciousness” is.   And it’s worthwhile to mention that chatbot technology is not new.   Google had released similar technology and get great press.  (See here)  So the New Yorkers’ piece reflect how much the public does not understand technology.

As a result, I decided to write a Tay’s postmortem myself, and offer some thoughts on why this problem could occur and how one could actively avoid such problems.

Since I try to write this piece for general audience, (say my facebook friends), the piece contains only small amount of technicalities.   If you are interested, I also list several more technical articles in the reference section.

How does a Chatbot work?  The Pre-Deep Learning Version

By now,  all of us use a chat bot or two, there is obviously Siri, which perhaps is the first program which put speech recognition and dialogue system in the national spotlight.  If you are familiar with history of computing, you would probably know ELIZA [1], which is the first example of using rule-based approach to respond to users.

What does it mean?  In such system, usually a natural language parser is used to parse human’s input, then come up with an answer with some pre-defined and mostly manually rules.    It’s a simple approach, but when it’s done correctly.   It creates an illusion of intelligence.

Rule-base approach can go quite far.  e.g. The ALICE language is a pretty popular tool to create intelligent sounding bot. (History as shown in here.)   There are many existing tools which help programmers to create dialogue.   Programmer can also extract existing dialogues into the own system.

The problem of rule-based approach is obvious: the response is rigid.  So if someone use the system for a while, they will easily notice they are talking with a machine.  In a way, you can say the illusion can be easily dispersed by human observation.

Another issue of rule-based approach is it taxes programmers to produce a large scale chat bot.   Even with convenient languages such as AIML (ALICE Markup Language), it would take a programmer a long long time to come up with a chat-bot, not to say one which can answer a wide-variety of questions.

Converser as a Translator

Before we go on to look at chat bot in the time of deep learning.  It is important to ask how we can model conversation.   Of course, you can think of it as … well… we first parse the sentence, generate entities and their grammatical relationships,  then based on those relationships, we come up with an answer.

This approach of decomposing a sentence to its element, is very natural to human beings.   In a way, this is also how the rule-based approach arise in the first place.  But we just discuss the weakness of rule-based approach, namely, it is hard to program and generalize.

So here is a more convenient way to think, you could simply ask,  “Hey, now I have an input sentence, what is the best response?”    It turns out this is very similar to the formulation of statistical machine translation.   “If I have an English sentence, what would be the best French translation?”    As it turns out, a converser can be built with the same principle and technology as a translator.    So all powerful technology developed for statistical machine translation (SMT) can be used on making a conversation bot.   This technology includes I.B.M. models, phrase-based models, syntax model [2]   And the training is very similar.

In fact, this is how many chat bots was made just before deep-learning arrived.    So some method simply use an existing translator to translate input-response pair.    e.g. [3]

The good thing about using a statistical approach, in particular, is that it generalizes much better than the rule-based approach.    Also, as the program is based on machine learning, all you have to do is to prepare (carefully) a bunch of training data.   Then existing machine learning program would help you come up with a system automatically.   It eases the programmer from long and tedious tweaking of the bot.

How does a Chatbot work?  The Deep Learning Version

Now given what we discuss, then how does Microsoft’s chat bot Tay works?   Since we don’t know Tay’s implementation, we can only speculate:

  1. Tay is smart, so it doesn’t sound like a purely rule-based system.  so let’s assume it is based on the aforementioned “converser-as-translator” paradigm.
  2. It’s Microsoft, there got to be some deep neural network.  (Microsoft is one of the first sites picked up the modern “deep” neural network” paradigm.)
  3. What’s the data?  Well,  given Tay is built for millennials, the guy who train Tay must be using dialogue between teenagers.  If I research for Microsoft [4],  may be I would use data collected from Microsoft Messenger or Skype.   Since Microsoft has all the age data for all users, the data can easily be segmented and bundled into training.

So let’s piece everything together.  Very likely,  Tay is a neural-network (NN)-based program which can intelligently translate an user’s natural language input to a response.    The program’s training is based on chat data.   So my speculation is the data is exactly where things goes wrong.   Before I conclude, the neural network in question is likely to be an Long-Short Term Model (LSTM).    I believe Google’s researchers are the first advocate such approach [5] (headlined last year and the bot is known for its philosophical undertone.) Microsoft did couple of papers on how LSTM can be used to model conversation.  [6].    There are also several existing bot building software on line e.g. Andrej Karpathy ‘s char-RNN.    So it’s likely that Tay is based on such approach. [7]

 

What goes wrong then?

Oh well, given that Tay is just a machine learning program.  Her behavior is really governed by the training material.   Since the training data is likely to be chat data, we can only conclude the data must contain some offensive speech, given the political landscape of the world.   So one reasonable hypothesis is the researcher who prepares the training material hadn’t really filter out topics related to hate speech and sensitive topics.    I guess one potential explanation of not doing that is that filtering would reduce the amount of training data.     But then given the data owned by Microsoft,  it doesn’t make sense.  Say 20% of 1 billion conversation is still a 200 million, which is more than enough to train a good chatterbot.  So I tend to think the issue is a human oversight. 

And then, as a simple fix,  you can also give the robots a list of keywords, e.g. you can just program  a simple regular expression match of “Hitler”,  then make sure there is a special rule to respond the user with  “No comment”.   At least the consequence wouldn’t be as huge as a take down.     That again, it’s another indication that there are oversights in the development.   You only need to spend more time in testing the program, this kind of issues would be noticed and rooted out.

Conclusion

In this piece, I come up with couple of hypothesis why Microsoft Tay fails.   At the end, I echo with the title of New Yorker’s piece: “I’ve Seen the Greatest A.I. Minds of My Generation Destroyed by Twitter” …. at least partially. Tay is perhaps one of the smartest chatter bots, backed by one of the strongest research organization in the world, trained by tons of data. But it is not destroyed by Twitter or trolls. More likely, it is destroyed by human oversights and lack of testing. In this sense, it’s failure is not too different from why many software fails.

Reference/Footnote

[1] Weizenbaum, Joseph “ELIZA—A Computer Program For the Study of Natural Language Communication Between Man And Machine”, Communications of the ACM 9 (1): 36–45,

[2] Philip Koehn, Statistical Machine Translation

[3] Alan Ritter, Colin Cherry, and William Dolan. 2011. Data-driven response generation in social media. In Proc. of EMNLP, pages 583–593. Association for Computational Linguistics.

[4] Woa! I could only dream! But I prefer to work on speech recognition, instead of chatterbot.

[5] Oriol Vinyal, Le Quoc, A Neural Conversational Model.

[6] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan, A Diversity-Promoting Objective Function for Neural Conversation Models

[7] A more technical point here: Using LSTM, a type of recurrent neural network (RNN), also resolved one issue of the classical models such as IBM models because the language model is usually n-gram which has limited long-range prediction capability.

Categories
Special Functions Statistics

The Probability Integral

Math of machine learning boils down to probability theory, calculus and linear algebra/matrix analysis.   The one gem which mesmerizes me most is the so called “probability integral”,  termed by Prof. Paul J. Nahin, author of Inside Interesting Integrals [1]. Or as you might learn in either probability theory or random process class:

$latex F(x)=\int_{-\infty}^{\infty}e^{\frac{-x^2}{2}}dx$

Of course, this is related to the Gaussian distribution.    When I learnt the probability integral,  I first learned the now standard trick of “polar integration” [2].   i.e.  We first create a double integral, then transform the integrand using polar coordinate transformation. Here is the detail. Since $latex F(x)$ is an even function. We just need to consider

$latex I=\int_{0}^{\infty}e^{\frac{-x^2}{2}}dx$

$latex I^2=\int_{0}^{\infty} e^{\frac{-x^2}{2}}dx \int_{0}^{\infty} e^{\frac{-y^2}{2}}dy=\int_{0}^{\infty}\int_{0}^{\infty}e^{\frac{-x^2}{2}}e^{\frac{-y^2}{2}}dxdy=\int_{0}^{\infty}\int_{0}^{\infty}e^{\frac{-(x^2+y^2)}{2}}dxdy$

Then let $latex x=r\cos\theta$, and $latex y=r\sin\theta$, the Jacobians

$latex \frac{\partial(r,\theta)}{\partial(x,y)}=\begin{vmatrix}\cos\theta & -r\sin\theta \\ \sin\theta & r\cos\theta\end{vmatrix} = r (\sin^2\theta+cos^2\theta) =r$.

Substitute to $latex 1$.

$latex I^2=\int_{0}^{\frac{\pi}{2}}\int_{0}^{\infty}re^{\frac{-r^2}{2}}drd\theta$

$latex =\int_{0}^{\frac{\pi}{2}}(\left.-e^{-r^2}\right|_{0}^{\infty})d\theta=\int_{0}^{\frac{\pi}{2}} (0 – (-1))d\theta=\int_{0}^{\frac{\pi}{2}}d\theta= \frac{\pi}{2},$

or $latex I = \sqrt{\frac{\pi}{2}}.$

So $latex F(x) = 2 * \sqrt{\frac{\pi}{2}} = \sqrt{2\pi}$.

This is a well known result, you can find it in almost all introductory book on probability distribution.

This derivation is a trick. Quite frankly, without reading the textbook.  I wouldn’t have the imagination to come up with smart way to derive integrals.   Perhaps that’s why Prof. Nahin said in his book Inside Interesting Integral [4] :

THIS IS ABOUT AS CLOSE TO A MIRACLE AS YOU’LL GET IN MATHEMATICS.

and in Solved Problems in Analysis [5],  Orin J. Farrell did a similar integral like $latex I$,

$latex F_{1}(x)=\int_{0}^{\infty}e^{-x^2}dx$.

And again he used the polar coordinate transform.  He then said “Indeed, its discoverer was surely a person with great mathematical ingenuity.”  I concur.  While the whole proof procedure is only using elementary calculus known by undergraduates.  It takes quite a bit of imagination to come up with such method.  It’s easier for us to use the trick but the originator must be very smart.

I also think this is one of the key Calculus trick to learn if you want to learn more daunting distributions such as gamma, beta or Dirichlet’s distribution.   So let me give several quick notes on the proof above:

  1. You can use similar idea to prove the famous gamma and beta function are related as $latex \Gamma(a+b)=\Gamma(a)\Gamma(b)B(a,b)$.  There are many proofs of this fundamental relationship between gamma and beta function.  e.g.  PRML[3] ex 2.6 outlines one way to a proof.   But I found that first transforming the gamma function with $latex x=y^2$ and follow the above polar coordinate transformation trick is the easiest way to proof the relationship.
  2. As mentioned before, the same proof can also be used in calculating $latex \Gamma(\frac{1}{2})$ (Also from [5]).
  3. The polar coordinate trick is not the only way to calculate the probability integral.  Look at note [6] on the history of its derivation.

Each of these points deserve a post on its own.  So let’s say I owe you some more posts.

References:
[1] Paul J. Nahin, Inside Interesting Integrals.

[2] Alberto Leon-Garcia, Probability and Random Processes for Electrical Engineering 2nd Edition.

[3] Christopher M. Bishop, Pattern Recognition and Machine Learning.

[4] Paul J. Nahin, Inside Interesting Integrals. p.4.

[5] Orin J. Farrell and Bertram Ross, Solved Problems in Analysis p.11.

[6] Author Unknown, The Probability Integral Link:  http://www.york.ac.uk/depts/maths/histstat/normal_history.pdf

Categories
Natural Language Processing

What does an NLP Engineer do?

After taking the very enjoyable class from Prof.  Radev, I was browsing the forum for some interesting topics.   Here one I found, posted by an anonymous student, (rephrase)

“After taking this course, I still have only a vague grasp of what an NLP Software Engineer does on a daily basis.  Will the job involve data analysis and modeling? Or data engineering? Or theoretical NLP research? Could anybody share some thoughts?”

I took a stab, here is a  rewritten and extended version:

“It’s a good question, it’s also a tough one to answer well.   In general, it depends on the company you work on.   Let’s say you work in a bigger company which there is a research department, which is separated from standard programming team.   I would also add that some NLP engineers are actually managers.  So their role was mostly general organization of the activity of a team.

If you minus pure coding and pure programming task,  I would say there are usually 2 types of task you will encounter:

  1. NLP component update and improvement:  e.g. Say you are working on a word sense disambiguation routine.   Through data collection, you were able to increase the amount of the data by 10 times, so your boss assign you to re-train the existing SVM which you trained a while ago.     Your job in this case, is not that different from HW3 (Arthur: i.e. a WSD exercise using machine learning) : you are going to massage the data, train a model, benchmark it and present it to your group.  Hopefully and eventually deploy it to real-life.
  2. Integration: Now suppose you are more coding-oriented, then it’s possible that you are asked to incorporate a pre-existing program written by your colleagues.   So your major task is to be the one who is responsible for such process.   Note that this requires different skill set from a vanilla programmer, you do need to understand the underlying NLP technology quite well to integrate an NLP component.   Of course, there will be issues remained from your more research oriented colleague to you.  That includes speed or memory optimization, etc.

Both task 1 or 2 can be infinitely complex and difficult in real-life.  e.g. You might be handed a task which is very difficult to improve upon.  May be because previous research has exhausted most of the routes of improvement.   Then in that case, you will play the role of pure researcher and think up new methods to improve the task.  Or if you are working on modeling, then significant amount of your time would be spent on modeling, tending different experiments, as well as data preparation (see below).

Similarly, in task 2, you can be handed with a program which consists of 100k+ lines of C/C++ code.  You need to take care of all system-related issues, pop up gdb and spend couple of waking time to fix a multi-thread program.   Or you will need to make sure the plumbing of your application works with your researcher colleague.   Those are also difficult skills which take years to refine.

In my case, I am more an integrator on speech recognition components.    I also do quite a dose of R&D on other ML components (see this post).    In the past, my role was more on the pure research side.   For example, my BBN role was a kind of experimenter to work on unsupervised topic detection.

In any case,  I think Prof Radev’s class presents students a very good summary of what real-life NLP people do.  Because we are mostly try to practice on task 1 and task 2.

There are perhaps two exceptions of task 1 and 2:

  1. First exception I would mention is data preparation.  Normally you will have to prepare data yourself.  The very nice thing happens in our homework that a TA wrote a working XML parser for input data?  It seldom happens.   That explains why many people claim “data preparation is 80% of machine learning.
  2. The second exception is you might ask to analyze and present your results. i.e. data analysis.   Analysis is more the arena of statistics and data science.    It is also an important skill.  But in my experience, product-oriented engineers are spending less time on those aspects.  Because most of the time you are either researching or developing the product.  Your analytical skill is mostly used when you are stuck.   For example, I seldom use data analysis in my work (descriptive? exploratory? casual?) unless I saw a rare phenomenon.    But a data scientist is more likely to use those skills daily. “

Arthur

Categories
Uncategorized

Using ARPA LM with Python

During Christmas, I tried to do some small fun hack with language modeling.  That obviously requires reading and evaluating an LM.   There are many ways to do so.   But here is new method which I really like: use the python interface of KenLM.

So here is a note for myself (largely adapted from Victor Chahuneau):

Install boost-1.6.0

To install KenLM, you need to first install boost.  If you want to install boost, then you need to install libbz2.

  1. First, install libbz2:
    sudo apt-get install libbz2-dev
  2. Then install 1.6.0:  download here, then type
    ./bootstrap.sh

    , and finally

    ./bj2 -j 4
  3. Install boost:
    ./bj2 install

Install KenLM

Now we install kenlm, I am using the copy of Victor’s here.

git clone https://github.com/vchahun/kenlm.git
pushd kenlm
./bjam
python setup.py install
popd

Training an LM

Download some books from Gutenberg, I am using Chambers’s Journal of Popular Literature, Science, and Art, No. 723.  And I got this file,

50780-0.txt 
So all you need to do to train a model is,

cat 50780-0.txt| /home/archan/src/github/kenlm/bin/lmplz -o 3 > yourLM.arpa

Then you can binarize the LM, which is the part I like about KenLM, it feels snappy and fast than other toolkits I used.

/home/archan/src/github/kenlm/bin/build_binary yourLM.arpa yourLM.klm

Evaluate a Sentence with an LM

Write a python script like this:

import kenlm
model = kenlm.LanguageModel('yourLM.klm')
score = model.score('i like science fiction')
print score

Arthur

Categories
deep learning Fun HTK Theano

Me and My Machines

IMG_4054I wrote a page long time ago about the machines I used.   When you work with computing for a while, every one of your computers mean something to you.    That’s why I  tried not to throw them away easily. Occasionally I also bought scrape computers, fixed them up and felt like I did a good thing for the planet.

Anyway, here is a list of machine I used.  Some with more stories than the others:

  1. A 286 (1991-1992?) : The first computer I ever touch back in junior high school.  There was a geeky senior dude tried to teach us the basic of database and none of us really understand him. He wasn’t nice to us, who were like 12-13 years old.  I disliked his attitude and called him out.   He was so unhappy and stormed out the computer room.   We eventually learn stuffs like LOGO, and basic DOS commands on these very slow 286. (Well, you can optimize the hell of them though.)
  2. A 486-66DX  (1994-1996?):  My first computer and I had it since high school.  I wasn’t very into computer at that time. I used it to play Tie-Fighter, and wrote documents using Words.  I also did several assignments on microprocessor programming (i.e. basic Assembly stuffs).   It was incredibly slow and it takes a long time to compile a Visual C++ backbone windows program.   Later, I gave it to a girl and she just threw the whole thing away.   (Shame on me. I threw away a relic of computer history.)
  3. A P166 “Mars” (1996-2000): I bought this when I am second year in College.   Since I spent most of my money on this machine, I was doing part-time during my degree.    And I was finally able to do some interesting stuffs on computer such as GUI programming.   The GUI programming stuffs makes me get a good contract from librarian who tries to develop cataloging software.   I also wrote my first isolated word speech recognizer on it.    Later I ran a speech recognizer written by a guy named Ricky Chan.    The recognizer was then used in my final year project.   Unfortunately, both the cataloging software and  my final year project were disasters:  I didn’t know how to fix memory leaks in C/C++ at that point.   All my programs died horribly.   Good Ricky Chan has nothing to do with it.  It’s all my fault. But, the horror of Windows 95’s blue screen still haunt me even these days.  Of course, both the librarian and my then-boss saw me at very dim light.  (They probably still do.)  I was cleaning my basement this year and Mars was getting too dirty.  So I painfully threw it away with tears in my eyes.
  4. A P500 “Jupiter” (2000-):  I bought this in my first year of graduate school, half a year after I started to receive stipends.    This is the moment I was very into HTK (Hidden Markov Toolkit).  I still kept Mars, but if you want to train HMM for connected digit recognition using TIDIGITS, my P166 with 16Mb will take me close to a week.   My P500 though allows me to run TIMIT and I was even able to train triphones (Woo!) .    I also gleefully run every steps from the HTK manual V2.2 even though I had no idea what I was doing.   Jupiter was also the machine I wrote the modified Viterbi algorithm in my thesis (formally Frame-Skipping Viterbi Algorithm (FSVA)).  I still keep the mid-frame body of the “Jupiter” but I think it wasn’t working well since around 6 years ago.
  5. A Book Mini-PC (2000): In between Mars and Jupiter, I bought a mini-form PC.  I tried to install Red Hat Linux on it, but I was very bad at any Linux installation then.   Eventually the mother board was burned and I gave it to my friend who claim to know how to fix motherboard.    (He never got back to me.)
  6. “eea045” (2000-2003):  It is a lab machine I used back in HKUST,  it was first a Pentium 500MHz, but soon my boss upgraded it to 1.7GHz.   I was jubilant to use it to run acoustic model training, I also ran most of my theses’ experiments on it.
  7. A Toshiba laptop (2002) My mom gave it to me because she said it’s not running too well.  It dies on me right at the day I was going to present my Master Thesis.   Luckily, someone helps me to borrow a machine from the EEE department so now I am a happy Master.
  8. “Snoopy” (2001-2003): I was then a Junior Speech Scientist at Speechworks. And this Pentium 500 was assigned to me.   It is also the first of the four machines I used with funny names.
  9. Grandpa” (2001-2003): The laptop assigned to me in Speechworks.   It solved a lot of funny crises for me.   I really missed “Grandpa” when I was laid off from Speechworks.
  10. iBuddie 4 A928 (2002-2003):  A thing called desknote at the time,  it’s like a laptop but you always have to give it juice.   Again, its motherboard burnt.  And again, I don’t quite know how to fix it.
  11. “Lumpy” (2003-2006): This is the machine assigned to me from CMU SCS,  and I asked the admin many times if the name is some kind of very profound joke.  “No” is their answer.  But I always know it’s a setup. 😐  Always know.
  12. “Grumpy”/”Big Baby” (2003-): This is a Dell Inspiron 9100 I bought in a hefty price of $3000.  Even at 2004, it was a heavy laptop.   I used it for most of my CMU work, including hacking Sphinx, writing papers.    Prof.  Alex Rudnicky, my then-boss in CMU, always jokingly asked me if Big Baby is a dock station.   (Seriously, No.)   I also used it as my temporary laptop in Scanscout.   The laptop is so large and heavy, I used it as my dumbbells in Scanscout.
  13. “The Cartoon Network”(2003-2006): This is the name of cluster in CMU Sphinx Group which is used by many students from the Robust Group, by me and David Huggins Daines, Alex’s student, as well as Evandro, who was then working for Prof. Jack Mostow.  The names of the machines were all based on cartoon characters from Cartoon networks:  for example, Blossoms,  Bubbles and Buttercups are three 2G Hz machines which were not too reliable.   I have been asking Alex to name one of the machines to be Mojo Jojo.  But he keeps on refusing me.  (Why? Why Alex?)
  14. A G4 (2004-2006) This is the first Mac I ever used in my life but it’s one of the most important.   I used it to develop for a project called CALO (Cognitive Agent that Learn and Organize), now venerable because several SRI participants started an engine which nowadays called Siri.   But what I learned is simpler:  Apple would grow big, since then I invested on Apple regularly, with reasonable profit.
  15. A Lenovo laptop (2007-2008):  In my short stay at Scanscout,  I used this machine exclusively to compile and develop what then called the SSFramework (“ScanScout Framework”), a java-Tomcat stack which Scanscout used to serve video ad.   I ghosted it to have two partitions: Windows and Linux.   I mostly worked on Windows.  At that point, I always have small issues here and there to switch back to Linux.  Usually, the very versatile tech guru Dr. Tadashi Yonezaki would help me. Dr. Yonezaki later became the Chief Scientist of Scanscout.
  16. “Scanscouts’ Machines” (2007-2008): I can’t quite remember how the setting is, but all machines from early Scanscouts were shared by core technology scientists, like Tadashi or me, and several developers, QAs.   I wasn’t too into “The Scout” (how couple of early Alumi called it).   So I left the company after only 1.5 years.   A good ending though: Scanscout was later acquired by Tremor Video and got listed.
  17. Inspiron 530 “Inspirie” (2008 – ): There was around half a year of time when I resigned from Scanscout, I was unemployed.   I stayed home most of the time, read a lot and played tons of poker and backgammon on-line.  That was also the time I bought Inspirie.   For long time, it wasn’t doing much other than being a home media center.    Last few years though, Inspirie played an important role as I tried to learn deep learning.   I ran all Theano’s tutorial on it (despite it being very very slow).
  18. Machines I used in a S&P 500 company (2009-2011): Between “The Scout” and Voci, I was hired by a mid-size research institute as a Staff Scientist, and took care much of the experimental work within the group.   It’s a tough job, has long hours and so my mind usually get very numb.   I can only vaguely remember there are around 3 incidences of my terminal were broken.    That was also the time I was routinely using around 200 to 300 cores, which my guess is around 10-15% of all cores available within the department.   I was always told to tone down usage.  Since there are couple of guys in the department were exactly like me, recklessly sending jobs to the queue,  the admin decides to have a scheme which limit the amount of cores we could use.
  19. A 2011 Macbook Pro 17 inches “Macky” (2011 – After several years of saving, I finally bought my first Macbook.   I LOVE IT SO MUCH! It was also the first time since many years I feel computing is fun.  I wrote several blogs, several little games with Macky but mostly it was the machine I carried around.   Unfortunately, a horrible person poured tea on top of it.   So its display was permanently broken, I have to connect it with an LCD all the time.   But it is still the machine I love most.  Because it makes me love computing again.
  20. “IBM P2.8 4 cores” (2011-) A machine assigned to me by Voci. Most of my recent work on the Voci’s speech recognition framework was done on it.
  21. “Machines from Voci” (2011-) They are fun machines.  Part of it is due to the rise of GPUs.  Unfortunately I can’t talk about theirs settings too much. Let’s say Voci has been doing great work with them.
  22. “A 13 inches MacBook” (2014-) This is my current laptop.   I took most of my Cousera classes with it.    I feel great about its size and easy-goingness.
  23. An HP Stream” (2015-) My current Windows machine.  I hate Windows but you got to use it sometimes. A $200 price tag seems about right.
  24. Dell M70″ and “HP Pavilion dv2000” (2015-) i.e. The machine you saw in the image up top of this post.   I bought each of them for less than $10 from Goodwill.   Both of them have no problem in operation, but small physical issues such as dent and broken hinges.   A screwdriver and some electric tape would fix them easily.

There you have it.  The 24 sets of machines I have touched.  Mostly a history of story of some unknown silicons, but also my personal perspective on computing.

Arthur

(Edit at Dec 24: Fixed some typos.)

Categories
Dan Jurafsky Dependency Parsing Dragomir Radev HMM Language Modeling Machine Learning Natural Language Processing Parsing POS tagging Programming Python SMT Word Sense Disambiguation

Radev’s Coursera Introduction to Natural Language Processing – A Review

As I promised earlier, I was going to review Prof.  Dragomir Radev’s introductory class on natural language processing.   Few words about Prof. Radev: from his Wikipedia entry, Prof Radev is an award winning Professor, who co-found North American Computational Linguistics Olympiad (NACLO), which is the equivalent of USAMO in computational linguistics. He was also the coach of U.S. coach of International Language Olympiad 2011 and helped the team won several medals [1].    I think these are great contributions to the speech and language community.  In late 90s, when I was still in undergraduate, there was not much recognition of computational language processing as an important computation skill.    With competition in high-school or college level, there will be a generation of young minds who would aspire to build intelligent conversation agent, robust speech recognizer and versatile question and answering machine.   (Or else everyone would think Linux kernel is the only cool hack in town. 🙂 )

The Class

So how about the class?  I got to say I am more than surprised and happy with it.   I was searching for an intro NLP class, so the two natural choices was Prof. Jurafsky’ and Manning’ s and Prof.  Collin’s Natural Language Processing.   Both classes received great praise and comments and few of my friends recommend to take both.   Unfortunately, there was no class offering recently so I could only watch the material off-line.

Then there comes the Radev’s class,  it is as Prof. Radev explains: “more introductory” than Collin’s class and “more focused on Linguistics and resources” than Jurafsky and Manning.   So it is good for two types of learners:

  1. Those who just started out in NLP.
  2. Those who want to gather useful resources and start projects on NLP.

I belong to both types.   My job requires me to have more comprehensive knowledge of language and speech processing.

The Syllabus and The Lectures

The class itself is a brief survey of many important topics of NLP.   There are the basics:  parsing, tagging, language modeling.  There are the advanced topics such as summarization, statistical machine translation (SMT), semantic analysis and dialogue modeling.   The lectures, except occasionally mistakes, are quite well done and filled with interesting examples.

My only criticism is perhaps the length of videos, I would hope that most videos I watch would be less than 10 minutes.    That makes it easier to rotate with my other daily tasks.

The material is not too difficult to absorb for newcomers.   For starter, advanced topic such as  SMT is not covered in too much detail mathematically.  (So no need to derive EM on IBM models.)  That I think it’s quite appropriate for first time learners like me.

One more unique feature of the lectures: it fills with interesting NACLO problems.    While NACLO is more a high-school level competition, most of the problems are challenging even for experienced practitioners.  And I found them quite stimulating.

The Prerequisites and The Homework Assignments

To me, the fun part is the homework.   There were 3 of them, they focus on,

  1. Nivre’s Dependency Parser,
  2. Language Modeling and POS Tagging,
  3. Word Sense Disambiguation

All homework are based on python.   If you know what you are doing, they are not that difficult to do.   For me, I spent around 12-14 hours on each.   (Those are usually weekends.) Just like Ng’s Machine Learning class,   you need to match numbers with  the golden reference.   I think that’s the right approach to learn any machine learning task the first time.   Blindly come up with a system and hope it works never get you anywhere.

The homework does speak about an issue of the class, i.e. you do need to know the basics of Machine Learning .  Also, if you never had any programming experience would find the homework very difficult.   This probably described many linguistic students but never take any computer science classes.  [3]    You can still “power it through” and pass.  But it can be unnecessarily hard.

So I will recommend you first take the Ng’s class or perhaps the latest Machine Learning specialization from Guestrin and Fox first.   Those are the classes which would give you some basics of programming as well as basic concept of Machine Learning.

If you didn’t take any machine learning class, one way to go through more difficult classes like this is to read forum messages.   There are many nice people in the course was answering various questions.   To be frank, if the forum doesn’t exist, then it will take me around 3 times more time to finish all assignments.

Final Word

All-in-all, I highly recommend Prof. Radev’s class to anyone who is interested in NLP.    As I mentioned though, the class does require prerequisite such as basics of programming and machine learning.   So  I would recommend any learners to first take the Ng’s class before taking this one.

In any case, I want to thank Prof. Radev and all teaching staffs who prepare this wonderful course.   I also thank to many classmates who help me through the homework.

Arthur

Postscript at 2017 April

After I wrote this review, Coursera had since upgraded to the new format.  It’s a pity none of the NLP classes, including Prof. Radev’s survive.   To bad for NLP lovers!

There is also a seismic shift in the field of NLP toward deep learning. While deep learning does not dominate evaluations like in computer vision or speech recognition, it is perhaps the most actively researched direction right now.  So if you are curious about what’s new, consider to take the latest Standford cs224n 2017 or Oxford’s Deep Learning for NLP.

[1] http://www.eecs.umich.edu/eecs/about/articles/2010/Radev-Linguistics.html

[2] Week 1 Lecture 1 Introduction

[3] One anecdote:  In the forum, some students was asking why you can’t just sum all data points of a class together and pour into scikit-learn’s fit().    I don’t blame the student because she started late and lacks of prerequisite.   She later finished all assignment and I really admire her determination.

Categories
Connectionist Duda and Hart Machine Learning PRML The Element of Statistical Learning Tom Mitchell

One Algorithm to rule them all – Reading “The Master Algorithm”

519HYQKubELI read a lot of sci non-fi,  and I always wonder why there is no popular account of machine learning, especially given it is so prevalent in our time.   Machine learners are everywhere: when you use Siri, when we search the web, when we translate using service such as Google translate, when we use email and spam filter helps us to remove most of the junks.

Prof.  Pedro Domingos’ book, The Master Algorithm (TMA), is perhaps the first popular account of machine learning (I know of).   I greatly enjoy the book.   The book is most suitable for high school or first-year college students who want to learn more about machine learning.   But experienced practitioners (like me) would enjoy many ideas from the book.

Target Audience

Let’s ignore forgettable books with title such as “Would MACHINES become SKYNET?”  or “Are ROBOTS TAKING OVER THE WORLD?” type of fluffiest fluff. Most books I know on introductory level of machine learning are specialized on one type of technique with titles such as “Machine Learning Technique X Without Pain”, etc.  They are more a kind of user manual.  In my view, they also lean on the practical side of thing too much.  Those are good for tasting machine learningbut they seldom give you more understanding of what you are doing.

On the other hand,  comprehensive textbooks in the field such as Mitchell’s Machine Learning or Bishop’s Pattern Recognition and Machine Learning (also known as PRML),  Hastie’s The Element of Statistical Learning and of course Duda and Hart’s Pattern Classification (2rd), are more for practitioners who want to deepen their understanding[1].    Out of the four books I just mentioned, perhaps Machine Learning is the most readable, but it still requires prerequisite knowledge such as multivariate calculus and familarity of Bayes’ Rules.   PRML would challenge you with (more) advanced tricks of calculus such as how to work with tricky integrals such as $latex \int_{-\infty}^{\infty} e^{-x^2} dx$ or gamma functions.   They are hardly for the general reader whom does not have much sophistication in math.

I think TMA  fills in the gap between a user manual and a comprehensive textbook.  Most explanation are in words, or perhaps college level of math.   Yet the coverage is very similar to Machine Learning.  It is still dumbed down but touch many goodies (and toughies) in machine learning such as No Free Lunch Theorem.

5 Schools of Machine Learning

In TMA,  Prof. Domingos divides existing machine learning techniques into 5 schools:

1, Symbolist : such as logic-based representation, rule-based approach,

2, Connectionist : such as neural networks,

3, Evolutionist: such as genetic algorithms,

4, Bayesian: bayesian network,

5, Analogizer:  nearest neighborhood,  linear separators, SVM.

To be frank, the scheme can be hard to use in practice…..  Most modern textbook such as PRML or Duda and Hart usually discuss Bayesian approach but mixed with techniques in the other 4 categories.   I guess the reason is you can always have some Bayesian interpretation of a parameter estimation technique.

Artificial neural network (NN) is another example, which can have multiple  categories in the TMA’s scheme.  It’s true,  ANN was motivated by human neural network (HNN).  But ANN’s formulation is quite different from computational models from HNN.   So one way to think about ANN is “a stacks of logistic regressor”.  [2]

Even though I think the TMA’s scheme of dividing algorithms is weird,  as a popular book, I think this treatment is fair.   In a way, you can say it’s also hard to find a consistent scheme to classify machine learning algorithm.   If you ask me, I will say “Woa, you should totally learn linear regression, logistic regression …..” and come up with 8 techniques.   But that, just like many textbooks, are not easy to comprehend by general readers.

I guess more importantly, you should ask if the coverage is good. Most practitioners of ML are perhaps specialists like me (on ASR) or particular subfields.   It’s easy to get tunnel vision on what can be done and researched.

“The Master Algorithm”

So what is the eponymous “Master Algorithm”  then?   The Professor explained in p.24, which he calls the central thesis of the book.

“All knowledge – past, present, and future – can be derived from data by a single universal learning algorithm.”

Prof. Domingo then motivates why he has such belief and I think this also highlights the thinking of latest research in Machine Learning.    What do I mean by that?

Well, you can think of in machine learning work in real life are just testing different techniques and see if they work well.   For example, my recent word sense disambiguation homework recommends me to try out both SVM and kNN.   (We then saw the miraculous power of SVM……)   But that’s the deal, most of the time, you use the best technique through evaluation.

But in the last 5-6 years, we witness that deep neural network (and its friends, RNN, CNN etc) becomes the winners of competitions in many fields.   Not only ASR [3], computer vision [4], we are seeing NLP’s record is beat by neural network [5].   That thus make you think.  Would one algorithm can rule it all?

And another frequently talk-about discovery is about human neocortex.  If you know the popular view of neuroscience.  Most of our brain’s function are localized.  i.e. For vision, there is a region called visual cortex, for sound, there is a region called audio cortex.

Then you might heard of amazing experiment that when researcher tried to wire the connection from the eyes, to the audio cortex.   It is possible that the audio cortex would learn how to see.  That is a sign that neocortex circuitry can be reused [6] for many different purposes.

I think that’s what Prof. Domingos is driving at.   In the book, he also tries to motivated from other perspectives.  But I think the neuro-scientific perspective probably resonates our time the most.

No Deep Learning

While I like TMA’s general coverage,  I would hope that there are some descriptions on deep learning, which as I said in last paragraph, has been beating records here and there.

But then we should feel alarmed.   Yes, right now deep learning is showing superior results.  But so did SVM (even now) or GMM.   It just means that our search of the best algorithm is still on-going and might never end.  That’s perhaps the good professor is not too focused on deep learning.

Conclusion

While I comment on the applicability of the “5 schools” categorization scheme,  I love the books’ comprehensive coverage and its central thesis.   The book is also good for wide audience: for high-school and college students who heard of machine learning first time, this book is  a good introductory book.   While for specialists like me, this book can inspire new ideas and help consolidate the old ones.  e.g.  This is the first time I read about performance of nearest neighbor is only twice as error prone as the best imaginable classifier (p.185).

So I highly recommend this book for anyone who is interested in machine learning.   Of course, feel free to tell me what you think in the comment section.

References:

[1] The other classic I missed here is Prof. Kevin Murphy’s Machine Learning – a Probabilist Perspective.

[2] I heard this from Dr. Richard Socher’s DNN+NLP class.

[3] G. Hinton et al.   Deep Neural Networks for Acoustic Modeling in Speech Recognition.

[4] Alexnet: paper.

[5] I.Sutskever et al Sequence to Sequence Learning with Neural NetworksI am thinking more on the line of SMT.   Latest I heard: with attention model and some tuning.  NN-based SMT beats traditional IBM Models-based approach.

[6] In TMA, this paper on ferret is quoted.

Categories
Classification Debugging Machine Learning Programming

Experience in Real-Life Machine Learning

I have been refreshing myself on the general topic of machine learning.   Mostly motivated by job requirements as well as my own curiosity.   That’s why you saw my review post on the famed Andrew Ng’s class.   And I have been taking the Dragomir Radev’s NLP class, as well as the Machine Learning Specialization by Emily Fox and Carlos Guestrin [1].   When you are at work, it’s tough to learn.  But so far, I managed to learn something from each class and was able to apply them in my job.

So, one question you might ask is how applicable are on-line or even  university machine learning courses in real life?     Short answer, they are quite different. Let me try to answer this question by giving an example that come up to me recently.

It is a gender detection task based on voice.  This comes up at work and I was tasked to improve the company’s existing detector.   For the majority of the my time, I tried to divide the data set, which has around 1 million data point to train/validation/test sets.   Furthermore,  from the beginning of the task I decided to create sets of dataset with increasing size.  For example, 2k, 5k, 10k….. and up to 1 million.     This simple exercise, done mostly in python, took me close to a week.

Training, aka the fun part, was comparatively short and anti-climatic.  I just chose couple of well-known methods in the field.    And test on the progressively sized data set.  Since prototyping a system is so easy,  I was able to weed out weaker methods very early and come up with a better system.    I was able to get high relative performance gain.  Before I submitted the system to my boss, I also worked out an analysis of why the system doesn’t give 100%.   No surprise.  it turns out volume of the speech matters, and some individual of the world doesn’t like their sex stereotypes.    But so far the tasks are still quite well-done because we get better performance as well as we know why certain things don’t work well.   Those are good knowledge in practice.

One twist here, after finishing the system, I found that the method which gives the best classification performance doesn’t give the best speed performance.   So I decided to choose a cheaper but still rather effective method.    It hurts my heart to see the best method wasn’t used but that’s the way it is sometimes.

Eventually, as one of the architects of the system, I also spent time to make sure integration is correct.   That took coding, much of it was done in C/C++/python.  Since there were couple of bugs from some existing code,  I was spending about a week to trace code with gdb.

The whole thing took me about three months.  Around 80% of my time was spent on data preparation and  coding.  Machine learning you do in class happens, but it only took me around 2 weeks to determine the best model.   I could make these 2 weeks shorter by using more cores. But compare to other tasks,  the machine learning you do in class, which is usually in the very nice form, “Here is a training set, go train and evaluate it with evaluation set.”,  seldom appears in real-life.  Most of the time, you are the one who prepare the training and evaluation set.

So if you happen to work on machine learning, do expect to work on tasks such as web crawling and scraping if you work on text processing,  listen to thousands of waveforms if you work on speech or music processing,  watch videos that you might not like to watch if you try to classify videos.   That’s machine learning in real-life.   If you happen to be also the one who decide which algorithm to use, yes, you will have some fun.   If you happen to design a new algorithm. then you will have a lot of fun.  But most of the time, practitioners need to handle issues, which can just be …. mundane.   Tasks such as web crawling, is certainly not as enjoyable as to apply advanced mathematics to a problem.   But they are incredibly important and they will take up most of time of your or your organization as a whole.

Perhaps that’s why you heard of the term “data munging”  or in Bill Howe’s class: “data jujitsu”.   This is a well-known skill but not very advertised and unlikely to be seen as important.    But in real-life, such data processing skill is crucial.   For example, in my case, if I didn’t have a progressive sized datasets, prototyping could take a long time.  And I might need to spend 4 to 5 times more experimental time to determine what the best method is.    Of course, debugging will also be slower if you only have a huge data set.

In short, data scientists and machine learning practitioners spend majority of their time as data janitors.   I think that’s a well-known phenomenon for a long time.  But now as machine learning become a thing,  there are more awareness [2].  I think this is a good thing because it helps better scheduling and division of labors if you want to manage a group of engineers in a machine learning task.

[1] I might do a review at a certain point.
[2] e.g. This NYT article.