Category: Machine Learning

Experience in Real-Life Machine Learning

Post author By grandjanitor
Post date December 10, 2015
No Comments on Experience in Real-Life Machine Learning

I have been refreshing myself on the general topic of machine learning. Mostly motivated by job requirements as well as my own curiosity. That’s why you saw my review post on the famed Andrew Ng’s class. And I have been taking the Dragomir Radev’s NLP class, as well as the Machine Learning Specialization by Emily Fox and Carlos Guestrin [1]. When you are at work, it’s tough to learn. But so far, I managed to learn something from each class and was able to apply them in my job.

So, one question you might ask is how applicable are on-line or even university machine learning courses in real life? Short answer, they are quite different. Let me try to answer this question by giving an example that come up to me recently.

It is a gender detection task based on voice. This comes up at work and I was tasked to improve the company’s existing detector. For the majority of the my time, I tried to divide the data set, which has around 1 million data point to train/validation/test sets. Furthermore, from the beginning of the task I decided to create sets of dataset with increasing size. For example, 2k, 5k, 10k….. and up to 1 million. This simple exercise, done mostly in python, took me close to a week.

Training, aka the fun part, was comparatively short and anti-climatic. I just chose couple of well-known methods in the field. And test on the progressively sized data set. Since prototyping a system is so easy, I was able to weed out weaker methods very early and come up with a better system. I was able to get high relative performance gain. Before I submitted the system to my boss, I also worked out an analysis of why the system doesn’t give 100%. No surprise. it turns out volume of the speech matters, and some individual of the world doesn’t like their sex stereotypes. But so far the tasks are still quite well-done because we get better performance as well as we know why certain things don’t work well. Those are good knowledge in practice.

One twist here, after finishing the system, I found that the method which gives the best classification performance doesn’t give the best speed performance. So I decided to choose a cheaper but still rather effective method. It hurts my heart to see the best method wasn’t used but that’s the way it is sometimes.

Eventually, as one of the architects of the system, I also spent time to make sure integration is correct. That took coding, much of it was done in C/C++/python. Since there were couple of bugs from some existing code, I was spending about a week to trace code with gdb.

The whole thing took me about three months. Around 80% of my time was spent on data preparation and coding. Machine learning you do in class happens, but it only took me around 2 weeks to determine the best model. I could make these 2 weeks shorter by using more cores. But compare to other tasks, the machine learning you do in class, which is usually in the very nice form, “Here is a training set, go train and evaluate it with evaluation set.”, seldom appears in real-life. Most of the time, you are the one who prepare the training and evaluation set.

So if you happen to work on machine learning, do expect to work on tasks such as web crawling and scraping if you work on text processing, listen to thousands of waveforms if you work on speech or music processing, watch videos that you might not like to watch if you try to classify videos. That’s machine learning in real-life. If you happen to be also the one who decide which algorithm to use, yes, you will have some fun. If you happen to design a new algorithm. then you will have a lot of fun. But most of the time, practitioners need to handle issues, which can just be …. mundane. Tasks such as web crawling, is certainly not as enjoyable as to apply advanced mathematics to a problem. But they are incredibly important and they will take up most of time of your or your organization as a whole.

Perhaps that’s why you heard of the term “data munging” or in Bill Howe’s class: “data jujitsu”. This is a well-known skill but not very advertised and unlikely to be seen as important. But in real-life, such data processing skill is crucial. For example, in my case, if I didn’t have a progressive sized datasets, prototyping could take a long time. And I might need to spend 4 to 5 times more experimental time to determine what the best method is. Of course, debugging will also be slower if you only have a huge data set.

In short, data scientists and machine learning practitioners spend majority of their time as data janitors. I think that’s a well-known phenomenon for a long time. But now as machine learning become a thing, there are more awareness [2]. I think this is a good thing because it helps better scheduling and division of labors if you want to manage a group of engineers in a machine learning task.

[1] I might do a review at a certain point.
[2] e.g. This NYT article.

Tags gender detection, Machine Learning

Machine Learning

For the Not-So-Uninitiated: Review of Ng’s Coursera Machine Learning Class

Post author By grandjanitor
Post date November 11, 2015
No Comments on For the Not-So-Uninitiated: Review of Ng’s Coursera Machine Learning Class

I heard about Prof. Andrew Ng’s Machine Learning Class for a long time. As MOOC goes, this is a famous one. You can say the class actually popularized MOOC. Many people seem to be benefited from the class and it has ~70% positive rating. I have no doubt that Prof. Ng has done a good job in teaching non-data scientist on a lot of difficult concepts in machine learning.

On the other hand, if you are more a experienced practitioner of ML, i.e. like me, who has worked on a sub field of the industry (eh, speech recognition……) for a while, would the class be useful for you?

I think the answer is yes for several reasons:

You want to connect the dots : most of us work in a particular machine learning problem for a while, it’s easy to fall into certain tunnel vision inherent to a certain type of machine learning. e.g. For a while, people think that using 13 dimension of MFCC is the norm in ASR. So if you learn machine learning through ASR, it’s natural to think that feature engineering is not important. That cannot be more wrong! If you look at reviews of Kaggle winners, most will tell you they spent majority of time to engineer feature. So learning machine learning from ground up would give you a new perspective.
You want to learn the language of machine learning properly: One thing I found which is useful Ng’s class is that it doesn’t assume you know everything (unlike many postgraduate level classes). e.g. I found that Ng’s explanation of the term of bias vs variance makes a lot of sense – because the terms have to be interpreted differently to make sense. Before his class, I always have to conjure in my head on the equation of bias and variance. True, it’s more elegant that way, but for the most part an intuitive feeling is more crucial at work.
You want to practice: Suppose you are like me, who has been focusing on one area in ASR, e.g. in my case, I spent quite a portion of my time just work on the codebase of the in-house engine. Chances are you will lack of opportunities to train yourself on other techniques. e.g. I never implemented linear regression (a one-liner), logistic regression before. So this class will give you an opportunity to play with these stuffs hand-ons.
Your knowledge is outdated : You might have learned pattern recognition or machine learning once back in school. But technology has changed so you want to keep up. I think Ng’s class is a good starter class. There are more difficult ones such as Hinton’s Neural Network for Machine Learning, the Caltech class by Prof. Yaser Abu-Mostafa, or the CMU’s class by Prof. Toni Mitchell. If you are already proficient, yes, may be you should jump to those first.

So this is how I see Ng’s class. It is deliberately simple and leaned on the practical side. Math is minimal and calculus is nada. There is no deep learning and you don’t have to implement algorithm to train SVM. There is o latest stuffs such as random forest and gradient boosting. But it’s a good starter class. It also get you good warm up if you hadn’t learn for a while.

Of course, this also speaks quite a bit of the downsides of the class, there are just too many practical techniques which are not covered. For example, if you work on a few machine learning class, you will notice that SVM with RBF kernel is not the most scalable option. Random forest and gradient boosting is usually a better choice. And even when using SVM, using a linear kernel with right packages (such as pegasus-ml) would give you much faster run. In practice, it could mean if you can deliver or not. So this is what Ng’s class is lacking, it doesn’t cover many important modern techniques.

In a way, you should see it as your first machine learning class. The realistic expectation should be you need to keep on learning. (Isn’t that speak for everything?)

Issues aside, I feel very grateful to learn something new in machine learning again. That was since I took my last ML class back in 2002, the landscape of the field was so different back then. For that, let’s thank to Prof. Ng! And Happy Learning.

Arthur

Postscript at 2017 April

Since taking this first class of coursera, I took several other classes such as Dragomir Radev’s NLP and perhaps more interesting to you, Hinton’s Neural Network Machine Learning. You can find my reviews on the following hyperlinks:

Radev’s Coursera Introduction to Natural Language Processing – A Review

A Review on Hinton’s Coursera “Neural Networks and Machine Learning”

I also have a mind to write a review for perfect beginner of machine learning, so stay tuned! 🙂

(20151112) Edit: tunnel effects -> tunnel vision. Fixed some writing issues.
(20170416) In the process of organizing my articles. So I do some superficial edits.

Reference:

Andrew Ng’s Coursera Machine Learning Class : https://www.coursera.org/learn/machine-learning/home/welcome

Geoff Hinton’s Neural Networks for Machine Learning: https://www.coursera.org/course/neuralnets

The Caltech class: https://work.caltech.edu/telecourse.html

The CMU class: http://www.cs.cmu.edu/~tom/10701_sp11/lectures.shtml

Tags Andrew Ng, Coursera, Machine Learning

Classification Machine Learning Regression

Gradient Descent For Logistic Regression

Post author By grandjanitor
Post date August 20, 2015
2 Comments on Gradient Descent For Logistic Regression

I was binge watching (no kidding) all videos from Andrew Ng’s Coursera ML class. May be I want to write a review at a certain point. In short, it is highly recommendable for anyone who works in data science and machine learning to go through the class and spend some time to finish the homework step-by-step.

What I want to talk about though is an interesting mathematical equation you can find in the lecture, namely the gradient descent update or logistic regression. You might notice that gradient descents for both linear regression and logistic regression have the same form in terms of the hypothesis function. i.e.

$latex \theta_j := \theta_{j} – \alpha \sum_{i=1}^M (H_{\theta} (\pmb{x}^{(i)}) – y^{(i)}) x_j^{(i)}……(1)$

Notation can be found in Prof. Ng’s lecture at Coursera. Also you can find the lecture notes at here.

So why is it the case then? In a nutshell, it has to do with how the cost function $\latex J(\theta)$ was constructed. But let us back up and do some simple Calculus exercises on how the update equation can be derived.

In general, updating the parameter $latex \theta_j$ with gradient descent follows

$latex \theta_j := \theta_{j} – \alpha \frac{\partial J(\theta)} {\partial \theta}……(2)$

So we first consider linear regression with hypothesis function,

$latex H_{\theta}(\pmb{x}) = \theta^T \pmb{x}……(3)$

and cost function,

$latex J(\theta) = \frac{1}{2}\sum_{i=1}^M (H_{\theta}(\pmb{x}^{(i)})- y^{(i)})^2……(4)$.

So….

$latex \frac{\partial J(\theta)}{\partial \theta_j} = \frac{\partial J(\theta)}{\partial H_{\theta}(\pmb{x}^{(i)})} \frac{\partial H_{\theta}(\pmb{x}^{(i)})}{\partial \theta_j} $

$latex = \sum_{i=1}^M (H_{\theta} (\pmb{x}^{(i)}) – y^{(i)}) x_j^{(i)} for k = 1 \ldots N$

So we arrive update equation (1).

Before we go on notice that in our derivation for linear regression, we use chain rule to simplify. Many of these super long expressions can be simplified much more easily if you happen to know the trick.

So how about logistic regression? The hypothesis function of logistic regression is
$latex H_{\theta}(\pmb{x}) = g(\theta^T \pmb{x})……(5)$

where $latex g(z)$ is the sigmoid function$

$latex g(z) = \frac{1}{1+e^{-z}}…… (6)$.

as we can plot in Diagram 1.

Sigmoid function is widely used in engineering and science. For our discussion, here’s one very useful property:

$latex \frac{dg(z)}{dz} = g(z) (1 – g(z)) …… (7)$

Proof:
$latex \frac{dg(z)}{dz} = \frac{d}{dz}\frac{1}{1+e^{-z}}$
$latex = -(\frac{-e^{-z}}{(1+e^{-z})^2})$
$latex = \frac{e^{-z}}{(1+e^{-z})^2}$
$latex = g(z)(1-g(z))$

as $latex 1-g(z) = \frac{e^{-z}}{1+e^{-z}}$.

Now we have all the tools, let’s go forward to calculate the gradient term for the logistic regression cost function, which is defined as,

$latex J(\theta) = \sum_{i=1}^M \lbrack -y^{(i)}log H_{\theta}(x^{(i)})-(1-y^{(i)})log (1- H_\theta(x^{i}))\rbrack$

The gradient is

$latex \frac{\partial J(\theta)}{\partial\theta_k} = \sum_{i=1}^M \lbrack -y^{(i)} \frac{H’_{\theta}(x^{(i)})}{H_{\theta}(x^{(i)})} + (1- y^{(i)}) \frac{H’_{\theta}(x^{(i)})}{1-H_{\theta}(x^{(i)})}\rbrack ……(8)$

So making use of Equation (7) and chain rule, the gradient w.r.t $latex \theta_k$:

$latex H’_{\theta}(x^{(i)}) = H_{\theta}(x^{(i)})(1-H_{\theta}(x^{(i)}))x_k^{(i)} …..(9)$

Substitute (9) into (8),

$latex \frac{\partial J(\theta)}{\partial\theta_k} = \sum_{i=1}^M -y^{(i)}\frac{H_{\theta}(x^{(i)})(1-H_{\theta}(x^{(i)}))x_k^{(i)} }{H_{\theta}(x^{(i)})} + \sum_{i=1}^M (1- y^{(i)}) \frac{H_{\theta}(x^{(i)})(1-H_{\theta}(x^{(i)}))x_k^{(i)} }{1-H_{\theta}(x^{(i)})}$
$latex = \sum_{i=1}^M\lbrack -y^{(i)} (1-H_{\theta}(x^{(i)}))x_k^{(i)} \rbrack + \sum_{i=1}^M\lbrack (1- y^{(i)}) H_{\theta}(x^{(i)})x_k^{(i)} \rbrack$
$latex = x_k^{(i)} \lbrack \sum_{i=1}^M (-y^{(i)} + y^{(i)} H_{\theta}(x^{(i)}) + \sum_{i=1}^M (H_{\theta}(x^{(i)}) – y^{(i)} H_{\theta}(x^{(i)})) \rbrack$

As you may observed, the second and the fourth term cancel out. So we end up having:

$latex \frac{\partial J(\theta)}{\partial\theta_k} = \sum_{i=1}^M (H_{\theta}(x^{(i)}) -y^{(i)})x_k^{(i)}$,

which brings us back update rule (2).

This little calculus exercise shows that both linear regression and logistic regression (actually a kind of classification) arrive the same update rule. What we should appreciate is that the design of the cost function is part of the reasons why such “coincidence” happens. But that’s why I appreciate Ng’s simple lecture. It is using a set of derivation which brings beginners into machine learning more easily.

Arthur

Reference:
Profession Ng’s lectures : https://www.coursera.org/learn/machine-learning

Property of sigmoid function can be found at Wikipedia page: https://en.wikipedia.org/wiki/Sigmoid_function

Many linear separator-based machine learning algorithms can be trained using simple gradient descent. Take a look of Chapter 5 of Duda, Hart and Stork’s Pattern Classification.