Experience in Real-Life Machine Learning

I have been refreshing myself on the general topic of machine learning.   Mostly motivated by job requirements as well as my own curiosity.   That's why you saw my review post on the famed Andrew Ng's class.   And I have been taking the Dragomir Radev's NLP class, as well as the Machine Learning Specialization by Emily Fox and Carlos Guestrin [1].   When you are at work, it's tough to learn.  But so far, I managed to learn something from each class and was able to apply them in my job.

So, one question you might ask is how applicable are on-line or even  university machine learning courses in real life?     Short answer, they are quite different. Let me try to answer this question by giving an example that come up to me recently.

It is a gender detection task based on voice.  This comes up at work and I was tasked to improve the company's existing detector.   For the majority of the my time, I tried to divide the data set, which has around 1 million data point to train/validation/test sets.   Furthermore,  from the beginning of the task I decided to create sets of dataset with increasing size.  For example, 2k, 5k, 10k..... and up to 1 million.     This simple exercise, done mostly in python, took me close to a week.

Training, aka the fun part, was comparatively short and anti-climatic.  I just chose couple of well-known methods in the field.    And test on the progressively sized data set.  Since prototyping a system is so easy,  I was able to weed out weaker methods very early and come up with a better system.    I was able to get high relative performance gain.  Before I submitted the system to my boss, I also worked out an analysis of why the system doesn't give 100%.   No surprise.  it turns out volume of the speech matters, and some individual of the world doesn't like their sex stereotypes.    But so far the tasks are still quite well-done because we get better performance as well as we know why certain things don't work well.   Those are good knowledge in practice.

One twist here, after finishing the system, I found that the method which gives the best classification performance doesn't give the best speed performance.   So I decided to choose a cheaper but still rather effective method.    It hurts my heart to see the best method wasn't used but that's the way it is sometimes.

Eventually, as one of the architects of the system, I also spent time to make sure integration is correct.   That took coding, much of it was done in C/C++/python.  Since there were couple of bugs from some existing code,  I was spending about a week to trace code with gdb.

The whole thing took me about three months.  Around 80% of my time was spent on data preparation and  coding.  Machine learning you do in class happens, but it only took me around 2 weeks to determine the best model.   I could make these 2 weeks shorter by using more cores. But compare to other tasks,  the machine learning you do in class, which is usually in the very nice form, "Here is a training set, go train and evaluate it with evaluation set.",  seldom appears in real-life.  Most of the time, you are the one who prepare the training and evaluation set.

So if you happen to work on machine learning, do expect to work on tasks such as web crawling and scraping if you work on text processing,  listen to thousands of waveforms if you work on speech or music processing,  watch videos that you might not like to watch if you try to classify videos.   That's machine learning in real-life.   If you happen to be also the one who decide which algorithm to use, yes, you will have some fun.   If you happen to design a new algorithm. then you will have a lot of fun.  But most of the time, practitioners need to handle issues, which can just be .... mundane.   Tasks such as web crawling, is certainly not as enjoyable as to apply advanced mathematics to a problem.   But they are incredibly important and they will take up most of time of your or your organization as a whole.

Perhaps that's why you heard of the term "data munging"  or in Bill Howe's class: "data jujitsu".   This is a well-known skill but not very advertised and unlikely to be seen as important.    But in real-life, such data processing skill is crucial.   For example, in my case, if I didn't have a progressive sized datasets, prototyping could take a long time.  And I might need to spend 4 to 5 times more experimental time to determine what the best method is.    Of course, debugging will also be slower if you only have a huge data set.

In short, data scientists and machine learning practitioners spend majority of their time as data janitors.   I think that's a well-known phenomenon for a long time.  But now as machine learning become a thing,  there are more awareness [2].  I think this is a good thing because it helps better scheduling and division of labors if you want to manage a group of engineers in a machine learning task.

[1] I might do a review at a certain point.
[2] e.g. This NYT article.

I was binge watching (no kidding) all videos from Andrew Ng's Coursera ML class.   May be I want to write a review at a certain point.  In short, it is highly recommendable for anyone who works in data science and machine learning to go through the class and spend some time to finish the homework step-by-step.

What I want to talk about though is an interesting mathematical equation you can find in the lecture, namely the gradient descent update or logistic regression.   You might notice that gradient descents for both linear regression and logistic regression have the same form in terms of the hypothesis function.  i.e.

$\theta_j := \theta_{j} - \alpha \sum_{i=1}^M (H_{\theta} (\pmb{x}^{(i)}) - y^{(i)}) x_j^{(i)}......(1)$

Notation can be found in Prof. Ng's lecture at Coursera.  Also you can find the lecture notes at here.

So why is it the case then? In a nutshell, it has to do with how the cost function $\latex J(\theta)$ was constructed.  But let us back up and do some simple Calculus exercises on how the update equation can be derived.

In general, updating the parameter $\theta_j$ with gradient descent follows

$\theta_j := \theta_{j} - \alpha \frac{\partial J(\theta)} {\partial \theta}......(2)$

So we first consider linear regression with hypothesis function,

$H_{\theta}(\pmb{x}) = \theta^T \pmb{x}......(3)$

and cost function,

$J(\theta) = \frac{1}{2}\sum_{i=1}^M (H_{\theta}(\pmb{x}^{(i)})- y^{(i)})^2......(4)$.

So....

$\frac{\partial J(\theta)}{\partial \theta_j} = \frac{\partial J(\theta)}{\partial H_{\theta}(\pmb{x}^{(i)})} \frac{\partial H_{\theta}(\pmb{x}^{(i)})}{\partial \theta_j}$

$= \sum_{i=1}^M (H_{\theta} (\pmb{x}^{(i)}) - y^{(i)}) x_j^{(i)} for k = 1 \ldots N$

So we arrive update equation (1).

Before we go on notice that in our derivation for linear regression, we use chain rule to simplify. Many of these super long expressions can be simplified much more easily if you happen to know the trick.

So how about logistic regression? The hypothesis function of logistic regression is
$H_{\theta}(\pmb{x}) = g(\theta^T \pmb{x})......(5)$

where $g(z)$ is the sigmoid function\$

$g(z) = \frac{1}{1+e^{-z}}...... (6)$.

as we can plot in Diagram 1.

Sigmoid function is widely used in engineering and science. For our discussion, here's one very useful property:

$\frac{dg(z)}{dz} = g(z) (1 - g(z)) ...... (7)$

Proof:
$\frac{dg(z)}{dz} = \frac{d}{dz}\frac{1}{1+e^{-z}}$
$= -(\frac{-e^{-z}}{(1+e^{-z})^2})$
$= \frac{e^{-z}}{(1+e^{-z})^2}$
$= g(z)(1-g(z))$

as $1-g(z) = \frac{e^{-z}}{1+e^{-z}}$.

Now we have all the tools, let's go forward to calculate the gradient term for the logistic regression cost function, which is defined as,

$J(\theta) = \sum_{i=1}^M \lbrack -y^{(i)}log H_{\theta}(x^{(i)})-(1-y^{(i)}log (1- H_\theta(x^{i}))\rbrack$

$\frac{\partial J(\theta)}{\partial\theta_k} = \sum_{i=1}^M \lbrack -y^{(i)} \frac{H'_{\theta}(x^{(i)})}{H_{\theta}(x^{(i)})} + (1- y^{(i)}) \frac{H'_{\theta}(x^{(i)})}{1-H_{\theta}(x^{(i)})}\rbrack ......(8)$

So making use of Equation (7) and chain rule, the gradient w.r.t $\theta_k$:

$H'_{\theta}(x^{(i)}) = H_{\theta}(x^{(i)})(1-H_{\theta}(x^{(i)}))x_k^{(i)} .....(9)$

Substitute (9) into (8),

$\frac{\partial J(\theta)}{\partial\theta_k} = \sum_{i=1}^M -y^{(i)}\frac{H_{\theta}(x^{(i)})(1-H_{\theta}(x^{(i)}))x_k^{(i)} }{H_{\theta}(x^{(i)})} + \sum_{i=1}^M (1- y^{(i)}) \frac{H_{\theta}(x^{(i)})(1-H_{\theta}(x^{(i)}))x_k^{(i)} }{1-H_{\theta}(x^{(i)})}$
$= \sum_{i=1}^M\lbrack -y^{(i)} (1-H_{\theta}(x^{(i)}))x_k^{(i)} \rbrack + \sum_{i=1}^M\lbrack (1- y^{(i)}) H_{\theta}(x^{(i)})x_k^{(i)} \rbrack$
$= x_k^{(i)} \lbrack \sum_{i=1}^M (-y^{(i)} + y^{(i)} H_{\theta}(x^{(i)}) + \sum_{i=1}^M (H_{\theta}(x^{(i)}) - y^{(i)} H_{\theta}(x^{(i)})) \rbrack$

As you may observed, the second and the fourth term cancel out. So we end up having:

$\frac{\partial J(\theta)}{\partial\theta_k} = \sum_{i=1}^M (H_{\theta}(x^{(i)}) -y^{(i)})x_k^{(i)}$,

which brings us back update rule (2).

This little calculus exercise shows that both linear regression and logistic regression (actually a kind of classification) arrive the same update rule. What we should appreciate is that the design of the cost function is part of the reasons why such "coincidence" happens. But that's why I appreciate Ng's simple lecture. It is using a set of derivation which brings beginners into machine learning more easily.

Arthur

Reference:
Profession Ng's lectures : https://www.coursera.org/learn/machine-learning