I was binge watching (no kidding) all videos from Andrew Ng's Coursera ML class. May be I want to write a review at a certain point. In short, it is highly recommendable for anyone who works in data science and machine learning to go through the class and spend some time to finish the homework step-by-step.

What I want to talk about though is an interesting mathematical equation you can find in the lecture, namely the gradient descent update or logistic regression. You might notice that gradient descents for both linear regression and logistic regression have the same form in terms of the hypothesis function. i.e.

So why is it the case then? In a nutshell, it has to do with how the cost function $\latex J(\theta)$ was constructed. But let us back up and do some simple Calculus exercises on how the update equation can be derived.

In general, updating the parameter with gradient descent follows

So we first consider linear regression with hypothesis function,

and cost function,

.

So....

So we arrive update equation (1).

Before we go on notice that in our derivation for linear regression, we use chain rule to simplify. Many of these super long expressions can be simplified much more easily if you happen to know the trick.

So how about logistic regression? The hypothesis function of logistic regression is

where is the sigmoid function$

.

as we can plot in Diagram 1.

Sigmoid function is widely used in engineering and science. For our discussion, here's one very useful property:

Proof:

as .

Now we have all the tools, let's go forward to calculate the gradient term for the logistic regression cost function, which is defined as,

The gradient is

So making use of Equation (7) and chain rule, the gradient w.r.t :

Substitute (9) into (8),

As you may observed, the second and the fourth term cancel out. So we end up having:

,

which brings us back update rule (2).

This little calculus exercise shows that both linear regression and logistic regression (actually a kind of classification) arrive the same update rule. What we should appreciate is that the design of the cost function is part of the reasons why such "coincidence" happens. But that's why I appreciate Ng's simple lecture. It is using a set of derivation which brings beginners into machine learning more easily.

Property of sigmoid function can be found at Wikipedia page: https://en.wikipedia.org/wiki/Sigmoid_function

Many linear separator-based machine learning algorithms can be trained using simple gradient descent. Take a look of Chapter 5 of Duda, Hart and Stork's Pattern Classification.

Speech Recognition, Machine Learning, and Random Musing of Arthur Chan