I was binge watching (no kidding) all videos from Andrew Ng’s Coursera ML class. May be I want to write a review at a certain point. In short, it is highly recommendable for anyone who works in data science and machine learning to go through the class and spend some time to finish the homework step-by-step.
What I want to talk about though is an interesting mathematical equation you can find in the lecture, namely the gradient descent update or logistic regression. You might notice that gradient descents for both linear regression and logistic regression have the same form in terms of the hypothesis function. i.e.
So why is it the case then? In a nutshell, it has to do with how the cost function $\latex J(\theta)$ was constructed. But let us back up and do some simple Calculus exercises on how the update equation can be derived.
In general, updating the parameter $latex \theta_j$ with gradient descent follows
$latex = \sum_{i=1}^M (H_{\theta} (\pmb{x}^{(i)}) – y^{(i)}) x_j^{(i)} for k = 1 \ldots N$
So we arrive update equation (1).
Before we go on notice that in our derivation for linear regression, we use chain rule to simplify. Many of these super long expressions can be simplified much more easily if you happen to know the trick.
So how about logistic regression? The hypothesis function of logistic regression is
$latex H_{\theta}(\pmb{x}) = g(\theta^T \pmb{x})……(5)$
where $latex g(z)$ is the sigmoid function$
$latex g(z) = \frac{1}{1+e^{-z}}…… (6)$.
as we can plot in Diagram 1.
Diagram 1: A sigmoid function
Sigmoid function is widely used in engineering and science. For our discussion, here’s one very useful property:
This little calculus exercise shows that both linear regression and logistic regression (actually a kind of classification) arrive the same update rule. What we should appreciate is that the design of the cost function is part of the reasons why such “coincidence” happens. But that’s why I appreciate Ng’s simple lecture. It is using a set of derivation which brings beginners into machine learning more easily.
Property of sigmoid function can be found at Wikipedia page: https://en.wikipedia.org/wiki/Sigmoid_function
Many linear separator-based machine learning algorithms can be trained using simple gradient descent. Take a look of Chapter 5 of Duda, Hart and Stork’s Pattern Classification.