Math of machine learning boils down to probability theory, calculus and linear algebra/matrix analysis. The one gem which mesmerizes me most is the so called “probability integral”, termed by Prof. Paul J. Nahin, author of Inside Interesting Integrals [1]. Or as you might learn in either probability theory or random process class:
$latex F(x)=\int_{-\infty}^{\infty}e^{\frac{-x^2}{2}}dx$
Of course, this is related to the Gaussian distribution. When I learnt the probability integral, I first learned the now standard trick of “polar integration” [2]. i.e. We first create a double integral, then transform the integrand using polar coordinate transformation. Here is the detail. Since $latex F(x)$ is an even function. We just need to consider
$latex I=\int_{0}^{\infty}e^{\frac{-x^2}{2}}dx$
$latex I^2=\int_{0}^{\infty} e^{\frac{-x^2}{2}}dx \int_{0}^{\infty} e^{\frac{-y^2}{2}}dy=\int_{0}^{\infty}\int_{0}^{\infty}e^{\frac{-x^2}{2}}e^{\frac{-y^2}{2}}dxdy=\int_{0}^{\infty}\int_{0}^{\infty}e^{\frac{-(x^2+y^2)}{2}}dxdy$
Then let $latex x=r\cos\theta$, and $latex y=r\sin\theta$, the Jacobians
$latex \frac{\partial(r,\theta)}{\partial(x,y)}=\begin{vmatrix}\cos\theta & -r\sin\theta \\ \sin\theta & r\cos\theta\end{vmatrix} = r (\sin^2\theta+cos^2\theta) =r$.
Substitute to $latex 1$.
$latex I^2=\int_{0}^{\frac{\pi}{2}}\int_{0}^{\infty}re^{\frac{-r^2}{2}}drd\theta$
$latex =\int_{0}^{\frac{\pi}{2}}(\left.-e^{-r^2}\right|_{0}^{\infty})d\theta=\int_{0}^{\frac{\pi}{2}} (0 – (-1))d\theta=\int_{0}^{\frac{\pi}{2}}d\theta= \frac{\pi}{2},$
or $latex I = \sqrt{\frac{\pi}{2}}.$
So $latex F(x) = 2 * \sqrt{\frac{\pi}{2}} = \sqrt{2\pi}$.
This is a well known result, you can find it in almost all introductory book on probability distribution.
This derivation is a trick. Quite frankly, without reading the textbook. I wouldn’t have the imagination to come up with smart way to derive integrals. Perhaps that’s why Prof. Nahin said in his book Inside Interesting Integral [4] :
THIS IS ABOUT AS CLOSE TO A MIRACLE AS YOU’LL GET IN MATHEMATICS.
and in Solved Problems in Analysis [5], Orin J. Farrell did a similar integral like $latex I$,
$latex F_{1}(x)=\int_{0}^{\infty}e^{-x^2}dx$.
And again he used the polar coordinate transform. He then said “Indeed, its discoverer was surely a person with great mathematical ingenuity.” I concur. While the whole proof procedure is only using elementary calculus known by undergraduates. It takes quite a bit of imagination to come up with such method. It’s easier for us to use the trick but the originator must be very smart.
I also think this is one of the key Calculus trick to learn if you want to learn more daunting distributions such as gamma, beta or Dirichlet’s distribution. So let me give several quick notes on the proof above:
- You can use similar idea to prove the famous gamma and beta function are related as $latex \Gamma(a+b)=\Gamma(a)\Gamma(b)B(a,b)$. There are many proofs of this fundamental relationship between gamma and beta function. e.g. PRML[3] ex 2.6 outlines one way to a proof. But I found that first transforming the gamma function with $latex x=y^2$ and follow the above polar coordinate transformation trick is the easiest way to proof the relationship.
- As mentioned before, the same proof can also be used in calculating $latex \Gamma(\frac{1}{2})$ (Also from [5]).
- The polar coordinate trick is not the only way to calculate the probability integral. Look at note [6] on the history of its derivation.
Each of these points deserve a post on its own. So let’s say I owe you some more posts.
References:
[1] Paul J. Nahin, Inside Interesting Integrals.
[2] Alberto Leon-Garcia, Probability and Random Processes for Electrical Engineering 2nd Edition.
[3] Christopher M. Bishop, Pattern Recognition and Machine Learning.
[4] Paul J. Nahin, Inside Interesting Integrals. p.4.
[5] Orin J. Farrell and Bertram Ross, Solved Problems in Analysis p.11.
[6] Author Unknown, The Probability Integral Link: http://www.york.ac.uk/depts/maths/histstat/normal_history.pdf