# Category: DNN

(Also see my full review of Course 1 and Course 2 here.)

Fellows, as you all know by now, Prof. Andrew Ng has started a new Coursera Specialization on Deep Learning. So many of you came to me today and ask my take on the class. As a rule, I usually don’t comment on a class unless I know something about it. (Search for my “Learning Deep Learning – Top 5 Lists” for more details.) But I’d like to make an exception for the Good Professor’s class.

AIDL member Bob Akili asked (rephrased):

What is the Difference between Deep Learning and Machine Learning?

Usually I don’t write a full blog message to answer member’s questions. But what is “deep” is such a fundamental concept in deep learning, yet there are many well-meaning but incorrect answers floating around. So I think it is a great idea to answer the question clearly and hopefully disabuse some of the misconceptions as well. Here is a cleaned up and expanded version of my comment to the thread.

# Deep Learning is Just a Subset of Machine Learning

First of all deep learning is just a subset of techniques of machine learning. You may heard from many “Deep Learning Consultants”-type: “deep learning is completely different from from Machine Learning”. But then when we are talking about “deep learning” these days, we are really talking about “neural networks which has more than one layer”. Since neural network is just one type of ML techniques, it doesn’t make any sense to call DL as “different” from ML. It might work for marketing purpose, but the thought was clearly misleading.

# Deep Learning is a kind of Representation Learning

So now we know that deep learning is a kind of machine learning. We still can’t quite answer why it is special. So let’s be more specific, deep learning is a kind of representation learning. What is representation learning? Representation learning is an opposite of another school of thought/practice: feature engineering. In feature engineering, humans are supposed to hand-craft features to make machine works better. If you Kaggle before, this should be obvious to you, sometimes you just want to manipulate the raw inputs and create new feature to represent your data.

Yet in some domains which involve high-dimensional data such as images, speech or text, hand-crafting feature was found to be very difficult. e.g. Using HOG type of approaches to do computer vision usually takes a 4-5 years of a PhD student. So here we come back to representation learning – can computer automatically learn good features?

# What is a “Deep” Technique?

Now we come to the part why deep learning is “deep” – usually we call a method “deep” when we are optimizing a** nested function** in the method. So for example, if you can express such functions as a graph, you would find that it has multiple layers. *The term “deep” really is describing such “nestedness”.* That should explain why we typically called any artificial neural network (ANN) with more than 1 hidden layer as “deep”. Or the general saying, “deep learning is just neural network which has more layers”.

(Another appropriate term is “hierarchical”. See footnote [4] for more detail.)

This is also the moment Karpathy in cs231n will show you the multi-layer CNN such that features are automatically learned from the simplest to more complex one. Eventually your last layer can just differentiate them using a linear classifier. As there is a “deep” structure that learn the right feature (last layer). Note the key term here is “automatic”, all these Gabor-filter like feature are not hand-made. Rather, they are results from back-propagation [3].

# Are there Anything which is “Deep” but not a Neural Network?

Actually, there are plenty, deep Boltzmann machine? deep belief network? deep Gaussian process? They are still discussed in unsupervised learning using neural network, but I always found that knowledge of graphical models is more important to understand them.

# So is Deep Learning also a Marketing Term?

Yes and no. It depends on who you talk to. If you talk with ANN researchers/practitioners, they would just tell you “deep learning is just neural network which has more than 1 hidden layer”. Indeed, if you think from their perspective, the term “deep learning” could just be a short-form. Yet as we just said, you can also called other methods “deep”. So the adjective is not totally void of meaning. But many people would also tell you that because “deep learning” has become such a marketing term, it can now mean many different things. I will say more in next section.

Also the term “deep learning” has been there for a century. Check out Prof. Schmidhuber’s thread for more details?

# “No Way! X is not Deep but it is also taught in Deep Learning Class, You made a Horrible Mistake!”

I said it with much authority and I know some of you guys would just jump in and argue:

“What about word2vec? It is nothing deep at all, but people still call it Deep learning!!!” “What about all wide architectures such as “wide-deep learning“?” “Arthur, You are Making a HORRIBLE MISTAKE!”

Indeed, the term “deep learning” is being abused these days. More learned people, on the other hand, are usually careful to call certain techniques “deep learning” For example, in cs221d 2015/2016 lectures, Dr. Richard Socher was quite cautious to call word2vec as “deep”. His supervisor, Prof. Chris Manning, who is an authority in NLP, is known to dispute whether deep learning is always useful in NLP, simply because some recent advances in NLP really due to deep learning [1][2].

I think these cautions make sense. Part of it is that calling everything “deep learning” just blurs what really should be credited in certain technical improvement. The other part is we shouldn’t see deep learning as the only type of ML we want to study. There are many ML techniques, some of them are more interesting and practical than deep learning in practice. For example, deep learning is not known to work well with small data scenario. Would I just yell at my boss and say “*Because I can’t use deep learning, so I can’t solve this problem*!”? No, I would just test out random forest, support vector machines, GMM and all these nifty methods I learn over the years.

# Misleading Claim About Deep Learning (I) – “Deep Learning is about Machine Learning Methods which use a lot of Data!”

So now we come to the arena of misconceptions, I am going to discuss two claims which many people have been drumming about deep learning. But neither of them is the right answer to the question “What is the Difference between Deep and Machine Learning?

The first one you probably heard all the time, “Deep Learning is about ML methods which use a lot of data”. Or people would tell you “Oh, deep learning *just *use a lot of data, right?” This sounds about right, deep learning in these days does use a lot of data. So what’s wrong with the statement?

Here is the answer: while deep learning does use a lot of data, *before deep learning*, *other techniques use tons of data too! *e.g. Speech recognition before deep learning, i.e. HMM+GMM, can use up to 10k hours of speech. Same for SMT. And you can do SVM+HOG on Imagenet. And more data is always better for those techniques as well. So if you say “deep learning use more data”, then you forgot the older techniques also can use more data.

What you can claim is that *“deep learning is a more effective way to utilize data”. * That’s very true, because once you get into either GMM or SVM, they would have scalability issues. GMM scales badly when the amount of data is around 10k hour. SVM (with RBF-kernel in particular) is super tough/slow to use when you have ~1 million point of data.

# Misleading Claim About Deep Learning II – “Deep Learning is About Using GPU and Having Data Center!”

This particular claim is different from the previous “Data Requirement” claim, but we can debunk it in a similar manner. The reason why it is wrong? Again *before deep learning, people have GPUs to do machine learning **already. *For example, you can use GPU to speed up GMM. Before deep learning is hot, you need a cluster of machines to train acoustic model or language model for speech recognition. You also need tons of RAM to train a language model for SMT. So calling GPU/Data Center/RAM/ASIC/FPGA a differentiator of deep learning is just misleading.

You can say though “*Deep Learning has change the computational model from distributed network model to more a single machine-centric paradigm (which each machine has one GPU). But later approaches also tried to combine both CPU-GPU processing together”. *

# Conclusion and “What you say is Just Your Opinion! My Theory makes Equal Sense!”

Indeed, you should always treat what you read on-line with a grain of salt. Being critical is a good thing, having your own opinion is good. But you should also try to avoid *equivocate *an issue. Meaning: sometimes things have only one side, but you insist there are two equally valid answers. If you do so, you are perhaps making a logical error in your thinking. And a lot of people who made claims such as “deep learning is learning which use more data and use a lot of GPUS” are probably making such thinking errors.

Saying so, I would suggest you to read several good sources to judge my answer, they are:

- Chapter 1 of Deep Learning.
- Shakir’s Machine Learning Blog on a Statistical View of Deep Learning. In particular, part VI, “What is Deep?“
- Tombone’s post on Deep Learning vs Machine Learning vs Pattern Recognition

In any case, I hope that this article helps you. I thank Bob to ask the question, Armaghan Rumi Naik has debunked many misconceptions in the original thread – his understanding on machine learning is clearly above mine and he was able to point out mistakes from other commenters. It is worthwhile for your reading time.

# Footnotes

[1] See “Last Words: Computational Linguistics and Deep Learning”

[2] Generally whether DL is useful in NLP is widely disputed topic. Take a look of Yoav Goldberg’s view on some recent GAN results on language generation. AIDL Weekly #18 also gave an expose on the issue.

[3] Perhaps another useful term is “hierarchical”. In the case of ConvNet the term is right on. As Eric Heitzman comments at AIDL:

“(deep structure) They are *not* necessarily recursive, but they *are* necessarily hierarchical since layers always form a hierarchical structure.” After Eric’s comment, I think both “deep” and “hierarchical” are fair terms to describe methods in “deep learning”. (Of course, “hierarchical learning” is a much a poorer marketing term.)

[4] In earlier draft. I use the term recursive to describe the term “deep”, which as Eric Heitzman at AIDL, is not entirely appropriate. “Recursive” give people a feeling that the function is self-recursive or$latex f(f( \ldots f(f(*))))$. but actual function are more “nested”, like $latex f_1(f_2( \ldots f_{n-1}(f_n(*))))$. As a result, I removed the term “recursive” but just call the function “nested function”.

Of course, you should be aware that my description is not too mathematically rigorous neither. (I guess it is a fair wordy description though)

History:

20170709 at 6: fix some typos.

20170711: fix more typos.

20170711 at 7:05 p.m.: I got a feedback from Eric Heitzman who points out that the term “recursive” can be deceiving. Thus I wrote footnote [4].

If you like this message, subscribe the Grand Janitor Blog’s RSS feed. You can also find me (Arthur) at twitter, LinkedIn, Plus, Clarity.fm. Together with Waikit Lau, I maintain the Deep Learning Facebook forum. Also check out my awesome employer: Voci.

# Introduction

Let me preface this article: after I wrote my top five list on deep learning resources, one oft-asked question is “What is the Math prerequisites to learn deep learning?” My first answer is Calculus and Linear Algebra, but then I will qualify certain techniques of Calculus and Linear Algebra are more useful. e.g. you should already know gradient, differentiation, partial differentiation and Lagrange multipliers, you should know matrix differentiation and preferably trace trick , eigen-decomposition and such. If your goal is to understand machine learning in general, then having good skills in integrations and knowledge in analysis helps. e.g. 1-2 stars problems of Chapter 2 at PRML [1] requires some knowledge of advanced function such as gamma, beta. Having some Math would help you go through these questions more easily.

Nevertheless, I find that people who want to learn Math first before approaching deep learning miss the point. Many engineering topics was not motivated by pure mathematical pursuit. More often than not, an engineering field is motivated by a physical observation. Mathematics is more like an aid to imagine and create a new solution. In the case of deep learning. If you listen to Hinton, he would often say he tries to first come up an idea and makes it work mathematically later. His insistence of working on neural networks at the time of kernel method stems more from his observation of the brain. “If the brain can do it, how come we can’t?” should be a question you ask every day when you run a deep learning algorithm. I think *these observations are fundamental *to deep learning. And you should go through arguments of why people think neural networks are worthwhile in the first place. Reading classic papers from Wiesel and Hubel helps. Understanding the history of neural network helps. Once you read these materials, you will quickly grasp the big picture of much development of deep learning.

Saying so, I think there are certain topics which are fundamental in deep learning. They are not necessarily very mathematical. For example, I will name back propagation [2] as a very fundamental concept which you want to get good at. Now, you may think that’s silly. “I know backprop already!” Yes, backprop is probably in every single machine learning class. It will easily give you an illusion that you master the material. But you can always learn more about a fundamental concept. And back propagation is important theoretically and practically. You will encounter back propagation either as a user of deep learning tools, a writer of a deep learning framework or an innovator of new algorithm. So a thorough understanding of backprop is very important, and one course is not enough.

This very long digression finally brings me to the great introductory book Michael Nielson’s *Neural Network and Deep Learning* (NNDL) The reason why I think Nielson’s book is important is that it offers an alternative discussion of back propagation as an algorithm. So I will use the rest of the article to explain why I appreciate the book so much and recommend nearly all beginning or intermediate learners of deep learning to read it.

# First Impression

I first learned about “Neural Network and Deep Learning” (NNDL) from going through Tensorflow’s tutorial. My first thought is “ah, another blogger tries to cover neural network”. i.e. I didn’t think it was promising. At that time, there were already plenty of articles about deep learning. Unfortunately, they often repeat the same topics without bringing anything new.

# Synopsis

Don’t make my mistake! NNDL is a great introductory book which balance theory and practice of deep neural network. The book has 6 chapters:

- Using neural network to recognize digits – the basic of neural network, a basic implementation using python (network.py)
- How the backpropagation algorithm works – various explanation(s) of back propagation
- Improving the way neural networks learn – standard improvements of the simple back propagation, another implementation in python (network2.py)
- A visual proof that neural nets can compute any function – universal approximation algorithm without the Math, plus fun games which you can approximate function yourself
- Why are deep neural networks hard to train? – practical difficultie of using back propagation, vanishing gradients
- Deep Learning – convolution neural network (CNN), the final implementation based on Theano (network3.py), recent advances in deep learning (circa 2015).

The accompanied python scripts are the gems of the book. network.py and network2.py can run in plain-old python. You need Theano on network3.py, but I think the strength of the book really lies on network.py and network2.py (Chapter 1 to 3) because if you want to learn CNN, Kaparthy’s lectures probably gives you bang for your buck.

# Why I like Nielsen’s Treatment of Back Propagation?

Reading Nielson’s exposition of neural network is the sixth time I learn about the basic formulation of back propagation [see footnote 3]. So what’s the difference between his treatment and my other reads then?

Forget about my first two reads because I didn’t care enough neural networks enough to know why back propagation is so named. But my latter reads pretty much give me the same impression of neural network: “a neural network is merely a stacking of logistic functions. So how do you train the system? Oh, just differentiate the loss functions, the rest is technicalities.” Usually the books will guide you to verify certain formulae in the text. Of course, you will be guided to deduce that “error” is actually “propagating backward” from a network. Let us call this view *network-level* view. In a *network-level* view, you really don’t care about how individual neurons operate. All you care is to see neural network as yet another machine learning algorithm.

The problem of network level view is that it doesn’t quite explain a lot of phenomena about back propagation. * Why is it so slow some time? Why certain initialization schemes matter? * Nielsen does an incredibly good job to break down the standard equations into 4 fundamental equations (BP1 to BP4 in Chapter2). Once interpret them, you will realize “Oh, saturation is really a big problem in back propagation” and “Oh, of course you have to initialize the weights of neural network with non-zero values. Or else nothing propagate/back propagate!” These insights, while not mathematical in nature and can be understood with college calculus, is deeper understanding about back propagation.

Another valuable part about Nielsen’s explanation is that it comes with a accessible implementation. His first implementation (network.py) is a 74 lines python in idiomatic python. By adding print statements on his code, you will quickly grasp on a lot of these daunting equations are implemented in practice. For example, as an exercise, you can try to identify how he implement BP1 to BP4 in network.py. It’s true that there are books and implementations about neural network, but the description and implementation don’t always come together. Nielsen’s presentation is a rare exception.

# Other Small Things I Like

- Nielsen correctly point out the Del symbol in machine learning is more like a convenient device rather than its more usual meaning like the Del operator in Math.
- In Chapter 4, Nielson mentioned universal approximation of neural network. Unlike standard text book which points you to a bunch of papers with daunting math, Nielsen created a javascript which allows you to approximate functions (!), which I think those are great ways to learn intuition behind the theorem.
- He points out that it’s important to differentiate
*activation*and the*weighted input*. In fact, this point is one thing which can confuse you when reading a derivation of back propagation because textbooks usually use different symbols for activation and weighted input.

There are many of these insightful comments from the book, I encourage you to read and discover them.

# Things I don’t like

- There are many exercises of the book. Unfortunately, there is no answer keys. In a way, this make Nielson more an old-style author which encourage readers to think. I guess this is something I don’t always like because spending time to think of one single problem forever doesn’t always give you better understanding.
- Chapter 6 gives the final implementation in Theano. Unfortunately, there is not much introductory material on Theano within the book. I think this is annoying but forgivable, as Nielson pointed out, it’s harder to introduce Theano and introductory book. I would think anyone interested in Theano should probably go through the standard Theano’s tutorial at here and here.

# Conclusion

All-in-all, I highly recommend *Neural Network and Deep Learning *to any beginning and intermediate learners of deep learning. If this is the first time you learn back propagation, NNDL is a great general introductory book. If you are like me, who already know a thing or two about neural networks, NNDL still have a lot to offer.

Arthur

[1] In my view, PRML’s problem sets have 3 ratings, 1-star, 2-star and 3-star. 1-star usually requires college-level of Calculus and patient manipulation, 2-star requires some creative thoughts in problem solving or knowledge other than basic Calculus. 3-star are more long-form questions and it could contain multiple 2-star questions in one. For your reference, I solved around 100 out of the 412 questions. Most of them are 1-star questions.

[2] The other important concept in my mind is gradient descent, and it is still an active research topic.

[3] The 5 reads before “learnt” it once back in HKUST, read it from Mitchell’s book, read it from Duda and Hart, learnt it again from Ng’s lecture, read it again from PRML. My 7th is to learn from Karparthy’s lecture, he present the material in yet another way. So it’s worth your time to look at them.

If you like this message, subscribe the Grand Janitor Blog’s RSS feed. You can also find me (Arthur) at twitter, LinkedIn, Plus, Clarity.fm. Together with Waikit Lau, I maintain the Deep Learning Facebook forum. Also check out my awesome employer: Voci.

Since I decided to revamp The Grand Janitor’s Blog last December, it has been 100 posts. (I cheat a bit, so “not since then”.)

It’s funny to describe time with the number of articles you write. In blogging though, that makes complete sense.

I have started several blogs in the past. Only 2 of them survive (, Cumulomanic and “Start-Up Employees 333 weeks“, both in Chinese) . When you cannot maintain your blog for more than 50 posts, you blog just dies, or simply to disappear into oblivion.

Yet I make it. So here’s an important question to ask: what makes me keep on?

I believe the answer is very simple. There is no bloggers so far who work on the niche of speech recognition: None on automatic speech recognition (ASR) systems, even though there was much progress. None on engines, even much work has been done in open source. None on applications, even great projects such as Simon was there.

Nor there were discussion on how open source speech recognition can be applied to the commercial world, even when there are dozens of companies are now based on Sphinx (e.g. my employer Voci, EnglishCentral and Nexiwave ), and they are filling the startup space.

How about how the latest technology such as deep neural network (DNN) and weighted finite state transducers (WFST) would affect us? I can see them in academic conferences, journals or sometimes tradeshows…… but not in a blog.

But blogging, which we all know, is probably the most prominent form of how people are getting news these days.

And news about speech recognition, once you understand them, is *fascinating. *

The only blog which comes close is Nicholay’s blog : nsh. When I try to recover as a speech recognition programmer, nsh was a great help. So thank you, Nick, thank you.

But there is only one nsh. There are still have a lot of fascinating to talk about…… Right?

So probably the reason why I keep on working: *I want to invent something I want*: a kind of information hub on speech recognition technology, commercial/open source, applications/engines, theory/implementations, the ideals/the realities.

I want to bring my unique perspective: I was in academia, in industrial research and now in the startup world so I know quite well people’s mindsets in each group.

I also want to connect with all of you. We are working on one of the most exciting technology in the world. Not everyone understands that. It will take time for all of us, to explain to our friends and families what speech recognition can really do and why it matters.

In any case, I hope you enjoy this blog. Feel free to connect with me on Plus, LinkedIn and Twitter.

Arthur

Speech Recognition Stumbles at Leeds Hospital

I wonder who the vendor is.

Interesting showcase again. Google always has pretty impressive speech technology.

Where Siri Has Trouble Hearing, a Crowd of Humans Could Help

Combining fragments of recognition a rather interesting idea though it’s probably not new. I am glad it is taking off though.

Google Buys Neural Net Startup, Boosting Its Speech Recognition, Computer Vision Chops

This is huge. Once again, it says something about the power of DNN approach. It is probably the real focus in the next 5 years.

Duolingo Adds Offline Mode And Speech Recognition To Its Mobile App

I always wonder how the algorithm works. Confidence-based algorithm of verification has always been tough to get it work. But then again, the whole deal of reCAPTCHA is really try to differentiate between human and machines. So it’s probably not as complicated than I thought.

Some notes on DNS 12: link

The whole sentence mode is the more interesting part. Does it make users more frustrated though? I am curious.

Arthur