# What is the Difference between Deep Learning and Machine Learning?

AIDL member Bob Akili asked (rephrased):

What is the Difference between Deep Learning and Machine Learning?

Usually I don't write a full blog message to answer member's questions. But what is "deep" is such a fundamental concept in deep learning, yet there are many well-meaning but incorrect answers floating around.   So I think it is a great idea to answer the question clearly and hopefully disabuse some of the misconceptions as well. Here is a cleaned up and expanded version of my comment to the thread.

# Deep Learning is Just a Subset of Machine Learning

First of all, as you might read from internet,  deep learning is just a subset of machine learning.  There are many "Deep Learning Consultants"-type would tell you deep learning is completely different from from Machine Learning.    When we are talking about "deep learning" these days, we are really talking about "neural networks which has more than one layer".  Since neural network is just one type of ML techniques, it doesn't make any sense to call DL as "different" from ML.   It might work for marketing purpose, but the thought was clearly misleading.

# Deep Learning is a kind of Representation Learning

So now we know that deep learning is a kind of machine learning.   We still can't quite answer why it is special.  So let's be more specific, deep learning is a kind of representation learning.  What is representation learning?  Representation learning is an opposite of another school of thought/practice: feature engineering. In feature engineering, humans are supposed to hand-craft features to make machine works better.   If you Kaggle before, this should be obvious to you, sometimes you just want to manipulate the raw inputs and create new feature to represent your data.

Yet in some domains which involve high-dimensional data such as images, speech or text, hand-crafting feature was found to be very difficult.  e.g. Using HOG type of approaches to do computer vision usually takes a 4-5 years of a PhD student.   So here we come back to representation learning - can computer automatically learn good features?

# What is a "Deep" Technique?

Now we come to the part why deep learning is "deep" - usually we call a method "deep" when we are optimizing a nested function in the method.   So for example, if you can express such functions as a graph, you would find that it has multiple layers.  The term "deep" really is describing such "nestedness".  That should explain why we typically called any artificial neural network (ANN) with more than 1 hidden layer as "deep".   Or the general saying, "deep learning is just neural network which has more layers".

(Another appropriate term is "hierarchical". See footnote [4] for more detail.)

This is also the moment Karpathy in cs231n will show you the multi-layer CNN such that features are automatically learned from the simplest to more complex one. Eventually your last layer can just differentiate them using a linear classifier. As there is a "deep" structure that learn the right feature (last layer).   Note the key term here is "automatic", all these Gabor-filter like feature are not hand-made.  Rather, they are results from back-propagation [3].

# Are there Anything which is "Deep" but not a Neural Network?

Actually, there are plenty, deep Boltzmann machine? deep belief network? deep Gaussian process?  They are still discussed in unsupervised learning using neural network, but I always found that knowledge of graphical models is more important to understand them.

# So is Deep Learning also a Marketing Term?

Yes and no. It depends on who you talk to.  If you talk with ANN researchers/practitioners, they would just tell you "deep learning is just neural network which has more than 1 hidden layer".   Indeed, if you think from their perspective, the term "deep learning" could just be a short-form.  Yet as we just said, you can also called other methods "deep".  So the adjective is not totally void of meaning.  But many people would also tell you that because "deep learning" has become such a marketing term, it can now mean many different things.  I will say more in next section.

Also the term "deep learning" has been there for a century.  Check out Prof. Schmidhuber's thread for more details?

# "No Way! X is not Deep but it is also taught in Deep Learning Class, You made a Horrible Mistake!"

I said it with much authority and I know some of you guys would just jump in and argue:

"What about word2vec? It is nothing deep at all, but people still call it Deep learning!!!"  "What about all wide architectures such as "wide-deep learning"?" "Arthur, You are Making a HORRIBLE MISTAKE!"

Indeed, the term "deep learning" is being abused these days.   More learned people, on the other hand, are usually careful to call certain techniques "deep learning"  For example,  in cs221d 2015/2016 lectures, Dr. Richard Socher was quite cautious to call word2vec as "deep".  His supervisor, Prof. Chris Manning, who is an authority in NLP, is known to dispute whether deep learning is always useful in NLP, simply because some recent advances in NLP really due to deep learning [1][2].

I think these cautions make sense.  Part of it is that calling everything "deep learning" just blurs what really should be credited in certain technical improvement.  The other part is we shouldn't see deep learning as the only type of ML we want to study.  There are many ML techniques, some of them are more interesting and practical than deep learning in practice.  For example, deep learning is not known to work well with small data scenario.  Would I just yell at my boss and say "Because I can't use deep learning, so I can't solve this problem!"?  No, I would just test out random forest, support vector machines, GMM and all these nifty methods I learn over the years.

# Misleading Claim About Deep Learning (I) - "Deep Learning is about Machine Learning Methods which use a lot of Data!"

So now we come to the arena of misconceptions, I am going to discuss two claims which many people have been drumming about deep learning.   But neither of them is the right answer to the question "What is the Difference between Deep and Machine Learning?

The first one you probably heard all the time, "Deep Learning is about ML methods which use a lot of data".   Or people would tell you "Oh, deep learning just use a lot of data, right?"  This sounds about right, deep learning in these days does use a lot of data.  So what's wrong with the statement?

Here is the answer: while deep learning does use a lot of data, before deep learningother techniques use tons of data too! e.g. Speech recognition before deep learning, i.e. HMM+GMM, can use up to 10k hours of speech. Same for SMT.  And you can do SVM+HOG on Imagenet. And more data is always better for those techniques as well. So if you say "deep learning use more data", then you forgot the older techniques also can use more data.

What you can claim is that "deep learning is a more effective way to utilize data".  That's very true, because once you get into either GMM or SVM, they would have scalability issues.  GMM scales badly when the amount of data is around 10k hour.  SVM (with RBF-kernel in particular) is super tough/slow to use when you have ~1 million point of data.

# Misleading Claim About Deep Learning II - "Deep Learning is About Using GPU and Having Data Center!"

This particular claim is different from the previous "Data Requirement" claim,  but we can debunk it in a similar manner.   The reason why it is wrong? Again before deep learning, people have GPUs to do machine learning already.  For example, you can use GPU to speed up GMM.   Before deep learning is hot, you need a cluster of machines to train acoustic model or language model for speech recognition.  You also need tons of RAM to train a language model for SMT.   So calling GPU/Data Center/RAM/ASIC/FPGA a differentiator of deep learning is just misleading.

You can say though "Deep Learning has change the computational model from distributed network model to more a single machine-centric paradigm (which each machine has one GPU).  But later approaches also tried to combine both CPU-GPU processing together".

# Conclusion and "What you say is Just Your Opinion! My Theory makes Equal Sense!"

Indeed, you should always treat what you read on-line with a grain of salt.   Being critical is a good thing, having your own opinion is good.  But you should also try to avoid equivocate an issue.  Meaning: sometimes things have only one side, but you insist there are two equally valid answers.   If you do so, you are perhaps making a logical error in your thinking.   And a lot of people who made claims such as "deep learning is learning which use more data and use a lot of GPUS" are probably making such thinking errors.

Saying so, I would suggest you to read several good sources to judge my answer, they are:

1. Chapter 1 of Deep Learning.
2. Shakir's Machine Learning Blog on a Statistical View of Deep Learning.  In particular, part VI, "What is Deep?"
3. Tombone's post on Deep Learning vs Machine Learning vs Pattern Recognition

In any case, I hope that this article helps you. I thank Bob to ask the question, Armaghan Rumi Naik has debunked many misconceptions in the original thread - his understanding on machine learning is clearly above mine and he was able to point out mistakes from other commenters.  It is worthwhile for your reading time.

# Footnotes

[1] See "Last Words: Computational Linguistics and Deep Learning"
[2] Generally whether DL is useful in NLP is widely disputed topic. Take a look of Yoav Goldberg's view on some recent GAN results on language generation. AIDL Weekly #18 also gave an expose on the issue.
[3] Perhaps another useful term is "hierarchical".  In the case of ConvNet the term is right on.  As Eric Heitzman comments at AIDL:
"(deep structure) They are *not* necessarily recursive, but they *are* necessarily hierarchical since layers always form a hierarchical structure."  After Eric's comment, I think both "deep" and "hierarchical" are fair terms to describe methods in "deep learning". (Of course, "hierarchical learning" is a much a poorer marketing term.)
[4] In earlier draft.  I use the term recursive to describe the term "deep", which as Eric Heitzman at AIDL, is not entirely appropriate.  "Recursive" give people a feeling that the function is self-recursive or$f(f( \ldots f(f(*))))$. but actual function are more "nested", like $f_1(f_2( \ldots f_{n-1}(f_n(*))))$. As a result, I removed the term "recursive" but just call the function "nested function".
Of course, you should be aware that my description is not too mathematically rigorous neither. (I guess it is a fair wordy description though)

History:
20170709 at 6: fix some typos.

20170711: fix more typos.

20170711 at 7:05 p.m.: I got a feedback from Eric Heitzman who points out that the term "recursive" can be deceiving.  Thus I wrote footnote [4].

If you like this message, subscribe the Grand Janitor Blog's RSS feed. You can also find me (Arthur) at twitterLinkedInPlusClarity.fm. Together with Waikit Lau, I maintain the Deep Learning Facebook forum.  Also check out my awesome employer: Voci.

# Learning Deep Learning: The "Basic Five" - Five Beginner Classes on Deep Learning

I have been self-learning deep learning for a while, informally from 2013 when I first read Hinton's "Deep Neural Networks for Acoustic Modeling in Speech Recognition" and through Theano, more "formally" from various classes since the 2015 Summer when I got freshly promoted to Principal Speech Architect [5].   It's not an exaggeration that deep learning changed my life and career.   I have been more active than my previous life.  e.g.  If you are reading this, you are probably directed from the very popular Facebook group, AIDL, which I admin.

So this article was written at the time I finished watching an older version on Richard Socher's cs224d on-line [1].  That, together with Ng's, Hinton's, Li and Karpathy's and Silvers's, are the 5 classes I recommended in my now widely-circulated "Learning Deep Learning - My Top-Five List".    I think it's fair to give these sets of classes a name - Basic Five. Because IMO, they are the first fives classes you should go through when you start learning deep learning.

In this post I will say a few words on why I chose these five classes as the Five. Compared to more established bloggers such as Kapathy, Olah or Denny Britz, I am more a learner in the space [2], experienced perhaps, yet still a learner.  So this article and my others usually stress on learning.  What you can learn from these classes? Less talk-about, but as important: what is the limitation of learning on-line?   As a learner, I think these are interesting discussion, so here you go.

# What are the Five?

Just to be clear, here is the classes I'd recommend:

And the ranking is the same as I wrote in Top-Five List.  Out of the five, four has official video playlist published on-line for free[6]. With a small fee, you can finish the Ng's and Hinton's class with certification.

# How much I actually Went Through the Basic Five

Many beginner articles usually come with gigantic set of links.   The authors usually expect you to click through all of them (and learn through them?) When you scrutinize the list, it could amount to more than 100 hours of video watching, and perhaps up to 200 hours of work.  I don't know about you, but I would suspect if the author really go through the list themselves.

So it's fair for me to first tell you what I've actually done with the Basic Five as of the first writing (May 13, 2017)

CoursesMy Progress
Ng's "Machine Learning"Finished the class in entirety without certification.
Li and Karpathy's "Convolutional Neural Networks for Visual Recognition" or cs231nListened through the class lectures about ~1.5 times. Haven't done any of the homework
Socher's "Deep Learning for Natural Language Processing" or cs224dListened through the class lecture once. Haven't done any of the homework.
Silver's "Reinforcement Learning"Listened through the class lecture 1.5 times. Only worked out few starter problems from Denny Britz's companion exercises.
Hinton's "Neural Network for Machine Learning"Finished the class in entirety with certification. Listen through the class for ~2.5 times.

This table is likely to update as I go deep into a certain class, but it should tell you the limitation of my reviews.  For example,  while I have watched through all the class videos, only on Ng's and Hinton's class I have finished the homework.   That means my understanding on two of the three "Stanford Trinities"[3] is weaker, nor my understanding of reinforcement learning is solid.   Together with my work at Voci, the Hinton's class gives me stronger insight than average commenters on topics such as unsupervised learning.

# Why The Basic Five? And Three Millennial Machine Learning Problems

Taking classes is for learning of course.  The five classes certainly give you the basics, and if you love to learn the fundamentals of deep learning. And take a look of footnote [7].  The five are not the only classes I sit through last 1.5 years so their choice is not arbitrary.  So oh yeah. Those are the stuffs you want to learn. Got it? That's my criterion. 🙂

But that's what other one thousand bloggers would tell you as well. I want to give you a more interesting reason.  Here you go:

If you go back in time to the Year 2000.  That was the time Google just launched their search engine, and there was no series of Google products and surely there was no Imagenet. What was the most difficult  problems for machine learning?   I think you would see three of them:

1. Object classification,
2. Statistical machine learning,
3. Speech recognition.

So what's so special about these three problems then?  If you think about that, back in 2000, all three were known to be hard problems.  They represent three seemingly different data structures -

1. Object classification - 2-dimensional, dense array of data
2. Statistical machine learning (SMT) - discrete symbols, seemingly related by loose rules human called grammars and translation rules
3. Automatic speech recognition (ASR)- 1-dimensional time series, has similarity to both object classification (through spectrogram), and loosely bound by rules such as dictionary and word grammar.

And you would recall all three problems have interest from the government, big institutions such as Big Four, and startup companies.  If you master one of them, you can make a living. Moreover, once you learn them well, you can transfer the knowledge into other problems.  For example, handwritten character recognition (HWR) resembles with ASR, and conversational agents work similarly as SMT.  That just has to do with the three problems are great metaphor of many other machine learning problems.

Now, okay, let me tell one more thing: even now, there are people still (or trying to) make a living by solving these three problems. Because I never say they are solved.  e.g. What about we increase the number of classes from 1000 to 5000?  What about instead of Switchboard, we work on conference speech or speech from Youtube? What if I ask you to translate so well that even human cannot distinguish it?  That should convince you, "Ah, if there is one method that could solve all these three problems, learning that method would be a great idea!"

And as you can guess, deep learning is that one method revolutionize all these three fields[4].  Now that's why you want to take the Basic Five.  Basic Five is not meant to make you the top researchers in the field of deep learning, rather it teaches you just the basic.   And at this point of your learning, knowing powerful template of solving problems is important.  You would also find going through Basic Five makes you able to read majority of the deep learning problems these days.

So here's why I chose the Five, Ng's and NNML are the essential basics of deep learning.   Li and Kaparthy's teaches you object classification to the state of the art.  Whereas, Socher would teach you where deep learning is on NLP, it forays into SMT and ASR a little bit, but you have enough to start.

My explanation excludes Silver's reinforcement learning.   That admittedly is the goat from the herd.   I add Silver's class because increasingly RL is used in even traditionally supervised learning task. And of course, to know the place of RL, you need a solid understanding.  Silver's class is perfect for the purpose.

# What You Actually Learn

In a way, it also reflect what's really important when learning deep learning.  So I will list out 8 points here, because they are repeated them among different courses.

1. Basics of machine learning:  this is mostly from Ng's class.  But theme such bias-variance would be repeated in NNML and Silver's class.
2. Gradient descent: its variants (e.g. ADAM), its alternatives (e.g. second-order method), it's a never-ending study.
3. Backpropagation: how to view it? As optimizing function, as a computational graph, as flowing of gradient.  Different classes give you different points of view. And don't skip them even if you learn it once.
4. Architecture: The big three family is DNN, CNN and RNN.  Why some of them emerge and re-emerge in history.  The detail of how they are trained and structured.  None of the courses would teach you everything, but going through the five will teach you enough to survive
5. Image-specific technique: not just classification, but localization/detection/segmentation (as in cs231n 2016 L8, L13). Not just convolution, but "deconvolution" and why we don't like it is called "deconvolution". 🙂
6. NLP-specific techniques: word2vec, Glovec, how they were applied in NLP-problem such as sentiment classification
7. (Advanced) Basics of unsupervised learning; mainly from Hinton's, and mainly about techniques 5 years ago such as RBM, DBN, DBM and autoencoders,  but they are the basics if you want to learn more advanced ideas such as GAN.
8. (Advanced) Basics of reinforcement learning: mainly from Silver's class, from the DP-based model to Monte-Carlo and TD.

# The Limitation of Autodidacts

By the time you finish the Basic Five, and if you genuinely learn something out of them.  Recruiters would start to knock your door. What you think and write about deep learning  would appeal to many people.   Perhaps you start to answer questions on forums? Or you might even write LinkedIn articles which has many Likes.

All good, but be cautious! During my year of administering AIDL, I've seen many people who purportedly took many deep learning class, but upon few minutes of discussion, I can point out holes in their understanding.    Some, after some probing, turned out only take 1 class in entirety.  So they don't really grok deeper concept such as back propagation.   In other words, they could still improve, but they just refuse to.   No wonder, with the hype of deep learning, many smart fellows just choose to start a company or code without really taking time to grok the concepts well.

That's a pity.  And all of us should be aware is that self-learning is limited.  If you decide to take a formal education path, like going to grad schools, most of the time you will sit with people who are as smart as you and willing to point out your issues daily.   So any of your weaknesses will be revealed sooner.

You should also be aware that as deep learning is hyping, your holes of misunderstanding is unlikely to be uncovered.  That has nothing to do with whether you work in a job.   Many companies just want to hire someone to work on a task, and expect you learn while working.

So what should you do then?  I guess my first advice is be humble, be aware of Dunning-Kruger Effect.  Self-learning usually give people an intoxicating feeling that they learn a lot.  But learning a lot doesn't mean you know everything.  There are always higher mountains, you are doing your own disservice to stop learning.

The second thought is you should try out your skill.  e.g. It's one thing to know about CNN, it's another to run a training with Imagenet data.   If you are smart, the former took a day.  For the latter, it took much planning, a powerful machine, and some training to get even Alexnet trained.

My final advice is to talk with people and understand your own limitation.  e.g. After reading many posts on AIDL, I notice that while many people understand object classification well enough, they don't really grasp the basics of object localization/detection.  In fact, I didn't too even after the first parse of the videos.   So what did I do?
I just go through the videos on localization/detection again and again until I understand[8].

# After the Basic Five.......

So some of you would ask "What's next?" Yes, you finished all these classes, as if you can't learn any more! Shake that feeling off!  There are tons of things you still want to learn.  So I list out several directions you can go:

• Completionist: As of the first writing, I still haven't really done all the homework on all five classes, notice that doing homework can really help your understand, so if you are like me, I would suggest you to go back to these homework and test your understanding.
• Intermediate Five:  You just learn the basics so it's time to learn the next level.   I don't have a concrete ideas of the next 5 classes yet, but for now I would go with Koller's Bayesian Network, Columbia's EdX CSMM 102xBerkeley's Deep Reinforcement LearningUdacity's Reinforcement Learning  and finally Oxford Deep NLP 2017.
• Drilling the Basics of Machine Learning: So this goes another direction - let's work on your fundamentals.  For that, you can any Math topics forever.  I would say the more important and non-trivial parts perhaps Linear Algebra, Matrix Differentiation and Topology.  Also  check out this very good link on how to learn college-level of Math.
• Specialize on one field: If you want to master just one single field out of the Three Millennial Machine Learning Problems I mentioned, it's important for you to just keep on looking at specialized classes on computer vision or NLP.   Since I don't want to clutter this point, let's say I will discuss the relevant classes/material in future articles.
• Writing:  That's what many of you have been doing, and I think it helps further your understanding.  One thing I would suggest is to always write something new and something you want to read yourself.  For example, there are too many blog posts on Computer Vision Using Tensorflow in the world.  So why not write one which is all about what people don't know?  For example, practical transfer learning for object detection.  Or what is deconvolution? Or literature review on some non-trivial architectures such as Mask-RCNN? And compare it with existing decoding-encoding structures.  Writing this kind of articles takes more time, but remember quality trumps quantity.
• Coding/Githubbing: There is a lot of room for re-implementing ideas from papers and open source them.  It is also a very useful skill as many companies need it to repeat many trendy deep learning techniques.
• Research:  If you genuinely understand deep learning, you might see many techniques need refinement.  Indeed, currently there is plenty of opportunities to come up with better techniques.   Of course, writing papers in the level of a professional researchers is tough and it's out of my scope.  But only when you can publish, people would give you respect as part of the community.
• Framework: Hacking in C/C++ level of a framework is not for faint of hearts.  But if you are my type who loves low-level coding, try to come up with a framework yourself could be a great idea to learn more.  e.g. Check out Darknet, which is surprisingly C!

# Conclusion

So here you go.  The complete Basic Five, what they are, why they were basic, and how you go from here.   In a way, it's also a summary of what I learned so far from various classes since Jun 2015.   As in my other posts, if I learn more in the future, I would keep this post updated.  Hope this post keep you learning deep learning.

Arthur Chan

Footnote:
[1] Before 2017, there was no coherent set of Socher's class available on-line.  Sadly there was also no legitimate version.  So the version I refer to is a mixture of 2015 and 2016 classes.   Of course, you may find a legitimate 2017 version of cs224n on Youtube.

[2] My genuine expertise is speech recognition, unfortunately that's not a topic I can share much due to IP issue.

[3] "Stanford Trinity" is a term I learned from the AI Playbook List from Andreseen Howoritz's list.

[4] Some of you (e.g. from AIDL) would jump up and say "No way! I thought that NLP wasn't solved by deep learning yet!" That's because you are one lost soul and misinformed by misinformed blog post.  ASR is the first field being tackled by deep learning, and it dated back to 2010.  And most systems you see in SMT are seq2seq based.

[5] I was in the business of speech recognition from 1998 when I worked on voice-activated project for my undergraduate degree back in HKUST.  It was a mess, but that's how I started.

[6] And the last one, you may always search it through youtube.  Of course, it is not legit for me to share it here.

[7] I also audit,

I also took,

[8]  It's still a subject that *I* could explore.  For example, just the logistic seems to be hard enough to setup.

* * *
If you like this message, subscribe the Grand Janitor Blog's RSS feed.  You can also find me at twitter, LinkedInPlus, Clarity.fm.  Together with Waikit Lau, I maintain the Deep Learning Facebook forum.  Also check out my awesome employer: Voci.

* * *

History:

20170513: First version finished

-------------

If you like this post, you might also like:

Learning Deep Learning - My Top-Five List

A Review on Hinton's Coursera "Neural Networks and Machine Learning"

For the Not-So-Uninitiated: Review of Ng's Coursera Machine Learning Class

Learning Machine Learning - Some Personal Experience

Since I started to re-learn machine learning.  I wrote several review articles on various classes, books and resources.   Here is a collection of links:

For the Not-So-Uninitiated: Review of Ng's Coursera Machine Learning Class

One Algorithm to rule them all - Reading "The Master Algorithm"

Radev's Coursera Introduction to Natural Language Processing - A Review

Learning Deep Learning - My Top-Five List

Learning Machine Learning - Some Personal Experience

A Review on Hinton's Coursera "Neural Networks and Machine Learning"

Reading Michael Nielsen's "Neural Networks and Deep Learning"

Arthur

# A Review on Hinton's Coursera "Neural Networks and Machine Learning"

For me, finishing Hinton's deep learning class, or Neural Networks and Machine Learning(NNML) is a long overdue task. As you know, the class was first launched back in 2012. I was not so convinced by deep learning back then. Of course, my mind changed at around 2013, but the class was archived. Not until 2 years later I decided to take Andrew Ng's class on ML, and finally I was able to loop through the Hinton's class once. But only last year October when the class relaunched, I decided to take it again, i.e watch all videos the second times, finish all homework and get passing grades for the course. As you read through my journey, this class is hard.  So some videos I watched it 4-5 times before groking what Hinton said. Some assignments made me takes long walks to think through. Finally I made through all 20 assignments, even bought a certificate for bragging right; It's a refreshing, thought-provoking and satisfying experience.

So this piece is my review on the class, why you should take it and when.  I also discuss one question which has been floating around forums from time to time: Given all these deep learning classes now, is the Hinton's class outdated?   Or is it still the best beginner class? I will chime in on the issue at the end of this review.

# The Old Format Is Tough

I admire people who could finish this class in the Coursera's old format.  NNML is well-known to be much harder than Andrew Ng's Machine Learning as multiple reviews said (here, here).  Many of my friends who have PhD cannot quite follow what Hinton said in the last half of the class.

No wonder: at the time when Kapathay reviewed it in 2013, he noted that there was an influx of non-MLers were working on the course. For new-comers, it must be mesmerizing for them to understand topics such as energy-based models, which many people have hard time to follow.   Or what about deep belief network (DBN)? Which people these days still mix up with deep neural network (DNN).  And quite frankly I still don't grok some of the proofs in lecture 15 after going through the course because deep belief networks are difficult material.

The old format only allows 3 trials in quiz, with tight deadlines, and you only have one chance to finish the course.  One homework requires deriving the matrix form of backprop from scratch.  All of these make the class unsuitable for busy individuals (like me).  But more for second to third year graduate students, or even experienced practitioners who have plenty of time (but, who do?).

# The New Format Is Easier, but Still Challenging

I took the class last year October, when Coursera had changed most classes to the new format, which allows students to re-take.  [1]  It strips out some difficulty of the task, but it's more suitable for busy people.   That doesn't mean you can go easy on the class : for the most part, you would need to review the lectures, work out the Math, draft pseudocode etc.   The homework requires you to derive backprop is still there.  The upside: you can still have all the fun of deep learning. 🙂 The downside:  you shouldn't expect going through the class without spending 10-15 hours/week.

# Why the Class is Challenging -  I: The Math

Unlike Ng's and cs231n, NNML is not too easy for beginners without background in calculus.   The Math is still not too difficult, mostly differentiation with chain rule, intuition on what Hessian is, and more importantly, vector differentiation - but if you never learn it - the class would be over your head.  Take at least Calculus I and II before you join, and know some basic equations from the Matrix Cookbook.

# Why the Class is Challenging - II:  Energy-based Models

Another reason why the class is difficult is that last half of the class was all based on so-called energy-based models. i.e. Models such as Hopfield network (HopfieldNet), Boltzmann machine (BM) and restricted Boltzmann machine (RBM).  Even if you are used to the math of supervised learning method such as linear regression, logistic regression or even backprop, Math of RBM can still throw you off.   No wonder: many of these models have their physical origin such as Ising model.  Deep learning research also frequently use ideas from Bayesian networks such as explaining away.  If you have no basic background on either physics or Bayesian networks, you would feel quite confused.

In my case, I spent quite some time to Google and read through relevant literature, that power me through some of the quizzes, but I don't pretend I understand those topics because they can be deep and unintuitive.

# Why the Class is Challenging - III: Recurrent Neural Network

If you learn RNN these days, probably from Socher's cs224d or by reading Mikolov's thesis.  LSTM would easily be your only thought on how  to resolve exploding/vanishing gradients in RNN.  Of course, there are other ways: echo state network (ESN) and Hessian-free methods.  They are seldom talked about these days.   Again, their formulation is quite different from your standard methods such as backprop and gradient-descent.  But learning them give you breadth, and make you think if the status quote is the right thing to do.

# But is it Good?

You bet! Let me quantify the statement in next section.

# Why is it good?

Suppose you just want to use some of the fancier tools in ML/DL, I guess you can just go through Andrew Ng's class, test out bunches of implementations, then claim yourself an expert - That's what many people do these days.  In fact, Ng's Coursera class is designed to give you a taste of ML, and indeed, you should be able to wield many ML tools after the course.

That's said, you should realize your understanding of ML/DL is still .... rather shallow.  May be you are thinking of "Oh, I have a bunch of data, let's throw them into Algorithm X!".  "Oh, we just want to use XGBoost, right! It always give you the best results!"   You should realize performance number isn't everything.  It's important to understand what's going on with your model.   You easily make costly short-sighted and ill-informed decision when you lack of understanding.  It happens to many of my peers, to me, and sadly even to some of my mentors.

Don't make the mistake!  Always seek for better understanding! Try to grok.  If you only do Ng's neural network assignment, by now you would still wonder how it can be applied to other tasks.   Go for Hinton's class, feel perplexed by the Prof said, and iterate.  Then you would start to build up a better understanding of deep learning.

Another more technical note:  if you want to learn deep unsupervised learning, I think this should be the first course as well.   Prof. Hinton teaches you the intuition of many of these machines, you will also have chance to implement them.   For models such as Hopfield net and RBM, it's quite doable if you know basic octave programming.

# So it's good, but is it outdated?

Learners these days are perhaps luckier, they have plenty of choices to learn deep topic such as deep learning.   Just check out my own "Top 5-List".   cs231n, cs224d and even Silver's class are great contenders to be the second class.

But I still recommend NNML.  There are four reasons:

1. It is deeper and tougher than other classes.  As I explained before, NNML is tough, not exactly mathematically (Socher's, Silver's Maths are also non-trivial), but conceptually.  e.g. energy-based model and different ways to train RNN are some of the examples.
2. Many concepts in ML/DL can be seen in different ways.  For example, bias/variance is a trade-off for frequentist, but it's seen as "frequentist illusion" for Bayesian.    Same thing can be said about concepts such as backprop, gradient descent.  Once you think about them, they are tough concepts.    So one reason to take a class, is not to just teach you a concept, but to allow you to look at things from different perspective.  In that sense, NNML perfectly fit into the bucket.  I found myself thinking about Hinton's statement during many long promenades.
3. Hinton's perspective - Prof Hinton has been mostly on the losing side of ML during last 30 years.   But then he persisted, from his lectures, you would get a feeling of how/why he starts a certain line of research, and perhaps ultimately how you would research something yourself in the future.
4. Prof. Hinton's delivery is humorous.   Check out his view in Lecture 10 about why physicists worked on neural network in early 80s.  (Note: he was a physicist before working no neural networks.)

# Conclusion and What's Next?

All-in-all, Prof. Hinton's "Neural Network and Machine Learning" is a must-take class.  All of us, beginners and experts include, will be benefited from the professor's perspective, breadth of the subject.

I do recommend you to first take the Ng's class if you are absolute beginners, and perhaps some Calculus I or II, plus some Linear Algebra, Probability and Statistics, it would make the class more enjoyable (and perhaps doable) for you.  In my view, both Kapathy's and Socher's class are perhaps easier second class than Hinton's class.

If you finish this class, make sure you check out other fundamental class.  Check out my post "Learning Deep Learning - My Top 5 List", you would have plenty of ideas for what's next.   A special mention here perhaps is Daphne Koller's Probabilistic Graphical Model, which found it equally challenging, and perhaps it will give you some insights on very deep topic such as Deep Belief Network.

Another suggestion for you: may be you can take the class again. That's what I plan to do about half a year later - as I mentioned, I don't understand every single nuance in the class.  But I think understanding would come up at my 6th to 7th times going through the material.

Arthur Chan

[1] To me, this makes a lot of sense for both the course's preparer and the students, because students can take more time to really go through the homework, and the course's preparer can monetize their class for infinite period of time.

History:

(20170410) First writing
(20170411) Fixed typos. Smooth up writings.
(20170412) Fixed typos
(20170414) Fixed typos.

If you like this message, subscribe the Grand Janitor Blog's RSS feed. You can also find me (Arthur) at twitter, LinkedInPlus, Clarity.fm. Together with Waikit Lau, I maintain the Deep Learning Facebook forum.  Also check out my awesome employer: Voci.

# Introduction

Let me preface this article: after I wrote my top five list on deep learning resources, one oft-asked question is "What is the Math prerequisites to learn deep learning?"   My first answer is Calculus and Linear Algebra, but then I will qualify certain techniques of Calculus and Linear Algebra are more useful.  e.g. you should already know gradient, differentiation, partial differentiation and Lagrange multipliers, you should know matrix differentiation and preferably trace trick , eigen-decomposition and such.    If your goal is to understand machine learning in general, then having good skills in integrations and knowledge in analysis helps. e.g. 1-2 stars problems of Chapter 2 at PRML [1] requires some knowledge of advanced function such as gamma, beta.   Having some Math would help you go through these questions more easily.

Nevertheless,  I find that people who want to learn Math first before approaching deep learning miss the point.  Many engineering topics was not motivated by pure mathematical pursuit.  More often than not, an engineering field is motivated by a physical observation. Mathematics is more like an aid to imagine and create a new solution.  In the case of deep learning.  If you listen to Hinton, he would often say he tries to first come up an idea and makes it work mathematically later.    His insistence of working on neural networks at the time of kernel method stems more from his observation of the brain.   "If the brain can do it, how come we can't?" should be a question you ask every day when you run a deep learning algorithm.   I think these observations are fundamental to deep learning.  And you should go through arguments of why people think neural networks are worthwhile in the first place.   Reading classic papers from Wiesel and Hubel helps. Understanding the history of neural network helps.  Once you read these materials, you will quickly grasp the big picture of much development of deep learning.

Saying so, I think there are certain topics which are fundamental in deep learning.   They are not necessarily very mathematical.  For example, I will name back propagation [2] as a very fundamental concept which you want to get good at.   Now, you may think that's silly.    "I know backprop already!"  Yes, backprop is probably in every single machine learning class.  It will easily give you an illusion that you master the material.    But you can always learn more about a fundamental concept.  And back propagation is important theoretically and practically.  You will encounter back propagation either as a user of deep learning tools, a writer of a deep learning framework or an innovator of new algorithm.  So a thorough understanding of backprop is very important, and one course is not enough.

This very long digression finally brings me to the great introductory book Michael Nielson's Neural Network and Deep Learning (NNDL)    The reason why I think Nielson's book is important is that it offers an alternative discussion of back propagation as an algorithm.   So I will use the rest of the article to explain why I appreciate the book so much and recommend nearly all beginning or intermediate learners of deep  learning to read it.

# First Impression

I first learned about "Neural Network and Deep Learning" (NNDL) from going through Tensorflow's tutorial.   My first thought is "ah, another blogger tries to cover neural network". i.e. I didn't think it was promising.   At that time, there were already plenty of articles about deep learning.  Unfortunately, they often repeat the same topics without bringing anything new.

# Synopsis

Don't make my mistake!  NNDL is a great introductory book which balance theory and practice of deep neural network.    The book has 6 chapters:

1. Using neural network to recognize digits - the basic of neural network, a basic implementation using python (network.py)
2. How the backpropagation algorithm works -  various explanation(s) of back propagation
3. Improving the way neural networks learn - standard improvements of the simple back propagation, another implementation in python (network2.py)
4. A visual proof that neural nets can compute any function - universal approximation algorithm without the Math, plus fun games which you can approximate function yourself
5. Why are deep neural networks hard to train?  - practical difficultie of using back propagation, vanishing gradients
6. Deep Learning  - convolution neural network (CNN), the final implementation based on Theano (network3.py), recent advances in deep learning (circa 2015).

The accompanied python scripts are the gems of the book. network.py and network2.py can run in plain-old python.   You need Theano on network3.py, but I think the strength of the book really lies on network.py and network2.py (Chapter 1 to 3) because if you want to learn CNN, Kaparthy's lectures probably gives you bang for your buck.

# Why I like Nielsen's Treatment of Back Propagation?

Reading Nielson's exposition of neural network is the sixth  time I learn about the basic formulation of back propagation [see footnote 3].  So what's the difference between his treatment and my other reads then?

Forget about my first two reads because I didn't care enough neural networks enough to know why back propagation is so named.   But my latter reads pretty much give me the same impression of neural network: "a neural network is merely a stacking of logistic functions.    So how do you train the system?  Oh, just differentiate the loss functions, the rest is technicalities."   Usually the books will guide you to verify certain formulae in the text.   Of course, you will be guided to deduce that "error" is actually "propagating backward" from a network.   Let us call this view network-level view.   In a network-level view, you really don't care about how individual neurons operate.   All you care is to see neural network as yet another machine learning algorithm.

The problem of network level view is that it doesn't quite explain a lot of phenomena about back propagation.  Why is it so slow some time?  Why certain initialization schemes matter?  Nielsen does an incredibly good job to break down the standard equations into 4 fundamental equations (BP1 to BP4 in Chapter2).  Once interpret them, you will realize "Oh, saturation is really a big problem in back propagation" and "Oh, of course you have to initialize the weights of neural network with non-zero values.  Or else nothing propagate/back propagate!"    These insights, while not mathematical in nature and can be understood with college calculus, is deeper understanding about back propagation.

Another valuable part about Nielsen's explanation is that it comes with a accessible implementation.  His first implementation (network.py) is a 74 lines python in idiomatic python.   By adding print statements on his code, you will quickly grasp on a lot of these daunting equations are implemented in practice.  For example, as an exercise, you can try to identify how he implement BP1 to BP4 in network.py.    It's true that there are books and implementations about neural network,  but the description and implementation don't always come together.  Nielsen's presentation is a rare exception.

# Other Small Things I Like

• Nielsen correctly point out the Del symbol in machine learning is more like a convenient device rather than its more usual meaning like the Del operator in Math.
• In Chapter 4,  Nielson mentioned universal approximation of neural network.  Unlike standard text book which points you to a bunch of papers with daunting math, Nielsen created a javascript which allows you to approximate functions (!), which I think those are great ways to learn intuition behind the theorem.
• He points out that it's important to differentiate activation and the weighted input.  In fact,  this point is one thing which can confuse you when reading a derivation of back propagation because textbooks usually use different symbols for activation and weighted input.

There are many of these insightful comments from the book, I encourage you to read and discover them.

# Things I don't like

• There are many exercises of the book.  Unfortunately, there is no answer keys.  In a way, this make Nielson more an old-style author which encourage readers to think.   I guess this is something I don't always like because spending time to think of one single problem forever doesn't always give you better understanding.
• Chapter 6 gives the final implementation in Theano.  Unfortunately, there is not much introductory material on Theano within the book.    I think this is annoying but forgivable, as Nielson pointed out, it's harder to introduce Theano and introductory book.  I would think anyone interested in Theano should probably go through the standard Theano's tutorial at here and here.

# Conclusion

All-in-all,  I highly recommend Neural Network and Deep Learning  to any beginning and intermediate learners of deep learning.  If this is the first time you learn back propagation,  NNDL is a great general introductory book.   If you are like me, who already know a thing or two about neural networks, NNDL still have a lot to offer.

Arthur

[1] In my view, PRML's problem sets have 3 ratings, 1-star, 2-star and 3-star.  1-star usually requires college-level of Calculus and patient manipulation, 2-star requires some creative thoughts in problem solving or knowledge other than basic Calculus.  3-star are more long-form questions and it could contain multiple 2-star questions in one.   For your reference, I solved around 100 out of the 412 questions.  Most of them are 1-star questions.

[2] The other important concept in my mind is gradient descent, and it is still an active research topic.

[3] The 5 reads before "learnt" it once back in HKUST, read it from Mitchell's book, read it from Duda and Hart, learnt it again from Ng's lecture, read it again from PRML.  My 7th is to learn from Karparthy's lecture, he present the material in yet another way.  So it's worth your time to look at them.

If you like this message, subscribe the Grand Janitor Blog's RSS feed. You can also find me (Arthur) at twitter, LinkedInPlus, Clarity.fm.  Together with Waikit Lau, I maintain the Deep Learning Facebook forum.  Also check out my awesome employer: Voci.

# Some Thoughts on Learning Machine Learning/Data Science

I have been refreshing myself on various aspects of machine learning and data science.  For the most part it has been a very nice experience.   What I like most is that I finally able to grok many machine learning jargons people talk about.    It gave me a lot of trouble even as merely a practitioner of machine learning.  Because most people just assume you have some understanding of what they mean.

Here is a little secret: all these jargons can be very shallow to very deep.  For instance, "lasso" just mean setting the regularization terms with exponent 1.   I always think it's just people don't want to say the mouthful: "Set the regularization term to 1", so they come up with lasso.

Then there is bias-variance trade off.   Now here is a concept which is very hard to explain well.    What opens my mind is what Andrew Ng said in his Coursera lecture, "just forget the term bias and variance".  Then he moves on to talk about over and under-fitting.  That's a much easier to understand concept.   And then he lead you to think.  In the case, when a model underfits, we have an estimator that has "huge bias",  and when the model overfit, the estimator would allow too much "variance".   Now that's a much easier way to understand.   Over and under-fitting can be visualized.   Anyone who understands the polynomial regression would understand what overfitting is.  That easily leads you to have a eureka moment: "Oh, complex models can easily overfit!"   That's actually the key of understanding the whole phenomenon.

Not only people are getting better to explain different concepts. Several important ideas are enunciated better.  e.g. reproducibility is huge, and it should be huge in machine learning as well.   Yet even now you see junior scientists in entry level ignore all important measures to make sure their work reproducible.   That's a pity.  In speech recognition, e.g. I remember there was a dark time where training a broadcast news model was so difficult, despite the fact that we know people have done it before.    How much time people waste to repeat other peoples' work?

Nowadays, perhaps I would just younger scientists to take the John Hopkins' "Reproducible Research".  No kidding.  Pay \$49 to finish that class.

Anyway, that's my rambling for today.   Before I go, I have been actively engaged in the Facebook's Deep Learning group.  It turns out many of the forum uses love to hear more about how to learn deep learning.   Perhaps I will write up more in the future.

Arthur

# Some Speculations On Why Microsoft Tay Collapsed

Microsoft's Tay, following Google AlphaGo, was meant to be yet another highly intelligent A.I. program which fulfill human's long standing dream: a machine which can truly converse.   But as you know, Tay fails spectacularly.  To me, this is a highly unusual event, part of it is that Microsoft's another conversation agent, Xiaoice, was extremely successful in China.   The other part is MSR, is one of the leading sites on using deep learning in various machine learning problems.   You would think that a major P.R. problem such as Tay confirming "Donald Trump is the hope",  and purportedly support genocide should be weeded out before launch.

As I read many posts in the past week attempted to describe why Tay fails, sadly they offer me no insights.  Some even written from respected magazines, e.g. in New Yorkers' "I’ve Seen the Greatest A.I. Minds of My Generation Destroyed by Twitter" at the end the author concluded,

"If there is a lesson to be learned, it is that consciousness wants conscience. Most consumer-tech companies have, at one time or another, launched a product before it was ready, or thought that it was equipped to do something that it ended up failing at dismally. "

While I always love the prose from New Yorkers, there is really no machine which can mimic/model  human consciousness (yet).   In fact, no one really knows how "consciousness" works, it's also tough to define what "consciousness" is.   And it's worthwhile to mention that chatbot technology is not new.   Google had released similar technology and get great press.  (See here)  So the New Yorkers' piece reflect how much the public does not understand technology.

As a result, I decided to write a Tay's postmortem myself, and offer some thoughts on why this problem could occur and how one could actively avoid such problems.

Since I try to write this piece for general audience, (say my facebook friends), the piece contains only small amount of technicalities.   If you are interested, I also list several more technical articles in the reference section.

## How does a Chatbot work?  The Pre-Deep Learning Version

By now,  all of us use a chat bot or two, there is obviously Siri, which perhaps is the first program which put speech recognition and dialogue system in the national spotlight.  If you are familiar with history of computing, you would probably know ELIZA [1], which is the first example of using rule-based approach to respond to users.

What does it mean?  In such system, usually a natural language parser is used to parse human's input, then come up with an answer with some pre-defined and mostly manually rules.    It's a simple approach, but when it's done correctly.   It creates an illusion of intelligence.

Rule-base approach can go quite far.  e.g. The ALICE language is a pretty popular tool to create intelligent sounding bot. (History as shown in here.)   There are many existing tools which help programmers to create dialogue.   Programmer can also extract existing dialogues into the own system.

The problem of rule-based approach is obvious: the response is rigid.  So if someone use the system for a while, they will easily notice they are talking with a machine.  In a way, you can say the illusion can be easily dispersed by human observation.

Another issue of rule-based approach is it taxes programmers to produce a large scale chat bot.   Even with convenient languages such as AIML (ALICE Markup Language), it would take a programmer a long long time to come up with a chat-bot, not to say one which can answer a wide-variety of questions.

## Converser as a Translator

Before we go on to look at chat bot in the time of deep learning.  It is important to ask how we can model conversation.   Of course, you can think of it as ... well... we first parse the sentence, generate entities and their grammatical relationships,  then based on those relationships, we come up with an answer.

This approach of decomposing a sentence to its element, is very natural to human beings.   In a way, this is also how the rule-based approach arise in the first place.  But we just discuss the weakness of rule-based approach, namely, it is hard to program and generalize.

So here is a more convenient way to think, you could simply ask,  "Hey, now I have an input sentence, what is the best response?"    It turns out this is very similar to the formulation of statistical machine translation.   "If I have an English sentence, what would be the best French translation?"    As it turns out, a converser can be built with the same principle and technology as a translator.    So all powerful technology developed for statistical machine translation (SMT) can be used on making a conversation bot.   This technology includes I.B.M. models, phrase-based models, syntax model [2]   And the training is very similar.

In fact, this is how many chat bots was made just before deep-learning arrived.    So some method simply use an existing translator to translate input-response pair.    e.g. [3]

The good thing about using a statistical approach, in particular, is that it generalizes much better than the rule-based approach.    Also, as the program is based on machine learning, all you have to do is to prepare (carefully) a bunch of training data.   Then existing machine learning program would help you come up with a system automatically.   It eases the programmer from long and tedious tweaking of the bot.

## How does a Chatbot work?  The Deep Learning Version

Now given what we discuss, then how does Microsoft's chat bot Tay works?   Since we don't know Tay's implementation, we can only speculate:

1. Tay is smart, so it doesn't sound like a purely rule-based system.  so let's assume it is based on the aforementioned "converser-as-translator" paradigm.
2. It's Microsoft, there got to be some deep neural network.  (Microsoft is one of the first sites picked up the modern "deep" neural network" paradigm.)
3. What's the data?  Well,  given Tay is built for millennials, the guy who train Tay must be using dialogue between teenagers.  If I research for Microsoft [4],  may be I would use data collected from Microsoft Messenger or Skype.   Since Microsoft has all the age data for all users, the data can easily be segmented and bundled into training.

So let's piece everything together.  Very likely,  Tay is a neural-network (NN)-based program which can intelligently translate an user's natural language input to a response.    The program's training is based on chat data.   So my speculation is the data is exactly where things goes wrong.   Before I conclude, the neural network in question is likely to be an Long-Short Term Model (LSTM).    I believe Google's researchers are the first advocate such approach [5] (headlined last year and the bot is known for its philosophical undertone.) Microsoft did couple of papers on how LSTM can be used to model conversation.  [6].    There are also several existing bot building software on line e.g. Andrej Karpathy 's char-RNN.    So it's likely that Tay is based on such approach. [7]

## What goes wrong then?

Oh well, given that Tay is just a machine learning program.  Her behavior is really governed by the training material.   Since the training data is likely to be chat data, we can only conclude the data must contain some offensive speech, given the political landscape of the world.   So one reasonable hypothesis is the researcher who prepares the training material hadn't really filter out topics related to hate speech and sensitive topics.    I guess one potential explanation of not doing that is that filtering would reduce the amount of training data.     But then given the data owned by Microsoft,  it doesn't make sense.  Say 20% of 1 billion conversation is still a 200 million, which is more than enough to train a good chatterbot.  So I tend to think the issue is a human oversight.

And then, as a simple fix,  you can also give the robots a list of keywords, e.g. you can just program  a simple regular expression match of "Hitler",  then make sure there is a special rule to respond the user with  "No comment".   At least the consequence wouldn't be as huge as a take down.     That again, it's another indication that there are oversights in the development.   You only need to spend more time in testing the program, this kind of issues would be noticed and rooted out.

## Conclusion

In this piece, I come up with couple of hypothesis why Microsoft Tay fails.   At the end, I echo with the title of New Yorker's piece: "I’ve Seen the Greatest A.I. Minds of My Generation Destroyed by Twitter" .... at least partially. Tay is perhaps one of the smartest chatter bots, backed by one of the strongest research organization in the world, trained by tons of data. But it is not destroyed by Twitter or trolls. More likely, it is destroyed by human oversights and lack of testing. In this sense, it's failure is not too different from why many software fails.

Reference/Footnote

[1] Weizenbaum, Joseph "ELIZA—A Computer Program For the Study of Natural Language Communication Between Man And Machine", Communications of the ACM 9 (1): 36–45,

[2] Philip Koehn, Statistical Machine Translation

[3] Alan Ritter, Colin Cherry, and William Dolan. 2011. Data-driven response generation in social media. In Proc. of EMNLP, pages 583–593. Association for Computational Linguistics.

[4] Woa! I could only dream! But I prefer to work on speech recognition, instead of chatterbot.

[5] Oriol Vinyal, Le Quoc, A Neural Conversational Model.

[6] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan, A Diversity-Promoting Objective Function for Neural Conversation Models

[7] A more technical point here: Using LSTM, a type of recurrent neural network (RNN), also resolved one issue of the classical models such as IBM models because the language model is usually n-gram which has limited long-range prediction capability.