Categories
Andrew Ng course review deep learning deep neural network Geoff Hinton Machine Learning review

Review of Ng’s deeplearning.ai Course 2: Improving Deep Neural Networks

(My Reviews on Course 2 and Course 3.)

In your life, there are times you think you know something, yet genuine understanding seems to elude you.  It’s always frustrating, isn’t it?   For example, why would all these seemingly simple concepts such as gradients or regularization can throw us off when we learn them since Day 1 of our learning in machine learning?

In programming, there’s a term called “grok”, grokking something usually means that not only you know the term, but you also have intuitive understanding of the concept.    Or as in “Zen and the Art of Motorcycle Maintenance” [1], you just try to dive deep into a concept, as if it is a journey…… For example, if you really think about speech recognition, then you would realize the frame independence  assumption [2] is very important.   Because it simplifies the problem in both search and parameter estimation.  Yet it certainly introduces a modeling error.  These small things which are not mentioned in classes or lectures are things you need to grok.

That brings us to Course 2 of deeplearning.ai.  What are you grokking in this Course?  After you take Course 1, should you take Course 2?  My answer is yes and here is my reasoning.

Really, What is Gradient Descent?

Gradient descent is a seemingly simple subject – say you want to find a minima of the function a convex function, so you follow the gradient down hill and after many iterations, you eventually hit the minima.  Sounds simple right?

Of course, once you start to realize that functions are normally not convex, and they are n-dimensional, and there can be plateaus.  Or when you follow the gradient,  but it happens to be a wrong direction! So you will have zigzagging when you try to descend.   It’s a little bit like descending from a real mountain, yet you don’t really can’t see n-dimensional space!

That explains the early difficulty of deep learning development – Stochastic gradient descent (SGD) was just too slow back in 2000 for DNN. That results in very interesting research of restricted Boltzmann machine (RBM) which was stacked and used to  initialize DNN, which was prominent subject of Hinton’s NNML after Lecture 8, or pretraining, which is still being used in some recipes in speech recognition as well as financial prediction.

But we are not doing RBM any more! In fact, research in RBM is not as fervent as in 2008. [4] Why? It has to do with people just understand more about SGD and can run it better – it has to do with initialization, e.g. Glorot’s and He’s initialization.   It also has to do with how gradient descent is done – ADAM is our current best.

So how do you learn these stuffs?  Before Ng deeplearning.ai’s class, I would say knowledge like this spread out on courses such as cs231n or cs224n.  But as I mentioned in the Course 1’s review, those are really courses with specific applications in mind.  Or you can go to read Michael Nielsen’s Neural Network and Deep Learning.   Of course, Nielsen’s work is a book.  So it really depends on whether you have the patience to work through the details while reading.  (Also see my review of the book.)

Now you don’t have to.  The one-stop shop is Course 2.  Course 2 actually covers the material I just mentioned such as initialization, gradient descent, as well as deeper concepts such as regularization  and batch normalization.   That makes me recommend you to keep on taking the course after you finish Course 1.  If you take the class, and are also willing to read Sebastian Ruder’s Review of SGD or Grabriel Goh’s Why Momentum Really Works, you would be much ahead of the game.

As a note, I also like Andrew breaks down many of the SGD algorithm as a smoothing algorithm.   That’s a new insight for me even after I used SGD many times.

Is it hard?

Nope, as Math goes, Course 1 is probably toughest.  Of course, even in Course 1, you will finish coursework faster if you don’t overthink the problem.  Most notebooks have the derived results for you.  On the other hand, you do want to derive the formulae,  you do need to have decent skill in matrix calculus.

Is it Necessary to Understand These Details?; Also Top-Down vs Bottom-Up learning, which is Better?

A legitimate question here is that : well, in our current state of deep learning which we have so many toolkits which already implemented techniques such as ADAM.  Do I really need to dig so deep?

I do think there are always two views in learning – one is from top-down, which in deep learning, perhaps is to read a bunch of papers, learn the concepts and see if you can wrap you head around them.  the fast.ai class is one of them.   And 95% of the current AI enthusiasts are following such paths.

What’s the problem of the top-down approach?  Let me go back to my first paragraph – which is – do you really grok something when you do something top-down?  I frequently can’t.   In my work life, I also heard senior people say that top-down is the way to go.  Yet, when I went ahead to check if they truly understand an implementation.  They frequently can’t give a satisfactory answer.  That happens to a lot of senior technical people who later turn to more management.   Literally, they lost their touch.

On the other hand, every time, I pop up an editor and write an algorithm, I gain tremendous understanding!   For example, I was asked to write a forward inference once with C, you better know what you are doing when you write in C!   In fact, I come to have opinion these days that you have to implement an algorithm once before you can claim you understand it.

So how come there are two sides of the opinion then?  One of my speculations is that back in 80s/90s, students are often taught to learn how to write program in first writing.  That create mindsets that you have to think up a perfect program before you start to write one.   Of course, in ML, such mindset is highly impractical because and the ML development process  are really experimental.  You can’t always assume you perfect the settings before you try something.

Another equally dangerous mindset is to say “if you are too focused on details, then you miss the big picture won’t come up with something new!” . This I heard a lot when I first do research and it’s close to most BS-ty thing I’ve heard.  If you want to come up with something new, the first thing you should learn is all the details of existing works.  The so called “big picture” and “details” are always interconnected.  That’s why in the AIDL forum, we never see young kids, who say “Oh I have this brand new idea, which is completely different from all previous works!”, would go anywhere.  That’s because you always learn how to walk before you run.   And knowing the details has no downsides.

Perhaps this is my long reasons why Ng’s class is useful for me, even after I read many literature.  I distrust people who only talk about theory but don’t show any implementation.

Conclusion

This concludes my review of Course 2.  To many people, after they took Course 1, they just decide to take Course 2, I don’t blame them, but you always want to ask if your time is well-spent.

To me though, taking Course 2 is not just about understanding more on deep learning.  It is also my hope to grok some of the seemingly simple concepts in the field.   Hope that my review is useful and I will keep you all posted when my Course 3’s review is done.

Arthur

Footnotes:
[1] As Pirsig said – it’s really not about motorcycle maintenance.

[2] Strictly speaking, it is conditional frame independence assumption.  But practitioners in ASR frequently just called it frame independence assumption.

[3] Also see HODL’s interview with Ruslan Salakhutdinov, his account is first hand on the rise and fall of RBM.

Categories
backpropagation deep learning deep neural network Machine Learning

Review of Ng’s deeplearning.ai Course 1: Neural Networks and Deep Learning

Credit:: Damien Kühn CC

(See my reviews on Course 2 and Course 3.)

As you all know, Prof. Ng has a new specialization on Deep Learning. I wrote about the course extensively yet informally, which include two “Quick Impressions” before and after I finished Course 1 to 3 of the specialization.  I also wrote three posts just on Heroes on Deep Learning including Prof. Geoffrey HintonProf. Yoshua Bengio and Prof. Pieter Abbeel and Dr. Yuanqing Lin .    And Waikit and I started a study group, Coursera deeplearning.ai (C. dl-ai), focused on just the specialization.    This is my full review of Course 1 after finish watching all the videos.   I will give a description on what the course is about, and why you want to take it.   There are already few very good reviews (from Arvind and Gautam).  I will write based on my experience as the admin of AIDL, as well as a deep learning learner.

The Most Frequently Asked Question in AIDL

If you don’t know, AIDL is one of most active Facebook group on the matter of A.I. and deep learning.  So what is the most frequently asked question (FAQ) in our group then?  Well, nothing fancy:

How do I start deep learning?

In fact, we got asked that question daily and I have personally answered that question for more than 500 times.   Eventually I decided to create an FAQ – which basically points back to “My Top-5 List” which gives a list of resources for beginners.

The Second Most Important Class

That brings us to the question what should be the most important class to take?   Oh well, for 90% of the learners these days, I would first recommend Andrew Ng’s “Machine Learning“, which is both good for beginners or more experienced practitioners (like me).  Lucky for me, I took it around 2 years ago and got benefited from the class since then.

But what’s next? What would be a good second class?  That’s always the question on my mind.   Karpathy cs231n comes to mind,  or may be Socher’s cs224[dn] is another choice.    But they are too specialized in the subfields.   E.g. If you view them from the study of general deep learning,  the material in both classes on model architecture are incomplete.

Or you can think of general class such as Hinton’s NNML.  But the class confuses even PhD friends I know.  Indeed, asking beginners to learn restricted Boltzmann machine is just too much.   Same can be said for Koller’s PGM.   Hinton’s and Koller’s class, to be frank, are quite advanced.  It’s better to take them if you already know the basics of ML.

That narrows us to several choices which you might already consider:  first is fast.ai by Jeremy Howard, second is deep learning specialization from Udacity.   But in my view, those class also seems to miss something essential –   e.g., fast.ai adopts a  top-down approach.  But that’s not how I learn.  I alway love to approach a technical subject from ground up.  e.g.  If I want to study string search, I would want to rewrite some classic algorithms such as KMP.  And for deep learning, I always think you should start with a good implementation of back-propagation.

That’s why for a long time, Top-5 List picked cs231n and cs224d as the second and third class.   They are the best I can think of  after researching ~20 DL classes.    Of course, deeplearning.ai changes my belief that either cs231n and cs224d should be the best second class.

Learning Deep Learning by Program Verification

So what so special about deeplearning.ai? Just like Andrew’s Machine Learning class, deeplearning.ai follows an approach what I would call program verification.   What that means is that instead of guessing whether your algorithm is right just by staring at the code, deeplearning.ai gives you an opportunity to come up with an implementation your own provided that you match with its official one.

Why is it important then?  First off, let me say that not everyone believes this is right approach.   e.g. Back when I started, many well-intentioned senior scientists told me that such a matching approach is not really good experimentally.  Because supposed your experiment have randomness, you should simply run your experiment N times, and calculate the variance.  Matching would remove this experimental aspect of your work.

So I certainly understand the point of what the scientists said.  But then, in practice, it was a huge pain in the neck to verify if you program is correct.  That’s why in most of my work I adopt the matching approach.  You need to learn a lot about numerical properties of algorithm this way.  But once you follow this approach, you will also get an ML tasks done efficiently.

But can you learn in another way? Nope, you got to have some practical experience in implementation.  Many people would advocate learning by just reading paper, or just by running pre-prepared programs.  I always think that’s missing the point – you would lose a lot of understanding if you skip an implementation.

What do you Learn in Course 1?

For the most part, implementing feed-forward (FF) algorithm and back-propagation (BP) algorithm from scratch.  Since for most of us, we are just using frameworks such as TF or Keras, such implementation from scratch experience is invaluable.  The nice thing about the class is that the mathematical formulation of BP is fined tuned such that it is suitable for implementing on Python numpy, the course designated language.

Wow, Implementing Back Propagation from scratch?  Wouldn’t it be very difficult?

Not really, in fact, many members finish the class in less than a week.  So the key here: when many of us calling it a from-scratch implementation, in fact it is highly guided.  All the tough matrix differentiation is done for you.  There are also strong hints on what numpy functions you should use.   At least for me, homework is very simple. (Also see Footnote [1])

Do you need to take Ng’s “Machine Learning” before you take this class?

That’s preferable but not mandatory.  Although without knowing the more classical view of ML, you won’t be able to understand some of the ideas in the class.  e.g. the difference how bias and variance are viewed.   In general, all good-old machine learning (GOML) techniques are still used in practice.  Learning it up doesn’t seem to have any downsides.

You may also notice that both “Machine Learning” and deeplearning.ai covers neural network.   So will the material duplicated?  Not really.  deeplearning.ai would guide you through implementation of multi-layer of deep neural networks, IMO which requires a more careful and consistent formulation than a simple network with one hidden layer.  So doing both won’t hurt and in fact it’s likely that you will have to implement a certain method multiple times in your life anyway.

Wouldn’t this class be too Simple for Me?

So another question you might ask.  If the class is so simple, does it even make sense to take it?   The answer is a resounding yes.  I am quite experienced in deep learning (~4 years by now) and I learn machine learning since college.  I still found the course very useful, because it offers many useful insights which only industry expert knows.  And of course, when a luminary such as Andrew speaks, you do want to listen.

In my case, I also want to take the course so that I can write reviews about it and my colleagues in Voci can ask me questions.  But with that in mind, I still learn several things new through listening to Andrew.

Conclusion

That’s what I have so far.   Follow us on Facebook AIDL, I will post reviews of the later courses in the future.

Arthur

[1] So what is a true from-scratch  implementation? Perhaps you write everything from C and even the matrix manipulation part?

If you like this message, subscribe the Grand Janitor Blog’s RSS feed. You can also find me (Arthur) at twitterLinkedInPlusClarity.fm. Together with Waikit Lau, I maintain the Deep Learning Facebook forum.  Also check out my awesome employer: Voci.

History:
Nov 29, 2017: revised the text once. Mostly rewriting the clunky parts.
Oct 16, 2017: fixed typoes and misc. changes.
Oct 14, 2017: first published

 

Categories
attention deep learning deep neural network neural machine translation recurrent neural network

Some Useful Links on Neural Machine Translation

Some good resources for NNMT

Tutorial:

a bit special: Tensor2Tensor uses a novel architecture instead of pure RNN/CNN decoder/encoder.   It gives a surprisingly large amount of gain.  So it’s likely that it will become a trend in NNMT in the future.

Important papers:

  • Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation by Cho Et al. (link) – Very innovative and smart paper by Kyunghyun Cho.  It also introduces GRU.
  • Sequence to Sequence Learning with Neural Networks by Ilya Sutskever (link) – By Google’s researchers, and perhaps it shows for the first time an NMT system is comparable to the traditional pipeline.
  • Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (link)
  • Neural Machine Translation by Joint Learning to Align and Translate by Dzmitry Bahdanau (link) – The paper which introduce attention
  • Neural Machine Translation by Min-Thuong Luong (link)
  • Effective Approaches to Attention-based Neural Machine Translation by Min-Thuong Luong (link) – On how to improve attention approach based on local attention.
  • Massive Exploration of Neural Machine Translation Architectures by Britz et al (link)
  • Recurrent Convolutional Neural Networks for Discourse Compositionality by Kalchbrenner and Blunsom (link)

Important Blog Posts/Web page:

Summarization:

Usage in Dialogue System:

Others: (Unsorted, and seems less important)

Categories
deep learning deep neural network

Quick Impression on deeplearning.ai’s “Heroes of Deep Learning” with Prof. Yoshua Bengio

Quick Impression on deeplearning.ai’s “Heroes of Deep Learning”. This time is the interview of Prof. Yoshua Bengio. As always, don’t post any copyrighted material here at the forum!

* Out of the ‘Canadian Mafia’, Prof Bengio is perhaps the less known among the three. Prof. Hinton and Prof. Lecun have their own courses, and as you know they work for Google and Facebook respectively. Whereas Prof. Bengio does work for MS, the role is more of a consultant.

* You may know him as one of the coauthors of the book “Deep Learning”. But then again, who really understand that book, especially part III?

* Whereas Prof. Hinton strikes me as an eccentric polymath, Prof. Bengio is more a conventional scholar. He was influenced by Hinton in his early study of AI which was mostly expert-system based.

* That explains why everyone seems to leave his interview out, which I found it very intersting.

* He named several of his group’s contributions: most of what he named was all fundamental results. Like Glorot and Bengio 2010 on now widely called Xavier’s initialization or attention in machine translation, his early work in language model using neural network, of course, the GAN from GoodFellow. All are more technical results. But once you think about these ideas, they are about understanding, rather than trying to beat the current records.

* Then he say few things about early deep learning researcher which surprised me: First is on depth. As it turns out, the benefit of depth was not as clear early in 2000s. That’s why when I graduated in my Master (2003), I never heard of the revival of neural network.

* And then there is the doubt no using ReLU, which is the current day staple of convnet. But the reason makes so much sense – ReLU is not smooth on all points of R. So would that causes a problem. Many one who know some calculus would doubt rationally.

* His idea on learning deep learning is also quite on point – he believe you can learn DL in 5-6 months if you had the right training – i.e. good computer science and Math education. Then you can just pick up DL by taking courses and reading proceedings from ICML.

* Finally, it is his current research on the fusion of neural networks and neuroscience. I found this part fascinating. Would backprop really used in brain a swell?

That’s what I have. Hope you enjoy!

Categories
deep learning deep neural network

Quick Impression on deeplearning.ai (After Finishing Coursework)

Following experienced guys like Arvind Nagaraj​ and Gautam Karmakar​, I just finished all course works for deeplearning.ai. I haven’t finished all videos yet. But it’s a good idea to write another “impression” post.

* It took me about 10 days clock time to finish all course works. The actual work would only take me around 5-6 hours. I guess my experience speaks for many veteran members at AIDL.
* python numpy has its quirk. But if you know R or matlab/octave, you are good to go.
* Assignment of Course 1 is to guide you building an NN “from scratch”. Course 2 is to guide you to implement several useful initialization/regularization/optimization algorithms. They are quite cute – you mostly just fill in the right code in python numpy.
* I quoted “from scratch” because you actually don’t need to write your own matrix routine. So this “from scratch” is quite different from people who try to write a NN package “from scratch using C”, in which you probably need to write a bit of code on matrix manipulation, and derive a set of formulate for your codebase. So Ng’s Course gives you a taste of how these program feel like. In that regard, perhaps the next best thing is Michael Nielsen’s NNDL book.
* Course 3 is quiz-only. So by far, is the easiest to finish. Just like Arvind and Gautam, I think it is the most intriguing course within the series (so far). Because it gives you a lot of many big picture advice on how to improve an ML system. Some of these advices are new to me.

Anyway, that’s what I have, once I watch all the videos, I will also come up with a full review. Before that, go check out our study group “Coursera deeplearning.ai”?

Thanks,
Arthur Chan​

https://www.facebook.com/groups/DeepLearningAISpecialization/

Categories
artificial intelligence deep learning deep neural network DNN Geoff Hinton

Quick Impression on deeplearning.ai Heroes of Deep Learning – Geoffrey Hinton

So I was going through deeplearning.ai. You know we started a new FB group on it? We haven’t public it yet but yes we are v. exited.
 
Now one thing you might notice of the class is that there is this optional lectures which Andrew Ng is interviewing luminaries of deep learning. Those lectures, in my view, are very different from the course lectures. Most of the topics mentioned are research and beginners would find it very perplexed. So I think these lectures deserve separate sets of notes. I still call it “quick impression” because usually I will do around 1-2 layers of literature search before I’d say I grok a video.
 
* Sorry I couldn’t post the video because it is copyrighted by Coursera, but it should be very easy for you to find it. Of course, respect our forum rules and don’t post the video here.
 
* This is a very interesting 40-min interview of Prof. Geoffrey Hinton. Perhaps it should also be seen as an optional material after you finish his class NNML on coursera.
 
* The interview is in research-level. So that means you would understand more if you took NNML or read part of Part III of deep learning.
 
* There are some material you heard from Prof. Hinton before, including how he became a NN/Brain researcher, how he came up with backprop and why he is not the first one who come up.
 
* There are also some which is new to me, like why does his and Rumelhart’s paper was so influential. Oh, it has to do with his first experience on marriage relationship (Lecture 2 of NNML).
 
* The role of Prof. Ng in the interview is quite interesting. Andrew is also a giant in deep learning, but Prof Hinton is more the founder of the field. So you can see that Prof. Ng was trying to understand several of Prof. Hinton’s thought, such as 1) Does back-propagation appear in brain? 2) The idea of capsule, which is a distributed representation of a feature vector, and allow a kind of what Hinton called “agreement”. 3) Unsupervised learning such as VAE.
 
* On Prof. Hinton’s favorite idea, and not to my surprise:
1) Boltzmann machine, 2) Stacking RBM to SBN, 3) variational method. I frankly don’t fully understand Pt. 3. But then L10 to L14 of NNML are all about Pt 1 and 2. Unfortunately, not everyone love to talk about Boltzmann machine – they are not hot as GAN, and perceived as not useful at all. But if you want to understand the origin of deep learning, and one way to pre-train your DNN, you should go to take NNML.
 
* Prof. Hinton’s advice on research is also very entertaining – he suggest you don’t always read up from literature first – which according to him is good for creative researchers.
 
* The part I like most is Prof Hinton’s view of why computer science departments are not catching up on teaching deep learning. As always, he words are penetrating. He said, ” And there’s a huge sea change going on, basically because our relationship to computers has changed. Instead of programming them, we now show them, and they figure it out.”
 
* Indeed, when I first start out at work, thinking as an MLer is not regarded as cool – programming is cool. But things are changing. And we AIDL is embracing the change.
 
Enjoy!
 
Arthur Chan
Categories
deep learning deep neural network DNN

Quick Impression on deeplearning.ai

(Also see my full review of Course 1 and Course 2 here.)

Fellows, as you all know by now, Prof. Andrew Ng has started a new Coursera Specialization on Deep Learning. So many of you came to me today and ask my take on the class. As a rule, I usually don’t comment on a class unless I know something about it. (Search for my “Learning Deep Learning – Top 5 Lists” for more details.) But I’d like to make an exception for the Good Professor’s class.

 
So here is my quick take after browsing through the specialization curriculum:
 
* Only Course 1 to 3 are published now, they are short classes, more like 2-4 weeks. It feels like the Data Science Specialization so it feels good for beginners. Assume that Course 4 and 5 are long: 4 weeks. So we are talking about 17 weeks of study.
 
* Unlike the standard Ng’s ML class, python is the default language. That’s good in my view because close to 80-90% of practitioners are using python-based framework.
 
* Course 1-3 has around 3 weeks of curriculum overlapped with “Intro to Machine Learning” Lecture 2-3. Course 1’s goal seems to implement NN from scratch. Course 2 is on regularization. Course 3 on different methodologies of deep learning and it’s short, only 2 weeks long.
 
* Course 4 and 5 are about CNN and RNN.
 
* So my general impression here is that it is more a comprehensive class, comparable with Hugo Larochelle’s Lectures, as well as Hinton’s lecture. Yet the latter two classes are known to be more difficult. Hinton’s class in particular, are know to confuse even PhDs. So that shows one of the values of this new DL class, it is a great transition from “Intro to ML” to more difficult classes such as Hinton’s.
 
* But how does it compared with other similar course such as Udacity’s DL nanodegree then? I am not sure yet, but the price seems to be more reasonable if you go through the Coursera route. Assume we are talking about 5 months of study, you are paying $245.
 
* I also found that many existing beginner classes advocate too much on running scripts, but avoid linking more fundamental concepts such as bias/variance with DL. Or go deep to describe models such as Convnet and RNN. cs231n did a good job on Convnet, and cs224n teach you RNN. But they seem to be more difficult than Ng or Udacity’s class. So again, Ng’s class sounds like a great transition class.
 
* My current take: 1) I am going to take the class myself. 2) It’s very likely this new deeplearning.ai class will change my recommendations of class on Top-5 list.
 
Hope this is helpful for all of you.
 
Arthur Chan
 
Categories
deep learning deep neural network DNN Machine Learning Natural Language Processing

What is the Difference between Deep Learning and Machine Learning?

AIDL member Bob Akili asked (rephrased):

What is the Difference between Deep Learning and Machine Learning?

Usually I don’t write a full blog message to answer member’s questions. But what is “deep” is such a fundamental concept in deep learning, yet there are many well-meaning but incorrect answers floating around.   So I think it is a great idea to answer the question clearly and hopefully disabuse some of the misconceptions as well. Here is a cleaned up and expanded version of my comment to the thread.

Deep Learning is Just a Subset of Machine Learning

First of all deep learning is just a subset of techniques of machine learning.  You may heard from  many “Deep Learning Consultants”-type: “deep learning is completely different from from Machine Learning”.   But then when we are talking about “deep learning” these days, we are really talking about “neural networks which has more than one layer”.  Since neural network is just one type of ML techniques, it doesn’t make any sense to call DL as “different” from ML.   It might work for marketing purpose, but the thought was clearly misleading.

Deep Learning is a kind of Representation Learning

So now we know that deep learning is a kind of machine learning.   We still can’t quite answer why it is special.  So let’s be more specific, deep learning is a kind of representation learning.  What is representation learning?  Representation learning is an opposite of another school of thought/practice: feature engineering. In feature engineering, humans are supposed to hand-craft features to make machine works better.   If you Kaggle before, this should be obvious to you, sometimes you just want to manipulate the raw inputs and create new feature to represent your data.

Yet in some domains which involve high-dimensional data such as images, speech or text, hand-crafting feature was found to be very difficult.  e.g. Using HOG type of approaches to do computer vision usually takes a 4-5 years of a PhD student.   So here we come back to representation learning – can computer automatically learn good features?

What is a “Deep” Technique?

Now we come to the part why deep learning is “deep” – usually we call a method “deep” when we are optimizing a nested function in the method.   So for example, if you can express such functions as a graph, you would find that it has multiple layers.  The term “deep” really is describing such “nestedness”.  That should explain why we typically called any artificial neural network (ANN) with more than 1 hidden layer as “deep”.   Or the general saying, “deep learning is just neural network which has more layers”.

(Another appropriate term is “hierarchical”. See footnote [4] for more detail.)

This is also the moment Karpathy in cs231n will show you the multi-layer CNN such that features are automatically learned from the simplest to more complex one. Eventually your last layer can just differentiate them using a linear classifier. As there is a “deep” structure that learn the right feature (last layer).   Note the key term here is “automatic”, all these Gabor-filter like feature are not hand-made.  Rather, they are results from back-propagation [3].

Are there Anything which is “Deep” but not a Neural Network?

Actually, there are plenty, deep Boltzmann machine? deep belief network? deep Gaussian process?  They are still discussed in unsupervised learning using neural network, but I always found that knowledge of graphical models is more important to understand them.

So is Deep Learning also a Marketing Term?

Yes and no. It depends on who you talk to.  If you talk with ANN researchers/practitioners, they would just tell you “deep learning is just neural network which has more than 1 hidden layer”.   Indeed, if you think from their perspective, the term “deep learning” could just be a short-form.  Yet as we just said, you can also called other methods “deep”.  So the adjective is not totally void of meaning.  But many people would also tell you that because “deep learning” has become such a marketing term, it can now mean many different things.  I will say more in next section.

Also the term “deep learning” has been there for a century.  Check out Prof. Schmidhuber’s thread for more details?

“No Way! X is not Deep but it is also taught in Deep Learning Class, You made a Horrible Mistake!”

I said it with much authority and I know some of you guys would just jump in and argue:

“What about word2vec? It is nothing deep at all, but people still call it Deep learning!!!”  “What about all wide architectures such as “wide-deep learning“?” “Arthur, You are Making a HORRIBLE MISTAKE!”

Indeed, the term “deep learning” is being abused these days.   More learned people, on the other hand, are usually careful to call certain techniques “deep learning”  For example,  in cs221d 2015/2016 lectures, Dr. Richard Socher was quite cautious to call word2vec as “deep”.  His supervisor, Prof. Chris Manning, who is an authority in NLP, is known to dispute whether deep learning is always useful in NLP, simply because some recent advances in NLP really due to deep learning [1][2].

I think these cautions make sense.  Part of it is that calling everything “deep learning” just blurs what really should be credited in certain technical improvement.  The other part is we shouldn’t see deep learning as the only type of ML we want to study.  There are many ML techniques, some of them are more interesting and practical than deep learning in practice.  For example, deep learning is not known to work well with small data scenario.  Would I just yell at my boss and say “Because I can’t use deep learning, so I can’t solve this problem!”?  No, I would just test out random forest, support vector machines, GMM and all these nifty methods I learn over the years.

Misleading Claim About Deep Learning (I) – “Deep Learning is about Machine Learning Methods which use a lot of Data!”

So now we come to the arena of misconceptions, I am going to discuss two claims which many people have been drumming about deep learning.   But neither of them is the right answer to the question “What is the Difference between Deep and Machine Learning?

The first one you probably heard all the time, “Deep Learning is about ML methods which use a lot of data”.   Or people would tell you “Oh, deep learning just use a lot of data, right?”  This sounds about right, deep learning in these days does use a lot of data.  So what’s wrong with the statement?

Here is the answer: while deep learning does use a lot of data, before deep learningother techniques use tons of data too! e.g. Speech recognition before deep learning, i.e. HMM+GMM, can use up to 10k hours of speech. Same for SMT.  And you can do SVM+HOG on Imagenet. And more data is always better for those techniques as well. So if you say “deep learning use more data”, then you forgot the older techniques also can use more data.

What you can claim is that “deep learning is a more effective way to utilize data”.  That’s very true, because once you get into either GMM or SVM, they would have scalability issues.  GMM scales badly when the amount of data is around 10k hour.  SVM (with RBF-kernel in particular) is super tough/slow to use when you have ~1 million point of data.

Misleading Claim About Deep Learning II – “Deep Learning is About Using GPU and Having Data Center!”

This particular claim is different from the previous “Data Requirement” claim,  but we can debunk it in a similar manner.   The reason why it is wrong? Again before deep learning, people have GPUs to do machine learning already.  For example, you can use GPU to speed up GMM.   Before deep learning is hot, you need a cluster of machines to train acoustic model or language model for speech recognition.  You also need tons of RAM to train a language model for SMT.   So calling GPU/Data Center/RAM/ASIC/FPGA a differentiator of deep learning is just misleading.

You can say though “Deep Learning has change the computational model from distributed network model to more a single machine-centric paradigm (which each machine has one GPU).  But later approaches also tried to combine both CPU-GPU processing together”.  

Conclusion and “What you say is Just Your Opinion! My Theory makes Equal Sense!”

Indeed, you should always treat what you read on-line with a grain of salt.   Being critical is a good thing, having your own opinion is good.  But you should also try to avoid equivocate an issue.  Meaning: sometimes things have only one side, but you insist there are two equally valid answers.   If you do so, you are perhaps making a logical error in your thinking.   And a lot of people who made claims such as “deep learning is learning which use more data and use a lot of GPUS” are probably making such thinking errors.

Saying so, I would suggest you to read several good sources to judge my answer, they are:

  1. Chapter 1 of Deep Learning.
  2. Shakir’s Machine Learning Blog on a Statistical View of Deep Learning.  In particular, part VI, “What is Deep?
  3. Tombone’s post on Deep Learning vs Machine Learning vs Pattern Recognition

In any case, I hope that this article helps you. I thank Bob to ask the question, Armaghan Rumi Naik has debunked many misconceptions in the original thread – his understanding on machine learning is clearly above mine and he was able to point out mistakes from other commenters.  It is worthwhile for your reading time.

Footnotes

[1] See “Last Words: Computational Linguistics and Deep Learning
[2] Generally whether DL is useful in NLP is widely disputed topic. Take a look of Yoav Goldberg’s view on some recent GAN results on language generation. AIDL Weekly #18 also gave an expose on the issue.
[3] Perhaps another useful term is “hierarchical”.  In the case of ConvNet the term is right on.  As Eric Heitzman comments at AIDL:
“(deep structure) They are *not* necessarily recursive, but they *are* necessarily hierarchical since layers always form a hierarchical structure.”  After Eric’s comment, I think both “deep” and “hierarchical” are fair terms to describe methods in “deep learning”. (Of course, “hierarchical learning” is a much a poorer marketing term.)
[4] In earlier draft.  I use the term recursive to describe the term “deep”, which as Eric Heitzman at AIDL, is not entirely appropriate.  “Recursive” give people a feeling that the function is self-recursive or$latex f(f( \ldots f(f(*))))$. but actual function are more “nested”, like $latex f_1(f_2( \ldots f_{n-1}(f_n(*))))$. As a result, I removed the term “recursive” but just call the function “nested function”.
Of course, you should be aware that my description is not too mathematically rigorous neither. (I guess it is a fair wordy description though)

History:
20170709 at 6: fix some typos.

20170711: fix more typos.

20170711 at 7:05 p.m.: I got a feedback from Eric Heitzman who points out that the term “recursive” can be deceiving.  Thus I wrote footnote [4].

If you like this message, subscribe the Grand Janitor Blog’s RSS feed. You can also find me (Arthur) at twitterLinkedInPlusClarity.fm. Together with Waikit Lau, I maintain the Deep Learning Facebook forum.  Also check out my awesome employer: Voci.

Categories
DBN deep leaerning deep learning deep neural network energy-based models ESN Geoff Hinton Machine Learning RBM recurrent neural network

A Review on Hinton’s Coursera “Neural Networks and Machine Learning”

CajalCerebellum
Cajal’s drawing chick cerebellum cells, from Estructura de los centros nerviosos de las aves, Madrid, 1905

For me, finishing Hinton’s deep learning class, or Neural Networks and Machine Learning(NNML) is a long overdue task. As you know, the class was first launched back in 2012. I was not so convinced by deep learning back then. Of course, my mind changed at around 2013, but the class was archived. Not until 2 years later I decided to take Andrew Ng’s class on ML, and finally I was able to loop through the Hinton’s class once. But only last year October when the class relaunched, I decided to take it again, i.e watch all videos the second times, finish all homework and get passing grades for the course. As you read through my journey, this class is hard.  So some videos I watched it 4-5 times before groking what Hinton said. Some assignments made me takes long walks to think through. Finally I made through all 20 assignments, even bought a certificate for bragging right; It’s a refreshing, thought-provoking and satisfying experience.

So this piece is my review on the class, why you should take it and when.  I also discuss one question which has been floating around forums from time to time: Given all these deep learning classes now, is the Hinton’s class outdated?   Or is it still the best beginner class? I will chime in on the issue at the end of this review.

The Old Format Is Tough

I admire people who could finish this class in the Coursera’s old format.  NNML is well-known to be much harder than Andrew Ng’s Machine Learning as multiple reviews said (here, here).  Many of my friends who have PhD cannot quite follow what Hinton said in the last half of the class.

No wonder: at the time when Kapathay reviewed it in 2013, he noted that there was an influx of non-MLers were working on the course. For new-comers, it must be mesmerizing for them to understand topics such as energy-based models, which many people have hard time to follow.   Or what about deep belief network (DBN)? Which people these days still mix up with deep neural network (DNN).  And quite frankly I still don’t grok some of the proofs in lecture 15 after going through the course because deep belief networks are difficult material.

The old format only allows 3 trials in quiz, with tight deadlines, and you only have one chance to finish the course.  One homework requires deriving the matrix form of backprop from scratch.  All of these make the class unsuitable for busy individuals (like me).  But more for second to third year graduate students, or even experienced practitioners who have plenty of time (but, who do?).

The New Format Is Easier, but Still Challenging

I took the class last year October, when Coursera had changed most classes to the new format, which allows students to re-take.  [1]  It strips out some difficulty of the task, but it’s more suitable for busy people.   That doesn’t mean you can go easy on the class : for the most part, you would need to review the lectures, work out the Math, draft pseudocode etc.   The homework requires you to derive backprop is still there.  The upside: you can still have all the fun of deep learning. 🙂 The downside:  you shouldn’t expect going through the class without spending 10-15 hours/week.

Why the Class is Challenging –  I: The Math

Unlike Ng’s and cs231n, NNML is not too easy for beginners without background in calculus.   The Math is still not too difficult, mostly differentiation with chain rule, intuition on what Hessian is, and more importantly, vector differentiation – but if you never learn it – the class would be over your head.  Take at least Calculus I and II before you join, and know some basic equations from the Matrix Cookbook.

Why the Class is Challenging – II:  Energy-based Models

Another reason why the class is difficult is that last half of the class was all based on so-called energy-based models. i.e. Models such as Hopfield network (HopfieldNet), Boltzmann machine (BM) and restricted Boltzmann machine (RBM).  Even if you are used to the math of supervised learning method such as linear regression, logistic regression or even backprop, Math of RBM can still throw you off.   No wonder: many of these models have their physical origin such as Ising model.  Deep learning research also frequently use ideas from Bayesian networks such as explaining away.  If you have no basic background on either physics or Bayesian networks, you would feel quite confused.

In my case, I spent quite some time to Google and read through relevant literature, that power me through some of the quizzes, but I don’t pretend I understand those topics because they can be deep and unintuitive.

Why the Class is Challenging – III: Recurrent Neural Network

If you learn RNN these days, probably from Socher’s cs224d or by reading Mikolov’s thesis.  LSTM would easily be your only thought on how  to resolve exploding/vanishing gradients in RNN.  Of course, there are other ways: echo state network (ESN) and Hessian-free methods.  They are seldom talked about these days.   Again, their formulation is quite different from your standard methods such as backprop and gradient-descent.  But learning them give you breadth, and make you think if the status quote is the right thing to do.

But is it Good?

You bet! Let me quantify the statement in next section.

Why is it good?

Suppose you just want to use some of the fancier tools in ML/DL, I guess you can just go through Andrew Ng’s class, test out bunches of implementations, then claim yourself an expert – That’s what many people do these days.  In fact, Ng’s Coursera class is designed to give you a taste of ML, and indeed, you should be able to wield many ML tools after the course.

That’s said, you should realize your understanding of ML/DL is still …. rather shallow.  May be you are thinking of “Oh, I have a bunch of data, let’s throw them into Algorithm X!”.  “Oh, we just want to use XGBoost, right! It always give you the best results!”   You should realize performance number isn’t everything.  It’s important to understand what’s going on with your model.   You easily make costly short-sighted and ill-informed decision when you lack of understanding.  It happens to many of my peers, to me, and sadly even to some of my mentors.

Don’t make the mistake!  Always seek for better understanding! Try to grok.  If you only do Ng’s neural network assignment, by now you would still wonder how it can be applied to other tasks.   Go for Hinton’s class, feel perplexed by the Prof said, and iterate.  Then you would start to build up a better understanding of deep learning.

Another more technical note:  if you want to learn deep unsupervised learning, I think this should be the first course as well.   Prof. Hinton teaches you the intuition of many of these machines, you will also have chance to implement them.   For models such as Hopfield net and RBM, it’s quite doable if you know basic octave programming.

So it’s good, but is it outdated?

Learners these days are perhaps luckier, they have plenty of choices to learn deep topic such as deep learning.   Just check out my own “Top 5-List“.   cs231n, cs224d and even Silver’s class are great contenders to be the second class.

But I still recommend NNML.  There are four reasons:

  1. It is deeper and tougher than other classes.  As I explained before, NNML is tough, not exactly mathematically (Socher’s, Silver’s Maths are also non-trivial), but conceptually.  e.g. energy-based model and different ways to train RNN are some of the examples.
  2. Many concepts in ML/DL can be seen in different ways.  For example, bias/variance is a trade-off for frequentist, but it’s seen as “frequentist illusion” for Bayesian.    Same thing can be said about concepts such as backprop, gradient descent.  Once you think about them, they are tough concepts.    So one reason to take a class, is not to just teach you a concept, but to allow you to look at things from different perspective.  In that sense, NNML perfectly fit into the bucket.  I found myself thinking about Hinton’s statement during many long promenades.
  3. Hinton’s perspective – Prof Hinton has been mostly on the losing side of ML during last 30 years.   But then he persisted, from his lectures, you would get a feeling of how/why he starts a certain line of research, and perhaps ultimately how you would research something yourself in the future.
  4. Prof. Hinton’s delivery is humorous.   Check out his view in Lecture 10 about why physicists worked on neural network in early 80s.  (Note: he was a physicist before working on neural networks.)

Conclusion and What’s Next?

All-in-all, Prof. Hinton’s “Neural Network and Machine Learning” is a must-take class.  All of us, beginners and experts include, will be benefited from the professor’s perspective, breadth of the subject.

I do recommend you to first take the Ng’s class if you are absolute beginners, and perhaps some Calculus I or II, plus some Linear Algebra, Probability and Statistics, it would make the class more enjoyable (and perhaps doable) for you.  In my view, both Kapathy’s and Socher’s class are perhaps easier second class than Hinton’s class.

If you finish this class, make sure you check out other fundamental classes.  Check out my post “Learning Deep Learning – My Top 5 List“, you would have plenty of ideas for what’s next.   A special mention here perhaps is Daphne Koller’s Probabilistic Graphical Model, which I found equally challenging, and perhaps it will give you some insights on very deep topics such as Deep Belief Network as well.

Another suggestion for you: may be you can take the class again. That’s what I plan to do about half a year later – as I mentioned, I don’t understand every single nuance in the class.  But I think understanding would come up at my 6th to 7th times going through the material.

Arthur Chan

[1] To me, this makes a lot of sense for both the course’s preparer and the students, because students can take more time to really go through the homework, and the course’s preparer can monetize their class for infinite period of time.

History:

(20170410) First writing
(20170411) Fixed typos. Smooth up writings.
(20170412) Fixed typos
(20170414) Fixed typos.

If you like this message, subscribe the Grand Janitor Blog’s RSS feed. You can also find me (Arthur) at twitter, LinkedInPlus, Clarity.fm. Together with Waikit Lau, I maintain the Deep Learning Facebook forum.  Also check out my awesome employer: Voci.

Categories
deep learning deep neural network

Some Quick Impression of Browsing “Deep Learning”

(Redacted from a post I wrote back in Feb 14 at AIDL)
I have some leisure lately to browse “Deep Learning” by Goodfellow for the first time. Since it is known as the bible of deep learning, I decide to write a short afterthought post, they are in point form and not too structured.

  • If you want to learn the zen of deep learning, “Deep Learning” is the book. In a nutshell, “Deep Learning” is an introductory style text book on nearly every contemporary fields in deep learning. It has a thorough chapter covered Backprop, perhaps best introductory material on SGD, computational graph and Convnet. So the book is very suitable for those who want to further their knowledge after going through 4-5 introductory DL classes.
  • Chapter 2 is supposed to go through the basic Math, but it’s unlikely to cover everything the book requires. PRML Chapter 6 seems to be a good preliminary before you start reading the book. If you don’t feel comfortable about matrix calculus, perhaps you want to read “Matrix Algebra” by Abadir as well.
  •  There are three parts of the book, Part 1 is all about the basics: math, basic ML, backprop, SGD and such. Part 2 is about how DL is used in real-life applications, Part 3 is about research topics such as E.M. and graphical model in deep learning, or generative models. All three parts deserve your time. The Math and general ML in Part 1 may be better replaced by more technical text such as PRML. But then the rest of the materials are deeper than the popular DL classes. You will also find relevant citations easily.
  • I enjoyed Part 1 and 2 a lot, mostly because they are deeper and fill me with interesting details. What about Part 3? While I don’t quite grok all the Math, Part 3 is strangely inspiring. For example, I notice a comparison of graphical models and NN. There is also how E.M. is used in latent model. Of course, there is an extensive survey on generative models. It covers difficult models such as deep Boltmann machine, spike-and-slab RBM and many variations. Reading Part 3 makes me want to learn classical machinelearning techniques, such as mixture models and graphical models better.
  • So I will say you will enjoy Part 3 if you are,
    1. a DL researcher in unsupervised learning and generative model or
    2. someone wants to squeeze out the last bit of performance through pre-training.
    3. someone who want to compare other deep methods such as mixture models or graphical model and NN.

Anyway, that’s what I have now. May be I will summarize in a blog post later on, but enjoy these random thoughts for now.

Arthur

You might also like the resource page and my top-five list.   Also check out Learning machine learning – some personal experience.
If you like this message, subscribe the Grand Janitor Blog’s RSS feed. You can also find me (Arthur) at twitter, LinkedInPlus, Clarity.fm.  Together with Waikit Lau, I maintain the Deep Learning Facebook forum.  Also check out my awesome employer: Voci.