Tag Archives: DNN

Review of Ng's deeplearning.ai Course 2: Improving Deep Neural Networks

(Review of deeplearning.ai Course 1 can be found here.)

In your life, there are times you think you know something, yet genuine understanding seems to elude you.  It's always frustrating, isn't it?   For example, why would all these seemingly simple concepts such as gradients or regularization can throw us off when we learn them since Day 1 of our learning in machine learning?

In programming, there's a term called "grok", grokking something usually means that not only you know the term, but you also have intuitive understanding of the concept.    Or as in "Zen and the Art of Motorcycle Maintenance" [1], you just try to dive deep into a concept, as if it is a journey...... For example, if you really think about speech recognition, then you would realize the frame independence  assumption [2] is very important.   Because it simplifies the problem in both search and parameter estimation.  Yet it certainly introduces a modeling error.  These small things which are not mentioned in classes or lectures are things you need to "grok".

That brings us to Course 2 of deeplearning.ai.  What are you grokking in this Course?  After you take Course 1, should you take this Course 2?  My answer is yes and here is my reasoning.

Really, What is Gradient Descent?

Gradient descent is a seemingly simple subject - say you want to find a minima of the function a convex function, so you follow the gradient down hill and after many iterations, you eventually hit the minima.  Sounds simple right?

Of course, once you start to realize that functions are normally not convex, and they are n-dimensional, and there can be plateaus.  Or when you follow the gradient,  but it happens to be a wrong direction! So you will have zigzagging when you try to descent.   It's a little bit descending from a real mountain, yet you don't really can't see n-dimensional space!

That explains the early difficulty of deep learning development.  Stochastic gradient descent (SGD) was just too slow back in 2000 for DNN. That results in very interesting research of restricted Boltzmann machine (RBM) which was stacked and initialize DNN, which was prominent subject of Hinton's NNML after Lecture 8, or pretraining, which is still being used in some recipes in speech recognition as well as financial prediction.

But we are not doing RBM any more! In fact, research in RBM is not as fervent as in 2008. [4] Why? It has to do with people just understand more about SGD and can run it better - it has to do with initialization, e.g. Glorot's and He's initialization.   It has to do with how gradient descent is done - ADAM is our current best.

So how do you learn these stuffs?  Before Ng deeplearning.ai's class, I would say knowledge like this spread out on courses such as cs231n or cs224n.  But as I mentioned in the Course 1's review, those are really courses with specific applications in mind.  Or you can go to read Michael Nielsen's Neural Network and Deep Learning.   Of course, Nielsen's is a book.  So it really depends on whether you have patience to work through the details while reading.  (Also see my review of the book.)

Of course, now you don't have to.  The one-stop shop is Course 2.  Course 2 actually covers the material I just mentioned such as initialization, gradient descent, as well as deeper concepts such as regularization in deep learning and batch normalization.   That makes me recommend you to keep on taking the course after you finish Course 1.  If you take the class, and are also willing to read Sebastian Ruder's Review of SGD or Grabriel Goh's Why Momentum Really Works, you would be much ahead of the game.

As a note, I also like Andrew breaks down many of the SGD algorithm as a smoothing algorithm.   That's a new insight for me even after I used SGD many times.

Is it hard?

Nope, as Math goes, Course 1 is probably toughest.  Of course, even in Course 1, you will finish coursework faster if you don't overthink the problem.  Most notebooks have the derived results for you.  On the other hand, you do want to derive the formulae,  you do need to have decent skill in matrix calculus.

Is it Necessary to Understand These Details?; Also Top-Down vs Bottom-Up learning, which is Better?

A legitimate question here is that : well, in our current state of deep learning which we have so many toolkits which already implemented techniques such as ADAM.  Do I really need to dig so deep?

I do think there are always two views of learning - one is from top-down, which in deep learning, perhaps is to read a bunch of papers, learn the concepts and see if you can wrap you head around them.  the fast.ai class is one of them.   And 95% of the current AI enthusiasts are following such paths.

What's the problem of the top-down approach?  Let me go back to my first paragraph - which is - do you really grok something when you do something top-down?  I frequently can't.   In my work life, I also heard Senior people say that top-down is the way to go.  Yet, when I went ahead to check if they truly understand an implementation.  They frequently can't.  That happens to a lot of technical people who later turn to more management.   Literally, they lost their skills and touch.

On the other hand, every time, I pop up an editor and write an algorithm, I gain tremendous understanding!   For example, I was asked to write a forward inference once with C, you better know what you are doing in the level of memory allocation.   In fact, I come to have opinion these days that you have to implement an algorithm once before you can claim you understand it.

So how come there are two sides of the opinion then?  One of my speculations is that back in 80s/90s, students are often taught to learn how to write program in first writing.  That create mindsets that you have to think up a perfect program before you start to write one.   Of course, in practice in ML, such mindset is highly impractical because and the ML development process  is really an experiment.  You can't always assume you perfect the settings before you try something.

Another equally dangerous mindset is to say "if you are too focused on details, then you miss the big picture won't come up with something new!" . This I heard a lot when I first do research and it's close to most BS-ty thing I've heard.  If you want to come up with something new, the first thing you should learn is all the details of existing works.  The so called "big picture" and "details" are always interconnected.  That's why in the AIDL forum, we never see young kids who say "Oh I have this brand new idea, which is completely different from all previous works!" go anywhere.  That's because you always learn how to walk before you run.   And knowing the details has no downsides.

Perhaps this is my long reasons why Ng's class is useful for me, even after I read many literature.  I distrust people who only talk about theory but don't show any implementation.

Conclusion

This concludes my review of Course 2.  To many people, after they took Course 1, they just decide to take Course 2, I don't blame them, but you always want to ask if your time is well-spent.

To me though, taking Course 2 is not just about understanding more on deep learning.  It is also my hope to grok some of the seemingly simple concepts in the field.   Hope that my review is useful and I will keep you all posted when my Course 3's review is done.

Arthur

Footnotes:
[1] As Pirsig said - it's really not about motorcycle maintenance.

[2] Strictly speaking, it is conditional frame independence assumption.  But practitioners in ASR frequently just called it frame independence assumption.

[3] Also see HODL's interview with Ruslan Salakhutdinov, his account is first hand on the rise and fall of RBM.

Some Useful Links on Neural Machine Translation

Some good resources for NNMT

Tutorial:

a bit special: Tensor2Tensor uses a novel architecture instead of pure RNN/CNN decoder/encoder.   It gives a surprisingly large amount of gain.  So it's likely that it will become a trend in NNMT in the future.

Important papers:

  • Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation by Cho Et al. (link) - Very innovative and smart paper by Kyunghyun Cho.  It also introduces GRU.
  • Sequence to Sequence Learning with Neural Networks by Ilya Sutskever (link) - By Google's researchers, and perhaps it shows for the first time an NMT system is comparable to the traditional pipeline.
  • Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (link)
  • Neural Machine Translation by Joint Learning to Align and Translate by Dzmitry Bahdanau (link) - The paper which introduce attention
  • Neural Machine Translation by Min-Thuong Luong (link)
  • Effective Approaches to Attention-based Neural Machine Translation by Min-Thuong Luong (link) - On how to improve attention approach based on local attention.
  • Massive Exploration of Neural Machine Translation Architectures by Britz et al (link)
  • Recurrent Convolutional Neural Networks for Discourse Compositionality by Kalchbrenner and Blunsom (link)

Important Blog Posts/Web page:

Others: (Unsorted, and seems less important)

Usage in Chatbot and Summarization (again unsorted, and again perhaps less important.....)

A Review on Hinton's Coursera "Neural Networks and Machine Learning"

CajalCerebellum
Cajal's drawing chick cerebellum cells, from Estructura de los centros nerviosos de las aves, Madrid, 1905

For me, finishing Hinton's deep learning class, or Neural Networks and Machine Learning(NNML) is a long overdue task. As you know, the class was first launched back in 2012. I was not so convinced by deep learning back then. Of course, my mind changed at around 2013, but the class was archived. Not until 2 years later I decided to take Andrew Ng's class on ML, and finally I was able to loop through the Hinton's class once. But only last year October when the class relaunched, I decided to take it again, i.e watch all videos the second times, finish all homework and get passing grades for the course. As you read through my journey, this class is hard.  So some videos I watched it 4-5 times before groking what Hinton said. Some assignments made me takes long walks to think through. Finally I made through all 20 assignments, even bought a certificate for bragging right; It's a refreshing, thought-provoking and satisfying experience.

So this piece is my review on the class, why you should take it and when.  I also discuss one question which has been floating around forums from time to time: Given all these deep learning classes now, is the Hinton's class outdated?   Or is it still the best beginner class? I will chime in on the issue at the end of this review.

The Old Format Is Tough

I admire people who could finish this class in the Coursera's old format.  NNML is well-known to be much harder than Andrew Ng's Machine Learning as multiple reviews said (here, here).  Many of my friends who have PhD cannot quite follow what Hinton said in the last half of the class.

No wonder: at the time when Kapathay reviewed it in 2013, he noted that there was an influx of non-MLers were working on the course. For new-comers, it must be mesmerizing for them to understand topics such as energy-based models, which many people have hard time to follow.   Or what about deep belief network (DBN)? Which people these days still mix up with deep neural network (DNN).  And quite frankly I still don't grok some of the proofs in lecture 15 after going through the course because deep belief networks are difficult material.

The old format only allows 3 trials in quiz, with tight deadlines, and you only have one chance to finish the course.  One homework requires deriving the matrix form of backprop from scratch.  All of these make the class unsuitable for busy individuals (like me).  But more for second to third year graduate students, or even experienced practitioners who have plenty of time (but, who do?).

The New Format Is Easier, but Still Challenging

I took the class last year October, when Coursera had changed most classes to the new format, which allows students to re-take.  [1]  It strips out some difficulty of the task, but it's more suitable for busy people.   That doesn't mean you can go easy on the class : for the most part, you would need to review the lectures, work out the Math, draft pseudocode etc.   The homework requires you to derive backprop is still there.  The upside: you can still have all the fun of deep learning. 🙂 The downside:  you shouldn't expect going through the class without spending 10-15 hours/week.

Why the Class is Challenging -  I: The Math

Unlike Ng's and cs231n, NNML is not too easy for beginners without background in calculus.   The Math is still not too difficult, mostly differentiation with chain rule, intuition on what Hessian is, and more importantly, vector differentiation - but if you never learn it - the class would be over your head.  Take at least Calculus I and II before you join, and know some basic equations from the Matrix Cookbook.

Why the Class is Challenging - II:  Energy-based Models

Another reason why the class is difficult is that last half of the class was all based on so-called energy-based models. i.e. Models such as Hopfield network (HopfieldNet), Boltzmann machine (BM) and restricted Boltzmann machine (RBM).  Even if you are used to the math of supervised learning method such as linear regression, logistic regression or even backprop, Math of RBM can still throw you off.   No wonder: many of these models have their physical origin such as Ising model.  Deep learning research also frequently use ideas from Bayesian networks such as explaining away.  If you have no basic background on either physics or Bayesian networks, you would feel quite confused.

In my case, I spent quite some time to Google and read through relevant literature, that power me through some of the quizzes, but I don't pretend I understand those topics because they can be deep and unintuitive.

Why the Class is Challenging - III: Recurrent Neural Network

If you learn RNN these days, probably from Socher's cs224d or by reading Mikolov's thesis.  LSTM would easily be your only thought on how  to resolve exploding/vanishing gradients in RNN.  Of course, there are other ways: echo state network (ESN) and Hessian-free methods.  They are seldom talked about these days.   Again, their formulation is quite different from your standard methods such as backprop and gradient-descent.  But learning them give you breadth, and make you think if the status quote is the right thing to do.

But is it Good?

You bet! Let me quantify the statement in next section.

Why is it good?

Suppose you just want to use some of the fancier tools in ML/DL, I guess you can just go through Andrew Ng's class, test out bunches of implementations, then claim yourself an expert - That's what many people do these days.  In fact, Ng's Coursera class is designed to give you a taste of ML, and indeed, you should be able to wield many ML tools after the course.

That's said, you should realize your understanding of ML/DL is still .... rather shallow.  May be you are thinking of "Oh, I have a bunch of data, let's throw them into Algorithm X!".  "Oh, we just want to use XGBoost, right! It always give you the best results!"   You should realize performance number isn't everything.  It's important to understand what's going on with your model.   You easily make costly short-sighted and ill-informed decision when you lack of understanding.  It happens to many of my peers, to me, and sadly even to some of my mentors.

Don't make the mistake!  Always seek for better understanding! Try to grok.  If you only do Ng's neural network assignment, by now you would still wonder how it can be applied to other tasks.   Go for Hinton's class, feel perplexed by the Prof said, and iterate.  Then you would start to build up a better understanding of deep learning.

Another more technical note:  if you want to learn deep unsupervised learning, I think this should be the first course as well.   Prof. Hinton teaches you the intuition of many of these machines, you will also have chance to implement them.   For models such as Hopfield net and RBM, it's quite doable if you know basic octave programming.

So it's good, but is it outdated?

Learners these days are perhaps luckier, they have plenty of choices to learn deep topic such as deep learning.   Just check out my own "Top 5-List".   cs231n, cs224d and even Silver's class are great contenders to be the second class.

But I still recommend NNML.  There are four reasons:

  1. It is deeper and tougher than other classes.  As I explained before, NNML is tough, not exactly mathematically (Socher's, Silver's Maths are also non-trivial), but conceptually.  e.g. energy-based model and different ways to train RNN are some of the examples.
  2. Many concepts in ML/DL can be seen in different ways.  For example, bias/variance is a trade-off for frequentist, but it's seen as "frequentist illusion" for Bayesian.    Same thing can be said about concepts such as backprop, gradient descent.  Once you think about them, they are tough concepts.    So one reason to take a class, is not to just teach you a concept, but to allow you to look at things from different perspective.  In that sense, NNML perfectly fit into the bucket.  I found myself thinking about Hinton's statement during many long promenades.
  3. Hinton's perspective - Prof Hinton has been mostly on the losing side of ML during last 30 years.   But then he persisted, from his lectures, you would get a feeling of how/why he starts a certain line of research, and perhaps ultimately how you would research something yourself in the future.
  4. Prof. Hinton's delivery is humorous.   Check out his view in Lecture 10 about why physicists worked on neural network in early 80s.  (Note: he was a physicist before working on neural networks.)

Conclusion and What's Next?

All-in-all, Prof. Hinton's "Neural Network and Machine Learning" is a must-take class.  All of us, beginners and experts include, will be benefited from the professor's perspective, breadth of the subject.

I do recommend you to first take the Ng's class if you are absolute beginners, and perhaps some Calculus I or II, plus some Linear Algebra, Probability and Statistics, it would make the class more enjoyable (and perhaps doable) for you.  In my view, both Kapathy's and Socher's class are perhaps easier second class than Hinton's class.

If you finish this class, make sure you check out other fundamental class.  Check out my post "Learning Deep Learning - My Top 5 List", you would have plenty of ideas for what's next.   A special mention here perhaps is Daphne Koller's Probabilistic Graphical Model, which found it equally challenging, and perhaps it will give you some insights on very deep topic such as Deep Belief Network.

Another suggestion for you: may be you can take the class again. That's what I plan to do about half a year later - as I mentioned, I don't understand every single nuance in the class.  But I think understanding would come up at my 6th to 7th times going through the material.

Arthur Chan

[1] To me, this makes a lot of sense for both the course's preparer and the students, because students can take more time to really go through the homework, and the course's preparer can monetize their class for infinite period of time.

History:

(20170410) First writing
(20170411) Fixed typos. Smooth up writings.
(20170412) Fixed typos
(20170414) Fixed typos.

If you like this message, subscribe the Grand Janitor Blog's RSS feed. You can also find me (Arthur) at twitter, LinkedInPlus, Clarity.fm. Together with Waikit Lau, I maintain the Deep Learning Facebook forum.  Also check out my awesome employer: Voci.

Some Quick Impression of Browsing "Deep Learning"

(Redacted from a post I wrote back in Feb 14 at AIDL)
I have some leisure lately to browse "Deep Learning" by Goodfellow for the first time. Since it is known as the bible of deep learning, I decide to write a short afterthought post, they are in point form and not too structured.

  • If you want to learn the zen of deep learning, "Deep Learning" is the book. In a nutshell, "Deep Learning" is an introductory style text book on nearly every contemporary fields in deep learning. It has a thorough chapter covered Backprop, perhaps best introductory material on SGD, computational graph and Convnet. So the book is very suitable for those who want to further their knowledge after going through 4-5 introductory DL classes.
  • Chapter 2 is supposed to go through the basic Math, but it's unlikely to cover everything the book requires. PRML Chapter 6 seems to be a good preliminary before you start reading the book. If you don't feel comfortable about matrix calculus, perhaps you want to read "Matrix Algebra" by Abadir as well.
  •  There are three parts of the book, Part 1 is all about the basics: math, basic ML, backprop, SGD and such. Part 2 is about how DL is used in real-life applications, Part 3 is about research topics such as E.M. and graphical model in deep learning, or generative models. All three parts deserve your time. The Math and general ML in Part 1 may be better replaced by more technical text such as PRML. But then the rest of the materials are deeper than the popular DL classes. You will also find relevant citations easily.
  • I enjoyed Part 1 and 2 a lot, mostly because they are deeper and fill me with interesting details. What about Part 3? While I don't quite grok all the Math, Part 3 is strangely inspiring. For example, I notice a comparison of graphical models and NN. There is also how E.M. is used in latent model. Of course, there is an extensive survey on generative models. It covers difficult models such as deep Boltmann machine, spike-and-slab RBM and many variations. Reading Part 3 makes me want to learn classical machinelearning techniques, such as mixture models and graphical models better.
  • So I will say you will enjoy Part 3 if you are,
    1. a DL researcher in unsupervised learning and generative model or
    2. someone wants to squeeze out the last bit of performance through pre-training.
    3. someone who want to compare other deep methods such as mixture models or graphical model and NN.

Anyway, that's what I have now. May be I will summarize in a blog post later on, but enjoy these random thoughts for now.

Arthur

You might also like the resource page and my top-five list.   Also check out Learning machine learning - some personal experience.
If you like this message, subscribe the Grand Janitor Blog's RSS feed. You can also find me (Arthur) at twitter, LinkedInPlus, Clarity.fm.  Together with Waikit Lau, I maintain the Deep Learning Facebook forum.  Also check out my awesome employer: Voci.

Reading Michael Nielsen's "Neural Networks and Deep Learning"

Introduction

Let me preface this article: after I wrote my top five list on deep learning resources, one oft-asked question is "What is the Math prerequisites to learn deep learning?"   My first answer is Calculus and Linear Algebra, but then I will qualify certain techniques of Calculus and Linear Algebra are more useful.  e.g. you should already know gradient, differentiation, partial differentiation and Lagrange multipliers, you should know matrix differentiation and preferably trace trick , eigen-decomposition and such.    If your goal is to understand machine learning in general, then having good skills in integrations and knowledge in analysis helps. e.g. 1-2 stars problems of Chapter 2 at PRML [1] requires some knowledge of advanced function such as gamma, beta.   Having some Math would help you go through these questions more easily.

Nevertheless,  I find that people who want to learn Math first before approaching deep learning miss the point.  Many engineering topics was not motivated by pure mathematical pursuit.  More often than not, an engineering field is motivated by a physical observation. Mathematics is more like an aid to imagine and create a new solution.  In the case of deep learning.  If you listen to Hinton, he would often say he tries to first come up an idea and makes it work mathematically later.    His insistence of working on neural networks at the time of kernel method stems more from his observation of the brain.   "If the brain can do it, how come we can't?" should be a question you ask every day when you run a deep learning algorithm.   I think these observations are fundamental to deep learning.  And you should go through arguments of why people think neural networks are worthwhile in the first place.   Reading classic papers from Wiesel and Hubel helps. Understanding the history of neural network helps.  Once you read these materials, you will quickly grasp the big picture of much development of deep learning.

Saying so, I think there are certain topics which are fundamental in deep learning.   They are not necessarily very mathematical.  For example, I will name back propagation [2] as a very fundamental concept which you want to get good at.   Now, you may think that's silly.    "I know backprop already!"  Yes, backprop is probably in every single machine learning class.  It will easily give you an illusion that you master the material.    But you can always learn more about a fundamental concept.  And back propagation is important theoretically and practically.  You will encounter back propagation either as a user of deep learning tools, a writer of a deep learning framework or an innovator of new algorithm.  So a thorough understanding of backprop is very important, and one course is not enough.

This very long digression finally brings me to the great introductory book Michael Nielson's Neural Network and Deep Learning (NNDL)    The reason why I think Nielson's book is important is that it offers an alternative discussion of back propagation as an algorithm.   So I will use the rest of the article to explain why I appreciate the book so much and recommend nearly all beginning or intermediate learners of deep  learning to read it.

First Impression

I first learned about "Neural Network and Deep Learning" (NNDL) from going through Tensorflow's tutorial.   My first thought is "ah, another blogger tries to cover neural network". i.e. I didn't think it was promising.   At that time, there were already plenty of articles about deep learning.  Unfortunately, they often repeat the same topics without bringing anything new.

Synopsis

Don't make my mistake!  NNDL is a great introductory book which balance theory and practice of deep neural network.    The book has 6 chapters:

  1. Using neural network to recognize digits - the basic of neural network, a basic implementation using python (network.py)
  2. How the backpropagation algorithm works -  various explanation(s) of back propagation
  3. Improving the way neural networks learn - standard improvements of the simple back propagation, another implementation in python (network2.py)
  4. A visual proof that neural nets can compute any function - universal approximation algorithm without the Math, plus fun games which you can approximate function yourself
  5. Why are deep neural networks hard to train?  - practical difficultie of using back propagation, vanishing gradients
  6. Deep Learning  - convolution neural network (CNN), the final implementation based on Theano (network3.py), recent advances in deep learning (circa 2015).

The accompanied python scripts are the gems of the book. network.py and network2.py can run in plain-old python.   You need Theano on network3.py, but I think the strength of the book really lies on network.py and network2.py (Chapter 1 to 3) because if you want to learn CNN, Kaparthy's lectures probably gives you bang for your buck.

Why I like Nielsen's Treatment of Back Propagation?

Reading Nielson's exposition of neural network is the sixth  time I learn about the basic formulation of back propagation [see footnote 3].  So what's the difference between his treatment and my other reads then?

Forget about my first two reads because I didn't care enough neural networks enough to know why back propagation is so named.   But my latter reads pretty much give me the same impression of neural network: "a neural network is merely a stacking of logistic functions.    So how do you train the system?  Oh, just differentiate the loss functions, the rest is technicalities."   Usually the books will guide you to verify certain formulae in the text.   Of course, you will be guided to deduce that "error" is actually "propagating backward" from a network.   Let us call this view network-level view.   In a network-level view, you really don't care about how individual neurons operate.   All you care is to see neural network as yet another machine learning algorithm.

The problem of network level view is that it doesn't quite explain a lot of phenomena about back propagation.  Why is it so slow some time?  Why certain initialization schemes matter?  Nielsen does an incredibly good job to break down the standard equations into 4 fundamental equations (BP1 to BP4 in Chapter2).  Once interpret them, you will realize "Oh, saturation is really a big problem in back propagation" and "Oh, of course you have to initialize the weights of neural network with non-zero values.  Or else nothing propagate/back propagate!"    These insights, while not mathematical in nature and can be understood with college calculus, is deeper understanding about back propagation.

Another valuable part about Nielsen's explanation is that it comes with a accessible implementation.  His first implementation (network.py) is a 74 lines python in idiomatic python.   By adding print statements on his code, you will quickly grasp on a lot of these daunting equations are implemented in practice.  For example, as an exercise, you can try to identify how he implement BP1 to BP4 in network.py.    It's true that there are books and implementations about neural network,  but the description and implementation don't always come together.  Nielsen's presentation is a rare exception.

Other Small Things I Like

  • Nielsen correctly point out the Del symbol in machine learning is more like a convenient device rather than its more usual meaning like the Del operator in Math.
  • In Chapter 4,  Nielson mentioned universal approximation of neural network.  Unlike standard text book which points you to a bunch of papers with daunting math, Nielsen created a javascript which allows you to approximate functions (!), which I think those are great ways to learn intuition behind the theorem.
  • He points out that it's important to differentiate activation and the weighted input.  In fact,  this point is one thing which can confuse you when reading a derivation of back propagation because textbooks usually use different symbols for activation and weighted input.

There are many of these insightful comments from the book, I encourage you to read and discover them.

Things I don't like

  • There are many exercises of the book.  Unfortunately, there is no answer keys.  In a way, this make Nielson more an old-style author which encourage readers to think.   I guess this is something I don't always like because spending time to think of one single problem forever doesn't always give you better understanding.
  • Chapter 6 gives the final implementation in Theano.  Unfortunately, there is not much introductory material on Theano within the book.    I think this is annoying but forgivable, as Nielson pointed out, it's harder to introduce Theano and introductory book.  I would think anyone interested in Theano should probably go through the standard Theano's tutorial at here and here.

Conclusion

All-in-all,  I highly recommend Neural Network and Deep Learning  to any beginning and intermediate learners of deep learning.  If this is the first time you learn back propagation,  NNDL is a great general introductory book.   If you are like me, who already know a thing or two about neural networks, NNDL still have a lot to offer.

Arthur

[1] In my view, PRML's problem sets have 3 ratings, 1-star, 2-star and 3-star.  1-star usually requires college-level of Calculus and patient manipulation, 2-star requires some creative thoughts in problem solving or knowledge other than basic Calculus.  3-star are more long-form questions and it could contain multiple 2-star questions in one.   For your reference, I solved around 100 out of the 412 questions.  Most of them are 1-star questions.

[2] The other important concept in my mind is gradient descent, and it is still an active research topic.

[3] The 5 reads before "learnt" it once back in HKUST, read it from Mitchell's book, read it from Duda and Hart, learnt it again from Ng's lecture, read it again from PRML.  My 7th is to learn from Karparthy's lecture, he present the material in yet another way.  So it's worth your time to look at them.

If you like this message, subscribe the Grand Janitor Blog's RSS feed. You can also find me (Arthur) at twitter, LinkedInPlus, Clarity.fm.  Together with Waikit Lau, I maintain the Deep Learning Facebook forum.  Also check out my awesome employer: Voci.

Some Speculations On Why Microsoft Tay Collapsed

Microsoft's Tay, following Google AlphaGo, was meant to be yet another highly intelligent A.I. program which fulfill human's long standing dream: a machine which can truly converse.   But as you know, Tay fails spectacularly.  To me, this is a highly unusual event, part of it is that Microsoft's another conversation agent, Xiaoice, was extremely successful in China.   The other part is MSR, is one of the leading sites on using deep learning in various machine learning problems.   You would think that a major P.R. problem such as Tay confirming "Donald Trump is the hope",  and purportedly support genocide should be weeded out before launch.

As I read many posts in the past week attempted to describe why Tay fails, sadly they offer me no insights.  Some even written from respected magazines, e.g. in New Yorkers' "I’ve Seen the Greatest A.I. Minds of My Generation Destroyed by Twitter" at the end the author concluded,

"If there is a lesson to be learned, it is that consciousness wants conscience. Most consumer-tech companies have, at one time or another, launched a product before it was ready, or thought that it was equipped to do something that it ended up failing at dismally. "

While I always love the prose from New Yorkers, there is really no machine which can mimic/model  human consciousness (yet).   In fact, no one really knows how "consciousness" works, it's also tough to define what "consciousness" is.   And it's worthwhile to mention that chatbot technology is not new.   Google had released similar technology and get great press.  (See here)  So the New Yorkers' piece reflect how much the public does not understand technology.

As a result, I decided to write a Tay's postmortem myself, and offer some thoughts on why this problem could occur and how one could actively avoid such problems.

Since I try to write this piece for general audience, (say my facebook friends), the piece contains only small amount of technicalities.   If you are interested, I also list several more technical articles in the reference section.

How does a Chatbot work?  The Pre-Deep Learning Version

By now,  all of us use a chat bot or two, there is obviously Siri, which perhaps is the first program which put speech recognition and dialogue system in the national spotlight.  If you are familiar with history of computing, you would probably know ELIZA [1], which is the first example of using rule-based approach to respond to users.

What does it mean?  In such system, usually a natural language parser is used to parse human's input, then come up with an answer with some pre-defined and mostly manually rules.    It's a simple approach, but when it's done correctly.   It creates an illusion of intelligence.

Rule-base approach can go quite far.  e.g. The ALICE language is a pretty popular tool to create intelligent sounding bot. (History as shown in here.)   There are many existing tools which help programmers to create dialogue.   Programmer can also extract existing dialogues into the own system.

The problem of rule-based approach is obvious: the response is rigid.  So if someone use the system for a while, they will easily notice they are talking with a machine.  In a way, you can say the illusion can be easily dispersed by human observation.

Another issue of rule-based approach is it taxes programmers to produce a large scale chat bot.   Even with convenient languages such as AIML (ALICE Markup Language), it would take a programmer a long long time to come up with a chat-bot, not to say one which can answer a wide-variety of questions.

Converser as a Translator

Before we go on to look at chat bot in the time of deep learning.  It is important to ask how we can model conversation.   Of course, you can think of it as ... well... we first parse the sentence, generate entities and their grammatical relationships,  then based on those relationships, we come up with an answer.

This approach of decomposing a sentence to its element, is very natural to human beings.   In a way, this is also how the rule-based approach arise in the first place.  But we just discuss the weakness of rule-based approach, namely, it is hard to program and generalize.

So here is a more convenient way to think, you could simply ask,  "Hey, now I have an input sentence, what is the best response?"    It turns out this is very similar to the formulation of statistical machine translation.   "If I have an English sentence, what would be the best French translation?"    As it turns out, a converser can be built with the same principle and technology as a translator.    So all powerful technology developed for statistical machine translation (SMT) can be used on making a conversation bot.   This technology includes I.B.M. models, phrase-based models, syntax model [2]   And the training is very similar.

In fact, this is how many chat bots was made just before deep-learning arrived.    So some method simply use an existing translator to translate input-response pair.    e.g. [3]

The good thing about using a statistical approach, in particular, is that it generalizes much better than the rule-based approach.    Also, as the program is based on machine learning, all you have to do is to prepare (carefully) a bunch of training data.   Then existing machine learning program would help you come up with a system automatically.   It eases the programmer from long and tedious tweaking of the bot.

How does a Chatbot work?  The Deep Learning Version

Now given what we discuss, then how does Microsoft's chat bot Tay works?   Since we don't know Tay's implementation, we can only speculate:

  1. Tay is smart, so it doesn't sound like a purely rule-based system.  so let's assume it is based on the aforementioned "converser-as-translator" paradigm.
  2. It's Microsoft, there got to be some deep neural network.  (Microsoft is one of the first sites picked up the modern "deep" neural network" paradigm.)
  3. What's the data?  Well,  given Tay is built for millennials, the guy who train Tay must be using dialogue between teenagers.  If I research for Microsoft [4],  may be I would use data collected from Microsoft Messenger or Skype.   Since Microsoft has all the age data for all users, the data can easily be segmented and bundled into training.

So let's piece everything together.  Very likely,  Tay is a neural-network (NN)-based program which can intelligently translate an user's natural language input to a response.    The program's training is based on chat data.   So my speculation is the data is exactly where things goes wrong.   Before I conclude, the neural network in question is likely to be an Long-Short Term Model (LSTM).    I believe Google's researchers are the first advocate such approach [5] (headlined last year and the bot is known for its philosophical undertone.) Microsoft did couple of papers on how LSTM can be used to model conversation.  [6].    There are also several existing bot building software on line e.g. Andrej Karpathy 's char-RNN.    So it's likely that Tay is based on such approach. [7]

 

What goes wrong then?

Oh well, given that Tay is just a machine learning program.  Her behavior is really governed by the training material.   Since the training data is likely to be chat data, we can only conclude the data must contain some offensive speech, given the political landscape of the world.   So one reasonable hypothesis is the researcher who prepares the training material hadn't really filter out topics related to hate speech and sensitive topics.    I guess one potential explanation of not doing that is that filtering would reduce the amount of training data.     But then given the data owned by Microsoft,  it doesn't make sense.  Say 20% of 1 billion conversation is still a 200 million, which is more than enough to train a good chatterbot.  So I tend to think the issue is a human oversight. 

And then, as a simple fix,  you can also give the robots a list of keywords, e.g. you can just program  a simple regular expression match of "Hitler",  then make sure there is a special rule to respond the user with  "No comment".   At least the consequence wouldn't be as huge as a take down.     That again, it's another indication that there are oversights in the development.   You only need to spend more time in testing the program, this kind of issues would be noticed and rooted out.

Conclusion

In this piece, I come up with couple of hypothesis why Microsoft Tay fails.   At the end, I echo with the title of New Yorker's piece: "I’ve Seen the Greatest A.I. Minds of My Generation Destroyed by Twitter" .... at least partially. Tay is perhaps one of the smartest chatter bots, backed by one of the strongest research organization in the world, trained by tons of data. But it is not destroyed by Twitter or trolls. More likely, it is destroyed by human oversights and lack of testing. In this sense, it's failure is not too different from why many software fails.

Reference/Footnote

[1] Weizenbaum, Joseph "ELIZA—A Computer Program For the Study of Natural Language Communication Between Man And Machine", Communications of the ACM 9 (1): 36–45,

[2] Philip Koehn, Statistical Machine Translation

[3] Alan Ritter, Colin Cherry, and William Dolan. 2011. Data-driven response generation in social media. In Proc. of EMNLP, pages 583–593. Association for Computational Linguistics.

[4] Woa! I could only dream! But I prefer to work on speech recognition, instead of chatterbot.

[5] Oriol Vinyal, Le Quoc, A Neural Conversational Model.

[6] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan, A Diversity-Promoting Objective Function for Neural Conversation Models

[7] A more technical point here: Using LSTM, a type of recurrent neural network (RNN), also resolved one issue of the classical models such as IBM models because the language model is usually n-gram which has limited long-range prediction capability.

Learning ASR Through Coding

In a way, speech recognition is not that different from many skills.  You need to have a lot of practice to really grasp how certain things can be done.  e.g. if you never write a Viterbi algorithm,  it's probably hard for you to convince anybody you know the search aspect of ASR.   And if you never write an estimation algorithm, then your knowledge in training would be shaky.

Of course, this might be too hard for many people.  Who will have time to write a decoder or a trainer?  Fair enough.  I guess the next best choice is to study implementations of open source speech recognizers, try to modify them to fit your goal.   In the process, you will start to build up understanding.

Which recognizers?

Let me say one thing for learners these days: you guys are lucky.  When I tried to learn to do any ASR coding back in 2000, you have to join a certain speech lab, get a license of HTK before you can do any tracing and modification.    Now you have many choices,  HTK, Sphinx, Kaldi, Julius, RWTH recognizer, etc..... So what will be the recognizers you should learn?

I will name three of them, HTK, Sphinx and Kaldi. Why?

Why HTK?

You want to learn HTK because it has a well-designed and coherent interface.  It also has some of the best of training technology: its ML training is assumption free and take care of small issues such as silence/short-pauses, multiple pronunciations.   It has one of the sort large vocabulary MMIE training.  All of these work are very nice.

HTK also has a well-written tutorial.   If you own either the TIMIT or the RM corpora, you can usually train the whole thing following through the instruction.  While going through the tutorial, you gain valuable understanding on data structures commonly used in speech recognition.

Though I mainly worked on Sphinx,  there were around 2-3 years of life I used HTK in a day-to-day basis.   The menu itself is a good literature that can teach you a lot of things.   I believe many designers of speech recognizers actually learn from HTK source code as well.

Why Sphinx?

"Because you work on Sphinx!"  True, I am biased in this case.   But I do have a legitimate reason to like Sphinx and claim that knowledge of Sphinx is more useful.

If you compare the history of HTK and Sphinx systems development, you will notice that HTK's very nice interface stemmed from design effort in Entropic stage. Whereas Sphinx as whole are more work from PhD students, faculties and staffs.   In another words, Sphinx tools are more "hacky" than HTK.  So as a project, you will find that Sphinx seems to be more incoherent.   e.g. there are many recognizers written in C or Java.  The system itself seems to require much learning curves.

Very true, those are weaknesses.  But one thing I like about Sphinx is that it is fertile ground for any enthusiasts to play with.   The free BSD license gives people are chance to incorporate any part of the code into their projects.  As a result, historically, there are many companies which are using Sphinx in their company code.

Before we go, you may ask "Which Sphinx?"  If you ask 5 guys from the CMU Sphinx project, they will give you 5 different answers.  But let me just offer my point of view, which I think more related to learning.  Nick, the current maintainer-at-large, and I once chat, he believed that current Sphinx project should only support triple: Sphinx4/pocketsphinx/SphinxTrain.     I support that view.  As a project, we should only support and maintain focused number of components.

Though if you are enthusiasts, I will highly recommend you to study more.  Other than the triple, you will find Sphinx2 and Sphinx3 have their own interesting parts.  Not all of them is transferred to Sphinx4 or pocketsphinx.  But they are nonetheless fun code to read.   e.g. how triphones were implemented in different sphinx?  With all computation these days, I don't full triphone expansion works for real-time system.   I believe in that aspect, 2 and 3 are very interesting.

Why Kaldi?

I am very excited about Kaldi.  You can think of it as the "new HTK with the Sphinx license".   The technology is strong and new.  e.g. there is a branch which has all deep-neural network-based training.  The recognizer is based on WFST.    The best, all components are in very liberal licenses.   So you can surely do many amazing things with it.

The only reason why I don't recommend it more is that it is still relatively new.   Open source toolkits have strange lives : if they are being supported by funding, they can live forever.   If they are not, their fate is quite unpredictable.    Say MITLM toolkit, there were a year or so the maintainer left and there was no new maintainer.   I am sure during the time users will need to patch a thing or two.   It is certainly a very interesting toolkit.  (Because automatic optimization of mKN smoothing weight.)   But sometimes it's hard to predict what will happen.

In a way, development of Kaldi is rare, someone decides to share the best technology in our time to everybody.  (WFST, SGMM, DNN are all examples.)   I can only wish the project goes on.  If I could, may be I want to contribute a thing or two.

Arthur

 

"The Grand Janitor Blog V2" Started

I moved "The Grand Janitor Blog" to WordPress.   Nothing much, Blogger is simply too constraining.  I don't like the theme.  I can't really customize a thing.  I can't put an ad there if I want to sell something.   So it was really annoying and it's time to change.

But then what's new with V2?   First of all, I might blog more about how machine learning influence speech recognition.  It's not new that machine learning is the source of how speech recognition. It has always been like that. Many experts who work in speech recognition have deep knowledge in pattern recognition.  When you look at their papers, you can sense that they have studied a certain machine learning method in great-depth.  So they can come up with creative ideas to improve the bottom-line, which is the only thing I care.  I don't really care the thousand APIs wrap around a certain recognizer.  I only care about the guts inside the decoder, the trainer.  Those components are what really matters but those are also components which are most misunderstood.

So why now?  It's obvious that the latest development of DBN-DNN (the "next big thing") is one factor.   I was told in school (10+ years ago) that GMM is the state of the art.  But things are rapidly changing, work of Prof. Hinton has given a theoretical basis for making DBN-DNN training practically feasible.   Enthusiasts, some rather sophisticated, are gather around the Kaldi forum.

For me,  as I I will describe myself as a recovering ASR programmer.   What does it mean?  It means I need to grok ASR from theory to implementation. That's tough.  I found myself studying again, dust off my "Advanced Calculus" and try to read and think creatively text such as "Connectionist Speech Recognition A Hybrid Approach" by Bourland and Nelson. (It's highly entertaining technical text!)  Perhaps more in the future.   But when you try to drill a certain skill in your life, there got to be a point you need to go back to the basic.   Re-think all the things you thought you know.  Re-prove all the proofs you thought you understood.    That takes time and patience but at the end it is also how you come up with new ideas.

As for the readers,  sorry for never getting back to your suggested blog messages.  You might be interested in a code trace of a certain part of Sphinx.  You might be interested in how certain parts of the program work.  I kept a list of them and probably write-up something when I have time.   No promise though;  I have been very busy.   And to be frank: everyone who works in ASR is busy.  That perhaps explain why not many actively maintained blogs in speech recognition.

Of course, I will keep on posting on other diverse topics such as programming and technology.   I am still a geek.  I don't think anyone can change that. 🙂

In any case, feel free to connect with me and have fun with speech recognition!

Cheers

Arthur Chan, "The Grand Janitor"