Categories
deep learning deep neural network DNN

Reading Michael Nielsen’s “Neural Networks and Deep Learning”

Introduction

Let me preface this article: after I wrote my top five list on deep learning resources, one oft-asked question is “What is the Math prerequisites to learn deep learning?”   My first answer is Calculus and Linear Algebra, but then I will qualify certain techniques of Calculus and Linear Algebra are more useful.  e.g. you should already know gradient, differentiation, partial differentiation and Lagrange multipliers, you should know matrix differentiation and preferably trace trick , eigen-decomposition and such.    If your goal is to understand machine learning in general, then having good skills in integrations and knowledge in analysis helps. e.g. 1-2 stars problems of Chapter 2 at PRML [1] requires some knowledge of advanced function such as gamma, beta.   Having some Math would help you go through these questions more easily.

Nevertheless,  I find that people who want to learn Math first before approaching deep learning miss the point.  Many engineering topics was not motivated by pure mathematical pursuit.  More often than not, an engineering field is motivated by a physical observation. Mathematics is more like an aid to imagine and create a new solution.  In the case of deep learning.  If you listen to Hinton, he would often say he tries to first come up an idea and makes it work mathematically later.    His insistence of working on neural networks at the time of kernel method stems more from his observation of the brain.   “If the brain can do it, how come we can’t?” should be a question you ask every day when you run a deep learning algorithm.   I think these observations are fundamental to deep learning.  And you should go through arguments of why people think neural networks are worthwhile in the first place.   Reading classic papers from Wiesel and Hubel helps. Understanding the history of neural network helps.  Once you read these materials, you will quickly grasp the big picture of much development of deep learning.

Saying so, I think there are certain topics which are fundamental in deep learning.   They are not necessarily very mathematical.  For example, I will name back propagation [2] as a very fundamental concept which you want to get good at.   Now, you may think that’s silly.    “I know backprop already!”  Yes, backprop is probably in every single machine learning class.  It will easily give you an illusion that you master the material.    But you can always learn more about a fundamental concept.  And back propagation is important theoretically and practically.  You will encounter back propagation either as a user of deep learning tools, a writer of a deep learning framework or an innovator of new algorithm.  So a thorough understanding of backprop is very important, and one course is not enough.

This very long digression finally brings me to the great introductory book Michael Nielson’s Neural Network and Deep Learning (NNDL)    The reason why I think Nielson’s book is important is that it offers an alternative discussion of back propagation as an algorithm.   So I will use the rest of the article to explain why I appreciate the book so much and recommend nearly all beginning or intermediate learners of deep  learning to read it.

First Impression

I first learned about “Neural Network and Deep Learning” (NNDL) from going through Tensorflow’s tutorial.   My first thought is “ah, another blogger tries to cover neural network”. i.e. I didn’t think it was promising.   At that time, there were already plenty of articles about deep learning.  Unfortunately, they often repeat the same topics without bringing anything new.

Synopsis

Don’t make my mistake!  NNDL is a great introductory book which balance theory and practice of deep neural network.    The book has 6 chapters:

  1. Using neural network to recognize digits – the basic of neural network, a basic implementation using python (network.py)
  2. How the backpropagation algorithm works –  various explanation(s) of back propagation
  3. Improving the way neural networks learn – standard improvements of the simple back propagation, another implementation in python (network2.py)
  4. A visual proof that neural nets can compute any function – universal approximation algorithm without the Math, plus fun games which you can approximate function yourself
  5. Why are deep neural networks hard to train?  – practical difficultie of using back propagation, vanishing gradients
  6. Deep Learning  – convolution neural network (CNN), the final implementation based on Theano (network3.py), recent advances in deep learning (circa 2015).

The accompanied python scripts are the gems of the book. network.py and network2.py can run in plain-old python.   You need Theano on network3.py, but I think the strength of the book really lies on network.py and network2.py (Chapter 1 to 3) because if you want to learn CNN, Kaparthy’s lectures probably gives you bang for your buck.

Why I like Nielsen’s Treatment of Back Propagation?

Reading Nielson’s exposition of neural network is the sixth  time I learn about the basic formulation of back propagation [see footnote 3].  So what’s the difference between his treatment and my other reads then?

Forget about my first two reads because I didn’t care enough neural networks enough to know why back propagation is so named.   But my latter reads pretty much give me the same impression of neural network: “a neural network is merely a stacking of logistic functions.    So how do you train the system?  Oh, just differentiate the loss functions, the rest is technicalities.”   Usually the books will guide you to verify certain formulae in the text.   Of course, you will be guided to deduce that “error” is actually “propagating backward” from a network.   Let us call this view network-level view.   In a network-level view, you really don’t care about how individual neurons operate.   All you care is to see neural network as yet another machine learning algorithm.

The problem of network level view is that it doesn’t quite explain a lot of phenomena about back propagation.  Why is it so slow some time?  Why certain initialization schemes matter?  Nielsen does an incredibly good job to break down the standard equations into 4 fundamental equations (BP1 to BP4 in Chapter2).  Once interpret them, you will realize “Oh, saturation is really a big problem in back propagation” and “Oh, of course you have to initialize the weights of neural network with non-zero values.  Or else nothing propagate/back propagate!”    These insights, while not mathematical in nature and can be understood with college calculus, is deeper understanding about back propagation.

Another valuable part about Nielsen’s explanation is that it comes with a accessible implementation.  His first implementation (network.py) is a 74 lines python in idiomatic python.   By adding print statements on his code, you will quickly grasp on a lot of these daunting equations are implemented in practice.  For example, as an exercise, you can try to identify how he implement BP1 to BP4 in network.py.    It’s true that there are books and implementations about neural network,  but the description and implementation don’t always come together.  Nielsen’s presentation is a rare exception.

Other Small Things I Like

  • Nielsen correctly point out the Del symbol in machine learning is more like a convenient device rather than its more usual meaning like the Del operator in Math.
  • In Chapter 4,  Nielson mentioned universal approximation of neural network.  Unlike standard text book which points you to a bunch of papers with daunting math, Nielsen created a javascript which allows you to approximate functions (!), which I think those are great ways to learn intuition behind the theorem.
  • He points out that it’s important to differentiate activation and the weighted input.  In fact,  this point is one thing which can confuse you when reading a derivation of back propagation because textbooks usually use different symbols for activation and weighted input.

There are many of these insightful comments from the book, I encourage you to read and discover them.

Things I don’t like

  • There are many exercises of the book.  Unfortunately, there is no answer keys.  In a way, this make Nielson more an old-style author which encourage readers to think.   I guess this is something I don’t always like because spending time to think of one single problem forever doesn’t always give you better understanding.
  • Chapter 6 gives the final implementation in Theano.  Unfortunately, there is not much introductory material on Theano within the book.    I think this is annoying but forgivable, as Nielson pointed out, it’s harder to introduce Theano and introductory book.  I would think anyone interested in Theano should probably go through the standard Theano’s tutorial at here and here.

Conclusion

All-in-all,  I highly recommend Neural Network and Deep Learning  to any beginning and intermediate learners of deep learning.  If this is the first time you learn back propagation,  NNDL is a great general introductory book.   If you are like me, who already know a thing or two about neural networks, NNDL still have a lot to offer.

Arthur

[1] In my view, PRML’s problem sets have 3 ratings, 1-star, 2-star and 3-star.  1-star usually requires college-level of Calculus and patient manipulation, 2-star requires some creative thoughts in problem solving or knowledge other than basic Calculus.  3-star are more long-form questions and it could contain multiple 2-star questions in one.   For your reference, I solved around 100 out of the 412 questions.  Most of them are 1-star questions.

[2] The other important concept in my mind is gradient descent, and it is still an active research topic.

[3] The 5 reads before “learnt” it once back in HKUST, read it from Mitchell’s book, read it from Duda and Hart, learnt it again from Ng’s lecture, read it again from PRML.  My 7th is to learn from Karparthy’s lecture, he present the material in yet another way.  So it’s worth your time to look at them.

If you like this message, subscribe the Grand Janitor Blog’s RSS feed. You can also find me (Arthur) at twitter, LinkedInPlus, Clarity.fm.  Together with Waikit Lau, I maintain the Deep Learning Facebook forum.  Also check out my awesome employer: Voci.

Categories
Uncategorized

Some Thoughts on Learning Machine Learning/Data Science

I have been refreshing myself on various aspects of machine learning and data science.  For the most part it has been a very nice experience.   What I like most is that I finally able to grok many machine learning jargons people talk about.    It gave me a lot of trouble even as merely a practitioner of machine learning.  Because most people just assume you have some understanding of what they mean.

Here is a little secret: all these jargons can be very shallow to very deep.  For instance, “lasso” just mean setting the regularization terms with exponent 1.   I always think it’s just people don’t want to say the mouthful: “Set the regularization term to 1”, so they come up with lasso.

Then there is bias-variance trade off.   Now here is a concept which is very hard to explain well.    What opens my mind is what Andrew Ng said in his Coursera lecture, “just forget the term bias and variance”.  Then he moves on to talk about over and under-fitting.  That’s a much easier to understand concept.   And then he lead you to think.  In the case, when a model underfits, we have an estimator that has “huge bias”,  and when the model overfit, the estimator would allow too much “variance”.   Now that’s a much easier way to understand.   Over and under-fitting can be visualized.   Anyone who understands the polynomial regression would understand what overfitting is.  That easily leads you to have a eureka moment: “Oh, complex models can easily overfit!”   That’s actually the key of understanding the whole phenomenon.

Not only people are getting better to explain different concepts. Several important ideas are enunciated better.  e.g. reproducibility is huge, and it should be huge in machine learning as well.   Yet even now you see junior scientists in entry level ignore all important measures to make sure their work reproducible.   That’s a pity.  In speech recognition, e.g. I remember there was a dark time where training a broadcast news model was so difficult, despite the fact that we know people have done it before.    How much time people waste to repeat other peoples’ work?

Nowadays, perhaps I would just younger scientists to take the John Hopkins’ “Reproducible Research”.  No kidding.  Pay $49 to finish that class.

Anyway, that’s my rambling for today.   Before I go, I have been actively engaged in the Facebook’s Deep Learning group.  It turns out many of the forum uses love to hear more about how to learn deep learning.   Perhaps I will write up more in the future.

Arthur

Categories
ANN deep learning Language Modeling SMT

Some Speculations On Why Microsoft Tay Collapsed

Microsoft’s Tay, following Google AlphaGo, was meant to be yet another highly intelligent A.I. program which fulfill human’s long standing dream: a machine which can truly converse.   But as you know, Tay fails spectacularly.  To me, this is a highly unusual event, part of it is that Microsoft’s another conversation agent, Xiaoice, was extremely successful in China.   The other part is MSR, is one of the leading sites on using deep learning in various machine learning problems.   You would think that a major P.R. problem such as Tay confirming “Donald Trump is the hope”,  and purportedly support genocide should be weeded out before launch.

As I read many posts in the past week attempted to describe why Tay fails, sadly they offer me no insights.  Some even written from respected magazines, e.g. in New Yorkers‘ “I’ve Seen the Greatest A.I. Minds of My Generation Destroyed by Twitter” at the end the author concluded,

“If there is a lesson to be learned, it is that consciousness wants conscience. Most consumer-tech companies have, at one time or another, launched a product before it was ready, or thought that it was equipped to do something that it ended up failing at dismally. “

While I always love the prose from New Yorkers, there is really no machine which can mimic/model  human consciousness (yet).   In fact, no one really knows how “consciousness” works, it’s also tough to define what “consciousness” is.   And it’s worthwhile to mention that chatbot technology is not new.   Google had released similar technology and get great press.  (See here)  So the New Yorkers’ piece reflect how much the public does not understand technology.

As a result, I decided to write a Tay’s postmortem myself, and offer some thoughts on why this problem could occur and how one could actively avoid such problems.

Since I try to write this piece for general audience, (say my facebook friends), the piece contains only small amount of technicalities.   If you are interested, I also list several more technical articles in the reference section.

How does a Chatbot work?  The Pre-Deep Learning Version

By now,  all of us use a chat bot or two, there is obviously Siri, which perhaps is the first program which put speech recognition and dialogue system in the national spotlight.  If you are familiar with history of computing, you would probably know ELIZA [1], which is the first example of using rule-based approach to respond to users.

What does it mean?  In such system, usually a natural language parser is used to parse human’s input, then come up with an answer with some pre-defined and mostly manually rules.    It’s a simple approach, but when it’s done correctly.   It creates an illusion of intelligence.

Rule-base approach can go quite far.  e.g. The ALICE language is a pretty popular tool to create intelligent sounding bot. (History as shown in here.)   There are many existing tools which help programmers to create dialogue.   Programmer can also extract existing dialogues into the own system.

The problem of rule-based approach is obvious: the response is rigid.  So if someone use the system for a while, they will easily notice they are talking with a machine.  In a way, you can say the illusion can be easily dispersed by human observation.

Another issue of rule-based approach is it taxes programmers to produce a large scale chat bot.   Even with convenient languages such as AIML (ALICE Markup Language), it would take a programmer a long long time to come up with a chat-bot, not to say one which can answer a wide-variety of questions.

Converser as a Translator

Before we go on to look at chat bot in the time of deep learning.  It is important to ask how we can model conversation.   Of course, you can think of it as … well… we first parse the sentence, generate entities and their grammatical relationships,  then based on those relationships, we come up with an answer.

This approach of decomposing a sentence to its element, is very natural to human beings.   In a way, this is also how the rule-based approach arise in the first place.  But we just discuss the weakness of rule-based approach, namely, it is hard to program and generalize.

So here is a more convenient way to think, you could simply ask,  “Hey, now I have an input sentence, what is the best response?”    It turns out this is very similar to the formulation of statistical machine translation.   “If I have an English sentence, what would be the best French translation?”    As it turns out, a converser can be built with the same principle and technology as a translator.    So all powerful technology developed for statistical machine translation (SMT) can be used on making a conversation bot.   This technology includes I.B.M. models, phrase-based models, syntax model [2]   And the training is very similar.

In fact, this is how many chat bots was made just before deep-learning arrived.    So some method simply use an existing translator to translate input-response pair.    e.g. [3]

The good thing about using a statistical approach, in particular, is that it generalizes much better than the rule-based approach.    Also, as the program is based on machine learning, all you have to do is to prepare (carefully) a bunch of training data.   Then existing machine learning program would help you come up with a system automatically.   It eases the programmer from long and tedious tweaking of the bot.

How does a Chatbot work?  The Deep Learning Version

Now given what we discuss, then how does Microsoft’s chat bot Tay works?   Since we don’t know Tay’s implementation, we can only speculate:

  1. Tay is smart, so it doesn’t sound like a purely rule-based system.  so let’s assume it is based on the aforementioned “converser-as-translator” paradigm.
  2. It’s Microsoft, there got to be some deep neural network.  (Microsoft is one of the first sites picked up the modern “deep” neural network” paradigm.)
  3. What’s the data?  Well,  given Tay is built for millennials, the guy who train Tay must be using dialogue between teenagers.  If I research for Microsoft [4],  may be I would use data collected from Microsoft Messenger or Skype.   Since Microsoft has all the age data for all users, the data can easily be segmented and bundled into training.

So let’s piece everything together.  Very likely,  Tay is a neural-network (NN)-based program which can intelligently translate an user’s natural language input to a response.    The program’s training is based on chat data.   So my speculation is the data is exactly where things goes wrong.   Before I conclude, the neural network in question is likely to be an Long-Short Term Model (LSTM).    I believe Google’s researchers are the first advocate such approach [5] (headlined last year and the bot is known for its philosophical undertone.) Microsoft did couple of papers on how LSTM can be used to model conversation.  [6].    There are also several existing bot building software on line e.g. Andrej Karpathy ‘s char-RNN.    So it’s likely that Tay is based on such approach. [7]

 

What goes wrong then?

Oh well, given that Tay is just a machine learning program.  Her behavior is really governed by the training material.   Since the training data is likely to be chat data, we can only conclude the data must contain some offensive speech, given the political landscape of the world.   So one reasonable hypothesis is the researcher who prepares the training material hadn’t really filter out topics related to hate speech and sensitive topics.    I guess one potential explanation of not doing that is that filtering would reduce the amount of training data.     But then given the data owned by Microsoft,  it doesn’t make sense.  Say 20% of 1 billion conversation is still a 200 million, which is more than enough to train a good chatterbot.  So I tend to think the issue is a human oversight. 

And then, as a simple fix,  you can also give the robots a list of keywords, e.g. you can just program  a simple regular expression match of “Hitler”,  then make sure there is a special rule to respond the user with  “No comment”.   At least the consequence wouldn’t be as huge as a take down.     That again, it’s another indication that there are oversights in the development.   You only need to spend more time in testing the program, this kind of issues would be noticed and rooted out.

Conclusion

In this piece, I come up with couple of hypothesis why Microsoft Tay fails.   At the end, I echo with the title of New Yorker’s piece: “I’ve Seen the Greatest A.I. Minds of My Generation Destroyed by Twitter” …. at least partially. Tay is perhaps one of the smartest chatter bots, backed by one of the strongest research organization in the world, trained by tons of data. But it is not destroyed by Twitter or trolls. More likely, it is destroyed by human oversights and lack of testing. In this sense, it’s failure is not too different from why many software fails.

Reference/Footnote

[1] Weizenbaum, Joseph “ELIZA—A Computer Program For the Study of Natural Language Communication Between Man And Machine”, Communications of the ACM 9 (1): 36–45,

[2] Philip Koehn, Statistical Machine Translation

[3] Alan Ritter, Colin Cherry, and William Dolan. 2011. Data-driven response generation in social media. In Proc. of EMNLP, pages 583–593. Association for Computational Linguistics.

[4] Woa! I could only dream! But I prefer to work on speech recognition, instead of chatterbot.

[5] Oriol Vinyal, Le Quoc, A Neural Conversational Model.

[6] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan, A Diversity-Promoting Objective Function for Neural Conversation Models

[7] A more technical point here: Using LSTM, a type of recurrent neural network (RNN), also resolved one issue of the classical models such as IBM models because the language model is usually n-gram which has limited long-range prediction capability.