Categories
ANN deep learning Language Modeling SMT

Some Speculations On Why Microsoft Tay Collapsed

Microsoft’s Tay, following Google AlphaGo, was meant to be yet another highly intelligent A.I. program which fulfill human’s long standing dream: a machine which can truly converse.   But as you know, Tay fails spectacularly.  To me, this is a highly unusual event, part of it is that Microsoft’s another conversation agent, Xiaoice, was extremely successful in China.   The other part is MSR, is one of the leading sites on using deep learning in various machine learning problems.   You would think that a major P.R. problem such as Tay confirming “Donald Trump is the hope”,  and purportedly support genocide should be weeded out before launch.

As I read many posts in the past week attempted to describe why Tay fails, sadly they offer me no insights.  Some even written from respected magazines, e.g. in New Yorkers‘ “I’ve Seen the Greatest A.I. Minds of My Generation Destroyed by Twitter” at the end the author concluded,

“If there is a lesson to be learned, it is that consciousness wants conscience. Most consumer-tech companies have, at one time or another, launched a product before it was ready, or thought that it was equipped to do something that it ended up failing at dismally. “

While I always love the prose from New Yorkers, there is really no machine which can mimic/model  human consciousness (yet).   In fact, no one really knows how “consciousness” works, it’s also tough to define what “consciousness” is.   And it’s worthwhile to mention that chatbot technology is not new.   Google had released similar technology and get great press.  (See here)  So the New Yorkers’ piece reflect how much the public does not understand technology.

As a result, I decided to write a Tay’s postmortem myself, and offer some thoughts on why this problem could occur and how one could actively avoid such problems.

Since I try to write this piece for general audience, (say my facebook friends), the piece contains only small amount of technicalities.   If you are interested, I also list several more technical articles in the reference section.

How does a Chatbot work?  The Pre-Deep Learning Version

By now,  all of us use a chat bot or two, there is obviously Siri, which perhaps is the first program which put speech recognition and dialogue system in the national spotlight.  If you are familiar with history of computing, you would probably know ELIZA [1], which is the first example of using rule-based approach to respond to users.

What does it mean?  In such system, usually a natural language parser is used to parse human’s input, then come up with an answer with some pre-defined and mostly manually rules.    It’s a simple approach, but when it’s done correctly.   It creates an illusion of intelligence.

Rule-base approach can go quite far.  e.g. The ALICE language is a pretty popular tool to create intelligent sounding bot. (History as shown in here.)   There are many existing tools which help programmers to create dialogue.   Programmer can also extract existing dialogues into the own system.

The problem of rule-based approach is obvious: the response is rigid.  So if someone use the system for a while, they will easily notice they are talking with a machine.  In a way, you can say the illusion can be easily dispersed by human observation.

Another issue of rule-based approach is it taxes programmers to produce a large scale chat bot.   Even with convenient languages such as AIML (ALICE Markup Language), it would take a programmer a long long time to come up with a chat-bot, not to say one which can answer a wide-variety of questions.

Converser as a Translator

Before we go on to look at chat bot in the time of deep learning.  It is important to ask how we can model conversation.   Of course, you can think of it as … well… we first parse the sentence, generate entities and their grammatical relationships,  then based on those relationships, we come up with an answer.

This approach of decomposing a sentence to its element, is very natural to human beings.   In a way, this is also how the rule-based approach arise in the first place.  But we just discuss the weakness of rule-based approach, namely, it is hard to program and generalize.

So here is a more convenient way to think, you could simply ask,  “Hey, now I have an input sentence, what is the best response?”    It turns out this is very similar to the formulation of statistical machine translation.   “If I have an English sentence, what would be the best French translation?”    As it turns out, a converser can be built with the same principle and technology as a translator.    So all powerful technology developed for statistical machine translation (SMT) can be used on making a conversation bot.   This technology includes I.B.M. models, phrase-based models, syntax model [2]   And the training is very similar.

In fact, this is how many chat bots was made just before deep-learning arrived.    So some method simply use an existing translator to translate input-response pair.    e.g. [3]

The good thing about using a statistical approach, in particular, is that it generalizes much better than the rule-based approach.    Also, as the program is based on machine learning, all you have to do is to prepare (carefully) a bunch of training data.   Then existing machine learning program would help you come up with a system automatically.   It eases the programmer from long and tedious tweaking of the bot.

How does a Chatbot work?  The Deep Learning Version

Now given what we discuss, then how does Microsoft’s chat bot Tay works?   Since we don’t know Tay’s implementation, we can only speculate:

  1. Tay is smart, so it doesn’t sound like a purely rule-based system.  so let’s assume it is based on the aforementioned “converser-as-translator” paradigm.
  2. It’s Microsoft, there got to be some deep neural network.  (Microsoft is one of the first sites picked up the modern “deep” neural network” paradigm.)
  3. What’s the data?  Well,  given Tay is built for millennials, the guy who train Tay must be using dialogue between teenagers.  If I research for Microsoft [4],  may be I would use data collected from Microsoft Messenger or Skype.   Since Microsoft has all the age data for all users, the data can easily be segmented and bundled into training.

So let’s piece everything together.  Very likely,  Tay is a neural-network (NN)-based program which can intelligently translate an user’s natural language input to a response.    The program’s training is based on chat data.   So my speculation is the data is exactly where things goes wrong.   Before I conclude, the neural network in question is likely to be an Long-Short Term Model (LSTM).    I believe Google’s researchers are the first advocate such approach [5] (headlined last year and the bot is known for its philosophical undertone.) Microsoft did couple of papers on how LSTM can be used to model conversation.  [6].    There are also several existing bot building software on line e.g. Andrej Karpathy ‘s char-RNN.    So it’s likely that Tay is based on such approach. [7]

 

What goes wrong then?

Oh well, given that Tay is just a machine learning program.  Her behavior is really governed by the training material.   Since the training data is likely to be chat data, we can only conclude the data must contain some offensive speech, given the political landscape of the world.   So one reasonable hypothesis is the researcher who prepares the training material hadn’t really filter out topics related to hate speech and sensitive topics.    I guess one potential explanation of not doing that is that filtering would reduce the amount of training data.     But then given the data owned by Microsoft,  it doesn’t make sense.  Say 20% of 1 billion conversation is still a 200 million, which is more than enough to train a good chatterbot.  So I tend to think the issue is a human oversight. 

And then, as a simple fix,  you can also give the robots a list of keywords, e.g. you can just program  a simple regular expression match of “Hitler”,  then make sure there is a special rule to respond the user with  “No comment”.   At least the consequence wouldn’t be as huge as a take down.     That again, it’s another indication that there are oversights in the development.   You only need to spend more time in testing the program, this kind of issues would be noticed and rooted out.

Conclusion

In this piece, I come up with couple of hypothesis why Microsoft Tay fails.   At the end, I echo with the title of New Yorker’s piece: “I’ve Seen the Greatest A.I. Minds of My Generation Destroyed by Twitter” …. at least partially. Tay is perhaps one of the smartest chatter bots, backed by one of the strongest research organization in the world, trained by tons of data. But it is not destroyed by Twitter or trolls. More likely, it is destroyed by human oversights and lack of testing. In this sense, it’s failure is not too different from why many software fails.

Reference/Footnote

[1] Weizenbaum, Joseph “ELIZA—A Computer Program For the Study of Natural Language Communication Between Man And Machine”, Communications of the ACM 9 (1): 36–45,

[2] Philip Koehn, Statistical Machine Translation

[3] Alan Ritter, Colin Cherry, and William Dolan. 2011. Data-driven response generation in social media. In Proc. of EMNLP, pages 583–593. Association for Computational Linguistics.

[4] Woa! I could only dream! But I prefer to work on speech recognition, instead of chatterbot.

[5] Oriol Vinyal, Le Quoc, A Neural Conversational Model.

[6] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan, A Diversity-Promoting Objective Function for Neural Conversation Models

[7] A more technical point here: Using LSTM, a type of recurrent neural network (RNN), also resolved one issue of the classical models such as IBM models because the language model is usually n-gram which has limited long-range prediction capability.

Categories
Dan Jurafsky Dependency Parsing Dragomir Radev HMM Language Modeling Machine Learning Natural Language Processing Parsing POS tagging Programming Python SMT Word Sense Disambiguation

Radev’s Coursera Introduction to Natural Language Processing – A Review

As I promised earlier, I was going to review Prof.  Dragomir Radev’s introductory class on natural language processing.   Few words about Prof. Radev: from his Wikipedia entry, Prof Radev is an award winning Professor, who co-found North American Computational Linguistics Olympiad (NACLO), which is the equivalent of USAMO in computational linguistics. He was also the coach of U.S. coach of International Language Olympiad 2011 and helped the team won several medals [1].    I think these are great contributions to the speech and language community.  In late 90s, when I was still in undergraduate, there was not much recognition of computational language processing as an important computation skill.    With competition in high-school or college level, there will be a generation of young minds who would aspire to build intelligent conversation agent, robust speech recognizer and versatile question and answering machine.   (Or else everyone would think Linux kernel is the only cool hack in town. 🙂 )

The Class

So how about the class?  I got to say I am more than surprised and happy with it.   I was searching for an intro NLP class, so the two natural choices was Prof. Jurafsky’ and Manning’ s and Prof.  Collin’s Natural Language Processing.   Both classes received great praise and comments and few of my friends recommend to take both.   Unfortunately, there was no class offering recently so I could only watch the material off-line.

Then there comes the Radev’s class,  it is as Prof. Radev explains: “more introductory” than Collin’s class and “more focused on Linguistics and resources” than Jurafsky and Manning.   So it is good for two types of learners:

  1. Those who just started out in NLP.
  2. Those who want to gather useful resources and start projects on NLP.

I belong to both types.   My job requires me to have more comprehensive knowledge of language and speech processing.

The Syllabus and The Lectures

The class itself is a brief survey of many important topics of NLP.   There are the basics:  parsing, tagging, language modeling.  There are the advanced topics such as summarization, statistical machine translation (SMT), semantic analysis and dialogue modeling.   The lectures, except occasionally mistakes, are quite well done and filled with interesting examples.

My only criticism is perhaps the length of videos, I would hope that most videos I watch would be less than 10 minutes.    That makes it easier to rotate with my other daily tasks.

The material is not too difficult to absorb for newcomers.   For starter, advanced topic such as  SMT is not covered in too much detail mathematically.  (So no need to derive EM on IBM models.)  That I think it’s quite appropriate for first time learners like me.

One more unique feature of the lectures: it fills with interesting NACLO problems.    While NACLO is more a high-school level competition, most of the problems are challenging even for experienced practitioners.  And I found them quite stimulating.

The Prerequisites and The Homework Assignments

To me, the fun part is the homework.   There were 3 of them, they focus on,

  1. Nivre’s Dependency Parser,
  2. Language Modeling and POS Tagging,
  3. Word Sense Disambiguation

All homework are based on python.   If you know what you are doing, they are not that difficult to do.   For me, I spent around 12-14 hours on each.   (Those are usually weekends.) Just like Ng’s Machine Learning class,   you need to match numbers with  the golden reference.   I think that’s the right approach to learn any machine learning task the first time.   Blindly come up with a system and hope it works never get you anywhere.

The homework does speak about an issue of the class, i.e. you do need to know the basics of Machine Learning .  Also, if you never had any programming experience would find the homework very difficult.   This probably described many linguistic students but never take any computer science classes.  [3]    You can still “power it through” and pass.  But it can be unnecessarily hard.

So I will recommend you first take the Ng’s class or perhaps the latest Machine Learning specialization from Guestrin and Fox first.   Those are the classes which would give you some basics of programming as well as basic concept of Machine Learning.

If you didn’t take any machine learning class, one way to go through more difficult classes like this is to read forum messages.   There are many nice people in the course was answering various questions.   To be frank, if the forum doesn’t exist, then it will take me around 3 times more time to finish all assignments.

Final Word

All-in-all, I highly recommend Prof. Radev’s class to anyone who is interested in NLP.    As I mentioned though, the class does require prerequisite such as basics of programming and machine learning.   So  I would recommend any learners to first take the Ng’s class before taking this one.

In any case, I want to thank Prof. Radev and all teaching staffs who prepare this wonderful course.   I also thank to many classmates who help me through the homework.

Arthur

Postscript at 2017 April

After I wrote this review, Coursera had since upgraded to the new format.  It’s a pity none of the NLP classes, including Prof. Radev’s survive.   To bad for NLP lovers!

There is also a seismic shift in the field of NLP toward deep learning. While deep learning does not dominate evaluations like in computer vision or speech recognition, it is perhaps the most actively researched direction right now.  So if you are curious about what’s new, consider to take the latest Standford cs224n 2017 or Oxford’s Deep Learning for NLP.

[1] http://www.eecs.umich.edu/eecs/about/articles/2010/Radev-Linguistics.html

[2] Week 1 Lecture 1 Introduction

[3] One anecdote:  In the forum, some students was asking why you can’t just sum all data points of a class together and pour into scikit-learn’s fit().    I don’t blame the student because she started late and lacks of prerequisite.   She later finished all assignment and I really admire her determination.

Categories
acoustic modeling data collection Language Modeling MFE MMIE MPE open source speech recognition Sphinx Thought

How to Ask Questions in the Sphinx Forum?

Many go to different open source toolkits to look for a ready-to-use speech recognizer, and seldom get what they want.   Many feel disappointed and curse that developers of open source speech recognizer just couldn’t catch up with commercial product.   Few know why and few decide to write about the reason.

People in the field blame Hollywood for lion share of the problem.  Indeed, many people believe ASR should work similarly to scenes of Space Odyssey 2001 or Star Trek.   We are far far away from there.   You may say SIRI is getting close.  True.   But when you look closer, SIRI doesn’t always get what you say right, her strength lies on the very intelligent response system.

Unlike compilers such as GCC, speech recognition toolkit such as the CMU Sphinx project HTK are toolkits.   The mathematical models these toolkits provided were trained and fit to certain group of samples. Whereas, applications such as Google Voice or SIRI gather 100 or even 1000 times more data when they train a model.   This is the fundamental reason why you don’t get the premium recognition rate you think you entitled to.

Many people (me included) saw that as a problem.  Unfortunately, to collect clean transcribed data has always been a problem.   Voxforge is the only attempt I am aware of to resolve the issue.    They are still growing up but it will be a while they can collect enough data to rival with commercial applications.

* * *
Now what does that tell you when you ask questions in CMU Sphinx or other speech recognition forum?   For users who expect out-of-the-box super performance, I would say “Sorry, we are not there yet.”  In fact, speech recognition, in general, is probably not in performance shown in the original Star Trek yet (that will require accent adaptation and very good noise cancellation since the characters seem to be able to use the recognizer any time they like).

How about many users who have a little bit (or much) programming background? I would say one thing important.  As a programmer, you probably get used to look at the code, understand what it’s done, do something cute and feel awesome from time-to-time.  You can’t do that if you seriously want to develop a speech recognition system.

Rather, you should think like a data analyst.  For example, when you feel the recognition rate is bad, what is your evidence?  What is your data set?  What is the size of your data set? If you have a set, can you share the set?   If you don’t have numerical measure, have you at least use pencil or paper to mark down at least some results and some mistakes? Report them when you ask questions, then you will get useful answers back.

If you go to look at programming forum, many ask questions with the source such that people can repeat the problem easily.    Some even go further to pinpoint location of the problem.    This is probably what you want to do if you get stuck.

* * *

Before I end this post, let’s also bring up the issue of how usually ASR problem is solved?  Like…… if you see performance is bad, what should you do?

Some speech recognition problems can be solved readily.  For example, if you try to recognize digit strings but only get one digit at a time, chances are your grammar was written incorrectly.  If you see completely crappy speech recognition performance, then I will first check if the front-end of decoder match exactly as the front-end used to train the models.

For the rest,  the strength of the model is really the issue.   So most of your time should spend on learning and understanding techniques of model improvement.    For example, do you want to collect data and boost up your acoustic model?  Or if you know more about the domain, can you crawl some text on the web and help your language model?   Those are the first ideas you should think about.

There are also an exoteric group of people in the world who ask a different question, “Can we use a different estimation algorithm to make the better?”  That is the basis of MMIE, MPE and MFE.   If you found yourself mathematically proficient (perhaps need to be very proficient……), then learning those techniques and implement some of them would help boosting up the performance as well.   What I mentioned such as MMIE are just the basics,  each site has their own specialized technique and you might want to know.

Of course, you normally don’t have to think so deep.   Adding more data is usually the first step of ASR improvement.    If you start to think something advance and if you can,  please try to put your implementation somewhere public such that everyone in the world can try it out.   These are something small to do, but I believe if we keep on doing something small right, there will be a day we can make open source speech recognizers as the commercial ones.

Arthur