Thoughts From Your Humble Curators
We cover F8 this week, point you to various resources of ICLR 2018, we also analyze a now-classic paper on text summarization.
As always, if you like our newsletter, feel free to subscribe and forward it to your colleagues!
This newsletter is published by Waikit Lau and Arthur Chan. We also run Facebook's most active A.I. group with 136,000+ members and host an occasional "office hour" on YouTube. To help defray our publishing costs, you may donate via link. Or you can donate by sending Eth to this address: 0xEB44F762c58Da2200957b5cc2C04473F609eAA65. Join our community for real-time discussions with this iOS app here: https://itunes.apple.com/us/app/expertify/id969850760
F8 2018 AI Summary
Our focus, of course, is AI, here are couple of interesting pieces of news:
"We salute our friends at DeepMind for doing awesome work," Facebook CTO Mike Schroepfer said in today’s keynote. "But we wondered: Are there some unanswered questions? What else can you apply these tools to."
When you lose an ML competition, you open source your code. Nothing much - while your competitor wins, you become the one who nurtures the future generation of enthusiasts. Smart move, Facebook. Here's the github.
A read on A Neural Attention Model for Abstractive Sentence Summarization"
This is a read on the paper "A Neural Attention Model for Abstractive Sentence Summarization" by A.M. Rush, Sumit Chopra and Jason Weston.
- Here is the arxiv, Video, Github
- The paper was written at 2015, and is more a classic paper on NN-based summarization. It is published slightly later than classic papers on NN-based translation such as those written by Cho or Badhanau. We assume you have some basic understanding on NN-based translation and attention.
- If you haven't worked on summarization, you can broadly think of techniques as extractive or abstractive. Given the text you want to summarize, "extractive" means you just usehe word from the input text, whereas "abstractive" means you can use any words you like, even the words which are in the input text.
- So this is why summarization is seen as similar problem as translation: you just think that there is a "translation" from the original text to the summary.
- Section 2 is a fairly nice mathematical background of summarization. One thing to note, the video also bring up noisy channel formulation. But as Rush said, their paper is to completely do away noisy-channel but do direct mapping.
- The next nuance you want to look at is the type of LM and the encoder used. That can all be found in Section 3. e.g. it uses the forward NNLM proposed by Bengio. Rush mentioned that he was trying RNNLM, but at that time, he get small gain. It feels like he can probably get better results if RNNLM is used.
- Then it's the type of encoder, there is a nice comparison between bag-of-words and attention models. Since there are words embeddings, the "bag-of-words" is actually all the input words embedded down to a certain size. Attention model, on the other hand, is what we know today, which contains a weight matrix P which map the context to input.
- Here is an insightful note from Rush: "Informally we can think of this model as simply replacing the uniform distribution in bag-of-words with a learned soft alignment, P, between the input and the summary."
- Section 4 is more on decoding, in Section 2, Markov assumption was made, this simplifies the decoding quite a lot. The authors were using beam search, so you can use trick such as path combination.
- Another cute thing is that the authors also comes up with method such that make the summarization more extractive. For that it uses a log-linear model to also weigh features such as unigram to trigram. See Section 5.
- Why would the author wants to make the summarization more extractive? That probably has to do with the metric. ROUGE usually favors words which are extracted from the input text.
- Another note pointed out by reader at AIDL-LD is that summary usually has proper nouns and can only be found it the input text. Once again, making the summarizer extractive is more appropriate.
- Here are several interesting commentaries about the paper. mathyouth, Denny Britz