Category Archives: Uncategorized

A Read on "Dynamic Routing Between Capsules"

(Also check out Keran's discussion - very exciting! I might go to write a blurb on the new capsule paper which seems to be written by Hinton and friends.)

As you know this is the Hinton's new invention of capsules algorithm. It's been a while I want to delve into this idea. So here is a write up: It's tl;dr but I doubt I completely grok the idea anyway.

* The first mention of "capsule" is perhaps in the paper "Transforming Auto-encoders" which Hinton and students coauthored.

* It's important to understand what capsules trying to solve before you delve into the details. If you look at Hinton's papers and talks, capsule is really an idea which improve upon Convnet, there are two major complaints from Hinton.

* First the general settings of Convnet assume that 1 filter is being used across different location. This is also known as "location invariance". In this setting, the exact location of a feature doesn't matter. That has a lot to do with robust feature parameter estimation. It also drastically simplify backprop with weight sharing.

* But then location invariance also removes one important information of an image: the apparent location.

* Second assumption is max pooling. As you know, pooling usually removes a high percentage of information from a layer. In early architectures, usually pooling is the key to shrink the size of a representation down. Of course, later architectures had changed. But pooling is still an important component.

* So the design of capsule has a lot of do to tackle problems of max pooling. Instead of losing information, can we "route" this information correctly so that they are optimal use? That's the thesis.

* Generally "capsule" represents a certain entity of an image, "such as pose (position, size, orientation), deformation, velocity, albedo, hue, texture etc". Notice that they are not hard-wired and automatically discovered.

* Then there is how the low level information can "route" to higher level. The mechanism is intriguing in this current implementation:

* First, your goal is to calculate a softmax in the form of

exp(b_{ij} / Sum_k exp(b_{ik} where b_{ij} is the output of lower level capsule i to a higher level capsule j. This is something you can train.

* Then what you do is iteratively estimate b_{ij}. This appears in Procedure 1. The 4 steps are:

a, calculate the softmax weight b.
b, compute the prediction vector from a capsule i, then form a weighted sum,
c, squash the weighted sum
d, update softmax weight b based on the squash value and weighted sum.

* So why the squash function, my guess is it is to normalize the value computed in b. According to Hinton, a good function is

v_j = |s_j|^2 / (1 + |s_j|^2) * s_j / |s_j|

* The rest of the architecture actually looks very much like a Convnet. The first layer was a Convnet with ReLU activation.

* Would this work? The authors say yes. Not only it reaches the state of art benchmark of MNIST. It can also tackle more difficult task such as CIFAR-10, SVNH. In fact, the authors found that in both task they already achieve better results when first Convnet was first used to tackle these tasks.

* It also works well for two tasks called affMNISt and multiMNIST. First is MNIST go through affine transform, second is MNIST regenerated with many overlappings. This is quite impressive, because you will need to use much data augmentation and effort of object detection to get these cases working.

* The part I have doubt - is this model more complex than convnet? If it is show, it's possible that we are just fitting a more complex model to get better results.

* Nice thing about the implementation: it's tensorflow, so we can expect and play with it in the future.

That's what I have so far. Have fun!

A Read on "Searching for Activation Functions"

(First published at AIDL-LD and AIDL-Weekly)

Perhaps the most interesting paper last week is the Swish function. Here are some notes:

* Swish is extraordinarily simple. It's just
swish(x) = x * sigmoid(x).
* Derivative? swish'(x) = swish(x) + sigmoid(x) (1 - swish (x)) Simple calculus. 
* Can you tune it? Yes, there is a tunable version which the parameter is trainable. It's call Swish-Beta which is x * sigmoid( Beta * x)
* So here's an interesting part of why it is a "self-gating function". So.... if you understand LSTM, essentially it introduced a multiplication sign. The multiplier strengthen the gradient and effectively resolve the vanishing/exploding gradient problem. e.g. input gate and forget gate, give you weights of "how much you want to consider the input" and "how much much you want to forget". (
* So swish is not too different - there is the activation function but it is weighted by the input itself. Thus the term self-gating. In a nutshell, in plain English, "because we multiply".
* It's all good, but does it work? The experimental results look promising. It works on Cifar-10, Cifar-100. On Imagenet, it beats Inception-v2 and v3 when swish replace ReLU.
* It's worthwhile to point out the latest Inception is in v4. So the imagenet number is not beating stoa even within Google, not to say the best number in Imagenet 2016. But that shouldn't matter, if something consistently improve on some models of Imagenet, it is a very good sign it is working.
* Of course, looking at the activation function. It introduces a multiplication. So it does increase computation when compare with a simple ReLU. And that seems to be the complaint I heard.

That's what I have. Enjoy!

A read on " Unsupervised Machine Translation Using Monolingual Corpora Only"

(First published on AIDL-LD and AIDL Weekly.)

"This is an impressive paper by FAIR authors which claims that one only need to use monolingual corpora to train a usable translation model. So how does it work? Here are some notes.

* For starter, indeed you don't need to use a parallel corpora, but you still need a bidirectional dictionary to generate translation. You also need to have monolingual corpora in both languages. That's why the title is about monolingual corpora (plural) but not monolingual corpus (singular).

* Then, there is the issue of how you actually create translation. It's actually much simpler than you thought, first imagine there is a latent language which both your source and target languages mapped to.

* How do you train? So let's just use the source language as an example first. What you can do is create an encoder-decoder architecture which translate your source to the latent space, then translate it back. Using BLEU score, you can then setup an optimization criteria.

* Now this doesn't quite do the translation. Now you apply the same procedure on both source and target language. Don't you now have a common latent space? In actual translation, what you need to do is to first map the target language in the common latent space, then map it back to the source language.

* Many of you might recognize that such encoder-decoder scheme which map the language to itself as very similar to autoencoder. Indeed, the authors in the paper actually use a version of autoencoder: denoising autoencoder(dA) to train the model.

* The final interesting idea I spot is to idea of iterative training. In this case, you can imagine that you can first train an initial translator, but then use its output as the truth and retrain another one. The authors found tremendous gain in BLEU score in the process.

* The results are stunning if you consider no parallel corpus is involved. BLEU score is around 10 points lower, but do remember: deep learning has pretty much improved BLEU scores by absolute 7-8 points anyway from the classical phrased based translation models."

A Read on "Non-Autoregressive Neural Machine Translation"

(First published at AIDL-LD and AIDL Weekly.)

This is the second of the two papers from Salesforce, "
"Non-Autoregressive Neural Machine Translation" . Unlike the "Weighted Transformer, I don't believe it improves SOTA results. But then it introduces a cute idea into a purely attention-based NNMT, I would suggest you my previous post before you read on:

Okay. The key idea introduced in the paper is fertility. So this is to address one of the issues introduced by a purely attention-based model introduced from "Attention is all you need". If you are doing translation, the translated word can 1) be expanded to multiple words, 2) transform to a totally different word location.

In the older world of statistical machine translation, or what we called IBM models. The latter model is called "Model 2" which decide the "absolute alignment" of source/target language pair. The former is called fertility model or "Model 3". Of course, in the world of NNMT, these two models were thought to be obsolete. Why not just use RNN in the Encoder/Decoder structure to solve the problem?

(Btw, there are totally 5 layers in the original IBM Models. If you are into SMT, you should probably learn it up.)

But then in the world of purely attention-based NNMT, idea such as absolute alignment and fertility become important again. Because you don't have memory within your model. So in the original "Attention is all you need" paper, there is already the thought of "positional encoding".

So the new Salesforce paper actually introduce another layer which reintroduce fertility. Instead of just feeding the output of encoder directly into the decoder. It will feed to a fertility layer to decide if a certain word should have higher fertility first. e.g. a fertility of 2 means that it should be copied twice. 0 means the word shouldn't be copy.

I think the cute thing about the paper is two fold. One is that it is an obvious expansion of the whole idea of attention-based NNMT . Then there is the Socher's group is reintroducing classical SMT idea back to NNMT.

The result though is not working as well as the standard NNMT. As you can see in Table 1. There is still some degradation using the attention-based approach. That's perhaps why when the Google Research Blog mention the Salesforce results : it said "*towards* non-autoregressive translation". It implies that the results is not yet satisfying.

A Read on " CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning"

(First published at AIDL-LD and AIDL Weekly.)

This is a note on CheXNet the paper. As you know it is the widely circulated paper from Stanford, purportedly outperform human's performance on Chest X-ray diagnostic.

* BUT, after I read it in detail, my impression is slightly different from just reading the popular news including the description on github.

* Since the ML part is not very interesting. I will just briefly go through it - it's a 121-layer Densenet, basically it means there are feed-forward connection from every previous layers. Given the data size, it's likely a full training.

* There was not much justification on the why of the architecture. My guess: the team first tried transfer learning, but decide to move on to full-training to get better performance. A manageable setup would be Densenet.

* Then there was a fairly standard experimental comparison using AUC. In a nut shell, CheXNet did perform better than humans in every one of the 14 classes of ChestX-ray-14, which is known to be the largest of the similar databases.

* Now here is the caveat popular news hadn't mentioned:
1, First of all, humans weren't allow to access previous medical records of a patient.
2, Only frontal images were shown to human doctors. But prior work did show when the lateral view was also shown.

* That's why on p.3 of the article, the authors note:
"We thus expect that this setup provides a conservative estimate of human radiologist performance."

which should make you realize that may be it will still take a bit for deep learning to "replace radiologists".

A Read on " Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm"

(First published at AIDL Weekly and AIDL-LD.)

Kade Gibson already post this paper and give a good summary. I want to analyze it with a more detail so I started a separate thread.

* As you know the story, AlphaZero is not only just playing Go, and is now playing Chess and Shogi. By itself this is a significant event, because most stoa board game engine are specific to games. General game playing engines are seen as novelties but not a norm.

* Another note, most Chess and Shogi engines are based on alpha-beta search. But then AlphaZero is now using Monte-Carlo Tree Search which simulate board positions. Positions are order by scores from a board NN. State is entered in the order of visit counts and value of the board according to NN. So you can see this is not just AlphaZero is beating up more games, it will be more a paradigm shift of both computer Chess and Shogi community.

* As you know, AlphaZero beats the strongest program in 2016, Stockfish. But one analysis which caught my eyes: In chess, DeepMind researchers also fix the first few moves of AlphaZero so that it follows the top 12 most-play openings for black and white. If you are into chess, Queen's Gambit, several Sicilian Defences, The French, KID. They show that AlphaZero can beat Stockfish in multiple type of situations, and openings doesn't matter too much.

* But then, would AlphaZero beat all computer players such as Shredder or Komodo? No one knows the answers yet.

* One more thing: AlphaZero doesn't assume zero knowledge neither. As Denny Britz points out in his tweet, AlphaZero was provided with perfect knowledge in terms of rules. So intriguing rules such as castling, threefold repetition or 50-move drawing rules are all provided to the machine. Perhaps Britz points out, may be we want to focus on how to let the machine to figure out the rules themselves in the future.

That's what I have. Hope you enjoy it.

A Read on "A Neural Attention Model for Abstractive Sentence Summarization" by A.M. Rush, Sumit Chopra and Jason Weston.

(First appeared in AIDL-LD and AIDL Weekly.)

This is a read on the paper "A Neural Attention Model for Abstractive Sentence Summarization" by A.M. Rush, Sumit Chopra and Jason Weston.

* Video: . Github:

* The paper was written at 2015, and is more a classic paper on NN-based summarization. It is published slightly later than classic papers on NN-based translation such as those written by Cho or Badhanau. We assume you have some basic understanding on NN-based translation and attention.

* There is a github ( and a video ( for the paper.

* If you haven't worked on summarization, you can broadly think of techniques as extractive or abstractive. Given the text you want to summarize, "extractive" means you just usehe word from the input text, whereas "abstractive" means you can use any words you like, even the words which are in the input text.

* So this is why summarization is seen as similar problem as translation: you just think that there is a "translation" from the original text to the summary.

* Section 2 is a fairly nice mathematical background of summarization. One thing to note, the video also bring up noisy channel formulation. But as Rush said, their paper is to completely do away noisy-channel but do direct mapping.

* The next nuance you want to look at is the type of LM and the encoder used. That can all be found in Section 3. e.g. it uses the forward NNLM proposed by Bengio. Rush mentioned that he was trying RNNLM, but at that time, he get small gain. It feels like he can probably get better results if RNNLM is used.

* Then it's the type of encoder, there is a nice comparison between bag-of-words and attention models. Since there are words embeddings, the "bag-of-words" is actually all the input words embedded down to a certain size. Attention model, on the other hand, is what we know today, which contains a weight matrix P which map the context to input.

* Here is an insightful note from Rush: "Informally we can think of this model as simply replacing the uniform distribution in bag-of-words with a learned soft alignment, P, between the input and the summary."

* Section 4 is more on decoding, in Section 2, Markov assumption was made, this simplifies the decoding quite a lot. The authors were using beam search, so you can use trick such as path combination.

* Another cute thing is that the authors also comes up with method such that make the summarization more extractive. For that it uses a log-linear model to also weigh features such as unigram to trigram. See Section 5.

* Why would the author wants to make the summarization more extractive? That probably has to do with the metric. ROUGE usually favors words which are extracted from the input text.

* We will stop at this point. Here are several interesting commentaries about the paper.

Denny Britz':…/neural-attention-model-for-abstractive…

Quick Impression on Waikit Lau's Crypto and Blockchain forum and MIT

* Hosted by Coach Wei. Our own Waikit Lau is presenting the topic on blockchain, cryptocurrency and ICO. Me and Brian Subiranna were invited as guest panels at the Q&A forums. Brian is planning to create a class on blockchain at MIT.

* About the forum, Coach Wei has launched MIT-Tsinghua Summit since Dec 30 last year. So this is part of the talk in the series:…/mit-tsinghua-innovation-summit-…/

* And Waikit, as you might know, he has been successful serial entrepreneur, angel and involved in several ICOs. He also co-admin the AIDL forum.

* In my view, Waikit gave a great presentation on the excitement of blockchain and cryptos. His ~50 min presentation have couple gists
- The rise of protocol coins such as ethereum.
- The potential business opportunity is comparable to development of HTTP. Waikit use the metaphor that blockchain can be seen as TCP/IP. Whereas application build on top of blockchain can be thought as HTTP.
- The current ambiguity of how ICO should be regulated. Or generally: should cryptos be seen as a commodity or a security?

* Gems from the Q&A session. The crowd has many sharp questions to the panels. Here is a summary:

Q. Are there any values of blockchain without cryptocurrency?
A. Generally yes from the panels. e.g. most chains can exist without the idea of mining. Mining probably makes more sense when parties within the network don't trust each other.

Q. What is the current state of decentralized exchanges?
A. Panels: Still under development. There are a lot of things need to happen to motivate a large ones.

Q. Would quantum computing be a threat to blockchain?
A. Panels (Arthur): It could be, yet current quantum computing still several technical roadblocks to solve to make it usable for applications. e.g. create stabilized inputs for the QC. There are also counter technology such as quantum cryptography being developed. So we can't quite say QC would just kill blockchain even if it is developed.

Q. Should a chain be controlled by one or multiple parties?
A. Yet another issue which is hard to predict. Looking at the development of Ethereum and Bitcoin, having a benevolent dictator seems to make the Ethereum community more unified. But the fact that Bitcoin's community is segmented that now the buyers/users have more say, might motivate adoption of speed-up algorithm.

Q. Would speeding up mining speed up transaction?
A. Unlikely. What would likely to improve transaction speed are technology such as Lightning.

That's what I have.  You can find the original thread at


Quick Impression of Course 5 of

Hey Hey! As you know Course 5 is just out, as always, I would check out the class and give you a quick impression of what about. So far, these new 3-week class look very exciting. Here is my takeaway. Remember, I haven't started the class yet. But this is likely to give you a sense of the scope and extent of the class.

* Course 5 is mostly focused on sequence models. That include the more mysterious models such as RNN, GRU, LSTM. You will go through standard ideas such as vanishing gradients which actually first discovered in RNN. Then go through GRU and LSTM afterward.

* The sequence of coverage is nice, covering GRU first, then LSTM doesn't quite follow the historical order. (Hochreiter & Schmidhuber first discovered LSTM in 97, Cho had the idea about GRU in 2014). But such order makes more sense for pedagogical purpose. I don't want to spoil it, but this is also how Socher's approach of the subject in cs229n as well. That makes me believe this is likely a course which would teach you well on RNN.

* Week 1 will be all about RNN, then Week 2 and 3 would be about word vectors, and end-to-end structure. Would one week be enough for each topic? Not at all. But Andrew seems to give all the essential in each topic - word2vec/GloVec in word vectors. Standard dec-enc structure in end-to-end scheme. Most examples are based on SMT, which I think it's appropriate. Other applications such as image captioning or speech recognition are possible applications. But they usually have details which is tough to cover in the first class.

* Would this class be everything you need on NLP? Very unlikely. You still need take cs229n to get good. Or even the Oxford class. But just like Course 4 is a good intro to computer vision. This will be a good intro to NLP and in general any topics which require sequence modeling such as speech recognition, stock analysis or DNA sequence analysis.

The course link can be found at

Hope this "Quick Impression" helps you!

Comparing and Udacity's nanodegree

I would think it this way:

"For the most part MOOC certificates don't mean too much in real life. It is whether you can actually solve problem matters. So the meaning of MOOC is really there to stimulate you to learn. And certificate serves as a motivation tool.

As for OP's question. I never take the Udacity nanodegree. From what I heard though, I will say the nanodegree will require effort to take 1 to 2 Ng's specialization. It's also tougher if you need to take a course in a specified period of time. But the upside is there are human graders which give you feedbacks.

As for which path to go, I think it's solely depend on your finance. Let's push to an extreme: e.g. If you purely think of credential and opportunities May be an actual PhD/Master degree will give you the most, but then the downside is it can cost you multi-year of salaries. One tier down would be online ML degree from Georgia tech, but it will still cost you up to $5k. Then there is taking cs231n or cs224d from Stanford online, again that will cost you $4k/class. So that's why you would consider to take MOOC. And as I said which price tag you choose depends on how motivate you are and how much feedbacks you want to get."