AIDL Weekly Issue 36 – Capsules, Capsules and Capsules

Editorial

Thoughts From Your Humble Curators

We were out last week. The hottest news this week is all about Prof. Hinton’s capsule models!

Prof. Hinton and students just released an arxiv paper on how the idea of capsule can be used and specifically how it can outperform MNIST as well as its more difficult cousins, affMNIST and MultiMNIST, which had distorted MNIST with affined transform and heavy overlapping. So we dedicate this issue on capsules. We provide our own analysis in the thesis/paper review section, highlight a popular link from Wired, and cover the latest developments. So far, we already know of two implementations which attempt to repeat the result.

Other than that, check out other interesting links such as ex-Google Brain Resident David Ha’s work on evolution strategy, and our piece on “Unsupervised Machine Translation Using Monolingual Corpora”.

Join our community for real-time discussions with this iOS app here: https://itunes.apple.com/us/app/expertify/id969850760

As always, if you like our newsletter, feel free to subscribe/share the letter to your colleagues.

We will host our own show at the AI World events: “Attack of the AI Startups”. If you live in the New England area, feel free to join!

Artificial Intelligence and Deep Learning Weekly

News

Sony AIBO is back! And Powered by Deep Learning!

Deep learning is revitalizing one of the most iconic product: Sony AIBO. The new AIBO, priced at ~$1730, will have sensors and run deep learning algorithm to recognize images and sounds.

theverge.com

Course 4 of deeplearning.ai is here!

Alright, ML nerds, Course 4 of deeplearning.ai is finally here! Course 4 looks good. It’s all about image classification with Convnet, object detection and fun exercises such as transfer learning and face verification.

coursera.org

Wired’s Coverage of Hinton’s Capsules Theory

Here is a more popular account of Prof. Hinton’s capsules theory. Notice that Wired assume the OpenReview paper is from students and Hinton as well. But at this point we really don’t have any confirmation yet.

wired.com

Blog Posts

Nvidia’s Progressive Growing of GAN

Here is one very impressive results you probably saw last week – GAN is now able to generate very realistic celebrity images (trained using the CelebA-HQ database). So how does it work?

Turns out it has a lot to do how the training is done. The authors start with using a GAN with both generator and discriminator with only 4 units. The training then progressively doubles the size of units and eventually increasing it to 1024 units. Doing so, the authors claims, helps to generate fine details.

What else is in the recipe? There are two we spotted:

The authors used facial landmarks to crop an image in the CelebA-HQ dataset.
Then the various normalization scheme

nvidia.com

The State of ML and Data Science Report from Kaggle

This is a great survey by Kaggle on data scientists and ML experts. What is their demographics? What tool do they use? And what is their most favorite machine learning algorithm? They are all answered in the report.

kaggle.com

A Visual Guide to Evolution Strategies

We often learned a lot just by reading what Google Resident David Ha share in his Facebook and LinkedIn feed. He has impressive experience in different sub-fields of machine learning. His seemingly casual experiments on reinforcement learning is one of the most interesting to read and understand.

This time, David himself wrote an extensive guide on evolution strategies, which compares various methods such as genetic algorithm (GA), covariance matrix adaptation evolution strategy (CMA-ES), REINFORCE and even OpenAI latest strategy. It’s certainly eye-opening for us. As in many of David’s work: code is released. So Enjoy!

otoro.net

Arthur’s Full Review of deeplearning.ai Course 2

Here is Arthur’s review of deeplearning.ai Course 2. This time he focus on why learning the details of deep learning could be a good thing for beginners of DL.

thegrandjanitor.com

Open Source

Implementation of CapsNet

Here is one of the first implementations which attempts to reimplement CapsNet.

github.com

A PyTorch Implementation

And here is a PyTorch implementation.

github.com

Paper/Thesis Review

What We Know About Capsules So Far

We wrote the previous piece on Monday on our Literature Discussion group. But then it triggered very interesting discussion which we learnt couple of things:

Hinton and students might have published another paper on ICLR 2018. It’s very likely to involve EM as the routing mechanism.
There are already two available implementations of CapsNet. (See the Implementation Section.)

facebook.com

Capsules

This is the Hinton’s new invention of capsules algorithm. Here is a write up: It’s TL;DR but we doubt we completely grok the idea anyway.

The first mention of “capsule” is perhaps in the paper “Transforming Auto-encoders” which Hinton and students coauthored.
It’s important to understand what capsules try to solve before you delve into the details. If you look at Hinton’s papers and talks, capsule is really an idea which improve upon Convnet. Hinton has two major complaints.
First, the general settings of Convnet assumes that one filter is being used across different locations. This is also known as “location invariance”. In this setting, the exact location of a feature doesn’t matter. That has a lot to do with robust feature parameter estimation. It also drastically simplify backprop with weight sharing.
But then location invariance also removes one important information of an image: the apparent location.
Second assumption is max pooling. As you know, pooling usually removes a high percentage of information from the previous layer. In early architectures, usually pooling is the key to shrink the size of a representation down. Of course, later architectures had changed. But pooling is still an important component.
So the design of capsule has a lot of do to tackle problems of max pooling: Instead of losing information, can we “route” pixel values from previous layer correctly so that they are in optimal use?
Generally “capsule” represents a certain entity of an image, “such as pose (position, size, orientation), deformation, velocity, albedo, hue, texture etc”. Notice that they are not hard-wired and automatically discovered.
Then there is how the low level information can “route” to higher level. The mechanism is intriguing in this current implementation:
First, your goal is to calculate a softmax in the form of
exp(b{ij} / Sum_k exp(b{ik} where b_{ij} is the output of lower level capsule i to a higher level capsule j. This is something you can train.
Then what you do is iteratively estimate b_{ij}. This appears in Procedure 1. The 4 steps are:
a, calculate the softmax weight b.
b, compute the prediction vector from a capsule i, then form a weighted sum,
c, squash the weighted sum
d, update softmax weight b based on the squash value and weighted sum.
So why the squash function, our guess is it is to normalize the value computed in b. According to Hinton, a good function is
v_j = |s_j|^2 / (1 + |s_j|^2) * s_j / |s_j|
The rest of the architecture actually looks very much like a Convnet. The first layer was a Convnet with ReLU activation.
Would this work? The authors say yes. Not only it reaches the state of art benchmark of MNIST. It can also tackle more difficult tasks such as CIFAR-10, SVNH. In fact, the authors found that in both task they already achieve better results when first Convnet was first used to tackle these tasks.
It also works well for two tasks called affMNISt and multiMNIST. First is MNIST go through affine transform, second is MNIST regenerated with many overlappings. This is quite impressive, because you will need to use much data augmentation and effort of object detection to get these cases working.
The part, we have some doubts – is this model more complex than convnet? It’s possible that we are just fitting a more complex model to get better results.
Nice thing about the implementation: it’s in Tensorflow, so we can play with it in the near future.
Have fun!

arxiv.org

Unsupervised Machine Translation Using Monolingual Corpora

This is an impressive paper by FAIR authors which claims that one only need to use monolingual corpora to train a usable translation model. So how does it work? Here are some notes.

For starter, indeed you don’t need to use a parallel corpora, but you still need a bidirectional dictionary to generate translation. You also need to have monolingual corpora in both languages. That’s why the title is about monolingual corpora (plural) but not monolingual corpus (singular).
Then, there is the issue of how you actually create translation. It’s actually much simpler than you thought, first imagine there is a latent language which both your source and target languages mapped to.
How do you train? So let’s just use the source language as an example first. What you can do is create an encoder-decoder architecture which translate your source to the latent space, then translate it back. Using BLEU score, you can then setup an optimization criteria.
Now this doesn’t quite do the translation. Now you apply the same procedure on both source and target language. Don’t you now have a common latent space? One you train up such common latent subspace, in actual translation, what you need to do is to first map the target language in the common latent space, then map it back to the source language.
Many of you might recognize that such encoder-decoder scheme which map the language to itself as very similar to autoencoder. Indeed, the authors in the paper actually use a version of autoencoder: denoising autoencoder(dA) to train the model.
The final interesting idea I spot is to idea of iterative training. In this case, you can imagine that you can first train an initial translator, but then use its output as the truth and retrain another one. The authors found tremendous gain in BLEU score in the process.
The results are stunning if you consider no parallel corpus is involved. BLEU score is around 10 points lower, but do remember: deep learning has pretty much improved BLEU scores by absolute 7-8 points anyway from the classical phrased based translation models.

Member Ben Davis also wrote a fairly good summary for the paper. Check it in our thread?