All posts by grandjanitor

A Read on "The Consciousness Prior" By Prof. Yoshua Bengio

Here are some notes after reading Prof. Yoshua Bengio's "The Consciousness Prior". I know many of you, like Stuart Gray was quite unhappy that there is no experimental results. Yet, this is an interesting paper and good food for thought for all of us. Here are some notes:

* The consciousness mentioned in the paper is much less of what would think as qualia but more about access of the different representations.

* The terminology is not too difficult to understand, suppose there is a representation of the brain at a current time h_t, a representation RNN F is used to model such representation.

* Whereas the protagonist here is the consciousness RNN, C, which is to used to model a consciousness state. What is *consciousness state& then? It is actually a low-dimension vector of the representation h_t.

* Now one thing to notice is that Bengio believe that consciousness RNN, C should by itself include some kind of attention mechanism. What that means is that attention being used in NNMT these days should be involved. In a nutshell, C should "pay attention" to only important details within this consciousness vector when it updates itself

* I think so far the idea is already fairly interesting, in fact, just the idea one interesting thought : what if we just initialize the consciousness vector to be random instead, in that case, there will be a new representation of brain appears. As a result. this mechanism mimic human brains on exploring different scenario we conjured with imagination.

* Bengio's thought also encompass a training method which he called verifier network, V. The goal of the network to match the current representation h_t with previous consciousness state c_{t-k} (states?). The training as he envisioned can be a Variational autoencoder (VAE) or GAN.

* So far the idea doesn't quite echo with human's way of thinking. Human seems to create high-level concepts, like symbols to simplify our thinking. So Bengio addresses these difficulty by suggesting we can just use another network to generate what we mean from the consciousness state, he called it U. Perhaps we can call it generation network. This network can well-be implemented by memory-augmented networks style of architecture which distinguish key/value pairs. In this case, we can map the consciousness to more concrete symbols which symbolic logic or knowledge representation framework can use. ... Or we humans can also understand this consciousness representation.

* This all sounds good, but as you may hear from many readers of the paper. There is no experimental results. So this is really a theoretical paper.

* To be fair though, the good professor has outlined how each of the above 4 networks can be actually implemented. He also mentioned how such idea can be experimented in practice. E.g. he believe one good arena is reinforcement learning tasks.

All-in-all, this is an interesting paper, it's a pity that the detail is scanty at this point. But it's still quite worthwhile for your time to read.

A Read on "Dynamic Routing Between Capsules"

(Also check out Keran's discussion - very exciting! I might go to write a blurb on the new capsule paper which seems to be written by Hinton and friends.)

As you know this is the Hinton's new invention of capsules algorithm. It's been a while I want to delve into this idea. So here is a write up: It's tl;dr but I doubt I completely grok the idea anyway.

* The first mention of "capsule" is perhaps in the paper "Transforming Auto-encoders" which Hinton and students coauthored.

* It's important to understand what capsules trying to solve before you delve into the details. If you look at Hinton's papers and talks, capsule is really an idea which improve upon Convnet, there are two major complaints from Hinton.

* First the general settings of Convnet assume that 1 filter is being used across different location. This is also known as "location invariance". In this setting, the exact location of a feature doesn't matter. That has a lot to do with robust feature parameter estimation. It also drastically simplify backprop with weight sharing.

* But then location invariance also removes one important information of an image: the apparent location.

* Second assumption is max pooling. As you know, pooling usually removes a high percentage of information from a layer. In early architectures, usually pooling is the key to shrink the size of a representation down. Of course, later architectures had changed. But pooling is still an important component.

* So the design of capsule has a lot of do to tackle problems of max pooling. Instead of losing information, can we "route" this information correctly so that they are optimal use? That's the thesis.

* Generally "capsule" represents a certain entity of an image, "such as pose (position, size, orientation), deformation, velocity, albedo, hue, texture etc". Notice that they are not hard-wired and automatically discovered.

* Then there is how the low level information can "route" to higher level. The mechanism is intriguing in this current implementation:

* First, your goal is to calculate a softmax in the form of

exp(b_{ij} / Sum_k exp(b_{ik} where b_{ij} is the output of lower level capsule i to a higher level capsule j. This is something you can train.

* Then what you do is iteratively estimate b_{ij}. This appears in Procedure 1. The 4 steps are:

a, calculate the softmax weight b.
b, compute the prediction vector from a capsule i, then form a weighted sum,
c, squash the weighted sum
d, update softmax weight b based on the squash value and weighted sum.

* So why the squash function, my guess is it is to normalize the value computed in b. According to Hinton, a good function is

v_j = |s_j|^2 / (1 + |s_j|^2) * s_j / |s_j|

* The rest of the architecture actually looks very much like a Convnet. The first layer was a Convnet with ReLU activation.

* Would this work? The authors say yes. Not only it reaches the state of art benchmark of MNIST. It can also tackle more difficult task such as CIFAR-10, SVNH. In fact, the authors found that in both task they already achieve better results when first Convnet was first used to tackle these tasks.

* It also works well for two tasks called affMNISt and multiMNIST. First is MNIST go through affine transform, second is MNIST regenerated with many overlappings. This is quite impressive, because you will need to use much data augmentation and effort of object detection to get these cases working.

* The part I have doubt - is this model more complex than convnet? If it is show, it's possible that we are just fitting a more complex model to get better results.

* Nice thing about the implementation: it's tensorflow, so we can expect and play with it in the future.

That's what I have so far. Have fun!

A Read on "Searching for Activation Functions"

(First published at AIDL-LD and AIDL-Weekly)

Perhaps the most interesting paper last week is the Swish function. Here are some notes:

* Swish is extraordinarily simple. It's just
swish(x) = x * sigmoid(x).
* Derivative? swish'(x) = swish(x) + sigmoid(x) (1 - swish (x)) Simple calculus. 
* Can you tune it? Yes, there is a tunable version which the parameter is trainable. It's call Swish-Beta which is x * sigmoid( Beta * x)
* So here's an interesting part of why it is a "self-gating function". So.... if you understand LSTM, essentially it introduced a multiplication sign. The multiplier strengthen the gradient and effectively resolve the vanishing/exploding gradient problem. e.g. input gate and forget gate, give you weights of "how much you want to consider the input" and "how much much you want to forget". (
* So swish is not too different - there is the activation function but it is weighted by the input itself. Thus the term self-gating. In a nutshell, in plain English, "because we multiply".
* It's all good, but does it work? The experimental results look promising. It works on Cifar-10, Cifar-100. On Imagenet, it beats Inception-v2 and v3 when swish replace ReLU.
* It's worthwhile to point out the latest Inception is in v4. So the imagenet number is not beating stoa even within Google, not to say the best number in Imagenet 2016. But that shouldn't matter, if something consistently improve on some models of Imagenet, it is a very good sign it is working.
* Of course, looking at the activation function. It introduces a multiplication. So it does increase computation when compare with a simple ReLU. And that seems to be the complaint I heard.

That's what I have. Enjoy!

A read on " Unsupervised Machine Translation Using Monolingual Corpora Only"

(First published on AIDL-LD and AIDL Weekly.)

"This is an impressive paper by FAIR authors which claims that one only need to use monolingual corpora to train a usable translation model. So how does it work? Here are some notes.

* For starter, indeed you don't need to use a parallel corpora, but you still need a bidirectional dictionary to generate translation. You also need to have monolingual corpora in both languages. That's why the title is about monolingual corpora (plural) but not monolingual corpus (singular).

* Then, there is the issue of how you actually create translation. It's actually much simpler than you thought, first imagine there is a latent language which both your source and target languages mapped to.

* How do you train? So let's just use the source language as an example first. What you can do is create an encoder-decoder architecture which translate your source to the latent space, then translate it back. Using BLEU score, you can then setup an optimization criteria.

* Now this doesn't quite do the translation. Now you apply the same procedure on both source and target language. Don't you now have a common latent space? In actual translation, what you need to do is to first map the target language in the common latent space, then map it back to the source language.

* Many of you might recognize that such encoder-decoder scheme which map the language to itself as very similar to autoencoder. Indeed, the authors in the paper actually use a version of autoencoder: denoising autoencoder(dA) to train the model.

* The final interesting idea I spot is to idea of iterative training. In this case, you can imagine that you can first train an initial translator, but then use its output as the truth and retrain another one. The authors found tremendous gain in BLEU score in the process.

* The results are stunning if you consider no parallel corpus is involved. BLEU score is around 10 points lower, but do remember: deep learning has pretty much improved BLEU scores by absolute 7-8 points anyway from the classical phrased based translation models."

A Read on "Non-Autoregressive Neural Machine Translation"

(First published at AIDL-LD and AIDL Weekly.)

This is the second of the two papers from Salesforce, "
"Non-Autoregressive Neural Machine Translation" . Unlike the "Weighted Transformer, I don't believe it improves SOTA results. But then it introduces a cute idea into a purely attention-based NNMT, I would suggest you my previous post before you read on:

Okay. The key idea introduced in the paper is fertility. So this is to address one of the issues introduced by a purely attention-based model introduced from "Attention is all you need". If you are doing translation, the translated word can 1) be expanded to multiple words, 2) transform to a totally different word location.

In the older world of statistical machine translation, or what we called IBM models. The latter model is called "Model 2" which decide the "absolute alignment" of source/target language pair. The former is called fertility model or "Model 3". Of course, in the world of NNMT, these two models were thought to be obsolete. Why not just use RNN in the Encoder/Decoder structure to solve the problem?

(Btw, there are totally 5 layers in the original IBM Models. If you are into SMT, you should probably learn it up.)

But then in the world of purely attention-based NNMT, idea such as absolute alignment and fertility become important again. Because you don't have memory within your model. So in the original "Attention is all you need" paper, there is already the thought of "positional encoding".

So the new Salesforce paper actually introduce another layer which reintroduce fertility. Instead of just feeding the output of encoder directly into the decoder. It will feed to a fertility layer to decide if a certain word should have higher fertility first. e.g. a fertility of 2 means that it should be copied twice. 0 means the word shouldn't be copy.

I think the cute thing about the paper is two fold. One is that it is an obvious expansion of the whole idea of attention-based NNMT . Then there is the Socher's group is reintroducing classical SMT idea back to NNMT.

The result though is not working as well as the standard NNMT. As you can see in Table 1. There is still some degradation using the attention-based approach. That's perhaps why when the Google Research Blog mention the Salesforce results : it said "*towards* non-autoregressive translation". It implies that the results is not yet satisfying.

A Read on " CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning"

(First published at AIDL-LD and AIDL Weekly.)

This is a note on CheXNet the paper. As you know it is the widely circulated paper from Stanford, purportedly outperform human's performance on Chest X-ray diagnostic.

* BUT, after I read it in detail, my impression is slightly different from just reading the popular news including the description on github.

* Since the ML part is not very interesting. I will just briefly go through it - it's a 121-layer Densenet, basically it means there are feed-forward connection from every previous layers. Given the data size, it's likely a full training.

* There was not much justification on the why of the architecture. My guess: the team first tried transfer learning, but decide to move on to full-training to get better performance. A manageable setup would be Densenet.

* Then there was a fairly standard experimental comparison using AUC. In a nut shell, CheXNet did perform better than humans in every one of the 14 classes of ChestX-ray-14, which is known to be the largest of the similar databases.

* Now here is the caveat popular news hadn't mentioned:
1, First of all, humans weren't allow to access previous medical records of a patient.
2, Only frontal images were shown to human doctors. But prior work did show when the lateral view was also shown.

* That's why on p.3 of the article, the authors note:
"We thus expect that this setup provides a conservative estimate of human radiologist performance."

which should make you realize that may be it will still take a bit for deep learning to "replace radiologists".

A Read on " Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm"

(First published at AIDL Weekly and AIDL-LD.)

Kade Gibson already post this paper and give a good summary. I want to analyze it with a more detail so I started a separate thread.

* As you know the story, AlphaZero is not only just playing Go, and is now playing Chess and Shogi. By itself this is a significant event, because most stoa board game engine are specific to games. General game playing engines are seen as novelties but not a norm.

* Another note, most Chess and Shogi engines are based on alpha-beta search. But then AlphaZero is now using Monte-Carlo Tree Search which simulate board positions. Positions are order by scores from a board NN. State is entered in the order of visit counts and value of the board according to NN. So you can see this is not just AlphaZero is beating up more games, it will be more a paradigm shift of both computer Chess and Shogi community.

* As you know, AlphaZero beats the strongest program in 2016, Stockfish. But one analysis which caught my eyes: In chess, DeepMind researchers also fix the first few moves of AlphaZero so that it follows the top 12 most-play openings for black and white. If you are into chess, Queen's Gambit, several Sicilian Defences, The French, KID. They show that AlphaZero can beat Stockfish in multiple type of situations, and openings doesn't matter too much.

* But then, would AlphaZero beat all computer players such as Shredder or Komodo? No one knows the answers yet.

* One more thing: AlphaZero doesn't assume zero knowledge neither. As Denny Britz points out in his tweet, AlphaZero was provided with perfect knowledge in terms of rules. So intriguing rules such as castling, threefold repetition or 50-move drawing rules are all provided to the machine. Perhaps Britz points out, may be we want to focus on how to let the machine to figure out the rules themselves in the future.

That's what I have. Hope you enjoy it.

A Read on "Deep Neural Networks for Acoustic Modeling in Speech Recognition" by Hinton et al.

A read on "Deep Neural Networks for Acoustic Modeling in Speech Recognition" by Hinton et al.

* This is the now-classic paper in deep learning, which is for the first time people confirmed that deep learning can improve ASR significantly. It is important in the fields of both deep learning and ASR. It's also one of the first papers I read on deep learning back in 2012-3.

* Many people know the origin of deep learning from image recognition, e.g. many kids would tell you stories about Imagenet, Alexnet and history from now on. But then the first important application of deep learning is perhaps speech recognition.

* So what's going on with ASR before deep learning then? For the most part, if you can come up with a technique that cut a state-of-the-art system's WER by 10%, your PhD thesis is good. If your technique can consistently beat previous techniques in multiple systems, you usually get a fairly good job in a research institute in Big 4.

* The only technique which I recall to be better than 10% relative improvement are discriminative training. It got ~15% in many domains. That happens back in 2003-2004. In ASR, the term "discriminative training" has very complicated connotation. So I am not going to explain much. This just gives you the context of how powerful deep learning is.

* You might be curious what "relative improvement" is. e.g. suppose your original WER is 18%, but you improve from 17%, then your relatively improvement is 1%/18% = 5.56%. So 10% improvement really means you go down to 16.2%. (Yes, ASR is that tough.)

* So here comes replacing GMM with DNN. In these days, it sounds like a no-brainer. But back then, it was a huge deal. Many people in the past tried to stuff various ML technique to replace GMM. But no one can successfully beat HMM. So this is innovative.

* Now then it is how GMM is setup - the ancestor of this work has to trace back to Bourlard and Morgan's "Connectionist Speech Recognition" in which the authors tried to come up with a Context-independent HMM system by replacing VQ scores with a shallow neural network. At that time, the unit are chosen to be CI-states.

* Hinton's and perhaps Deng's thinking are interesting: The unit was chose to be context-dependent states. Now that's an new change, and reflect how modern HMM system is trained.

* Then there is how the network is really trained. Now you can see the early DLer's stress on using pre-training because training is very expensive at that point. (I suspect it wasn't using GPUs).

* Then there is the use of entropy to train a model. Later on, in other systems, many people just use a sentence-based entropy to do training. So in this sentence, the paper is olden.

* None of these are trivial work. But the result is stellar: we are talking about 18%-33% relative gain (p.14). To ASR people, that's unreal.

* The paper also foresee some future use of DNN, such as bottleneck feature and articulatory feature. You probably know the former already. The latter is more exoteric in ASR, so I am not going to talk about much.

Anyway, that's what I have. Enjoy the reading!

A Read on "Regularized Evolution for Image Classifier Architecture Search"

(First appeared in AIDL-LD and AIDL Weekly.)

This is a read on "Regularized Evolution for Image Classifier Architecture Search" which is the paper version of AmoebaNet, the latest result in AutoML (Or this page:…/using-evolutionary-automl…)

* If you recall, Google already has several results on how to use RL and evolution strategy (ES) to discover model architecture in the past. e.g. Nasnet is one of the examples.

* So what's new? The key idea is so-called regularized evolution strategy. What does it mean?

* Basically it is a tweak of the more standard tournament strategy, commonly used as the means of selecting individual out of a population. (

* Tournament is not too difficulty to describe:
- Choose random individuals from the population.
- Choose the best candidate according to certain optimizing criterion.

You can also use a probabilistic scheme to decide whether to use the second or third best candidate. You might also think of it as throwing away the worst-N-candidate.

* The AutoML calls this original method by Miller and Goldberg (1995) as non-regularized evolution method.

* What is "regularized" then? Instead of throwing away the worst-N-candidates. The author proposed to throw away the oldest-trained candidate.

* Now you won't see a justification of why this method is better until the "Discussion" section. Okay, let's go with the authors' intended flow. As it turns the regularized method is better than non-regularized method. e.g. In CIFAR-10, the evolved model is ~10% relatively better either man-made model or NasNet. On Imagenet, it performs better than Squeeze-and-Excite Net as well as NasNet. (Squeenze-and-Excite Net is the ILSVRC 2017's winner.)

* One technicality when you read the paper is the G-X dataset, they are actually the gray-scale version the normal X data. e.g. G-CIFAR-10 is the gray-scale version of CIFAR-10. The intention of why the authors do it are probably two folds: 1) to scale the problem down, 2) to avoid overfitting to only the standard testsets of the problems.

* Now, these are all great. But how come the "regularized" approach is better then? How would the authors explain it?

* I don't want to come up with a hypothesis. So let me just quote the last paragraph here: "Under regularized evolution, all models have a short lifespan. Yet, populations improve over longer timescales (Figures 1d, 2c,d, 3a–c). This requires that its surviving lineages remain good through the generations. This, in turn, demands that the inherited architectures retrain well (since we always train from scratch, the weights are not heritable). On the other hand, non-regularized tournament selection allows models to live infinitely long, so a population can improve simply by accumulating high-accuracy models. Unfortunately, these models may have reached their high accuracy by luck during the noisy training process. In summary, only the regularized form requires that the architectures remain good after they are retrained."

* And also: "Whether this mechanism is responsible for
the observed superiority of regularization is conjecture. We
leave its verification to future work."

A Read on "A Neural Attention Model for Abstractive Sentence Summarization" by A.M. Rush, Sumit Chopra and Jason Weston.

(First appeared in AIDL-LD and AIDL Weekly.)

This is a read on the paper "A Neural Attention Model for Abstractive Sentence Summarization" by A.M. Rush, Sumit Chopra and Jason Weston.

* Video: . Github:

* The paper was written at 2015, and is more a classic paper on NN-based summarization. It is published slightly later than classic papers on NN-based translation such as those written by Cho or Badhanau. We assume you have some basic understanding on NN-based translation and attention.

* There is a github ( and a video ( for the paper.

* If you haven't worked on summarization, you can broadly think of techniques as extractive or abstractive. Given the text you want to summarize, "extractive" means you just usehe word from the input text, whereas "abstractive" means you can use any words you like, even the words which are in the input text.

* So this is why summarization is seen as similar problem as translation: you just think that there is a "translation" from the original text to the summary.

* Section 2 is a fairly nice mathematical background of summarization. One thing to note, the video also bring up noisy channel formulation. But as Rush said, their paper is to completely do away noisy-channel but do direct mapping.

* The next nuance you want to look at is the type of LM and the encoder used. That can all be found in Section 3. e.g. it uses the forward NNLM proposed by Bengio. Rush mentioned that he was trying RNNLM, but at that time, he get small gain. It feels like he can probably get better results if RNNLM is used.

* Then it's the type of encoder, there is a nice comparison between bag-of-words and attention models. Since there are words embeddings, the "bag-of-words" is actually all the input words embedded down to a certain size. Attention model, on the other hand, is what we know today, which contains a weight matrix P which map the context to input.

* Here is an insightful note from Rush: "Informally we can think of this model as simply replacing the uniform distribution in bag-of-words with a learned soft alignment, P, between the input and the summary."

* Section 4 is more on decoding, in Section 2, Markov assumption was made, this simplifies the decoding quite a lot. The authors were using beam search, so you can use trick such as path combination.

* Another cute thing is that the authors also comes up with method such that make the summarization more extractive. For that it uses a log-linear model to also weigh features such as unigram to trigram. See Section 5.

* Why would the author wants to make the summarization more extractive? That probably has to do with the metric. ROUGE usually favors words which are extracted from the input text.

* We will stop at this point. Here are several interesting commentaries about the paper.

Denny Britz':…/neural-attention-model-for-abstractive…