Category Archives: Uncategorized

AIDL Weekly Issue 2 - Gamalon/Batch Renormalization/TF 1.0/Oxford Deep NLP

Issue 2  


Thoughts From Your Humble Curators

How to create a good A.I. newsletter? What first comes to everybody's mind is to simply aggregate a lot of links. This is very common in deep learning resource lists, say "Cool List of XX in Deep Learning". Our experience is that you usually have to sift through 100-200 links and decide which are useful.

We believe there is a better way: In AIDL Weekly, we only choose important news and always provide detailed analysis on each of them. For example, here we take a look at newsworthy Gamalon, it is known to use a ground-breaking method to outperform deep learning and win a defense contract recently. What is the basis of its technology? We cover this in a deep dive in the "News" section.

Or you can take a look of the exciting development of batch renormalization that tackles its current shortcomings. Anyone who does normalization in training will likely benefit from the paper.

Last week, we also saw the official release of Tensorflow 1.0 as well as the 2017 Official Tensorflow Summit. We prepared two good links so that you can follow. If you love deep learning with NLP, you might also want to check out the new course from Oxford.

As always, check out our FB group, our YouTube channel, of course subscribe this newsletter.

Artificial Intelligence and Deep Learning Weekly


Blog Posts

Open Source


Member's Question

Question from a AIDL Member

Q: (Rephrase) I am trying to learn the following languages, (...) to intermediate level, and the following languages, (...) to professional level. Would this be helpful for my career on Data Science/Machine Learning? I have a mind to work on deep learning."

This is a variation of a frequently asked question. In a nutshell, "how much programming should I learn if I want to work on deep learning?". The question itself shows misconceptions about programming and machine learning. So we include it in this issue. This is my (Arthur's) take:

  1. First thing first, usually you first decide which package to work on, if the package use language X, then you go to learn-up language X. e.g. if I want to hack Linux kernel, I would need to know C and learn Linux system calls, and perhaps some assembly language. Learning programming is more like a means to achieve a goal. Echoing J.T. Bowlin's point, programming language is more like a language, you can always learn more, but there's a point it seems to be unnecessary.
  2. Then you ask what language should be used to work on deep learning. I will say mathematics, because once you understand the greek symbols, you can translate all these symbols to code (approximately). So if you ask me what you need to learn to hack tensorflow, "Mathematics" would be the first answer, yes, the package is written by Python/C++/C, but they won't be even close in my top-5 answers. Because if you don't know what Backprop is, knowing how C++ destructor works can't make you an expert of TF.
  3. The final thing is you mentioned the term "level". What does this "level" mean? So is it like chess-rating or go-rating that someone has higher rating, they will have a better career in deep learning? It might work for competitive programming...... but real-life programming doesn't work that way. Real-life programming means you can read/write a complex programs. e.g. in C++, you use a class instead of repeating a function implementation many times to reduce programming. Same as templates. That's why class and templates are important concept and people debate their usages a lot. How can you give "levels" to such skills?

Lastly I would say if you seriously want to focus on one language, consider python, but always learn a new programming language yearly. Also pick up some side-projects, both your job and side-projects would usually give you ideas which language you should learn more.

Artificial Intelligence and Deep Learning Weekly

©2017-2019 Artificial Intelligence and Deep Learning Weekly
| Sponsorship


AIDL Weekly Issue 1 - First AIDL Weekly

Issue 1  


Thoughts From Your Humble Curators

When Waikit Lau and I (Arthur Chan) started the Facebook Group Artificial Intelligence and Deep Learning Group (AIDL) last April. We have no idea it would become a group with 9000+ members, and still growing fast. (We added 1k members in last 7 days alone)

We suspect this is just the beginning of the long curvy road of a new layer of intelligence that can be applied everywhere. The question is how do we start? That was the first thing we realized back in late 2015: facing literally ten thousands of links, tutorials etc., it was like drinking from a firehose and we had a hard time to pick up the gems.

We decided to start our little AIDL group to see if we could get a community to help makes sense of the velocity of information. In less than one year, AIDL become the most active A.I. and deep learning group on Facebook. We hope to summarize, analyze, educate and disseminate and I think we have done a good job so far. This resulted in conversations flourishing in the group. We strived to have discussions one level deeper than others. For example, forum members including us fact check several pieces of news related to deep learning. This gives us a better edge in the rapidly changing field of A.I.

This newsletter follows exactly the same philosophy as our forum. We hope to summarize, analyze, educate and disseminate. We will keep an eye on the latest and most salient developments and present them in a coherent fashion to your mailbox.

We sincerely hope that AIDL will be helpful to your career or studies. Please share our newsletter here with your friends. Also check out our Youtube channel at here.


Your Humble Curators, Arthur and Waikit

Artificial Intelligence and Deep Learning Weekly


Blog Posts

Open Source


©2017-2019 Artificial Intelligence and Deep Learning Weekly


Resources on Speech Recognition

Unlike other deep learning topics, there are no readily made video courses available on speech recognition.   So here is a list of other resources that you may find useful.


If you want to learn from online resources:

Useful E2E Speech Recognition Lecture

Important papers:

  • Deep Neural Networks for Acoustic Modeling
    in Speech Recognition" by G. Hinton et al
  • Supervised Sequence Labelling with Recurrent Neural Networks by Alex Graves

Resources on Understanding Heaps

Some assorted links for understanding heaps in user-land,

Resources on CUDA programming

Here is a list of resources for CUDA programming, in particular, in C.


Perhaps the best beginning guide is written by Mark Harris, currently spot 10 articles. They start from simple HelloWorld-type of example.  But goes deeper and deeper into important topic such as data transfer optimization, as well as shared memory.  The final 3 articles focus on optimizing real-life applications such as matrix transpose and finite-difference method.

  1. An Easy Introduction to CUDA C and C++
  2. How to Implement Performance Metrics in CUDA C/C++
  3. How to Query Device Properties and Handle Errors in CUDA C/C++
  4. How to Optimize Data Transfers in CUDA C/C++
  5. How to Overlap Data Transfers in CUDA C/C++
  6. An Even Easier Introduction to CUDA
  7. Unified Memory for CUDA Beginners
  8. An Efficient Matrix Transpose in CUDA C/C++
  9. Finite Difference Methods in CUDA C/C++, Part 1
  10. Finite Difference Methods in CUDA C/C++, Part 2


A very important document on the internal of Nvidia chips as well as CUDA programming models would be CUDA C Programming Guide.

In version 9, the document has around 90 pages of content with the rest of 210 pages to be appendices.  I found it very helpful to read through the content and look up the appendices from time to time.

The next document which is useful is CUDA Best Practice Guide.  You will find a lot of performance tuning tips there in the guide.

If you want to profile a CUDA application, you must use nvprof and the Visual profiler, you can find their manuals here.  Two other very good links to read are here and this one by Mark Harris.

If you want to read a very good textbook, consider to read "Professional CUDA C Programming" which I think is the best book on the topic.   You will learn what the author called "profile-based programming" which is perhaps the best way to proceed in CUDA programming.



Inline PTX Assembly

CuBLAS:  indispensible for linear algebra.  The original Nvidia documentation is good.  But you may also find this little gem on "cuBLAS by example" useful.

Resources on ResNet


youtube video:


Quite related:

  •  Convolutional Neural Networks at Constrained Time Cost ( Interesting predecessor of the paper.
  • Highway networks: (

Unprocessed but Good:

  • multigrid tutorial (
  • (Talk about Resnet, Wide Resnet and ResXnet)
  • Wide Residual Networks (
  • Aggregated Residual Transformations for Deep Neural Networks (
  • Deep Networks with Stochastic Depth
  • Highway network:
  • Ablation study:
  • It's implemented in TF:
  • Wider or Deeper: Revisiting the ResNet Model for Visual Recognition:
  • Deep Residual Learning and PDEs on Manifold:
  • Is it really because of ensemble?
  • Multi-level Residual Networks from Dynamical Systems View (
  • Exploring Normalization in Deep Residual Networks with Concatenated Rectified Linear Units (
  • TinyImageNet (
  • Predict Cortical Representation (

Another summary:

A read on "ImageNet Training in Minutes"

Yes, you read it right, Imagenet training in 24 mins. In particular, an Alexnet structure in 24 mins and Resnet-50 in 60 mins. In terms of Alexnet, in fact, You's work break the previous Facebook's record: 1 hour for Alexnet training. Last time I check, my slightly-optimized training with one single GPU will take ~7 days. Of course, I'm curious how these ideas work. So this post is a summary.

* For the most part, this is not GPU works. This is mostly more a CPU platform but accelerated by Intel Knight Landing (KNL) accelerator. Such accelerator is very suitable in HPC platforms. And there are couple of supercomputers in the world which were built up to 2000 to 10000 such CPUS.

* The gist of why KNL is good: it can divide processors on chip with the memory well. So unlike many clusters you might encounter with 8 to 16 processors, memory bandwidth is much wider. That's usually is a huge bottleneck in training speed.

* Another important line of thought here is "Can you load in more data per batch?" because that allows calculation to be parallelized much easier. The first author, You's previous work already allow the Imagenet batch goes from the standard, 256-512 to something like 8192. This thought has been there for a while, perhaps since Alex Krishevzky. His previous idea is based on adaptive calculation of learning rate per layers. Or Layer-wise Adaptive Rate Scaling (LARS).

* You then combined LARS with another insight from FB researchers: a slow warmup in learning rate. That results in his current work. And it is literally 60% faster than the previous work.

Given what we know, it's thinkable that the training can be even faster in the future. What has been blocking people seem to be 1) No. of CPUs within a system 2) How large a batch size can be loaded in. And I bet after FB read You's paper, there will be another batch of improvement as well. How about that? Don't you love competition in deep learning?

A Read on "The Consciousness Prior" By Prof. Yoshua Bengio

Here are some notes after reading Prof. Yoshua Bengio's "The Consciousness Prior". I know many of you, like Stuart Gray was quite unhappy that there is no experimental results. Yet, this is an interesting paper and good food for thought for all of us. Here are some notes:

* The consciousness mentioned in the paper is much less of what would think as qualia but more about access of the different representations.

* The terminology is not too difficult to understand, suppose there is a representation of the brain at a current time h_t, a representation RNN F is used to model such representation.

* Whereas the protagonist here is the consciousness RNN, C, which is to used to model a consciousness state. What is *consciousness state& then? It is actually a low-dimension vector of the representation h_t.

* Now one thing to notice is that Bengio believe that consciousness RNN, C should by itself include some kind of attention mechanism. What that means is that attention being used in NNMT these days should be involved. In a nutshell, C should "pay attention" to only important details within this consciousness vector when it updates itself

* I think so far the idea is already fairly interesting, in fact, just the idea one interesting thought : what if we just initialize the consciousness vector to be random instead, in that case, there will be a new representation of brain appears. As a result. this mechanism mimic human brains on exploring different scenario we conjured with imagination.

* Bengio's thought also encompass a training method which he called verifier network, V. The goal of the network to match the current representation h_t with previous consciousness state c_{t-k} (states?). The training as he envisioned can be a Variational autoencoder (VAE) or GAN.

* So far the idea doesn't quite echo with human's way of thinking. Human seems to create high-level concepts, like symbols to simplify our thinking. So Bengio addresses these difficulty by suggesting we can just use another network to generate what we mean from the consciousness state, he called it U. Perhaps we can call it generation network. This network can well-be implemented by memory-augmented networks style of architecture which distinguish key/value pairs. In this case, we can map the consciousness to more concrete symbols which symbolic logic or knowledge representation framework can use. ... Or we humans can also understand this consciousness representation.

* This all sounds good, but as you may hear from many readers of the paper. There is no experimental results. So this is really a theoretical paper.

* To be fair though, the good professor has outlined how each of the above 4 networks can be actually implemented. He also mentioned how such idea can be experimented in practice. E.g. he believe one good arena is reinforcement learning tasks.

All-in-all, this is an interesting paper, it's a pity that the detail is scanty at this point. But it's still quite worthwhile for your time to read.

A Read on "Dynamic Routing Between Capsules"

(Also check out Keran's discussion - very exciting! I might go to write a blurb on the new capsule paper which seems to be written by Hinton and friends.)

As you know this is the Hinton's new invention of capsules algorithm. It's been a while I want to delve into this idea. So here is a write up: It's tl;dr but I doubt I completely grok the idea anyway.

* The first mention of "capsule" is perhaps in the paper "Transforming Auto-encoders" which Hinton and students coauthored.

* It's important to understand what capsules trying to solve before you delve into the details. If you look at Hinton's papers and talks, capsule is really an idea which improve upon Convnet, there are two major complaints from Hinton.

* First the general settings of Convnet assume that 1 filter is being used across different location. This is also known as "location invariance". In this setting, the exact location of a feature doesn't matter. That has a lot to do with robust feature parameter estimation. It also drastically simplify backprop with weight sharing.

* But then location invariance also removes one important information of an image: the apparent location.

* Second assumption is max pooling. As you know, pooling usually removes a high percentage of information from a layer. In early architectures, usually pooling is the key to shrink the size of a representation down. Of course, later architectures had changed. But pooling is still an important component.

* So the design of capsule has a lot of do to tackle problems of max pooling. Instead of losing information, can we "route" this information correctly so that they are optimal use? That's the thesis.

* Generally "capsule" represents a certain entity of an image, "such as pose (position, size, orientation), deformation, velocity, albedo, hue, texture etc". Notice that they are not hard-wired and automatically discovered.

* Then there is how the low level information can "route" to higher level. The mechanism is intriguing in this current implementation:

* First, your goal is to calculate a softmax in the form of

exp(b_{ij} / Sum_k exp(b_{ik} where b_{ij} is the output of lower level capsule i to a higher level capsule j. This is something you can train.

* Then what you do is iteratively estimate b_{ij}. This appears in Procedure 1. The 4 steps are:

a, calculate the softmax weight b.
b, compute the prediction vector from a capsule i, then form a weighted sum,
c, squash the weighted sum
d, update softmax weight b based on the squash value and weighted sum.

* So why the squash function, my guess is it is to normalize the value computed in b. According to Hinton, a good function is

v_j = |s_j|^2 / (1 + |s_j|^2) * s_j / |s_j|

* The rest of the architecture actually looks very much like a Convnet. The first layer was a Convnet with ReLU activation.

* Would this work? The authors say yes. Not only it reaches the state of art benchmark of MNIST. It can also tackle more difficult task such as CIFAR-10, SVNH. In fact, the authors found that in both task they already achieve better results when first Convnet was first used to tackle these tasks.

* It also works well for two tasks called affMNISt and multiMNIST. First is MNIST go through affine transform, second is MNIST regenerated with many overlappings. This is quite impressive, because you will need to use much data augmentation and effort of object detection to get these cases working.

* The part I have doubt - is this model more complex than convnet? If it is show, it's possible that we are just fitting a more complex model to get better results.

* Nice thing about the implementation: it's tensorflow, so we can expect and play with it in the future.

That's what I have so far. Have fun!

A Read on "Searching for Activation Functions"

(First published at AIDL-LD and AIDL-Weekly)

Perhaps the most interesting paper last week is the Swish function. Here are some notes:

* Swish is extraordinarily simple. It's just
swish(x) = x * sigmoid(x).
* Derivative? swish'(x) = swish(x) + sigmoid(x) (1 - swish (x)) Simple calculus. 
* Can you tune it? Yes, there is a tunable version which the parameter is trainable. It's call Swish-Beta which is x * sigmoid( Beta * x)
* So here's an interesting part of why it is a "self-gating function". So.... if you understand LSTM, essentially it introduced a multiplication sign. The multiplier strengthen the gradient and effectively resolve the vanishing/exploding gradient problem. e.g. input gate and forget gate, give you weights of "how much you want to consider the input" and "how much much you want to forget". (
* So swish is not too different - there is the activation function but it is weighted by the input itself. Thus the term self-gating. In a nutshell, in plain English, "because we multiply".
* It's all good, but does it work? The experimental results look promising. It works on Cifar-10, Cifar-100. On Imagenet, it beats Inception-v2 and v3 when swish replace ReLU.
* It's worthwhile to point out the latest Inception is in v4. So the imagenet number is not beating stoa even within Google, not to say the best number in Imagenet 2016. But that shouldn't matter, if something consistently improve on some models of Imagenet, it is a very good sign it is working.
* Of course, looking at the activation function. It introduces a multiplication. So it does increase computation when compare with a simple ReLU. And that seems to be the complaint I heard.

That's what I have. Enjoy!