GTC 2019 Write-Up Part 1: Keynotes


Inside every serious deep learning house, there is a cluster of machines.  And inside each machine, there is one GPU card.    Since Voci is a serious deep learning house, we end-up owning many GPU cards.

* * *

By now, no one would disagree deep learning has reinvigorated the ASR industry.   Back in 2013, Voci was one of the earliest startups which adopt deep learning.   It was the time when the Hinton's seminal paper [1] was still fresh.   Some brave souls in Voci, including Haozheng Li, Edward Lin and John Kominek decided to just jump to this then-radically new approach.   My hybrid role, as part researcher, and part software maintainer also started then.     We did several other things in Voci, but none of them is as powerful as deep learning.

* * *

But I digress.  Where were we?

Yep, Voci has a lot of GPU cards.  At first you might have the impression that GPU is more like a "parallellizable-CPU".  But then the reality is because GPU is specifically made for high-performance computing applications such as graphics rending.   A GPU has a very different design from CPU.    If you are a C-programmer, you can pick up ideas of Compute Unified Device Architecture (or CUDA as Nvidia love acronyms).  But then your intuition which was developed from years of programming CPU (Intel or Intel-like) would be completely wrong.

So we realized all these at Voci, that's why part of our focus is to understand how GPU works, and that's why both me and my boss , John Kominek, decided to travel to Silicon Valley and attend  GTC 2019, which is the short for GPU Technology Conference.

This article is Part I of my impression for both on the Keynotes by Jensen Huang, I will also take a look of the Poster session as well as the various booths.  But I will leave more technical ideas in the next post.  

Huang's Keynotes


We love Jensen Huang!  He is walking around the stage, enthusiastically explain to the ten thousand audience what's new with Nvidia.   But let's round up the top-5 announcements?

  1. CUDA-X: More like a convergence among different technologies within Nvidia.  CUDA, as we know now, is more like a programming language.  Whereas CUDA-X is more an architecture term within Nvidia, which encompass various technologie, such as RTX, HPC, AI etc.
  2. Mellanox Acquisition:  Once you look at it, strength of Nvidia against its competitors is not just about the GPU cards.  Nvidia also put an infrastructure that enables customers to build system among GPU cards.  So of course,  the first you want to think about is how do you use multiple machines, each with separate cards?  Now, that explains the Mellanox deals (Infiniband?).  That also explains Huang spent a lion share of his time to talk about data center.  How different containers are talking with each other and how that generate traffics. In a way, it is not just about the card, it is about the card and all peripherals.  In fact, it is about the machines and their ecosystem.
  3. The T4 GPU: The way Nvidia market it, T4 is suitable for data-centers which focus on AI.  Currently benchmarking says, a T4 has lower speed than V100, but has a higher energy efficiency.  So this year big news on the server side is AWS has now adopted T4 in their GPU instances.
  4.  Automatic Mixed Precision (AMP) : What about news for us techies?   Well, the most interesting part is perhaps AMP is now available in Tensorcores.    So why precision is so important then?  Well, once you create a production system on either training or inference.  The first thing you will realize is that it takes a lot of GPU memory.  How to reduce it?  Reducing precision is one way to go.  But when you reduce precision, it's possible that quality of your tasks (training or inference) would degrade.  So it's a tricky problem.  Couple of years ago, researchers have figured out couple of methods.   Now you can implement it yourself, but then Nvidia decided to put it in Tensorcore directly.

Oh, FYI, keynotes feel like a party. 



In a large conference like GTC, you can learn many interesting aspects of your technology.    Unlike a pure academic conference, GTC also has the aspect of being a trade-show.   So how does if feels like? Here are some impressions:

  1. All GPU peripherals: Once you get a GPU card, perhaps the bigger problem is how to install them and make them usable.   Do you think it's easy to do so? It should be plug and play right?  Nope, in reality, working with hardware GPU cards is a difficult technical problem.   Part of the issues is heat dissipation.  If you don't trust me, try to get a few consumer grades GPU card into a same box, you can use it to be a heater in Boston!
    That's perhaps why there are so many vendors other than Nvidia try to get into the game of building GPU-based servers.  They are probably one third of the booths in the show.
  2. Self Driving Car/LiDAR I don't envy my colleagues in the SDC industry.  Actually when will see Level 4 self-driving?  Anyway, people do want to see SDC in thenear future.  So that's why you see all SDC vendors show up in the conference.
  3. The Ecosystem: Finally, you also see demonstrations of various clouds which use GPU.


Finally, here is a picture of donuts:There are more than 100 vendors showcasing their AI products.  If you go t

o look at all the booths, you are going to get very hungry.

Wait for Part II!!!

Arthur Chan


[1] The paper was actually jointly written by researchers from Google, IBM and Microsoft back then.   Notice that these researchers were from separate (rival) groups and they seldom wrote joint paper, not to say about ground-breaking results.

Resources on Speech Recognition

Unlike other deep learning topics, there are no readily made video courses available on speech recognition.   So here is a list of other resources that you may find useful.


If you want to learn from online resources:

Useful E2E Speech Recognition Lecture

Important papers:

  • Deep Neural Networks for Acoustic Modeling
    in Speech Recognition" by G. Hinton et al
  • Supervised Sequence Labelling with Recurrent Neural Networks by Alex Graves

Resources on Understanding Heaps

Some assorted links for understanding heaps in user-land,

Resources on CUDA programming

Here is a list of resources for CUDA programming, in particular, in C.


Perhaps the best beginning guide is written by Mark Harris, currently spot 10 articles. They start from simple HelloWorld-type of example.  But goes deeper and deeper into important topic such as data transfer optimization, as well as shared memory.  The final 3 articles focus on optimizing real-life applications such as matrix transpose and finite-difference method.

  1. An Easy Introduction to CUDA C and C++
  2. How to Implement Performance Metrics in CUDA C/C++
  3. How to Query Device Properties and Handle Errors in CUDA C/C++
  4. How to Optimize Data Transfers in CUDA C/C++
  5. How to Overlap Data Transfers in CUDA C/C++
  6. An Even Easier Introduction to CUDA
  7. Unified Memory for CUDA Beginners
  8. An Efficient Matrix Transpose in CUDA C/C++
  9. Finite Difference Methods in CUDA C/C++, Part 1
  10. Finite Difference Methods in CUDA C/C++, Part 2


A very important document on the internal of Nvidia chips as well as CUDA programming models would be CUDA C Programming Guide.

In version 9, the document has around 90 pages of content with the rest of 210 pages to be appendices.  I found it very helpful to read through the content and look up the appendices from time to time.

The next document which is useful is CUDA Best Practice Guide.  You will find a lot of performance tuning tips there in the guide.

If you want to profile a CUDA application, you must use nvprof and the Visual profiler, you can find their manuals here.  Two other very good links to read are here and this one by Mark Harris.

If you want to read a very good textbook, consider to read "Professional CUDA C Programming" which I think is the best book on the topic.   You will learn what the author called "profile-based programming" which is perhaps the best way to proceed in CUDA programming.



Inline PTX Assembly

CuBLAS:  indispensible for linear algebra.  The original Nvidia documentation is good.  But you may also find this little gem on "cuBLAS by example" useful.

Resources on ResNet


youtube video:


Quite related:

  •  Convolutional Neural Networks at Constrained Time Cost ( Interesting predecessor of the paper.
  • Highway networks: (

Unprocessed but Good:

  • multigrid tutorial (
  • (Talk about Resnet, Wide Resnet and ResXnet)
  • Wide Residual Networks (
  • Aggregated Residual Transformations for Deep Neural Networks (
  • Deep Networks with Stochastic Depth
  • Highway network:
  • Ablation study:
  • It's implemented in TF:
  • Wider or Deeper: Revisiting the ResNet Model for Visual Recognition:
  • Deep Residual Learning and PDEs on Manifold:
  • Is it really because of ensemble?
  • Multi-level Residual Networks from Dynamical Systems View (
  • Exploring Normalization in Deep Residual Networks with Concatenated Rectified Linear Units (
  • TinyImageNet (
  • Predict Cortical Representation (

Another summary:

A read on "ImageNet Training in Minutes"

Yes, you read it right, Imagenet training in 24 mins. In particular, an Alexnet structure in 24 mins and Resnet-50 in 60 mins. In terms of Alexnet, in fact, You's work break the previous Facebook's record: 1 hour for Alexnet training. Last time I check, my slightly-optimized training with one single GPU will take ~7 days. Of course, I'm curious how these ideas work. So this post is a summary.

* For the most part, this is not GPU works. This is mostly more a CPU platform but accelerated by Intel Knight Landing (KNL) accelerator. Such accelerator is very suitable in HPC platforms. And there are couple of supercomputers in the world which were built up to 2000 to 10000 such CPUS.

* The gist of why KNL is good: it can divide processors on chip with the memory well. So unlike many clusters you might encounter with 8 to 16 processors, memory bandwidth is much wider. That's usually is a huge bottleneck in training speed.

* Another important line of thought here is "Can you load in more data per batch?" because that allows calculation to be parallelized much easier. The first author, You's previous work already allow the Imagenet batch goes from the standard, 256-512 to something like 8192. This thought has been there for a while, perhaps since Alex Krishevzky. His previous idea is based on adaptive calculation of learning rate per layers. Or Layer-wise Adaptive Rate Scaling (LARS).

* You then combined LARS with another insight from FB researchers: a slow warmup in learning rate. That results in his current work. And it is literally 60% faster than the previous work.

Given what we know, it's thinkable that the training can be even faster in the future. What has been blocking people seem to be 1) No. of CPUs within a system 2) How large a batch size can be loaded in. And I bet after FB read You's paper, there will be another batch of improvement as well. How about that? Don't you love competition in deep learning?

A Read on "The Consciousness Prior" By Prof. Yoshua Bengio

Here are some notes after reading Prof. Yoshua Bengio's "The Consciousness Prior". I know many of you, like Stuart Gray was quite unhappy that there is no experimental results. Yet, this is an interesting paper and good food for thought for all of us. Here are some notes:

* The consciousness mentioned in the paper is much less of what would think as qualia but more about access of the different representations.

* The terminology is not too difficult to understand, suppose there is a representation of the brain at a current time h_t, a representation RNN F is used to model such representation.

* Whereas the protagonist here is the consciousness RNN, C, which is to used to model a consciousness state. What is *consciousness state& then? It is actually a low-dimension vector of the representation h_t.

* Now one thing to notice is that Bengio believe that consciousness RNN, C should by itself include some kind of attention mechanism. What that means is that attention being used in NNMT these days should be involved. In a nutshell, C should "pay attention" to only important details within this consciousness vector when it updates itself

* I think so far the idea is already fairly interesting, in fact, just the idea one interesting thought : what if we just initialize the consciousness vector to be random instead, in that case, there will be a new representation of brain appears. As a result. this mechanism mimic human brains on exploring different scenario we conjured with imagination.

* Bengio's thought also encompass a training method which he called verifier network, V. The goal of the network to match the current representation h_t with previous consciousness state c_{t-k} (states?). The training as he envisioned can be a Variational autoencoder (VAE) or GAN.

* So far the idea doesn't quite echo with human's way of thinking. Human seems to create high-level concepts, like symbols to simplify our thinking. So Bengio addresses these difficulty by suggesting we can just use another network to generate what we mean from the consciousness state, he called it U. Perhaps we can call it generation network. This network can well-be implemented by memory-augmented networks style of architecture which distinguish key/value pairs. In this case, we can map the consciousness to more concrete symbols which symbolic logic or knowledge representation framework can use. ... Or we humans can also understand this consciousness representation.

* This all sounds good, but as you may hear from many readers of the paper. There is no experimental results. So this is really a theoretical paper.

* To be fair though, the good professor has outlined how each of the above 4 networks can be actually implemented. He also mentioned how such idea can be experimented in practice. E.g. he believe one good arena is reinforcement learning tasks.

All-in-all, this is an interesting paper, it's a pity that the detail is scanty at this point. But it's still quite worthwhile for your time to read.

A Read on "Dynamic Routing Between Capsules"

(Also check out Keran's discussion - very exciting! I might go to write a blurb on the new capsule paper which seems to be written by Hinton and friends.)

As you know this is the Hinton's new invention of capsules algorithm. It's been a while I want to delve into this idea. So here is a write up: It's tl;dr but I doubt I completely grok the idea anyway.

* The first mention of "capsule" is perhaps in the paper "Transforming Auto-encoders" which Hinton and students coauthored.

* It's important to understand what capsules trying to solve before you delve into the details. If you look at Hinton's papers and talks, capsule is really an idea which improve upon Convnet, there are two major complaints from Hinton.

* First the general settings of Convnet assume that 1 filter is being used across different location. This is also known as "location invariance". In this setting, the exact location of a feature doesn't matter. That has a lot to do with robust feature parameter estimation. It also drastically simplify backprop with weight sharing.

* But then location invariance also removes one important information of an image: the apparent location.

* Second assumption is max pooling. As you know, pooling usually removes a high percentage of information from a layer. In early architectures, usually pooling is the key to shrink the size of a representation down. Of course, later architectures had changed. But pooling is still an important component.

* So the design of capsule has a lot of do to tackle problems of max pooling. Instead of losing information, can we "route" this information correctly so that they are optimal use? That's the thesis.

* Generally "capsule" represents a certain entity of an image, "such as pose (position, size, orientation), deformation, velocity, albedo, hue, texture etc". Notice that they are not hard-wired and automatically discovered.

* Then there is how the low level information can "route" to higher level. The mechanism is intriguing in this current implementation:

* First, your goal is to calculate a softmax in the form of

exp(b_{ij} / Sum_k exp(b_{ik} where b_{ij} is the output of lower level capsule i to a higher level capsule j. This is something you can train.

* Then what you do is iteratively estimate b_{ij}. This appears in Procedure 1. The 4 steps are:

a, calculate the softmax weight b.
b, compute the prediction vector from a capsule i, then form a weighted sum,
c, squash the weighted sum
d, update softmax weight b based on the squash value and weighted sum.

* So why the squash function, my guess is it is to normalize the value computed in b. According to Hinton, a good function is

v_j = |s_j|^2 / (1 + |s_j|^2) * s_j / |s_j|

* The rest of the architecture actually looks very much like a Convnet. The first layer was a Convnet with ReLU activation.

* Would this work? The authors say yes. Not only it reaches the state of art benchmark of MNIST. It can also tackle more difficult task such as CIFAR-10, SVNH. In fact, the authors found that in both task they already achieve better results when first Convnet was first used to tackle these tasks.

* It also works well for two tasks called affMNISt and multiMNIST. First is MNIST go through affine transform, second is MNIST regenerated with many overlappings. This is quite impressive, because you will need to use much data augmentation and effort of object detection to get these cases working.

* The part I have doubt - is this model more complex than convnet? If it is show, it's possible that we are just fitting a more complex model to get better results.

* Nice thing about the implementation: it's tensorflow, so we can expect and play with it in the future.

That's what I have so far. Have fun!

A Read on "Searching for Activation Functions"

(First published at AIDL-LD and AIDL-Weekly)

Perhaps the most interesting paper last week is the Swish function. Here are some notes:

* Swish is extraordinarily simple. It's just
swish(x) = x * sigmoid(x).
* Derivative? swish'(x) = swish(x) + sigmoid(x) (1 - swish (x)) Simple calculus. 
* Can you tune it? Yes, there is a tunable version which the parameter is trainable. It's call Swish-Beta which is x * sigmoid( Beta * x)
* So here's an interesting part of why it is a "self-gating function". So.... if you understand LSTM, essentially it introduced a multiplication sign. The multiplier strengthen the gradient and effectively resolve the vanishing/exploding gradient problem. e.g. input gate and forget gate, give you weights of "how much you want to consider the input" and "how much much you want to forget". (
* So swish is not too different - there is the activation function but it is weighted by the input itself. Thus the term self-gating. In a nutshell, in plain English, "because we multiply".
* It's all good, but does it work? The experimental results look promising. It works on Cifar-10, Cifar-100. On Imagenet, it beats Inception-v2 and v3 when swish replace ReLU.
* It's worthwhile to point out the latest Inception is in v4. So the imagenet number is not beating stoa even within Google, not to say the best number in Imagenet 2016. But that shouldn't matter, if something consistently improve on some models of Imagenet, it is a very good sign it is working.
* Of course, looking at the activation function. It introduces a multiplication. So it does increase computation when compare with a simple ReLU. And that seems to be the complaint I heard.

That's what I have. Enjoy!

A read on " Unsupervised Machine Translation Using Monolingual Corpora Only"

(First published on AIDL-LD and AIDL Weekly.)

"This is an impressive paper by FAIR authors which claims that one only need to use monolingual corpora to train a usable translation model. So how does it work? Here are some notes.

* For starter, indeed you don't need to use a parallel corpora, but you still need a bidirectional dictionary to generate translation. You also need to have monolingual corpora in both languages. That's why the title is about monolingual corpora (plural) but not monolingual corpus (singular).

* Then, there is the issue of how you actually create translation. It's actually much simpler than you thought, first imagine there is a latent language which both your source and target languages mapped to.

* How do you train? So let's just use the source language as an example first. What you can do is create an encoder-decoder architecture which translate your source to the latent space, then translate it back. Using BLEU score, you can then setup an optimization criteria.

* Now this doesn't quite do the translation. Now you apply the same procedure on both source and target language. Don't you now have a common latent space? In actual translation, what you need to do is to first map the target language in the common latent space, then map it back to the source language.

* Many of you might recognize that such encoder-decoder scheme which map the language to itself as very similar to autoencoder. Indeed, the authors in the paper actually use a version of autoencoder: denoising autoencoder(dA) to train the model.

* The final interesting idea I spot is to idea of iterative training. In this case, you can imagine that you can first train an initial translator, but then use its output as the truth and retrain another one. The authors found tremendous gain in BLEU score in the process.

* The results are stunning if you consider no parallel corpus is involved. BLEU score is around 10 points lower, but do remember: deep learning has pretty much improved BLEU scores by absolute 7-8 points anyway from the classical phrased based translation models."