All posts by grandjanitor

AIDL Weekly Issue 4: K for Kaggle, Jetson TX2 and DeepStack

Thoughts From Your Humble Curators

Three big news last week:

  1. Google acquired Kaggle
  2. Jetson TX2 was out,
  3. Just like its rival Libratus, DeepStack made headlines for beating human poker pros.

In this Editorial though, we want to bring to your attention is this little paper titled "Stopping GAN Violence: Generative Unadversarial Networks". After 1 minute of reading, you would quickly notice that it is a fake paper. But to our dismay, there are newsletters just treat the paper as a serious one. It's obvious that the "editors" hadn't really read the original paper.

It is another proof point that the current deep learning space is a over-hyped. Similar happened to Rocket AI). You can get a chuckle out of it but if over-done, it could also over-correct when expectations aren't met.

Perhaps more importantly, as a community we should spend more conscious effort to fact-check and research a source before we share. We at AIDL Weekly, follow this philosophy religiously and all sources we include are carefully checked - that's why our newsletter stands out in the crowd of AI/ML/DL newsletters.

If you like what we are doing, check out our FB group, our YouTube channel.

And of course, please share this newsletter with friends so they can subscribe to this newsletter.

Artificial Intelligence and Deep Learning Weekly

Blog Posts

Open Source


Member's Question

Question from an AIDL Member

Q. (Rephrases from a question asked by Flávio Schuindt) I've been studying classification problems with deep learning and now I can understand quite well it. Activation functions, regularizeres, cost functions, etc. Now, I think its time to step forward. What I am really trying to do now is enter in the deep learning image segmentation world. It's a more complicated problem than classification (object occlusion, lightning variations, etc). My first question is: How can I approach this king of problem? [...]

A. You do hit one of the toughest (but hot) problem in deep-learning-based image processing. Many people confuse problems such as image detection/segmentation with image classification. Here are some useful notes.

  1. First of all, have you watched Karpathy's 2016 cs231n's lecture 8 and 13? Those lectures should be your starting points to work on segmentation. Notice that image localization/detection/ segmentation are 3 different things. Localization and detection find bounding boxes and their techniques/concepts can be helpful on "instance segmentation". "Semantic segmentation" requires downsampling/upsampling architecture. (see below.)
  2. Is your problem more a "semantic segmentation" problem of "instance segmentation" problem? (See cs231n's lecture 13) The former comes up with regions of different meaning, the latter comes up with instances.
  3. Are you identifying something which always appear? If that's the case you don't have to use flunky detection technique, treat it as a localization problem and you can solve by Backprop a simple loss function (as described in cs231n lecture 8). If it might or might not appear, then a detection-type of pipeline might be necessary.
  4. If you do need to use detection-type of pipeline. Does standard segment proposal techniques work for your domain? This is crucial, because at least the beginning of your segmentation research, you have to do find segment proposals.
  5. Lastly if you decide this is really a semantic segmentation problem, then most likely your major task is to adopt an existing pre-train network. Very likely your goal is to transfer learning. Of course check out my point 2 and see if this is really the case.

Artificial Intelligence and Deep Learning Weekly

AIDL Weekly Issue 3

Issue 3  


Thoughts From Your Humble Curators

What a week in Machine Learning! Last week we saw Waymo high-profile lawsuit against Uber, as well as perhaps the first API against online trolling from Jigsaw. Both events got a lot of media coverage. Both of these events are featured in our News section, with our analysis on it.

On exciting news: GTX 1080 Ti is here yesterday, and featured in this issue. Its spec is more impressive than the $1.2k Titan X, and only costs $699.

In other news, you might have heard of DeepCoder in the last few weeks, and how it purportedly steals and integrates code from other repos. Well, it's fake. We feature a piece from Stephen Merity which debunks these hyped news.

One must-see this week perhaps is Kleiner Perkins' Mike Abbott's interview with Prof. Fei-Fei Li from Stanford. The discussion on how A.I. startups can compete with the larger incumbents is definitely worth watching.

As always, check out our FB group, our YouTube channel.

And of course, please share this newsletter with friends so they can subscribe to this newsletter.

Artificial Intelligence and Deep Learning Weekly


Blog Posts

Open Source


AIDL Weekly Issue 2 - Gamalon/Batch Renormalization/TF 1.0/Oxford Deep NLP

Issue 2  


Thoughts From Your Humble Curators

How to create a good A.I. newsletter? What first comes to everybody's mind is to simply aggregate a lot of links. This is very common in deep learning resource lists, say "Cool List of XX in Deep Learning". Our experience is that you usually have to sift through 100-200 links and decide which are useful.

We believe there is a better way: In AIDL Weekly, we only choose important news and always provide detailed analysis on each of them. For example, here we take a look at newsworthy Gamalon, it is known to use a ground-breaking method to outperform deep learning and win a defense contract recently. What is the basis of its technology? We cover this in a deep dive in the "News" section.

Or you can take a look of the exciting development of batch renormalization that tackles its current shortcomings. Anyone who does normalization in training will likely benefit from the paper.

Last week, we also saw the official release of Tensorflow 1.0 as well as the 2017 Official Tensorflow Summit. We prepared two good links so that you can follow. If you love deep learning with NLP, you might also want to check out the new course from Oxford.

As always, check out our FB group, our YouTube channel, of course subscribe this newsletter.

Artificial Intelligence and Deep Learning Weekly


Blog Posts

Open Source


Member's Question

Question from a AIDL Member

Q: (Rephrase) I am trying to learn the following languages, (...) to intermediate level, and the following languages, (...) to professional level. Would this be helpful for my career on Data Science/Machine Learning? I have a mind to work on deep learning."

This is a variation of a frequently asked question. In a nutshell, "how much programming should I learn if I want to work on deep learning?". The question itself shows misconceptions about programming and machine learning. So we include it in this issue. This is my (Arthur's) take:

  1. First thing first, usually you first decide which package to work on, if the package use language X, then you go to learn-up language X. e.g. if I want to hack Linux kernel, I would need to know C and learn Linux system calls, and perhaps some assembly language. Learning programming is more like a means to achieve a goal. Echoing J.T. Bowlin's point, programming language is more like a language, you can always learn more, but there's a point it seems to be unnecessary.
  2. Then you ask what language should be used to work on deep learning. I will say mathematics, because once you understand the greek symbols, you can translate all these symbols to code (approximately). So if you ask me what you need to learn to hack tensorflow, "Mathematics" would be the first answer, yes, the package is written by Python/C++/C, but they won't be even close in my top-5 answers. Because if you don't know what Backprop is, knowing how C++ destructor works can't make you an expert of TF.
  3. The final thing is you mentioned the term "level". What does this "level" mean? So is it like chess-rating or go-rating that someone has higher rating, they will have a better career in deep learning? It might work for competitive programming...... but real-life programming doesn't work that way. Real-life programming means you can read/write a complex programs. e.g. in C++, you use a class instead of repeating a function implementation many times to reduce programming. Same as templates. That's why class and templates are important concept and people debate their usages a lot. How can you give "levels" to such skills?

Lastly I would say if you seriously want to focus on one language, consider python, but always learn a new programming language yearly. Also pick up some side-projects, both your job and side-projects would usually give you ideas which language you should learn more.

Artificial Intelligence and Deep Learning Weekly

©2017-2019 Artificial Intelligence and Deep Learning Weekly
| Sponsorship


AIDL Weekly Issue 1 - First AIDL Weekly

Issue 1  


Thoughts From Your Humble Curators

When Waikit Lau and I (Arthur Chan) started the Facebook Group Artificial Intelligence and Deep Learning Group (AIDL) last April. We have no idea it would become a group with 9000+ members, and still growing fast. (We added 1k members in last 7 days alone)

We suspect this is just the beginning of the long curvy road of a new layer of intelligence that can be applied everywhere. The question is how do we start? That was the first thing we realized back in late 2015: facing literally ten thousands of links, tutorials etc., it was like drinking from a firehose and we had a hard time to pick up the gems.

We decided to start our little AIDL group to see if we could get a community to help makes sense of the velocity of information. In less than one year, AIDL become the most active A.I. and deep learning group on Facebook. We hope to summarize, analyze, educate and disseminate and I think we have done a good job so far. This resulted in conversations flourishing in the group. We strived to have discussions one level deeper than others. For example, forum members including us fact check several pieces of news related to deep learning. This gives us a better edge in the rapidly changing field of A.I.

This newsletter follows exactly the same philosophy as our forum. We hope to summarize, analyze, educate and disseminate. We will keep an eye on the latest and most salient developments and present them in a coherent fashion to your mailbox.

We sincerely hope that AIDL will be helpful to your career or studies. Please share our newsletter here with your friends. Also check out our Youtube channel at here.


Your Humble Curators, Arthur and Waikit

Artificial Intelligence and Deep Learning Weekly


Blog Posts

Open Source


©2017-2019 Artificial Intelligence and Deep Learning Weekly


GTC 2019 Write-Up Part 1: Keynotes


Inside every serious deep learning house, there is a cluster of machines.  And inside each machine, there is one GPU card.    Since Voci is a serious deep learning house, we end-up owning many GPU cards.

* * *

By now, no one would disagree deep learning has reinvigorated the ASR industry.   Back in 2013, Voci was one of the earliest startups which adopt deep learning.   It was the time when the Hinton's seminal paper [1] was still fresh.   Some brave souls in Voci, including Haozheng Li, Edward Lin and John Kominek decided to just jump to this then-radically new approach.   My hybrid role, as part researcher, and part software maintainer also started then.     We did several other things in Voci, but none of them is as powerful as deep learning.

* * *

But I digress.  Where were we?

Yep, Voci has a lot of GPU cards.  At first you might have the impression that GPU is more like a "parallellizable-CPU".  But then the reality is because GPU is specifically made for high-performance computing applications such as graphics rending.   A GPU has a very different design from CPU.    If you are a C-programmer, you can pick up ideas of Compute Unified Device Architecture (or CUDA as Nvidia love acronyms).  But then your intuition which was developed from years of programming CPU (Intel or Intel-like) would be completely wrong.

So we realized all these at Voci, that's why part of our focus is to understand how GPU works, and that's why both me and my boss , John Kominek, decided to travel to Silicon Valley and attend  GTC 2019, which is the short for GPU Technology Conference.

This article is Part I of my impression for both on the Keynotes by Jensen Huang, I will also take a look of the Poster session as well as the various booths.  But I will leave more technical ideas in the next post.  

Huang's Keynotes


We love Jensen Huang!  He is walking around the stage, enthusiastically explain to the ten thousand audience what's new with Nvidia.   But let's round up the top-5 announcements?

  1. CUDA-X: More like a convergence among different technologies within Nvidia.  CUDA, as we know now, is more like a programming language.  Whereas CUDA-X is more an architecture term within Nvidia, which encompass various technologie, such as RTX, HPC, AI etc.
  2. Mellanox Acquisition:  Once you look at it, strength of Nvidia against its competitors is not just about the GPU cards.  Nvidia also put an infrastructure that enables customers to build system among GPU cards.  So of course,  the first you want to think about is how do you use multiple machines, each with separate cards?  Now, that explains the Mellanox deals (Infiniband?).  That also explains Huang spent a lion share of his time to talk about data center.  How different containers are talking with each other and how that generate traffics. In a way, it is not just about the card, it is about the card and all peripherals.  In fact, it is about the machines and their ecosystem.
  3. The T4 GPU: The way Nvidia market it, T4 is suitable for data-centers which focus on AI.  Currently benchmarking says, a T4 has lower speed than V100, but has a higher energy efficiency.  So this year big news on the server side is AWS has now adopted T4 in their GPU instances.
  4.  Automatic Mixed Precision (AMP) : What about news for us techies?   Well, the most interesting part is perhaps AMP is now available in Tensorcores.    So why precision is so important then?  Well, once you create a production system on either training or inference.  The first thing you will realize is that it takes a lot of GPU memory.  How to reduce it?  Reducing precision is one way to go.  But when you reduce precision, it's possible that quality of your tasks (training or inference) would degrade.  So it's a tricky problem.  Couple of years ago, researchers have figured out couple of methods.   Now you can implement it yourself, but then Nvidia decided to put it in Tensorcore directly.

Oh, FYI, keynotes feel like a party. 



In a large conference like GTC, you can learn many interesting aspects of your technology.    Unlike a pure academic conference, GTC also has the aspect of being a trade-show.   So how does if feels like? Here are some impressions:

  1. All GPU peripherals: Once you get a GPU card, perhaps the bigger problem is how to install them and make them usable.   Do you think it's easy to do so? It should be plug and play right?  Nope, in reality, working with hardware GPU cards is a difficult technical problem.   Part of the issues is heat dissipation.  If you don't trust me, try to get a few consumer grades GPU card into a same box, you can use it to be a heater in Boston!
    That's perhaps why there are so many vendors other than Nvidia try to get into the game of building GPU-based servers.  They are probably one third of the booths in the show.
  2. Self Driving Car/LiDAR I don't envy my colleagues in the SDC industry.  Actually when will see Level 4 self-driving?  Anyway, people do want to see SDC in thenear future.  So that's why you see all SDC vendors show up in the conference.
  3. The Ecosystem: Finally, you also see demonstrations of various clouds which use GPU.


Finally, here is a picture of donuts:There are more than 100 vendors showcasing their AI products.  If you go t

o look at all the booths, you are going to get very hungry.

Wait for Part II!!!

Arthur Chan


[1] The paper was actually jointly written by researchers from Google, IBM and Microsoft back then.   Notice that these researchers were from separate (rival) groups and they seldom wrote joint paper, not to say about ground-breaking results.

Resources on Speech Recognition

Unlike other deep learning topics, there are no readily made video courses available on speech recognition.   So here is a list of other resources that you may find useful.


If you want to learn from online resources:

Useful E2E Speech Recognition Lecture

Important papers:

  • Deep Neural Networks for Acoustic Modeling
    in Speech Recognition" by G. Hinton et al
  • Supervised Sequence Labelling with Recurrent Neural Networks by Alex Graves

Resources on Understanding Heaps

Some assorted links for understanding heaps in user-land,

Resources on CUDA programming

Here is a list of resources for CUDA programming, in particular, in C.


Perhaps the best beginning guide is written by Mark Harris, currently spot 10 articles. They start from simple HelloWorld-type of example.  But goes deeper and deeper into important topic such as data transfer optimization, as well as shared memory.  The final 3 articles focus on optimizing real-life applications such as matrix transpose and finite-difference method.

  1. An Easy Introduction to CUDA C and C++
  2. How to Implement Performance Metrics in CUDA C/C++
  3. How to Query Device Properties and Handle Errors in CUDA C/C++
  4. How to Optimize Data Transfers in CUDA C/C++
  5. How to Overlap Data Transfers in CUDA C/C++
  6. An Even Easier Introduction to CUDA
  7. Unified Memory for CUDA Beginners
  8. An Efficient Matrix Transpose in CUDA C/C++
  9. Finite Difference Methods in CUDA C/C++, Part 1
  10. Finite Difference Methods in CUDA C/C++, Part 2


A very important document on the internal of Nvidia chips as well as CUDA programming models would be CUDA C Programming Guide.

In version 9, the document has around 90 pages of content with the rest of 210 pages to be appendices.  I found it very helpful to read through the content and look up the appendices from time to time.

The next document which is useful is CUDA Best Practice Guide.  You will find a lot of performance tuning tips there in the guide.

If you want to profile a CUDA application, you must use nvprof and the Visual profiler, you can find their manuals here.  Two other very good links to read are here and this one by Mark Harris.

If you want to read a very good textbook, consider to read "Professional CUDA C Programming" which I think is the best book on the topic.   You will learn what the author called "profile-based programming" which is perhaps the best way to proceed in CUDA programming.



Inline PTX Assembly

CuBLAS:  indispensible for linear algebra.  The original Nvidia documentation is good.  But you may also find this little gem on "cuBLAS by example" useful.

Resources on ResNet


youtube video:


Quite related:

  •  Convolutional Neural Networks at Constrained Time Cost ( Interesting predecessor of the paper.
  • Highway networks: (

Unprocessed but Good:

  • multigrid tutorial (
  • (Talk about Resnet, Wide Resnet and ResXnet)
  • Wide Residual Networks (
  • Aggregated Residual Transformations for Deep Neural Networks (
  • Deep Networks with Stochastic Depth
  • Highway network:
  • Ablation study:
  • It's implemented in TF:
  • Wider or Deeper: Revisiting the ResNet Model for Visual Recognition:
  • Deep Residual Learning and PDEs on Manifold:
  • Is it really because of ensemble?
  • Multi-level Residual Networks from Dynamical Systems View (
  • Exploring Normalization in Deep Residual Networks with Concatenated Rectified Linear Units (
  • TinyImageNet (
  • Predict Cortical Representation (

Another summary:

A read on "ImageNet Training in Minutes"

Yes, you read it right, Imagenet training in 24 mins. In particular, an Alexnet structure in 24 mins and Resnet-50 in 60 mins. In terms of Alexnet, in fact, You's work break the previous Facebook's record: 1 hour for Alexnet training. Last time I check, my slightly-optimized training with one single GPU will take ~7 days. Of course, I'm curious how these ideas work. So this post is a summary.

* For the most part, this is not GPU works. This is mostly more a CPU platform but accelerated by Intel Knight Landing (KNL) accelerator. Such accelerator is very suitable in HPC platforms. And there are couple of supercomputers in the world which were built up to 2000 to 10000 such CPUS.

* The gist of why KNL is good: it can divide processors on chip with the memory well. So unlike many clusters you might encounter with 8 to 16 processors, memory bandwidth is much wider. That's usually is a huge bottleneck in training speed.

* Another important line of thought here is "Can you load in more data per batch?" because that allows calculation to be parallelized much easier. The first author, You's previous work already allow the Imagenet batch goes from the standard, 256-512 to something like 8192. This thought has been there for a while, perhaps since Alex Krishevzky. His previous idea is based on adaptive calculation of learning rate per layers. Or Layer-wise Adaptive Rate Scaling (LARS).

* You then combined LARS with another insight from FB researchers: a slow warmup in learning rate. That results in his current work. And it is literally 60% faster than the previous work.

Given what we know, it's thinkable that the training can be even faster in the future. What has been blocking people seem to be 1) No. of CPUs within a system 2) How large a batch size can be loaded in. And I bet after FB read You's paper, there will be another batch of improvement as well. How about that? Don't you love competition in deep learning?