Category Archives: review

GTC 2019 Write-Up Part 1: Keynotes


Inside every serious deep learning house, there is a cluster of machines.  And inside each machine, there is one GPU card.    Since Voci is a serious deep learning house, we end-up owning many GPU cards.

* * *

By now, no one would disagree deep learning has reinvigorated the ASR industry.   Back in 2013, Voci was one of the earliest startups which adopt deep learning.   It was the time when the Hinton's seminal paper [1] was still fresh.   Some brave souls in Voci, including Haozheng Li, Edward Lin and John Kominek decided to just jump to this then-radically new approach.   My hybrid role, as part researcher, and part software maintainer also started then.     We did several other things in Voci, but none of them is as powerful as deep learning.

* * *

But I digress.  Where were we?

Yep, Voci has a lot of GPU cards.  At first you might have the impression that GPU is more like a "parallellizable-CPU".  But then the reality is because GPU is specifically made for high-performance computing applications such as graphics rending.   A GPU has a very different design from CPU.    If you are a C-programmer, you can pick up ideas of Compute Unified Device Architecture (or CUDA as Nvidia love acronyms).  But then your intuition which was developed from years of programming CPU (Intel or Intel-like) would be completely wrong.

So we realized all these at Voci, that's why part of our focus is to understand how GPU works, and that's why both me and my boss , John Kominek, decided to travel to Silicon Valley and attend  GTC 2019, which is the short for GPU Technology Conference.

This article is Part I of my impression for both on the Keynotes by Jensen Huang, I will also take a look of the Poster session as well as the various booths.  But I will leave more technical ideas in the next post.  

Huang's Keynotes


We love Jensen Huang!  He is walking around the stage, enthusiastically explain to the ten thousand audience what's new with Nvidia.   But let's round up the top-5 announcements?

  1. CUDA-X: More like a convergence among different technologies within Nvidia.  CUDA, as we know now, is more like a programming language.  Whereas CUDA-X is more an architecture term within Nvidia, which encompass various technologie, such as RTX, HPC, AI etc.
  2. Mellanox Acquisition:  Once you look at it, strength of Nvidia against its competitors is not just about the GPU cards.  Nvidia also put an infrastructure that enables customers to build system among GPU cards.  So of course,  the first you want to think about is how do you use multiple machines, each with separate cards?  Now, that explains the Mellanox deals (Infiniband?).  That also explains Huang spent a lion share of his time to talk about data center.  How different containers are talking with each other and how that generate traffics. In a way, it is not just about the card, it is about the card and all peripherals.  In fact, it is about the machines and their ecosystem.
  3. The T4 GPU: The way Nvidia market it, T4 is suitable for data-centers which focus on AI.  Currently benchmarking says, a T4 has lower speed than V100, but has a higher energy efficiency.  So this year big news on the server side is AWS has now adopted T4 in their GPU instances.
  4.  Automatic Mixed Precision (AMP) : What about news for us techies?   Well, the most interesting part is perhaps AMP is now available in Tensorcores.    So why precision is so important then?  Well, once you create a production system on either training or inference.  The first thing you will realize is that it takes a lot of GPU memory.  How to reduce it?  Reducing precision is one way to go.  But when you reduce precision, it's possible that quality of your tasks (training or inference) would degrade.  So it's a tricky problem.  Couple of years ago, researchers have figured out couple of methods.   Now you can implement it yourself, but then Nvidia decided to put it in Tensorcore directly.

Oh, FYI, keynotes feel like a party. 



In a large conference like GTC, you can learn many interesting aspects of your technology.    Unlike a pure academic conference, GTC also has the aspect of being a trade-show.   So how does if feels like? Here are some impressions:

  1. All GPU peripherals: Once you get a GPU card, perhaps the bigger problem is how to install them and make them usable.   Do you think it's easy to do so? It should be plug and play right?  Nope, in reality, working with hardware GPU cards is a difficult technical problem.   Part of the issues is heat dissipation.  If you don't trust me, try to get a few consumer grades GPU card into a same box, you can use it to be a heater in Boston!
    That's perhaps why there are so many vendors other than Nvidia try to get into the game of building GPU-based servers.  They are probably one third of the booths in the show.
  2. Self Driving Car/LiDAR I don't envy my colleagues in the SDC industry.  Actually when will see Level 4 self-driving?  Anyway, people do want to see SDC in thenear future.  So that's why you see all SDC vendors show up in the conference.
  3. The Ecosystem: Finally, you also see demonstrations of various clouds which use GPU.


Finally, here is a picture of donuts:There are more than 100 vendors showcasing their AI products.  If you go t

o look at all the booths, you are going to get very hungry.

Wait for Part II!!!

Arthur Chan


[1] The paper was actually jointly written by researchers from Google, IBM and Microsoft back then.   Notice that these researchers were from separate (rival) groups and they seldom wrote joint paper, not to say about ground-breaking results.

Review of Ng's Course 2: Improving Deep Neural Networks

(My Reviews on Course 2 and Course 3.)

In your life, there are times you think you know something, yet genuine understanding seems to elude you.  It's always frustrating, isn't it?   For example, why would all these seemingly simple concepts such as gradients or regularization can throw us off when we learn them since Day 1 of our learning in machine learning?

In programming, there's a term called "grok", grokking something usually means that not only you know the term, but you also have intuitive understanding of the concept.    Or as in "Zen and the Art of Motorcycle Maintenance" [1], you just try to dive deep into a concept, as if it is a journey...... For example, if you really think about speech recognition, then you would realize the frame independence  assumption [2] is very important.   Because it simplifies the problem in both search and parameter estimation.  Yet it certainly introduces a modeling error.  These small things which are not mentioned in classes or lectures are things you need to grok.

That brings us to Course 2 of  What are you grokking in this Course?  After you take Course 1, should you take Course 2?  My answer is yes and here is my reasoning.

Really, What is Gradient Descent?

Gradient descent is a seemingly simple subject - say you want to find a minima of the function a convex function, so you follow the gradient down hill and after many iterations, you eventually hit the minima.  Sounds simple right?

Of course, once you start to realize that functions are normally not convex, and they are n-dimensional, and there can be plateaus.  Or when you follow the gradient,  but it happens to be a wrong direction! So you will have zigzagging when you try to descend.   It's a little bit like descending from a real mountain, yet you don't really can't see n-dimensional space!

That explains the early difficulty of deep learning development - Stochastic gradient descent (SGD) was just too slow back in 2000 for DNN. That results in very interesting research of restricted Boltzmann machine (RBM) which was stacked and used to  initialize DNN, which was prominent subject of Hinton's NNML after Lecture 8, or pretraining, which is still being used in some recipes in speech recognition as well as financial prediction.

But we are not doing RBM any more! In fact, research in RBM is not as fervent as in 2008. [4] Why? It has to do with people just understand more about SGD and can run it better - it has to do with initialization, e.g. Glorot's and He's initialization.   It also has to do with how gradient descent is done - ADAM is our current best.

So how do you learn these stuffs?  Before Ng's class, I would say knowledge like this spread out on courses such as cs231n or cs224n.  But as I mentioned in the Course 1's review, those are really courses with specific applications in mind.  Or you can go to read Michael Nielsen's Neural Network and Deep Learning.   Of course, Nielsen's work is a book.  So it really depends on whether you have the patience to work through the details while reading.  (Also see my review of the book.)

Now you don't have to.  The one-stop shop is Course 2.  Course 2 actually covers the material I just mentioned such as initialization, gradient descent, as well as deeper concepts such as regularization  and batch normalization.   That makes me recommend you to keep on taking the course after you finish Course 1.  If you take the class, and are also willing to read Sebastian Ruder's Review of SGD or Grabriel Goh's Why Momentum Really Works, you would be much ahead of the game.

As a note, I also like Andrew breaks down many of the SGD algorithm as a smoothing algorithm.   That's a new insight for me even after I used SGD many times.

Is it hard?

Nope, as Math goes, Course 1 is probably toughest.  Of course, even in Course 1, you will finish coursework faster if you don't overthink the problem.  Most notebooks have the derived results for you.  On the other hand, you do want to derive the formulae,  you do need to have decent skill in matrix calculus.

Is it Necessary to Understand These Details?; Also Top-Down vs Bottom-Up learning, which is Better?

A legitimate question here is that : well, in our current state of deep learning which we have so many toolkits which already implemented techniques such as ADAM.  Do I really need to dig so deep?

I do think there are always two views in learning - one is from top-down, which in deep learning, perhaps is to read a bunch of papers, learn the concepts and see if you can wrap you head around them.  the class is one of them.   And 95% of the current AI enthusiasts are following such paths.

What's the problem of the top-down approach?  Let me go back to my first paragraph - which is - do you really grok something when you do something top-down?  I frequently can't.   In my work life, I also heard senior people say that top-down is the way to go.  Yet, when I went ahead to check if they truly understand an implementation.  They frequently can't give a satisfactory answer.  That happens to a lot of senior technical people who later turn to more management.   Literally, they lost their touch.

On the other hand, every time, I pop up an editor and write an algorithm, I gain tremendous understanding!   For example, I was asked to write a forward inference once with C, you better know what you are doing when you write in C!   In fact, I come to have opinion these days that you have to implement an algorithm once before you can claim you understand it.

So how come there are two sides of the opinion then?  One of my speculations is that back in 80s/90s, students are often taught to learn how to write program in first writing.  That create mindsets that you have to think up a perfect program before you start to write one.   Of course, in ML, such mindset is highly impractical because and the ML development process  are really experimental.  You can't always assume you perfect the settings before you try something.

Another equally dangerous mindset is to say "if you are too focused on details, then you miss the big picture won't come up with something new!" . This I heard a lot when I first do research and it's close to most BS-ty thing I've heard.  If you want to come up with something new, the first thing you should learn is all the details of existing works.  The so called "big picture" and "details" are always interconnected.  That's why in the AIDL forum, we never see young kids, who say "Oh I have this brand new idea, which is completely different from all previous works!", would go anywhere.  That's because you always learn how to walk before you run.   And knowing the details has no downsides.

Perhaps this is my long reasons why Ng's class is useful for me, even after I read many literature.  I distrust people who only talk about theory but don't show any implementation.


This concludes my review of Course 2.  To many people, after they took Course 1, they just decide to take Course 2, I don't blame them, but you always want to ask if your time is well-spent.

To me though, taking Course 2 is not just about understanding more on deep learning.  It is also my hope to grok some of the seemingly simple concepts in the field.   Hope that my review is useful and I will keep you all posted when my Course 3's review is done.


[1] As Pirsig said - it's really not about motorcycle maintenance.

[2] Strictly speaking, it is conditional frame independence assumption.  But practitioners in ASR frequently just called it frame independence assumption.

[3] Also see HODL's interview with Ruslan Salakhutdinov, his account is first hand on the rise and fall of RBM.

Links of My Reviews

Since I started to re-learn machine learning.  I wrote several review articles on various classes, books and resources.   Here is a collection of links:

For the Not-So-Uninitiated: Review of Ng's Coursera Machine Learning Class

One Algorithm to rule them all - Reading "The Master Algorithm"

Radev's Coursera Introduction to Natural Language Processing - A Review

Learning Deep Learning - My Top-Five List

Learning Machine Learning - Some Personal Experience

A Review on Hinton's Coursera "Neural Networks and Machine Learning"

Reading Michael Nielsen's "Neural Networks and Deep Learning"