A Read on " CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning"

(First published at AIDL-LD and AIDL Weekly.)

This is a note on CheXNet the paper. As you know it is the widely circulated paper from Stanford, purportedly outperform human's performance on Chest X-ray diagnostic.

* BUT, after I read it in detail, my impression is slightly different from just reading the popular news including the description on github.

* Since the ML part is not very interesting. I will just briefly go through it - it's a 121-layer Densenet, basically it means there are feed-forward connection from every previous layers. Given the data size, it's likely a full training.

* There was not much justification on the why of the architecture. My guess: the team first tried transfer learning, but decide to move on to full-training to get better performance. A manageable setup would be Densenet.

* Then there was a fairly standard experimental comparison using AUC. In a nut shell, CheXNet did perform better than humans in every one of the 14 classes of ChestX-ray-14, which is known to be the largest of the similar databases.

* Now here is the caveat popular news hadn't mentioned:
1, First of all, humans weren't allow to access previous medical records of a patient.
2, Only frontal images were shown to human doctors. But prior work did show when the lateral view was also shown.

* That's why on p.3 of the article, the authors note:
"We thus expect that this setup provides a conservative estimate of human radiologist performance."

which should make you realize that may be it will still take a bit for deep learning to "replace radiologists".

A Read on " Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm"

(First published at AIDL Weekly and AIDL-LD.)

Kade Gibson already post this paper and give a good summary. I want to analyze it with a more detail so I started a separate thread.

* As you know the story, AlphaZero is not only just playing Go, and is now playing Chess and Shogi. By itself this is a significant event, because most stoa board game engine are specific to games. General game playing engines are seen as novelties but not a norm.

* Another note, most Chess and Shogi engines are based on alpha-beta search. But then AlphaZero is now using Monte-Carlo Tree Search which simulate board positions. Positions are order by scores from a board NN. State is entered in the order of visit counts and value of the board according to NN. So you can see this is not just AlphaZero is beating up more games, it will be more a paradigm shift of both computer Chess and Shogi community.

* As you know, AlphaZero beats the strongest program in 2016, Stockfish. But one analysis which caught my eyes: In chess, DeepMind researchers also fix the first few moves of AlphaZero so that it follows the top 12 most-play openings for black and white. If you are into chess, Queen's Gambit, several Sicilian Defences, The French, KID. They show that AlphaZero can beat Stockfish in multiple type of situations, and openings doesn't matter too much.

* But then, would AlphaZero beat all computer players such as Shredder or Komodo? No one knows the answers yet.

* One more thing: AlphaZero doesn't assume zero knowledge neither. As Denny Britz points out in his tweet, AlphaZero was provided with perfect knowledge in terms of rules. So intriguing rules such as castling, threefold repetition or 50-move drawing rules are all provided to the machine. Perhaps Britz points out, may be we want to focus on how to let the machine to figure out the rules themselves in the future.

That's what I have. Hope you enjoy it.

A Read on "Deep Neural Networks for Acoustic Modeling in Speech Recognition" by Hinton et al.

A read on "Deep Neural Networks for Acoustic Modeling in Speech Recognition" by Hinton et al.

* This is the now-classic paper in deep learning, which is for the first time people confirmed that deep learning can improve ASR significantly. It is important in the fields of both deep learning and ASR. It's also one of the first papers I read on deep learning back in 2012-3.

* Many people know the origin of deep learning from image recognition, e.g. many kids would tell you stories about Imagenet, Alexnet and history from now on. But then the first important application of deep learning is perhaps speech recognition.

* So what's going on with ASR before deep learning then? For the most part, if you can come up with a technique that cut a state-of-the-art system's WER by 10%, your PhD thesis is good. If your technique can consistently beat previous techniques in multiple systems, you usually get a fairly good job in a research institute in Big 4.

* The only technique which I recall to be better than 10% relative improvement are discriminative training. It got ~15% in many domains. That happens back in 2003-2004. In ASR, the term "discriminative training" has very complicated connotation. So I am not going to explain much. This just gives you the context of how powerful deep learning is.

* You might be curious what "relative improvement" is. e.g. suppose your original WER is 18%, but you improve from 17%, then your relatively improvement is 1%/18% = 5.56%. So 10% improvement really means you go down to 16.2%. (Yes, ASR is that tough.)

* So here comes replacing GMM with DNN. In these days, it sounds like a no-brainer. But back then, it was a huge deal. Many people in the past tried to stuff various ML technique to replace GMM. But no one can successfully beat HMM. So this is innovative.

* Now then it is how GMM is setup - the ancestor of this work has to trace back to Bourlard and Morgan's "Connectionist Speech Recognition" in which the authors tried to come up with a Context-independent HMM system by replacing VQ scores with a shallow neural network. At that time, the unit are chosen to be CI-states.

* Hinton's and perhaps Deng's thinking are interesting: The unit was chose to be context-dependent states. Now that's an new change, and reflect how modern HMM system is trained.

* Then there is how the network is really trained. Now you can see the early DLer's stress on using pre-training because training is very expensive at that point. (I suspect it wasn't using GPUs).

* Then there is the use of entropy to train a model. Later on, in other systems, many people just use a sentence-based entropy to do training. So in this sentence, the paper is olden.

* None of these are trivial work. But the result is stellar: we are talking about 18%-33% relative gain (p.14). To ASR people, that's unreal.

* The paper also foresee some future use of DNN, such as bottleneck feature and articulatory feature. You probably know the former already. The latter is more exoteric in ASR, so I am not going to talk about much.

Anyway, that's what I have. Enjoy the reading!

A Read on "Regularized Evolution for Image Classifier Architecture Search"

(First appeared in AIDL-LD and AIDL Weekly.)

This is a read on "Regularized Evolution for Image Classifier Architecture Search" which is the paper version of AmoebaNet, the latest result in AutoML (Or this page: https://research.googleblog.com/…/using-evolutionary-automl…)

* If you recall, Google already has several results on how to use RL and evolution strategy (ES) to discover model architecture in the past. e.g. Nasnet is one of the examples.

* So what's new? The key idea is so-called regularized evolution strategy. What does it mean?

* Basically it is a tweak of the more standard tournament strategy, commonly used as the means of selecting individual out of a population. (https://en.wikipedia.org/wiki/Tournament_selection)

* Tournament is not too difficulty to describe:
- Choose random individuals from the population.
- Choose the best candidate according to certain optimizing criterion.

You can also use a probabilistic scheme to decide whether to use the second or third best candidate. You might also think of it as throwing away the worst-N-candidate.

* The AutoML calls this original method by Miller and Goldberg (1995) as non-regularized evolution method.

* What is "regularized" then? Instead of throwing away the worst-N-candidates. The author proposed to throw away the oldest-trained candidate.

* Now you won't see a justification of why this method is better until the "Discussion" section. Okay, let's go with the authors' intended flow. As it turns the regularized method is better than non-regularized method. e.g. In CIFAR-10, the evolved model is ~10% relatively better either man-made model or NasNet. On Imagenet, it performs better than Squeeze-and-Excite Net as well as NasNet. (Squeenze-and-Excite Net is the ILSVRC 2017's winner.)

* One technicality when you read the paper is the G-X dataset, they are actually the gray-scale version the normal X data. e.g. G-CIFAR-10 is the gray-scale version of CIFAR-10. The intention of why the authors do it are probably two folds: 1) to scale the problem down, 2) to avoid overfitting to only the standard testsets of the problems.

* Now, these are all great. But how come the "regularized" approach is better then? How would the authors explain it?

* I don't want to come up with a hypothesis. So let me just quote the last paragraph here: "Under regularized evolution, all models have a short lifespan. Yet, populations improve over longer timescales (Figures 1d, 2c,d, 3a–c). This requires that its surviving lineages remain good through the generations. This, in turn, demands that the inherited architectures retrain well (since we always train from scratch, the weights are not heritable). On the other hand, non-regularized tournament selection allows models to live infinitely long, so a population can improve simply by accumulating high-accuracy models. Unfortunately, these models may have reached their high accuracy by luck during the noisy training process. In summary, only the regularized form requires that the architectures remain good after they are retrained."

* And also: "Whether this mechanism is responsible for
the observed superiority of regularization is conjecture. We
leave its verification to future work."

A Read on "A Neural Attention Model for Abstractive Sentence Summarization" by A.M. Rush, Sumit Chopra and Jason Weston.

(First appeared in AIDL-LD and AIDL Weekly.)

This is a read on the paper "A Neural Attention Model for Abstractive Sentence Summarization" by A.M. Rush, Sumit Chopra and Jason Weston.

* Video: https://vimeo.com/159993537 . Github: https://github.com/facebookarchive/NAMAS

* The paper was written at 2015, and is more a classic paper on NN-based summarization. It is published slightly later than classic papers on NN-based translation such as those written by Cho or Badhanau. We assume you have some basic understanding on NN-based translation and attention.

* There is a github (https://github.com/facebookarchive/NAMAS) and a video (https://vimeo.com/159993537) for the paper.

* If you haven't worked on summarization, you can broadly think of techniques as extractive or abstractive. Given the text you want to summarize, "extractive" means you just usehe word from the input text, whereas "abstractive" means you can use any words you like, even the words which are in the input text.

* So this is why summarization is seen as similar problem as translation: you just think that there is a "translation" from the original text to the summary.

* Section 2 is a fairly nice mathematical background of summarization. One thing to note, the video also bring up noisy channel formulation. But as Rush said, their paper is to completely do away noisy-channel but do direct mapping.

* The next nuance you want to look at is the type of LM and the encoder used. That can all be found in Section 3. e.g. it uses the forward NNLM proposed by Bengio. Rush mentioned that he was trying RNNLM, but at that time, he get small gain. It feels like he can probably get better results if RNNLM is used.

* Then it's the type of encoder, there is a nice comparison between bag-of-words and attention models. Since there are words embeddings, the "bag-of-words" is actually all the input words embedded down to a certain size. Attention model, on the other hand, is what we know today, which contains a weight matrix P which map the context to input.

* Here is an insightful note from Rush: "Informally we can think of this model as simply replacing the uniform distribution in bag-of-words with a learned soft alignment, P, between the input and the summary."

* Section 4 is more on decoding, in Section 2, Markov assumption was made, this simplifies the decoding quite a lot. The authors were using beam search, so you can use trick such as path combination.

* Another cute thing is that the authors also comes up with method such that make the summarization more extractive. For that it uses a log-linear model to also weigh features such as unigram to trigram. See Section 5.

* Why would the author wants to make the summarization more extractive? That probably has to do with the metric. ROUGE usually favors words which are extracted from the input text.

* We will stop at this point. Here are several interesting commentaries about the paper.

mathyouth's: https://github.com/mathsyouth/awesome-text-summarization
Denny Britz': https://github.com/…/neural-attention-model-for-abstractive…


Some Notes on Building a DL-Machine (Installing CUDA 8.0 on an Ubuntu 14.04)

I mostly just follow this link by Slav IvanovThe build is for a friend, so nothing fancy.  The only thing different is my friend got a Titan X instead of a 1080, and he requires Ubuntu 14.04.

As a rule of Linux installation, you can't always follow the instruction as if it is casted in stone.   So what I did differently?  So I did:

  1. sudo apt-get update
  2. sudo apt-get --assume-yes install tmux build-essential gcc g++ make binutils
    sudo apt-get --assume-yes install software-properties-common
    sudo apt-get --assume-yes install git

    (Notice unlike Slavv, I didn't do an upgrade because upgrade seems to easily screw up CUDA 8.0 installation later on.)

  3.  So this is a major difference, Slavv suggested to install CUDA directly. No, no, no.  What you should do is to make sure driver of your graphic card is installed first.  And Ubuntu/Nvidia has good support on it.  Following this thread, I found that installing Titan require updating driving to nvidia-367.  So I just did an apt-get install nvidia-367.
  4. At this point if you reboot, you will notice that 14.04 recognize the display card. Usually what it means is the display is in the right resolution.   (If the driver is not installed properly, then you will find a display with overlarged icons, etc.)
  5. So now, you can test your setting, by typing nvidia-smi.  Normally a screen would look like this one.  If you are running within a GUI, there should be at least one process running on the GPU.
  6. Now all good, you now have the driver of the display card, now you can really follow Slavv's procedure :
    wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_8.0.61-1_amd64.deb
    sudo dpkg -i cuda-repo-ubuntu1404_8.0.61-1_amd64.deb
    sudo apt-get update
    sudo apt-get install cuda-toolkit-8.0
  7. This is the point when I stopped.  And I left it to my friend to install more software.   Usually, installing display card driver and CUDA are the toughest steps in a Linux software build.  So the rest should be quite smooth.

Arthur Chan

Acknowledgement: Thanks for the great post by Slav Ivanov!

Quick Impression on Waikit Lau's Crypto and Blockchain forum and MIT

* Hosted by Coach Wei. Our own Waikit Lau is presenting the topic on blockchain, cryptocurrency and ICO. Me and Brian Subiranna were invited as guest panels at the Q&A forums. Brian is planning to create a class on blockchain at MIT.

* About the forum, Coach Wei has launched MIT-Tsinghua Summit since Dec 30 last year. So this is part of the talk in the series: https://www.linkedin.com/…/mit-tsinghua-innovation-summit-…/

* And Waikit, as you might know, he has been successful serial entrepreneur, angel and involved in several ICOs. He also co-admin the AIDL forum.

* In my view, Waikit gave a great presentation on the excitement of blockchain and cryptos. His ~50 min presentation have couple gists
- The rise of protocol coins such as ethereum.
- The potential business opportunity is comparable to development of HTTP. Waikit use the metaphor that blockchain can be seen as TCP/IP. Whereas application build on top of blockchain can be thought as HTTP.
- The current ambiguity of how ICO should be regulated. Or generally: should cryptos be seen as a commodity or a security?

* Gems from the Q&A session. The crowd has many sharp questions to the panels. Here is a summary:

Q. Are there any values of blockchain without cryptocurrency?
A. Generally yes from the panels. e.g. most chains can exist without the idea of mining. Mining probably makes more sense when parties within the network don't trust each other.

Q. What is the current state of decentralized exchanges?
A. Panels: Still under development. There are a lot of things need to happen to motivate a large ones.

Q. Would quantum computing be a threat to blockchain?
A. Panels (Arthur): It could be, yet current quantum computing still several technical roadblocks to solve to make it usable for applications. e.g. create stabilized inputs for the QC. There are also counter technology such as quantum cryptography being developed. So we can't quite say QC would just kill blockchain even if it is developed.

Q. Should a chain be controlled by one or multiple parties?
A. Yet another issue which is hard to predict. Looking at the development of Ethereum and Bitcoin, having a benevolent dictator seems to make the Ethereum community more unified. But the fact that Bitcoin's community is segmented that now the buyers/users have more say, might motivate adoption of speed-up algorithm.

Q. Would speeding up mining speed up transaction?
A. Unlikely. What would likely to improve transaction speed are technology such as Lightning.

That's what I have.  You can find the original thread at



Quick Impression of Course 5 of deeplearning.ai

Hey Hey! As you know Course 5 is just out, as always, I would check out the class and give you a quick impression of what about. So far, these new 3-week class look very exciting. Here is my takeaway. Remember, I haven't started the class yet. But this is likely to give you a sense of the scope and extent of the class.

* Course 5 is mostly focused on sequence models. That include the more mysterious models such as RNN, GRU, LSTM. You will go through standard ideas such as vanishing gradients which actually first discovered in RNN. Then go through GRU and LSTM afterward.

* The sequence of coverage is nice, covering GRU first, then LSTM doesn't quite follow the historical order. (Hochreiter & Schmidhuber first discovered LSTM in 97, Cho had the idea about GRU in 2014). But such order makes more sense for pedagogical purpose. I don't want to spoil it, but this is also how Socher's approach of the subject in cs229n as well. That makes me believe this is likely a course which would teach you well on RNN.

* Week 1 will be all about RNN, then Week 2 and 3 would be about word vectors, and end-to-end structure. Would one week be enough for each topic? Not at all. But Andrew seems to give all the essential in each topic - word2vec/GloVec in word vectors. Standard dec-enc structure in end-to-end scheme. Most examples are based on SMT, which I think it's appropriate. Other applications such as image captioning or speech recognition are possible applications. But they usually have details which is tough to cover in the first class.

* Would this class be everything you need on NLP? Very unlikely. You still need take cs229n to get good. Or even the Oxford class. But just like Course 4 is a good intro to computer vision. This will be a good intro to NLP and in general any topics which require sequence modeling such as speech recognition, stock analysis or DNA sequence analysis.

The course link can be found at https://www.coursera.org/learn/nlp-sequence-models

Hope this "Quick Impression" helps you!


Review of Ng's deeplearning.ai Course 4: Convolutional Neural Networks

(You can find my reviews on previous courses here: Course 1, Course 2 and Course 3. )

Time flies, I finished Course 4 around a month ago and finally have a chance to write a full review.   Course 4 is different from the first three deeplearning.ai courses, which focused on fundamental understanding of deep learning topics such as back propagation (Course 1) , tuning hyperparameters (Course 2) and decide what improvement strategy is the best (Course 3) .  Course 4 is more about an important application of deep learning: computer vision.

Focusing on computer vision make designing Course 4 subjects to a distinct challenges as a course: how does Course 4 scales up with other existing computer vision class?   Would it be comparable with the greats such as Stanford cs231n?  For these questions, I will do a comparison between Course 4 and cs231n in this article.   My goal is to answer how you would choose between the two classes in your learning process.

Convolutional Neural Network In the Context of Deep Learning

Convolutional neural networks (CNN) has a very special place in deep learning.   For the most part, you can think of it as interesting special case of a vanilla feed-forward network with parameters tied. Computationally, you can parallelize it much better than technique such as recurrent neural networks.   Of course, it is prominent in image classification (since LeNet-5).   But then it is also frequently used in sequence modeling such as speech recognition and text classification (check out cs224n for details).   I guess, more importantly, since image classification is also used a template of development in many other newer application.  It makes learning CNN sort of mandatory for students of deep learning.

Learning Deep-Learning-based Computer Vision before deeplearning.ai

Interesting enough, there is a rather standard option to learning deep learning-based computer vision on-line.   Yes! You guess it right! It is cs231n which used to be taught by then Stanford PhD candidates, Andrej Karpathy in 2015/16.   [1]  To recap, cs231n is not only a good class for computer vision, it is also a good class for learning basics of deep learning.   Also as now famous Dr. Karpathy said, it has probably one of the best explanation of back-propagation.    My only criticism for the class (as I mentioned in earlier reviews) is that as a first class, it is too focused on image recognition.   But as a first class of deep-learning-based computer vision, I think it was the best.

Course 4: Convolutional Neural Networks Briefly

Would Course 4 changes my opinion about cs231n then?   I guess we should look at it in perspective.   Comparing Course 4 with cs231n is comparing orange and apple.  Course 4 is a month-long class which is suitable for absolute beginners.   If you look into it course 4 basically is a quick introductory class.  Week 1 focuses on what CNN is, Week 2 and 3 talks about 2 prominent applications: image classification, image detection.  Whereas Week 4 are about fun stuffs such as face verification and  image transfer.

Many people I know finish the class within 3 days when the class started.   Whereas cs231n is a semester-long course which contain ~18 hours of video to watch with more substantial (and difficult) homework problems.   It is more suitable for people who already have at least one or two machine learning full courses at their belt.

So my take is that Course 4 can be a good first class of deep-learning-based computer vision, but it is not a replacement of cs231n.  So if you only took Course 4, you will find that there are still a lot in computer vision you don't grok.   My advice is you should then audit cs231n afterward, or else your understanding would still have holes.

What if I already took cs231n? Would Course 4 still helps me?

Absolutely.   While Course 4 is much shorter - remember that a lot of deep learning concepts are obscure.  It doesn't hurt to learn the same thing in different ways.    And Course 4 offer different perspectives on several topics:

  • For starter, Course 4, just like all other deeplearning.ai has homework which require code verification at every step.  As I argued in an earlier review, that's a huge plus for learning.
  • Then there is the treatment of individual topics,  I found that Ng's treatment on image detection is refreshing - the more conventional view (which cs231n took) was to start from RCNN and its two faster variants, then bring up YOLO.   But Andrew just decide to go with YOLO instead.   Notice that neither of the classes had gave detail description of the algorithm.  (Reading the paper is probably the best.)  But YOLO is indeed more practical than RCNN variants.
  • On Week 4 about applications,  such as face verification and Siamese networks are actually new to me.   Andrew also give a very nice explanation on why image transfer really works.
  • As always, even a new note for old topics matter.  E.g.  This is the first time I am aware the convolution in deep learning is different from convolution in signal processing. (See Week 1).   I also found that Andrew's note on various image classification papers are gems.  Even if you you read those paper, I do suggest you to listen to him again.


Since I admin an unofficial forum for the course,  I learn that there are fairly obvious problems with the courses.   For example, back in December when I took the course, there is one homework you need to submit an algorithm which wouldn't match the notebook.   Also, there was also a period of time where submission was very slow, which I need to fix the file downloading to straighten it up.   I do think those are frustrating issue.  Hopefully, by the time when you read this article, the staff has already fixed the issues. [2]

To be fair, even the great NNML by Hinton has glitches here and there in their homeworks.   So I am not entirely surprised glitches happen in deeplearning.ai.   Of course, I would still highly recommend the class.


There you have it - I reviewed Course 4 of deeplearning.ai.  Unlike earlier parts of the courses, Course 4 has a very obvious competitor: cs231n.  And I don't quite put Course 4 as the one course you can take and master computer vision.   My belief is you need to go through both Course4 and cs23n to have reasonable understanding. 

But as a first class of DL-based computer vision.  I still think Course 4 has tremendous value.  So once again I highly recommend yo all to take the class.

As a final note, I was able to catch up reviews for all classes in deeplearning.ai.  Now all eyes on Course 5 and currently (as of Jan 23), it is set to launch at Jan 31.  Before that, do check out ourforum AIDL and Coursera deeplearning.ai for more discussion!

Arthur Chan

First published at http://thegrandjanitor.com/2018/01/24/review-of-ngs-deeplearning-ai-course-4-convolutional-neural-networks/

If you like this message, subscribe the Grand Janitor Blog's RSS feed. You can also find me (Arthur) at twitterLinkedInPlusClarity.fm. Together with Waikit Lau, I maintain the Deep Learning Facebook forum.  Also check out my awesome employer: Voci.


[1] Funny enough, while I went through all cs231n 2016 videos a while ago, I never wrote a review about the course.

[2] As a side note, I think it has to do with Andrew and the staffs are probably rushing to create the class.   That's why I was actually relieved when I learn that Course 5 will be released in January.  Hopefully this gives more time for the staffs to perfect the class.



Some Notes on OpenMined.org

Yesterdays I watched the video from Introduction of OpenMined by Andrew Liam Trask. Wow, it's so interesting. Some notes:

* First of all, unlike similar projects. There is no ICO. It's a genuine open source projects with a vibrant community.

* The idea is quite explainable. It's a marketplace between data scientists and miners who want to provide data. The key question Trask tried to figure out is to protect data privacy from the user, at the same time allow data scientists to train model securely.

* At first I thought it is just an idea about federated learning combined with deep learning. But Trask has augmented the idea with homomorphic encryption and smart contract. I am still in the process to learn why the last two concepts but briefly homomorphic encryption allows model to be securely to miners without getting stolen. Whereas having smart contract would genuinely allow an open market place.

* What if miner tries to come up with fake data? This is actually an FAQ on OpenMined.org. As it turns out, a data scientist can also specify a test set on the smart contract. This ensures data uploaded by miner improve a model.

* Another question I asked to their slack community is how everyone is paid without a coin. Currently the project would rely on USD, ETH and BTC. Fair enough.