We don't want to spoil a joke. But Commit Strip did a great job to tell you what most "AI" companies are really selling these days.
CNTK 2.0 is now at general availability. While not as popular as her peers such as Tensorflow or Theano, it is still a powerful toolkit and some third-party studies show that it has much better performance (5x-10x) in LSTM than other toolkit.
2.0 is more a culmination of several release candidates beforehand. But the key features seems to be Keras backend support and Java frontend. Both are interesting. Supporting keras means CNTK 2.0 can be easily replace existing engines like TF and Theano. Whereas Java support would allow CNTK 2.0 to compete with java framework such as deeplearning4j. In a nutshell, it makes CNTK 2.0 stand out among the dozens, if not a hundred of deep learning toolkits.
Looking at Edwin Chen's profile, you know that he is deep in several fields which require sophisticated mathematics: ASR in MSR, quantitive trading and ML in Google. One of his many interesting articles was Winning the Netflix Prize: A Summary which is on the techniques by the winning Netflix Prize team; And I (Arthur) greatly enjoy his writing.
This time Edwin Chen helps us to explore the idea of LSTM, which is always very non-trivial to understand. How do you approach such concept? Karpathy's Unreasonable Effectiveness of Recurrent Neural Network is a modern classic, but it's hard to get too much insight into what LSTM is.
The article you should probably read first is Chris Olah's Understanding LSTM Networks which give you a nice visualization how LSTM "evolved" from a vanilla RNN. It also make you easily read equations from standard LSTM literature.
Then there is Richard Socher's lecture on LSTM. Socher's approach is to first teach gated recurrent unit (GRU). He remarks that while GRU is developed later, it has a much more logical structure. Whereas it's never trivial why LSTM need all the gates in the first place. So that seems to be artificial.
So what is the merit of Chen's article then? He chose not to avoid the complexity of LSTM, but discover with his reader on how different cell gates behaves. That was the gist of the very long visualization he puts in the article.
I found his take interesting, I would recommend his along with Karpathy's, Colah's and Socher's treatment for people who want to understand LSTM.
About a year ago, I heard that Google can train an Imagenet with Alexnet architecture in 1 day. That for an amateur like me is fairly amazing - my not-so-optimized setup at home would take around 5.5 days. And I have around 3-4 thoughts on how to speed it up, but 2-3 days is probably the limit. My guess is in a large company, with other experts's help, I can also make the time to be around 1 day.
Yes, training neural network from scratch is a super painful process. Spreading computations across GPUs is already tough, not to say spreading computation across machines. Google is probably the first group which comes up with ideas which get parallelization working across multiple machines with multiple GPUs. Yet their results are not widespread. For example, Tim Dettmers actually advise beginners to stay away from a mult-card system, because making deep learning works is still difficult.
This makes Facebook's results quite amazing - the key insights here is Facebook was able to reduce the number of minibatches per machine by using a large mini-batch. But then wouldn't you need to retune the learning rate? They found there is a simple linear relationship between batch size and learning rate :
Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.
This is surprising. In fact, according to the paper, Alex Krizhevsky, of Alexnet fame, is the first person who tried out this rule, but he couldn't quite get it working with a large batch (1% absolute loss when batch size goes from 128 to 1024). The authors, on the other hand, argues that you can't quite amplify the learning rate by k in the first few epochs, you need to warmup this amplification gradually. That sounds like a breakthrough insight.
Lo and behold, that's what they did, they were able to train a Resnet-50 using 32 machines each with 8 GPUs, their actual implementation has more into it and I won't spoil here. But it's definitely one of the papers you want to check out.
DeepMind doesn't seem to stop after AlphaGo, and here's a very interesting result from their visual reasoning system. It covers both relational neural network (RN) and visual interaction network (VIN). For me RN is immensely interesting. Why? For the most part, DeepMind author is suggesting you don't impose a function with the scope of objects, rather you just always model pairwise relationship. (i.e. the function g in the paper) As the authors suggest, the formulation simply baked the constraints directly into the model, with similar spirit as convolution neural network in computer vision. Again, this is a surprisingly simple thought, but DeepMind see good results which surpass human performance.