Review of Ng’s deeplearning.ai Course 3: Structuring Machine Learning Projects

What do you actually do as an ML Engineer?

Let me digress a bit: I know many of my readers are young college students who are looking for careers in data science or machine learning. But what do people actually do in the business of machine learning or AI? I think this is a legit question because I was very confused when I first started out.

Oh well, it really depends on how much you are on the development side or research side of your team. Terms like “Research” and “Development” can have various meaning depends on the title. But you can think “researcher” are the people who try to get a new techniques working – usually the criterion is whether it beats the status quo such as accuracy performance. “Developers” on the other hand, are people come up with a production implementation. You can think that many ML jobs are really in between the spectrum of “developers” and “researchers”. e.g. I am usually known for my skill as a architect. That usually means I have the knowledge on both sides. My quote on my skills is usually “50% development and 50% research”. There are also people who are highly specialized in either side. But I will focus on the research-side more in this article.

So, What do you actually do as an ML Researcher then?

Now I can see a lot of you jump up and say “OH I WANT TO BE A RESEARCHER!” Yes, because doing research is fun, right? You just need to train some models and beat the baseline and write a paper. BOOM! In fact, if you are good, you just need to ask people to do your research. Woohoo, you are happy and done, right?

Oh well, in reality, good researchers are usually fairly good coders themselves. Especially in applied field such as machine learning, my guess is out of 100 researchers in an institute, may be there is perhaps 1 person who is really a “thinking staff”. i.e. They do nothing other than coming up with new theory or writing proposal. Just like you, I admire the life of a pure academician. But in our time, you usually have to be either very smart and very lucky to be one of them. (There is a tl;dr explanation here, but it is out of scope of this article.)

“Okay, okay, got it….. so can we start to have some fun now? We just need to do some coding, right? ” Not really, the first step before you can work on fun stuffs such as modeling, or implement new algorithm, is to clean-up data. So say if you work on a fraudulent transaction detection, the first is to load a giant table somewhere so that you can query it and get the training data. Then you want to clean the data, and massage the data so that it can be an input of ML engine. Notice that by themselves these tasks can be non-trivial as well.

Course 3: Structuring Machine Learning Projects

Then there you are, after you code, you clean up your data, finally you have some time to do machine learning. Notice that your time after all these “chores” are actually quite limited. That makes how to use your time effectively a very important topic. And here comes why you want to take Course 3: Andrew teaches you the basics of how to assign time/resource in a deep learning task. e.g. How large are your train/validation/test sets? When should you stop your development? What is human performance? What if there are mismatches between your train set/test set? If you are stuck, should you tune your hyperparemeters more? Or should you regularize?

In a way, Course 3 is a reminiscence of “Machine Learning”‘s Week 6 and Week 11, basically what you try to learn is to make good “meta-decision”e of all your projects you will work for your life time. I also think it’s the right stuffs in your ML career.

One final note: as you might notice in my last two reviews, I usually tried to compare deeplearning.ai with other classes. But Course 3 is quite unique, so you might only find similar material on machine learning course which focus on theory. But Ng’s treatment is unique: first what he gave is practical and easy to understand advice. Then his advice focused on deep learning – while we are talking about similar principle. Working on deep learning usually implies special circumstance – such as close to human performance, and you might just have low train and test set performance. Those scenarios did appear in the past – but only in cutting edge ML evaluation involved the best ML teams. So you don’t normally hear about it in a course, but now Andrew tell you all. Doesn’t that worth the price of $49? 🙂

Conclusion

So here you have it. This is my review of Course 3 of deeplearning.ai. Surprising even to me, I actually write more than I expect for these two-week course. Perhaps the main reason is – I really hope this course were there say 3 years ago. This would have change the course of some projects I develop.

May be it’s too late for me….. but if you are early in deep learning, do recognize the importance of Course 3, or any advices you hear similar to what Course 3 taught. It will save you much time – not just on one ML task but many ML tasks you will work in your career.

Arthur Chan

Review of Ng’s deeplearning.ai Course 2: Improving Deep Neural Networks

(My Reviews on Course 2 and Course 3.)

In your life, there are times you think you know something, yet genuine understanding seems to elude you. It’s always frustrating, isn’t it? For example, why would all these seemingly simple concepts such as gradients or regularization can throw us off when we learn them since Day 1 of our learning in machine learning?

In programming, there’s a term called “grok”, grokking something usually means that not only you know the term, but you also have intuitive understanding of the concept. Or as in “Zen and the Art of Motorcycle Maintenance” [1], you just try to dive deep into a concept, as if it is a journey…… For example, if you really think about speech recognition, then you would realize the frame independence assumption [2] is very important. Because it simplifies the problem in both search and parameter estimation. Yet it certainly introduces a modeling error. These small things which are not mentioned in classes or lectures are things you need to grok.

That brings us to Course 2 of deeplearning.ai. What are you grokking in this Course? After you take Course 1, should you take Course 2? My answer is yes and here is my reasoning.

Really, What is Gradient Descent?

Gradient descent is a seemingly simple subject – say you want to find a minima of the function a convex function, so you follow the gradient down hill and after many iterations, you eventually hit the minima. Sounds simple right?

Of course, once you start to realize that functions are normally not convex, and they are n-dimensional, and there can be plateaus. Or when you follow the gradient, but it happens to be a wrong direction! So you will have zigzagging when you try to descend. It’s a little bit like descending from a real mountain, yet you don’t really can’t see n-dimensional space!

That explains the early difficulty of deep learning development – Stochastic gradient descent (SGD) was just too slow back in 2000 for DNN. That results in very interesting research of restricted Boltzmann machine (RBM) which was stacked and used to initialize DNN, which was prominent subject of Hinton’s NNML after Lecture 8, or pretraining, which is still being used in some recipes in speech recognition as well as financial prediction.

But we are not doing RBM any more! In fact, research in RBM is not as fervent as in 2008. [4] Why? It has to do with people just understand more about SGD and can run it better – it has to do with initialization, e.g. Glorot’s and He’s initialization. It also has to do with how gradient descent is done – ADAM is our current best.

So how do you learn these stuffs? Before Ng deeplearning.ai’s class, I would say knowledge like this spread out on courses such as cs231n or cs224n. But as I mentioned in the Course 1’s review, those are really courses with specific applications in mind. Or you can go to read Michael Nielsen’s Neural Network and Deep Learning. Of course, Nielsen’s work is a book. So it really depends on whether you have the patience to work through the details while reading. (Also see my review of the book.)

Now you don’t have to. The one-stop shop is Course 2. Course 2 actually covers the material I just mentioned such as initialization, gradient descent, as well as deeper concepts such as regularization and batch normalization. That makes me recommend you to keep on taking the course after you finish Course 1. If you take the class, and are also willing to read Sebastian Ruder’s Review of SGD or Grabriel Goh’s Why Momentum Really Works, you would be much ahead of the game.

As a note, I also like Andrew breaks down many of the SGD algorithm as a smoothing algorithm. That’s a new insight for me even after I used SGD many times.

Is it hard?

Nope, as Math goes, Course 1 is probably toughest. Of course, even in Course 1, you will finish coursework faster if you don’t overthink the problem. Most notebooks have the derived results for you. On the other hand, you do want to derive the formulae, you do need to have decent skill in matrix calculus.

Is it Necessary to Understand These Details?; Also Top-Down vs Bottom-Up learning, which is Better?

A legitimate question here is that : well, in our current state of deep learning which we have so many toolkits which already implemented techniques such as ADAM. Do I really need to dig so deep?

I do think there are always two views in learning – one is from top-down, which in deep learning, perhaps is to read a bunch of papers, learn the concepts and see if you can wrap you head around them. the fast.ai class is one of them. And 95% of the current AI enthusiasts are following such paths.

What’s the problem of the top-down approach? Let me go back to my first paragraph – which is – do you really grok something when you do something top-down? I frequently can’t. In my work life, I also heard senior people say that top-down is the way to go. Yet, when I went ahead to check if they truly understand an implementation. They frequently can’t give a satisfactory answer. That happens to a lot of senior technical people who later turn to more management. Literally, they lost their touch.

On the other hand, every time, I pop up an editor and write an algorithm, I gain tremendous understanding! For example, I was asked to write a forward inference once with C, you better know what you are doing when you write in C! In fact, I come to have opinion these days that you have to implement an algorithm once before you can claim you understand it.

So how come there are two sides of the opinion then? One of my speculations is that back in 80s/90s, students are often taught to learn how to write program in first writing. That create mindsets that you have to think up a perfect program before you start to write one. Of course, in ML, such mindset is highly impractical because and the ML development process are really experimental. You can’t always assume you perfect the settings before you try something.

Another equally dangerous mindset is to say “if you are too focused on details, then you miss the big picture won’t come up with something new!” . This I heard a lot when I first do research and it’s close to most BS-ty thing I’ve heard. If you want to come up with something new, the first thing you should learn is all the details of existing works. The so called “big picture” and “details” are always interconnected. That’s why in the AIDL forum, we never see young kids, who say “Oh I have this brand new idea, which is completely different from all previous works!”, would go anywhere. That’s because you always learn how to walk before you run. And knowing the details has no downsides.

Perhaps this is my long reasons why Ng’s class is useful for me, even after I read many literature. I distrust people who only talk about theory but don’t show any implementation.

Conclusion

This concludes my review of Course 2. To many people, after they took Course 1, they just decide to take Course 2, I don’t blame them, but you always want to ask if your time is well-spent.

To me though, taking Course 2 is not just about understanding more on deep learning. It is also my hope to grok some of the seemingly simple concepts in the field. Hope that my review is useful and I will keep you all posted when my Course 3’s review is done.

Arthur

Footnotes:
[1] As Pirsig said – it’s really not about motorcycle maintenance.

[2] Strictly speaking, it is conditional frame independence assumption. But practitioners in ASR frequently just called it frame independence assumption.

[3] Also see HODL’s interview with Ruslan Salakhutdinov, his account is first hand on the rise and fall of RBM.