Debugging – The Grand Janitor Blog V3

I have been refreshing myself on the general topic of machine learning. Mostly motivated by job requirements as well as my own curiosity. That’s why you saw my review post on the famed Andrew Ng’s class. And I have been taking the Dragomir Radev’s NLP class, as well as the Machine Learning Specialization by Emily Fox and Carlos Guestrin [1]. When you are at work, it’s tough to learn. But so far, I managed to learn something from each class and was able to apply them in my job.

So, one question you might ask is how applicable are on-line or even university machine learning courses in real life? Short answer, they are quite different. Let me try to answer this question by giving an example that come up to me recently.

It is a gender detection task based on voice. This comes up at work and I was tasked to improve the company’s existing detector. For the majority of the my time, I tried to divide the data set, which has around 1 million data point to train/validation/test sets. Furthermore, from the beginning of the task I decided to create sets of dataset with increasing size. For example, 2k, 5k, 10k….. and up to 1 million. This simple exercise, done mostly in python, took me close to a week.

Training, aka the fun part, was comparatively short and anti-climatic. I just chose couple of well-known methods in the field. And test on the progressively sized data set. Since prototyping a system is so easy, I was able to weed out weaker methods very early and come up with a better system. I was able to get high relative performance gain. Before I submitted the system to my boss, I also worked out an analysis of why the system doesn’t give 100%. No surprise. it turns out volume of the speech matters, and some individual of the world doesn’t like their sex stereotypes. But so far the tasks are still quite well-done because we get better performance as well as we know why certain things don’t work well. Those are good knowledge in practice.

One twist here, after finishing the system, I found that the method which gives the best classification performance doesn’t give the best speed performance. So I decided to choose a cheaper but still rather effective method. It hurts my heart to see the best method wasn’t used but that’s the way it is sometimes.

Eventually, as one of the architects of the system, I also spent time to make sure integration is correct. That took coding, much of it was done in C/C++/python. Since there were couple of bugs from some existing code, I was spending about a week to trace code with gdb.

The whole thing took me about three months. Around 80% of my time was spent on data preparation and coding. Machine learning you do in class happens, but it only took me around 2 weeks to determine the best model. I could make these 2 weeks shorter by using more cores. But compare to other tasks, the machine learning you do in class, which is usually in the very nice form, “Here is a training set, go train and evaluate it with evaluation set.”, seldom appears in real-life. Most of the time, you are the one who prepare the training and evaluation set.

So if you happen to work on machine learning, do expect to work on tasks such as web crawling and scraping if you work on text processing, listen to thousands of waveforms if you work on speech or music processing, watch videos that you might not like to watch if you try to classify videos. That’s machine learning in real-life. If you happen to be also the one who decide which algorithm to use, yes, you will have some fun. If you happen to design a new algorithm. then you will have a lot of fun. But most of the time, practitioners need to handle issues, which can just be …. mundane. Tasks such as web crawling, is certainly not as enjoyable as to apply advanced mathematics to a problem. But they are incredibly important and they will take up most of time of your or your organization as a whole.

Perhaps that’s why you heard of the term “data munging” or in Bill Howe’s class: “data jujitsu”. This is a well-known skill but not very advertised and unlikely to be seen as important. But in real-life, such data processing skill is crucial. For example, in my case, if I didn’t have a progressive sized datasets, prototyping could take a long time. And I might need to spend 4 to 5 times more experimental time to determine what the best method is. Of course, debugging will also be slower if you only have a huge data set.

In short, data scientists and machine learning practitioners spend majority of their time as data janitors. I think that’s a well-known phenomenon for a long time. But now as machine learning become a thing, there are more awareness [2]. I think this is a good thing because it helps better scheduling and division of labors if you want to manage a group of engineers in a machine learning task.

[1] I might do a review at a certain point.
[2] e.g. This NYT article.

For the most time, you shouldn’t need to care about the internals of python. It is usually thought as a tool and assumed to be bug-free.

Of course, there are moments you should question these assumptions. Sometimes, the interpreter fails itself. It could segfault, it could be too slow.

A more common scenario is that you write C-extension for a python and things are not working. So what do you do? You can go to stare at the source code and hope that you find the issues. Or you can just debug the python. e.g. you can simply see python as a program and your script as arguments. Most of the time, when your interpret the custom extension, it will just link it with the library you wrote.

If you want to do it right, you also want to have a debugged version of python. So the question you have is how to compile. Is it difficult? The answer is that compiling python is surprisingly simple. In fact, I think it is a very joyful activity. (That’s just me……)

Anyway, the following is a procedure I used. I use Python2.7.5 as an example. It is so-called the old standard. The tip now is Python2.7.6. I think the same procedure works.

1, Download Python Source Code

For Python2.7.5, you can do it by

wget http://www.python.org/ftp/python/2.7.5/Python-2.7.5.tgz

2, Configure Python in Debug Mode

./configure –with-pydebug –prefix ./my_installation_path/

3, Compile and Install the debug version

make OPT=’-g’

make install

Notes:

Remember, valgrinding the debugged version of python will generates tons of messages (because valgrind would think much memory space was not freed.) So remember to use the right suppression file. See http://svn.python.org/projects/python/trunk/Misc/README.valgrind
You also need to reinstall all libraries that make your application works.
And remember, different version of .pyc may not be compatible with each other. So make sure you use the correct version of python to rerun your applications.

Arthur