Lessons Learned From The Higgs Boson Kaggle Challenge15th November 2014
A little while ago I participated in a competition on Kaggle called the Higgs Boson Machine Learning Challenge, and I wanted to share some thoughts about my approach to the competition and some of the lessons I learned that may be useful to other machine learning practitioners.
If you're not familiar with Kaggle, it's the self-described "Home of Data Science". Kaggle hosts machine learning competitions put together by various companies and organizations. The competitions are typically well defined and time-bound, and usually offer some monetary award for the winning entries in the competition. Individuals and teams from all over the world can sign up and compete, submitting their best results to a public leaderboard to see where they stand. At the end of the competition, a different draw from the test data is used to evaluate each competitor's best models, and the final "private" leaderboard that results from evaluating those models on the test data is used to determine your rank in the competition.
The competition that I took part in was the largest that Kaggle has ever hosted, with nearly 1,800 teams participating. The event was sponsored by CERN with backing from Google and several other organizations. Our task was to analyze a set of simulated particle collision data containing features characterizing events detected by ATLAS (a particle physics experiment at the Large Hadron Collider at CERN). The objective of the analysis was to classify events as either a signal, indicating the tau tau Decay of a Higgs boson, or background noise. It sounds complicated, but in truth there is very little physics knowledge required (although it certainly helps when deriving interesting features from the data).
Since this was my first exposure to Kaggle, I pretty much had to learn as I went along. Now that I've done it once, I feel confident that I would be even more successful with less ramp-up time in future competitions. Here are the most useful insights that I gained during this process.
Be prepared to code
It goes without saying that in order to be sucessful in a Kaggle competition, you need to have a fair amount of data science and machine learning knowledge already. But another skill set you'll probably need that may not be quite as obvious is the ability to program. While I'm sure that some competitors make due with software packages like RapidMiner or KNIME, the impression I got from reading the forums and listening to other participants is that almost everyone is using either R or python along with each language's vast array of open-source software (I personally used python). There are a number of challenges such as generating submission files or implementing custom evaluation metrics that are simply easier to do in code, not to mention things like the transparency and reproducability of your solution.
Get a submission pipeline going early on
One aspect of the competition that I underestimated initially was just setting up a pipeline to be able to generate a valid submission file. Each competition has its own format and specifics in terms of what a "solution" looks like, and you need to get it right in order to even see what your score is. It's a pretty good idea to get this set up right at the start and submit a solution with just random guessing, so you can verify that it works. In fact, you can extend this methodology to other aspects of the "machine learning pipleine", which leads to...
Use a modular approach to running experiments
Coming from a software engineering background, I seemed to have a stronger tendency to write "good" code than some of the other competitors, and while it requires more thought and effort early on, I think this pays off in the long run. In particular, I focused on modularizing and parameterizing various stages of the pipline such as data loading, data visualization, feature scaling & imputation, model training, cross-validation, hyperparameter optimization, and submission file generation. Using this approach, I was able to run lots of different experiments very quickly with little effort. Check out the "higgs" folder of my GitHub repo for Kaggle competitions to see a concrete example (this is a relatively simple pipeline but it gets the point across).
Explore the data before building models
Once of the tendencies that machine learning newcomers seem to have is to just load up their favorite ML library and start throwing algorithms at the data to see what sticks (I admit that I'm guilty of this myself!). In order to understand the characteristics of the data you're working with, it's important to do some exploratory analysis before jumping to the modeling stage. This can lead to valuable insights that may affect modeling decisions, for example detecting the presence of outliers that need to be dealt with. A few things I like to try are calculating basic stats on each feature (mean, median, standard deviation etc.), computing histograms or kernel density functions for each of the features, generating a correlation matrix for the data set, and experimenting with transforms like PCA or even some manifold learning algorithms like Isomap.
Prefer simple models over complex models (at first)
It's generally a good idea to see how far you can get with simpler models like linear regression, logistic regression, or Naive Bayes, before trying to build more complex models. One advantage that simple models have is they usually train really fast, so you can run lot of experiments quickly. They also tend to be more interpretable than complex non-linear models, so attempting to optimize a linear model may lead to valuable insights that you can leverage with more complex models if needed.
Feature engineering is usually the difference between good and great
This sentiment was re-iterated over and over again by many of the experts in the competition. Often the difference between good models and great models is the choice of features used. There's no hard and fast rule for deriving features, it's very much a creative process. Curiousity and persistence are critical.
Trust your CV results, NOT the public leaderboard
Another common thread that was very evident from the post-competition forum discussion is how important it is to basically ignore the public leaderboard scores and trust that your cross-validation scores are a good indication of the model's ability to generalize on unseen data. There were a number of teams that overfit to the public leaderboard in the Higgs competition by making hundreds of submissions and basically tuning their model parameters to optimize their leaderboard score. This technique gives the illusion of progress throughout the competition as it appears that your score is increasing, but the final evaluation uses a different subset of the test data than the public leaderboard and it's clear which teams overfit because their private scores are much lower than their public scores.
Always ensemble at the end
Finally, one technique that I didn't use that was very popular among the top competitors is creating an ensemble of many models together for the final submission. Ensembles have been shown to generalize better on average than any single model, so this is almost a no-brainer. There are a number of different ensembling techniques and most of the top libraries implement at least a few ensemble algorithms. One caveat with ensembles though is to wait until you've optimized a single model as much as possible before trying to ensemble since it's very computationally expensive and time-consuming to train a large ensemble.
Hopefully you found this advice useful and will be able to apply it to future machine learning projects and Kaggle competitions.