January 14, 2012
One of my first runs in a Kaggle competition finally ended with the closing of Don’t Get Kicked!. It was an excellent learning experience, and I had a lot of fun in the process (at the expense of sleep). Plus, I did pretty well, ranking 36th out of 588 teams!
The task was to predict bad buys or “lemons” among thousands of cars purchased through online auctions. Contestants were given information about each car, like purchase price, make and model, trim level, odometer reading, date of purchase, state of origin, and so on. This totaled about 40 variables (plus the lemon status) on roughly 70,000 cars, and the test data had the same information on about 40,000 new cars (minus the lemon status). Contestants submit predictions, in the form of probabilities, about which of these 40,000 cars will turn out to be lemons. Predictions are instantly scored against the true lemon status (hidden from the competitors), and each set of predictions is scored using the Gini coefficient.
I submitted 30 sets of predictions over the course of the competition, and I played around with several methods. Below I’ve plotted my results over time, along with the method behind each submission. The dashed grey line along the top is the score of the winning submission.
Notice that big jump around October 10th? Right around that time, I made a groundbreaking insight: I should read the competition directions more carefully. It turned out that I had been submitting predictions as 0s and 1s, not as predicted (continuous) probabilities between 0 and 1.
Once I fixed that, things got a lot better. Here’s the same plot, excluding the erroneous submissions:
My best scores resulted from a combining sets of predictions from different types of models: boosted trees, random forests, and additive models. I used a weighted average of the predictions from each method, choosing the weights using cross-validation. Everything was done in R: boosting with the
gbm package, random forests with the
randomForest package, and additive models with the
My basic strategy was to create lots of plots and descriptive statistics (just to wrap my head around the data), then create useful transformations and combinations of variables that could be fed into models as predictors. Over time, I was including more and more predictors in my models:
I had created about 150 predictors in the end but only about 50 proved to be useful. As you can see below, boosting was superior to most other methods given the same predictors.
This was a blast and I learned a lot from my mistakes, but I’m ready for a break. I look forward to seeing what the winners did!