Facebook RSS

Follow-up to the StumbleUpon Challenge

     A few weeks ago I talked about my first experience with a data science competition on Kaggle.com, the StumbleUpon Evergreen Classifier Challenge. This competition officially ended last night and I was able to manage 16th place (out of ~600)! I learned a ton from reading other folks' blog posts and the forums, and thought I'd do my part and share my final solution as well:

     In the end, I used an ensemble of six different types of model. I used only the boilerplate text, with a TfIdf vectorizer and various types of lemmatization/stemming. Along the way, I added in a couple of extra features, including a part-of-speech tagger, as well as a genre tagger, both using nltk. However, neither of these ended up improving the final score. However, two things that did end up helping were the domain of the URL (i.e. www.bleacherreport.com), as well as separately using only the first 5/10/15/20 words in the boilerplate. In the end, the final ensemble included:

  • 1. Logistic Regression with Word 1,2-Grams
  • 2. Logistic Regression with Character 3-Grams
  • 3. Multinomial Naive-Bayes with Word 1,2-Grams and chi2 Feature Selection
  • 4. Random Forest with Character 1,2-Grams and chi2 Feature Selection
  • 5. Logistic Regression with the URL Domain
  • 6. Logistic Regression with the First 5/10/15/20 Words

     In order to combine all of the models together, I started out by splitting the training set into a 5-fold CV loop. Within each fold, I trained each of the models separately (thereby generating a prediction on 1/5 of the data, using 4/5 of the data). After the CV loop, this resulted in a prediction for each of the different models on the complete training set. Then, I used Ridge Regression to fit the combination of the predictions to the training set.

     Finally, I re-trained each of the different models on the complete training set and applied them to the test set. Then, I used the previously-trained Ridge model to combine the outputs on the test set. The best score I was able to achieve on the private leaderboard ended up being 0.88562.

      (Note: In order to cross-validate all of this, I originally had a much-longer script that split the training set into a 5-fold cv loop, and performed the entire above routine using 4/5 as the training set and 1/5 as the test set. This gave a CV-score of around 0.8835.)


Source Code: FinalModel.py


  • Facebook
  • Twitter
  • Myspace
  • Google Buzz
  • Reddit
  • Stumnleupon
  • Delicious
  • Digg
  • Technorati


  1. interesting! do you have any plan to post your code?

  2. Sure, it's posted at the end (right above the share this post part), hope it's useful!

  3. Hi Lauren did you forget to transpose the final X array before Ridge regression?

  4. Oh, yeah, great catch! I fixed it in the link, thanks!