Facebook RSS

My First Kaggle.com Competition

     One of the coolest parts about data science is getting to tackle brand-new questions you wouldn't ordinarily have come across. A great place for finding such challenges is through data science competitions at Kaggle. A while back, I started playing around with some of their 'learning' competitions, which are a great place to start since there's tons of tutorials and starter codes to get you started. However, lately I've wanted to take on a full competition from scratch, which has brought me to an especially cool challenge, the StumbleUpon Evergreen Classifier Challenge.

     The goal of this competition is to classify a bunch of webpages as "evergreen" or "ephemeral". We're given quite a bit of information about each webpage, but the most useful information seems to come from the boilerplate text. Basically, each webpage has a url, a title, and a body (i.e. a bunch of text), and we want to build some sort of document classifier from this. Our plan is the following:

  • 1. Preprocess the Data
  • 2. Run Feature Selection
  • 3. Train a Classifier

1. Preprocessing the Data

     The first thing we want to do is clean up the data a bit. There's a few things we want to do: First, we want to turn the text into a list of words. To do this, we're going to use the tokenizer in the nltk package. This will simply convert the text for each webpage into a list of words, i.e.:

'The cat swims while the cats swim'
['The', 'cat', 'swims', 'while', 'the', 'cats', 'swim']

We also want to do something to recognize similar words (i.e. 'cats' vs. 'cat'). Fortunately, the nltk package also has a bunch of ways to do this, one of which is the WordNet Lemmatizer. For example:      
['The', 'cat', 'swims', 'while', 'the', 'cats', 'swim']
['The', 'cat', 'swim', 'while', 'the', 'cat', 'swim']

The scikit-learn example page provides a quick example for combining these two steps into the format we'll need later:

           class LemmaTokenizer(object):
                def __init__(self):
                    self.wnl = WordNetLemmatizer()
                def __call__(self, doc):
                     return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

The next thing we want to do is to use a vectorizer to put this in the format we need for our classifier. We're going to use a Tf-Idf Vectorizer, which will transform the list of words in each webpage into a matrix, where each element is weights the number of occurrences of a word with its inverse total number of occurrences in all pages. This will weight the different words such that common words that appear everywhere don't swamp out rare words that appear preferentially in "ephemeral" or "evergreen" webpages. We can combine all the steps so far:

           vect = TfidfVectorizer(stop_words = 'english', min_df = 3, max_df = 1.0,
               strip_accents = 'unicode', analyzer = 'word', ngram_range = (1,2), use_idf = 1,
               smooth_idf = 1, sublinear_tf = 1, tokenizer = LemmaTokenizer())


2. Running Feature Selection

      Now we have a big matrix (in this case, around 10000 x 160000) where each row is a webpage, and each column represents a specific word (feature). Most of these words, however, are not particularly informative. We can improve our model by eliminating those words that are least informative and are likely to lead to overfitting (A really great reference for this step is here).

      There are a number of ways to evaluate the 'usefulness' of a feature -- we're going to use a simple one, the Chi-Squared score. A Chi-Squared test is typically used to evaluate the independence of events. For example, if we have some feature where P(evergreen | feature) = P(evergreen), then we know that the feature is not particularly informative, whereas if P(evergreen | feature) is very different from the base probability P(evergreen), then the feature is very informative. Chi-Squared feature selection will help us to identify the most informative features. This is also very easily implemented using scikit-learn:

           FS = SelectPercentile(score_func = chi2, percentile = k)

In order to figure out which percentage of features to keep, we will use a CV-loop in the next step.


3. Training a Classifier

      Now we're ready to train the classifier. There are many different options here, but we're going to start off using a Multinomial Naive-Bayes classifier. There are going to be two main parameters in our model: First, we have to decide which percentage of features to keep in the feature selection step (k). Second, we have to decide which alpha to use in our Naive-Bayes classifier (alpha). To do this, we'll run a 5-fold cross-validation loop for a range of different k's and alpha's, and we'll select the best k and alpha. Scikit-learn provides an easy way to perform cross-validation:

           kf = StratifiedKFold(trainlabels, n_folds = 5, indices = True)
           for train,cv in kf:
                X_train, X_cv, y_train, y_cv =
                              trainset[train], trainset[cv], trainlabels[train], trainlabels[cv]
                ## Train Model and Calculate AUC Score for this Fold, for Each k, Alpha ##

After we do this, we can use our best k and best alpha to fit the classifier to the full training set, and then apply it to the test set. And we're done! This particular model gives a score of about 0.869; however, when combined in an ensemble with other models, I've been able to get up to about 0.883. The next step is to incorporate some of the non-text features into our model and see how much higher we can get!


Source Code: ModelNB.py


  • Facebook
  • Twitter
  • Myspace
  • Google Buzz
  • Reddit
  • Stumnleupon
  • Delicious
  • Digg
  • Technorati


  1. Hello,
    thank you for this code.
    A question, I think lines 60 and 64 should be :
    FS=SelectPercentile(score_func=chi2,percentile=kToTest [k[0]])
    model = MultinomialNB(alpha=alphaToTest[alpha[0]])

    What do you think ?

    Does it change the final score ?

  2. Oh, great catch! Completely missed that, I rewrote all the variables before posting to make it more clear and messed that part up. Thanks Andre!

    (The original submission was with the right k and alpha though, so the score should be the same. Thanks!)

  3. Hii you are providing good information.Thanks for sharing AND Data Scientist Course in Hyderabad, Data Analytics Courses,

    Data Science Courses, Business Analytics Training ISB HYD Trained Faculty with 10 yrs of Exp See below link