Facebook RSS
(This was originally posted as a guest post at the very awesome Insight Data Science blog. I highly recommend checking it out!)

It's more than a little frightening to make any major career transition in life, and switching from academia into industry is certainly one of the scarier. For me, graduate school, while full of frustrations and stressful moments, felt surprisingly safe. In academia, it's usually very clear what is expected of you, there's no real risk of being fired or your company not succeeding, and you're typically free to set your own schedule (albeit, a long and not necessarily ideal one). In its own way, graduate school can feel almost comfortable, while switching careers into an unknown territory can be very intimidating.

On the other hand, there are a number of real upsides to making the transition. Data science has the potential to be an extremely rewarding career, with the possibility of making real, day-to-day impacts on the lives of hundreds, thousands, even millions of people. There's nothing more satisfying than building some app, or website, or policy, or program, that you can point to and say, "I made that happen!" And you can do this, not on a five, six, seven year time-scale, but in weeks or even days (or hours!).

The hardest part is taking that leap of faith necessary to get your foot in the door. Although applying to graduate school was very stressful, it was also fairly objective. Getting the perfect job in industry is a completely different story - the good grades, test scores, and research results can certainly matter, but much of the job decision will come down to how you present yourself in your resume and through the interviews. 

Figuring out how to navigate through the transition process is very tricky, so where do you start? The first (and probably the trickiest) step is to learn what you need to be focusing on in the first place. For example, I had a fairly strong background in programming prior to Insight, but I learned all of it on my own through many different graduate school projects. In order to show that I had the right skills in interviews, I had to go back and learn many of the CS fundamentals that come with having studied them in a university setting: algorithms, dynamic programming, etc. Similarly, I had lots of experience with using data, making data usable in the first place, running statistical tests, and so on, but I had no way of knowing the right language to use in interviews or how to highlight the most important and relevant parts of work that I'd done.

 For me, learning what to focus on in order to be successful was by far the most helpful benefit of participating in the Insight program. Surrounding myself with people who have been through the same process and people who were going through the same process alongside me was crucial to my confidence in my decision. If you're considering making the transition, and it really is an awesome one, I highly encourage you to seek out as many folks as you can, and ask a lot of questions.

Here are a few other questions I get asked a lot today, and my answers for those of you ready to take the first scary step:

What was your project at Insight? 

I had an absolute blast making my web-app, Pick Your Fiction. The idea behind the app is simple: sometimes, I want to read a book that features certain things - say, a fantasy novel with elves. But, there are many things I don't necessarily like in fantasy novels (gothic, angst-ridden, teenage vampires?). What I'd really like to be able to do is to say, "These are the things that make me happy, these are the things that make me sad, find me a book tailored to those preferences."

With this in mind, I developed Pick Your Fiction. Here, you can enter a title of a book you love (or leave it blank) and things that you like and/or don't like in books, and it will recommend a book based on your input. For example, suppose I want a book about elves:

However, what if I want a book that doesn't have dragons in it?

There are also a few additional options - you can control how important it is to you that the books are popular (for example, if you've read all the Harry Potter books, you may also have read other very popular young adult fantasy novels, and might want something less well known), or that the books are of the same genre as your original title (i.e., say you want an adult version of a children's story, or a science fiction version of a historical novel). The app will also provide a few suggestions for you based upon features that tend to be divisive within a subject.

How does all of this work? The app primarily relies upon a very large number of customer reviews scraped from a well-known website, as well as the description/summary provided for each book. These reviews compose a significant database of words for each of the books. The app starts off by comparing the similarity between books using Python with nltk for tokenizing + stemming + Tfidf weighting + cosine similarity. This similarity metric is then weighted based upon the similarity in genre and the popularity of the book, where each of these weights is controlled by the user.

The final scores are then boosted based upon the frequencies of the user-added features relative to their baseline frequencies. Finally, the additional suggestions are chosen based upon those words that appear in as close to 50% of the closest-matching books as possible, but where this value is high or low relative to the typical frequency. All of the information is stored in a MySQL database, and the front-end is built using Flask + Bootstrap + jQuery, and hosted on AWS.

How did you pick your project? 

This is a really tricky question, and was actually one of the hardest parts of the program for me. There is a huge amount of freedom in deciding what to work on, and there's essentially no limit to what you can do. There are, however, a few key things to think about when deciding. Obviously, the project has to be doable (and doable within a 2-3 week timespan!), and, ideally, it should involve using techniques that highlight your ability to work with data. Projects that involve cleaning data and using interesting techniques/software are good, since interviews will often involve a significant amount of time spent talking about your project. In my case, the project was heavy on natural language processing and I ended up talking about that quite a bit.

The most important thing, however, is to make sure that it's a project that you're enthusiastic about and will enjoy working on. The reason for this is that you're going to spend an inordinate amount of time talking about your project, and it shows right away if it's a project you care strongly about. In my case, Pick Your Fiction was an app I was extremely excited about - I loved testing out different people's requests, talking about all of the many issues I faced along the way, discussing what I would do if I were to try and monetize and grow the app, etc. I was always extremely happy when interviewers asked about my app, because I loved talking about it, and I genuinely thought it was a useful/interesting thing.

A few weeks ago, John Joo posted an excellent list of things to do to prepare for Insight in the weeks/months leading up to the program. However, what if you're still several years out? What can you do throughout your Ph.D. to prepare for a future career in data science? 

The best thing you can do for yourself, by far, is to work on some pet project that involves data. It doesn't have to be anything mind-blowing or enormous, just something that you can point to as evidence of your skills in data science, and also your interest and enthusiasm for the data science industry. As I mentioned in the previous question, the single best thing you can do to convince someone you're capable of being a great data scientist is to show them with something tangible. And perhaps even more importantly, steering interviews to a project that you've done yourself means you'll know all the answers to the questions they're likely to ask!

There are a bunch of other ways to get involved in data science as well. A big one for me was the data science competitions over at kaggle.com. This was another case of having a massive community from which you can learn how to do things. I started off with the basic tutorial competitions, which come with a ton of different sample code snippets, and from there started to develop a better intuition for which techniques worked and which didn't. (These projects are also great things to talk about in interviews, and they typically provide good data sets to play around with).

When you're ready to start making the transition to a career in data science (or any new career for that matter), your first step should be to reach out to people who have been in your position before and ask them about their experiences directly. For me, I talked with many different alumni from the Insight program, as well as friends from my graduate program who had graduated prior to me, and other people in the alumni network. One of the biggest benefits of doing this is simply the exposure to the field - there were a huge number of different phrases and keywords that I'd heard for the very first time in my first week of Insight, that are now just everyday occurrences.  Knowing that many of my colleagues and peers had made the same transition made my decision that much less scary, and helped steer me towards Insight and my new role at Khan Academy. Good luck taking that first step!
     A few weeks ago I talked about my first experience with a data science competition on Kaggle.com, the StumbleUpon Evergreen Classifier Challenge. This competition officially ended last night and I was able to manage 16th place (out of ~600)! I learned a ton from reading other folks' blog posts and the forums, and thought I'd do my part and share my final solution as well:

     In the end, I used an ensemble of six different types of model. I used only the boilerplate text, with a TfIdf vectorizer and various types of lemmatization/stemming. Along the way, I added in a couple of extra features, including a part-of-speech tagger, as well as a genre tagger, both using nltk. However, neither of these ended up improving the final score. However, two things that did end up helping were the domain of the URL (i.e. www.bleacherreport.com), as well as separately using only the first 5/10/15/20 words in the boilerplate. In the end, the final ensemble included:

  • 1. Logistic Regression with Word 1,2-Grams
  • 2. Logistic Regression with Character 3-Grams
  • 3. Multinomial Naive-Bayes with Word 1,2-Grams and chi2 Feature Selection
  • 4. Random Forest with Character 1,2-Grams and chi2 Feature Selection
  • 5. Logistic Regression with the URL Domain
  • 6. Logistic Regression with the First 5/10/15/20 Words

     In order to combine all of the models together, I started out by splitting the training set into a 5-fold CV loop. Within each fold, I trained each of the models separately (thereby generating a prediction on 1/5 of the data, using 4/5 of the data). After the CV loop, this resulted in a prediction for each of the different models on the complete training set. Then, I used Ridge Regression to fit the combination of the predictions to the training set.

     Finally, I re-trained each of the different models on the complete training set and applied them to the test set. Then, I used the previously-trained Ridge model to combine the outputs on the test set. The best score I was able to achieve on the private leaderboard ended up being 0.88562.

      (Note: In order to cross-validate all of this, I originally had a much-longer script that split the training set into a 5-fold cv loop, and performed the entire above routine using 4/5 as the training set and 1/5 as the test set. This gave a CV-score of around 0.8835.)


Source Code: FinalModel.py
     One of the coolest parts about data science is getting to tackle brand-new questions you wouldn't ordinarily have come across. A great place for finding such challenges is through data science competitions at Kaggle. A while back, I started playing around with some of their 'learning' competitions, which are a great place to start since there's tons of tutorials and starter codes to get you started. However, lately I've wanted to take on a full competition from scratch, which has brought me to an especially cool challenge, the StumbleUpon Evergreen Classifier Challenge.

     The goal of this competition is to classify a bunch of webpages as "evergreen" or "ephemeral". We're given quite a bit of information about each webpage, but the most useful information seems to come from the boilerplate text. Basically, each webpage has a url, a title, and a body (i.e. a bunch of text), and we want to build some sort of document classifier from this. Our plan is the following:

  • 1. Preprocess the Data
  • 2. Run Feature Selection
  • 3. Train a Classifier

1. Preprocessing the Data

     The first thing we want to do is clean up the data a bit. There's a few things we want to do: First, we want to turn the text into a list of words. To do this, we're going to use the tokenizer in the nltk package. This will simply convert the text for each webpage into a list of words, i.e.:
'The cat swims while the cats swim'
['The', 'cat', 'swims', 'while', 'the', 'cats', 'swim']

We also want to do something to recognize similar words (i.e. 'cats' vs. 'cat'). Fortunately, the nltk package also has a bunch of ways to do this, one of which is the WordNet Lemmatizer. For example:      
['The', 'cat', 'swims', 'while', 'the', 'cats', 'swim']
['The', 'cat', 'swim', 'while', 'the', 'cat', 'swim']

The scikit-learn example page provides a quick example for combining these two steps into the format we'll need later:

           class LemmaTokenizer(object):
                def __init__(self):
                    self.wnl = WordNetLemmatizer()
                def __call__(self, doc):
                     return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

The next thing we want to do is to use a vectorizer to put this in the format we need for our classifier. We're going to use a Tf-Idf Vectorizer, which will transform the list of words in each webpage into a matrix, where each element is weights the number of occurrences of a word with its inverse total number of occurrences in all pages. This will weight the different words such that common words that appear everywhere don't swamp out rare words that appear preferentially in "ephemeral" or "evergreen" webpages. We can combine all the steps so far:

           vect = TfidfVectorizer(stop_words = 'english', min_df = 3, max_df = 1.0,
               strip_accents = 'unicode', analyzer = 'word', ngram_range = (1,2), use_idf = 1,
               smooth_idf = 1, sublinear_tf = 1, tokenizer = LemmaTokenizer())


2. Running Feature Selection

      Now we have a big matrix (in this case, around 10000 x 160000) where each row is a webpage, and each column represents a specific word (feature). Most of these words, however, are not particularly informative. We can improve our model by eliminating those words that are least informative and are likely to lead to overfitting (A really great reference for this step is here).

      There are a number of ways to evaluate the 'usefulness' of a feature -- we're going to use a simple one, the Chi-Squared score. A Chi-Squared test is typically used to evaluate the independence of events. For example, if we have some feature where P(evergreen | feature) = P(evergreen), then we know that the feature is not particularly informative, whereas if P(evergreen | feature) is very different from the base probability P(evergreen), then the feature is very informative. Chi-Squared feature selection will help us to identify the most informative features. This is also very easily implemented using scikit-learn:

           FS = SelectPercentile(score_func = chi2, percentile = k)

In order to figure out which percentage of features to keep, we will use a CV-loop in the next step.


3. Training a Classifier

      Now we're ready to train the classifier. There are many different options here, but we're going to start off using a Multinomial Naive-Bayes classifier. There are going to be two main parameters in our model: First, we have to decide which percentage of features to keep in the feature selection step (k). Second, we have to decide which alpha to use in our Naive-Bayes classifier (alpha). To do this, we'll run a 5-fold cross-validation loop for a range of different k's and alpha's, and we'll select the best k and alpha. Scikit-learn provides an easy way to perform cross-validation:

           kf = StratifiedKFold(trainlabels, n_folds = 5, indices = True)
           for train,cv in kf:
                X_train, X_cv, y_train, y_cv =
                              trainset[train], trainset[cv], trainlabels[train], trainlabels[cv]
                ## Train Model and Calculate AUC Score for this Fold, for Each k, Alpha ##

After we do this, we can use our best k and best alpha to fit the classifier to the full training set, and then apply it to the test set. And we're done! This particular model gives a score of about 0.869; however, when combined in an ensemble with other models, I've been able to get up to about 0.883. The next step is to incorporate some of the non-text features into our model and see how much higher we can get!


Source Code: ModelNB.py
     In Part I, we laid out our plan for classifying populations that have undergone significant amounts of selection. Our plan is to simulate a very large number of populations experiencing a wide variety of different demographic scenarios. We'll then train a classifier on these simulated populations, in hopes of distinguishing between populations with or without selection. We laid out a number of potential pitfalls and difficulties. In Part II, we carried out this plan for the simpler case when the population size remained constant. Now, we want to attempt the full project.

     First, we generated the training and test sets. We're going to start out by generating about 30,000 populations (we'll add more later). For now, we have the following scenarios:

  • 1. Constant Populations
  • 2. Exponentially Growing Populations
  • 3. Logistically Growing Populations
  • 4. Bottleneck Populations
  • 5. "Alike" Populations

     For "Alike" populations, we use our selected populations to infer the time-varying population size that would be the most consistent with the population, if we assumed there was no selection. In other words, we choose the population size that looks as close as possible to what we see. Check out the research section for more information about how we get this.

     We'll now use these 30,000 populations to train our classifier. For now, we'll just use logistic regression, and make a histogram of our results:

Histogram of the logistic regression prediction for neutral
populations (blue) and selected populations (red).

     This doesn't do nearly as well as the simple case did earlier. The distribution of outputs for the neutral populations is far more spread out, which is potentially very problematic if we try to generalize to more complicated scenarios. It is also the case that certain demographic scenarios perform much better than others -- this isn't very confidence inspiring, since it implies that the model may only be picking out differences between these specific neutral populations and the selected ones, as opposed to a general signal of selection, which is what we're hoping for.

     What do we do next then? Well, at the moment, we're only using a very tiny portion of the total information we have about the populations. We're only using the site-frequency spectrum, which summarizes all of the data into just 50 numbers. However, in practice, we actually have the complete distribution of mutations in all individuals throughout the whole genome. Thus, the next step is to come up with better ways to use this data.

     There are many potential candidates: we can look at the distance between neighboring mutations, or the clustering of common/rare mutations. We can look at statistics like Tajima's D in sliding windows along the genome, etc.

     Many of these statistics show only small differences between neutral and selected populations; however, taken together, they could provide the key for distinguishing them. One such example is plotted below: this is a histogram of the difference between the distance to the nearest mutation from a mutation that appears in only one individual vs. a mutation that appears in an intermediate number of individuals. In other words, it measures the typical clustering of rare vs. common mutations:

Histogram of the Average Distance to the Nearest Mutation
for a Singleton vs. an Intermediate

     Although the distributions are fairly similar, there is a noticeable shift between the two. By including many such statistics, we can potentially improve our model and hopefully, figure out how to classify populations that have undergone selection. Stay tuned for Part IV in the future, in which we (hopefully) discover a model that solves everything!
     In Part I, we laid out our plan for classifying populations based upon whether they experienced significant amounts of selection. Our goal is to simulate a very large set of populations under a variety of demographic scenarios and use machine learning algorithms to classify them.

     However, before we move on to the full-scale question, we can first look at a simpler question: distinguishing constant-sized populations with selection from those without. This will be very helpful, as we already know how to do this using simple statistics such as Tajima's D, and will ensure that we're on the right track.

     First, we simulate a bunch of populations with a range of population sizes and mutation rates, some with selection and some without. In this case, we have around 1000 neutral populations and around 1000 selected populations. We've divided this set of populations 50/50 into a training set and a test set. The first thing we'll do is look at a simple statistic that is commonly used to detect selection: Fu and Li's D. As we saw in Part I, constant neutral populations will tend to have a Fu and Li's D near zero, while selected populations will often have a significantly negative value. If we calculate this statistic for our test set and make a histogram we find:

Histogram of Fu and Li's D for neutral populations (blue)
and selected populations (red).

     We see that the neutral populations are distributed around zero, while the selected populations are skewed towards more-negative values of Fu and Li's D. (Note: The actual shape of the distribution depends strongly upon the specific range of parameters we have chosen -- if we preferentially choose very small or very large selection coefficients, the distribution will be closer to neutral).

     In order to compare the predictions using Fu and Li's D with subsequent models, we can compare the false positive and false negative rates at different cutoff points. In general, there is a tradeoff between these two rates (precision-recall tradeoff). For the cutoff point shown above, there is a 1% false positive rate and a 9.3% false negative rate.

     Now, we can attempt to perform the same analysis using machine learning techniques. Instead of relying on our understanding of population genetics, we will simply feed features from our training set into a classification algorithm, and see how well this algorithm performs on the same test set as above. To begin, we will use only two features: the number of segregating sites (total number of sites that have a mutation) and the number of singletons (number of sites that have a mutation in exactly one individual). These are the same two features that Fu and Li's D uses. We'll then train a simple logistic regression classifier on the training set, and apply this to the test set. Doing so, we find:

Histogram of the logistic regression prediction for neutral
populations (blue) and selected populations (red).

     We see that the vast majority of neutral populations are clustered around zero, while the vast majority of selected populations are clustered around one. For the cutoff point shown in the histogram, there is a 1% false positive rate and a 10% false negative rate. Thus, the logistic regression algorithm performs about as well as the summary statistic approach.

     As a side note, one thing that is always very helpful is to look at when our model is getting it wrong. We mentioned previously that in certain parameter regimes, selected populations appear neutral. In the case of very strong purifying (negative) selection, when Ne s >> 1, selection does not lead to a significant deviation in the relative branch lengths. Thus, we expect that as Ne s becomes large, our prediction should be less and less accurate. Plotting our prediction as a function of Ne s:

Plot of our prediction vs. Log(Ne s) for negatively selected populations.

     This plot confirms our expectations: as Ne s becomes larger, populations begin to appear more and more neutral, such that our prediction falls off. This is consistent with what we expect, and provides some confirmation that the model is working as expected.

     So far, we have shown that we can recreate a simple model using the same features as Fu and Li's D. However, the main advantage of using these techniques is that we are not restricted to models that we can predict analytically: instead, we can incorporate much more complicated features into our model. Thus, we want to repeat our analysis using additional features. In this case, we'll now include the complete site frequency spectrum. Doing so, we now get:

Histogram of the logistic regression prediction for neutral
populations (blue) and selected populations (red).

     Now, the cutoff point gives a 1% false positive rate and a 7.4% false negative rate. Thus, we have significantly improved our model by incorporating additional information. So far, everything is working as expected. However, we now want to move on to the much more complicated situation in which populations may also experience arbitrary demographic changes.
      In the past few decades, as the cost of DNA sequencing has decreased dramatically, an enormous amount of sequence data has emerged. A central goal in biology is to extract as much information about the evolutionary process as we can from this data. Already, there have been tremendous advances as a consequence of sequence data: we have improved our understanding of the relatedness between species, the patterns of migration and population structure in humans, the evolution of antibiotic resistance, the evolution of disease, and far more.

     One interesting question that is often asked is whether a population experienced significant amounts of selection. Several different "tests of neutrality" have been developed to answer this question. Early tests of neutrality, such as Tajima's D, relied on the fact that under a neutral model, different summary statistics describing a sample are unbiased-estimators of a single quantity, Θ. Thus, if a population is evolving under neutrality, the estimated values of Θ from different estimators should be consistent with one another. Tajima's D and other statistics provide a mathematical framework to evaluate whether there is a statistically significant deviation from the null hypothesis.

     When a population is under directional selection, less-fit individuals tend to be out-competed in the population, which leads to an overall reduction in diversity, and an excess of rare mutations relative to more common ones. This leads to negative values for Tajima's D (for more detailed information about this and other tests of neutrality, I find the online lecture notes by Prof. Wolfgang Stephan to be very helpful). Thus, negative values of Tajima's D can sometimes be indicative of directional selection. However, it is important to note that this is only one possible explanation -- this can also be caused by simple demographic effects, such as population expansion.

     To illustrate this, consider two populations: One population is under selection but has a constant population size, while the other is neutral but growing exponentially. If we calculate Tajima's D for these two populations, or another feature often used for detecting selection/growth, a Bayesian skyline plot, we see very similar signals for each:

Common features used to identify pop. growth / selection for two populations, one experiencing
exponential growth but no selection, the other experiencing selection but no growth.

     Thus, we see that the signal for selection can look very similar to that for population growth. An open question is whether we can develop more sophisticated methods to distinguish between these scenarios. Currently, our analytical understanding of these effects is still incomplete, making it very difficult to develop statistical tests incorporating both effects.

     This leads to my current research idea: instead of relying on deriving a statistical test from first-principles, can we adopt traditional machine learning approaches to address this question? The plan is to generate a very large set of simulated populations, evolving under a wide variety of demographic scenarios, with and without selection. We'll then use machine learning algorithms to classify populations that have undergone selection vs. those that have not.

     There are a number of potential problems with this idea: First, although our goal is to differentiate any demographic model from selection, we will only be able to show that we can differentiate the specific demographic scenarios that we have tested from selection. Thus, it is not at all clear that the results will generalize to cover the complete space of all demographic models. We will have to be very careful in our interpretation of the results and in our selection of appropriate training / test sets.

     A second major problem is the following: Although all neutral populations should 'look' like neutral populations, selected populations do not always have a noticeable signature of selection. In fact, there are many regimes of selection under which we do not expect to see any signal at all -- for example, in the very strong background selection regime. Thus, although we expect the false positive rate to be very low (i.e. neutral populations should never be labeled as selected), there are many cases in which we expect a significant number of false negatives (i.e. selected populations can be labeled as neutral). Furthermore, we can make this false negative rate arbitrarily higher or lower simply by including more or less populations that are in this region. Thus, there are a number of potential caveats that have to be addressed.