Facebook RSS
banner
And other spurious correlations...


One of the true joys of being a data scientist is digging into a new data set -- exploring a new field, figuring out how different things interact and discovering correlations. Each field has its own unique quirks -- different factors that end up having enormous influence on what you see in the data. And there’s one particularly enjoyable way to learn about these quirks: making the most absurd conclusions you possibly can.

Today, we’ll forget for a moment that correlation doesn’t imply causation, and discover some of the most baffling things that affect how difficult math is.


Disclaimer: None of the things I’m about to say are truly causal -- all of these statements are merely a result of confounding factors and spurious correlations -- studying math on rainy days is excellent for you, I promise.



--------------------------------------------------------------------------------------------------------------------------


We all know that rainy and cold days feel dreary, dark, and more frustrating. But did you know that math is actually more difficult the colder it gets? Yep:



Slide1.png




If you look at accuracy across all math problems on Khan Academy, you’ll see that accuracy is almost 5% lower on the coldest days than the warmest days. This is a mind-bogglingly huge effect. Why does it happen? Is math really more difficult when it’s cold?


Of course not. What we’re really seeing is that seasonality has a huge effect on who is doing math problems. If we look at accuracy throughout the year, we see:



Slide5.png



The reason for these huge shifts is that there’s many different motivations for using Khan Academy: some folks are using Khan Academy for their own enrichment, enthusiastic about learning new things and reviewing things they have learned in the past, and these users are likely to continue to be active on Khan Academy throughout the entire year, including the summer and the holidays. However, a less motivated user may be less inclined to stay active when they’re not currently in school.


Here’s another fun fact: did you know that people are noticeably more accurate during football games? Afternoons during which there is a nationally-televised NFL game have an almost 1.5% higher accuracy rate:



Slide4.png
Fun Fact: If you zoom in far enough, all two-bar plots looks extremely impressive.

Of course, as before, this is just because afternoon NFL games are all on Sunday (or Saturday in January!), and accuracy is far higher on the weekends than on weekdays:



Slide2.png



Similarly, users are more accurate during baseball games than basketball games (summer vs. winter), ice cream is absolutely awesome for your math abilities, ice skatingis disastrous, and holidays are fantastic.


This ends up having significant implications for data science -- it’s very easy to reach highly misleading conclusions whenever you do anything that involves time. Testing out a new feature that has different effects on more vs. less engaged users can have wildly different effects depending upon the time of day, time of week, or even time of year that you launch it.

This might be obvious in any field when you launch something around the holidays or late at night, but for education in particular, the timing of back to school and school breaks are hugely important.


--------------------------------------------------------------------------------------------------------------------------


Quick question: What age group do you think is the most accurate on Khan Academy? The answer is 97 year olds. In fact, 97 year olds tend to answer over 85% of questions correctly, which is vastly higher than the average accuracy.


Why is this? It’s the same reason that the ‘best’ and ‘worst’ states in the U.S. are also the smallest ones -- smaller sample sizes have far higher variance. Only 17 users claim to be 97 years old, while younger ages typically have hundreds of thousands. Thus, while younger ages tend to be very close to the overall average, higher ages can vary wildly. Incidentally, the least accurate users are 99 year olds.


Another fun question: Which city has the highest mission completion rate in the world? You’re probably thinking this is another sample size trick, so let’s change it up slightly and ask: of cities with at least 100 purported users, which city has the highest mission completion rate?  


That would be Antarctic Great Wall Station, Antarctica. The average user from Antarctica has completed a staggering 2.3 entire missions.



BlogPost1.jpg



What’s causing this? Well, we’re all liars. When you select a city on Khan Academy, you choose from a dropdown menu of real cities -- so if you want to pick something ‘fun’, your options are somewhat limited. Antarctica is a pretty great choice.


In fact, 132 users claim to be from Antarctic Great Wall Station, Antarctica, which is pretty interesting when you consider that the fount of all true knowledge, Wikipedia, claims that the summer population is only 40 (winter: 14).


Users who choose this location also happen to be far more engaged, and far more accurate, than the average user. Other cities come pretty close: Nowhere Else, Tasmania, Australia is strangely popular too. In fact, since selecting a city is purely optional (and requires deliberately editing your profile), merely choosing one at all makes you far more accurate.


In conclusion,



  •      1. Calculus is impossible on rainy days.
  •      2. Watching football makes you far more accurate.
  •      3. Antarcticans are math experts.
  •      4. 97-year olds are excellent at math, 99-year olds not as much.

Have any good spurious correlations you’d like to share, or curious about this data and how it was collected? Leave a comment below! 


See laurennicolaisen.github.io! (Still a work in progress)

KhanCraft was built as part of Hackweek at Khan Academy -- the basic idea is to provide students with an opportunity to build something awesome using their math skills. Each of the different blocks corresponds with a different skill in a grade-level mission, and each of the different designs corresponds with a sub-section of the mission. 

Thus, for example, if you are in 3rd Grade and you complete the first design, you'll have completed the entire Multiplication and Division topic. If you complete the final spaceship design, you'll have completed the entire mission!

This is definitely a long way from complete, and I'd love to hear any ideas for improving it in the future. If you have any suggestions / comments, or if you want to submit any designs, e-mail me at lauren.nicolaisen@gmail.com!

Curious about the designs? Although any design will include a variety of problems from throughout the mission, each design focuses on a particular topic, listed below. Note that this only includes a portion of the total exercises available on Khan Academy.

3rd Grade
      • Penguin: Multiplication and Division
      • Puppy/Tree: Fractions
      • Halloween: Measurement and Geometry
      • Robot: Addition and Subtraction (+ Above)
      • Underwater: Expressions and Patterns (+ Above)
      • Spaceships: All of the Above

4th Grade
      • Penguin: Multiplication and Division
      • Puppy/Tree: Measurement and Data
      • Halloween: Fractions
      • Robot: Geometry (+ Above)
      • Underwater: Factors, Multiples, and Patterns (+ Above)
      • Spaceships: All of the Above

5th Grade
      • Penguin: Place Value and Decimals
      • Puppy/Tree: Fractions
      • Halloween: Measurement and Data
      • Robot: Geometry (+ Above)
      • Underwater: Algebraic Thinking (+ Above)
      • Spaceships: All of the Above

6th Grade
      • Penguin: Geometry
      • Puppy/Tree: Data and Statistics
      • Halloween: Ratios, Rates, and Percentages
      • Robot: Negative Numbers (+ Above)
      • Underwater: Variables and Expressions (+ Above)
      • Spaceships: All of the Above

7th Grade
      • Penguin: Statistics and Probability
      • Puppy/Tree: Geometry
      • Halloween: Variables and Expressions
      • Robot: Negative Numbers (+ Above)
      • Underwater: Rates and Proportional Relationships (+ Above)
      • Spaceships: All of the Above

8th Grade
      • Penguin: Numbers and Equations
      • Puppy/Tree: Geometry
      • Halloween: Systems of Equations
      • Robot: Data and Modeling (+ Above)
      • Underwater: Relationships and Functions (+ Above)
      • Spaceships: All of the Above

Much of this project was based off of an earlier Hackathon project, KhanQuest, built by Charles Marsh, Joel Burget, Zach Gotsch, Desmond Branch, Aria Toole, and Michelle Todd.  Their project is awesome, and they provided a ton of help in getting this all set up. I highly recommend their very awesome blog posts here and here.


(This was originally posted as a guest post at the very awesome Insight Data Science blog. I highly recommend checking it out!)

It's more than a little frightening to make any major career transition in life, and switching from academia into industry is certainly one of the scarier. For me, graduate school, while full of frustrations and stressful moments, felt surprisingly safe. In academia, it's usually very clear what is expected of you, there's no real risk of being fired or your company not succeeding, and you're typically free to set your own schedule (albeit, a long and not necessarily ideal one). In its own way, graduate school can feel almost comfortable, while switching careers into an unknown territory can be very intimidating.


On the other hand, there are a number of real upsides to making the transition. Data science has the potential to be an extremely rewarding career, with the possibility of making real, day-to-day impacts on the lives of hundreds, thousands, even millions of people. There's nothing more satisfying than building some app, or website, or policy, or program, that you can point to and say, "I made that happen!" And you can do this, not on a five, six, seven year time-scale, but in weeks or even days (or hours!).


The hardest part is taking that leap of faith necessary to get your foot in the door. Although applying to graduate school was very stressful, it was also fairly objective. Getting the perfect job in industry is a completely different story - the good grades, test scores, and research results can certainly matter, but much of the job decision will come down to how you present yourself in your resume and through the interviews. 


Figuring out how to navigate through the transition process is very tricky, so where do you start? The first (and probably the trickiest) step is to learn what you need to be focusing on in the first place. For example, I had a fairly strong background in programming prior to Insight, but I learned all of it on my own through many different graduate school projects. In order to show that I had the right skills in interviews, I had to go back and learn many of the CS fundamentals that come with having studied them in a university setting: algorithms, dynamic programming, etc. Similarly, I had lots of experience with using data, making data usable in the first place, running statistical tests, and so on, but I had no way of knowing the right language to use in interviews or how to highlight the most important and relevant parts of work that I'd done.


 For me, learning what to focus on in order to be successful was by far the most helpful benefit of participating in the Insight program. Surrounding myself with people who have been through the same process and people who were going through the same process alongside me was crucial to my confidence in my decision. If you're considering making the transition, and it really is an awesome one, I highly encourage you to seek out as many folks as you can, and ask a lot of questions.


Here are a few other questions I get asked a lot today, and my answers for those of you ready to take the first scary step:




What was your project at Insight? 


I had an absolute blast making my web-app, Pick Your Fiction. The idea behind the app is simple: sometimes, I want to read a book that features certain things - say, a fantasy novel with elves. But, there are many things I don't necessarily like in fantasy novels (gothic, angst-ridden, teenage vampires?). What I'd really like to be able to do is to say, "These are the things that make me happy, these are the things that make me sad, find me a book tailored to those preferences."





With this in mind, I developed Pick Your Fiction. Here, you can enter a title of a book you love (or leave it blank) and things that you like and/or don't like in books, and it will recommend a book based on your input. For example, suppose I want a book about elves:






However, what if I want a book that doesn't have dragons in it?






There are also a few additional options - you can control how important it is to you that the books are popular (for example, if you've read all the Harry Potter books, you may also have read other very popular young adult fantasy novels, and might want something less well known), or that the books are of the same genre as your original title (i.e., say you want an adult version of a children's story, or a science fiction version of a historical novel). The app will also provide a few suggestions for you based upon features that tend to be divisive within a subject.


How does all of this work? The app primarily relies upon a very large number of customer reviews scraped from a well-known website, as well as the description/summary provided for each book. These reviews compose a significant database of words for each of the books. The app starts off by comparing the similarity between books using Python with nltk for tokenizing + stemming + Tfidf weighting + cosine similarity. This similarity metric is then weighted based upon the similarity in genre and the popularity of the book, where each of these weights is controlled by the user.


The final scores are then boosted based upon the frequencies of the user-added features relative to their baseline frequencies. Finally, the additional suggestions are chosen based upon those words that appear in as close to 50% of the closest-matching books as possible, but where this value is high or low relative to the typical frequency. All of the information is stored in a MySQL database, and the front-end is built using Flask + Bootstrap + jQuery, and hosted on AWS.




How did you pick your project? 


This is a really tricky question, and was actually one of the hardest parts of the program for me. There is a huge amount of freedom in deciding what to work on, and there's essentially no limit to what you can do. There are, however, a few key things to think about when deciding. Obviously, the project has to be doable (and doable within a 2-3 week timespan!), and, ideally, it should involve using techniques that highlight your ability to work with data. Projects that involve cleaning data and using interesting techniques/software are good, since interviews will often involve a significant amount of time spent talking about your project. In my case, the project was heavy on natural language processing and I ended up talking about that quite a bit.


The most important thing, however, is to make sure that it's a project that you're enthusiastic about and will enjoy working on. The reason for this is that you're going to spend an inordinate amount of time talking about your project, and it shows right away if it's a project you care strongly about. In my case, Pick Your Fiction was an app I was extremely excited about - I loved testing out different people's requests, talking about all of the many issues I faced along the way, discussing what I would do if I were to try and monetize and grow the app, etc. I was always extremely happy when interviewers asked about my app, because I loved talking about it, and I genuinely thought it was a useful/interesting thing.




A few weeks ago, John Joo posted an excellent list of things to do to prepare for Insight in the weeks/months leading up to the program. However, what if you're still several years out? What can you do throughout your Ph.D. to prepare for a future career in data science? 


The best thing you can do for yourself, by far, is to work on some pet project that involves data. It doesn't have to be anything mind-blowing or enormous, just something that you can point to as evidence of your skills in data science, and also your interest and enthusiasm for the data science industry. As I mentioned in the previous question, the single best thing you can do to convince someone you're capable of being a great data scientist is to show them with something tangible. And perhaps even more importantly, steering interviews to a project that you've done yourself means you'll know all the answers to the questions they're likely to ask!


There are a bunch of other ways to get involved in data science as well. A big one for me was the data science competitions over at kaggle.com. This was another case of having a massive community from which you can learn how to do things. I started off with the basic tutorial competitions, which come with a ton of different sample code snippets, and from there started to develop a better intuition for which techniques worked and which didn't. (These projects are also great things to talk about in interviews, and they typically provide good data sets to play around with).


When you're ready to start making the transition to a career in data science (or any new career for that matter), your first step should be to reach out to people who have been in your position before and ask them about their experiences directly. For me, I talked with many different alumni from the Insight program, as well as friends from my graduate program who had graduated prior to me, and other people in the alumni network. One of the biggest benefits of doing this is simply the exposure to the field - there were a huge number of different phrases and keywords that I'd heard for the very first time in my first week of Insight, that are now just everyday occurrences.  Knowing that many of my colleagues and peers had made the same transition made my decision that much less scary, and helped steer me towards Insight and my new role at Khan Academy. Good luck taking that first step!
     A few weeks ago I talked about my first experience with a data science competition on Kaggle.com, the StumbleUpon Evergreen Classifier Challenge. This competition officially ended last night and I was able to manage 16th place (out of ~600)! I learned a ton from reading other folks' blog posts and the forums, and thought I'd do my part and share my final solution as well:

     In the end, I used an ensemble of six different types of model. I used only the boilerplate text, with a TfIdf vectorizer and various types of lemmatization/stemming. Along the way, I added in a couple of extra features, including a part-of-speech tagger, as well as a genre tagger, both using nltk. However, neither of these ended up improving the final score. However, two things that did end up helping were the domain of the URL (i.e. www.bleacherreport.com), as well as separately using only the first 5/10/15/20 words in the boilerplate. In the end, the final ensemble included:


  • 1. Logistic Regression with Word 1,2-Grams
  • 2. Logistic Regression with Character 3-Grams
  • 3. Multinomial Naive-Bayes with Word 1,2-Grams and chi2 Feature Selection
  • 4. Random Forest with Character 1,2-Grams and chi2 Feature Selection
  • 5. Logistic Regression with the URL Domain
  • 6. Logistic Regression with the First 5/10/15/20 Words

     In order to combine all of the models together, I started out by splitting the training set into a 5-fold CV loop. Within each fold, I trained each of the models separately (thereby generating a prediction on 1/5 of the data, using 4/5 of the data). After the CV loop, this resulted in a prediction for each of the different models on the complete training set. Then, I used Ridge Regression to fit the combination of the predictions to the training set.

     Finally, I re-trained each of the different models on the complete training set and applied them to the test set. Then, I used the previously-trained Ridge model to combine the outputs on the test set. The best score I was able to achieve on the private leaderboard ended up being 0.88562.

      (Note: In order to cross-validate all of this, I originally had a much-longer script that split the training set into a 5-fold cv loop, and performed the entire above routine using 4/5 as the training set and 1/5 as the test set. This gave a CV-score of around 0.8835.)

     

Source Code: FinalModel.py
     One of the coolest parts about data science is getting to tackle brand-new questions you wouldn't ordinarily have come across. A great place for finding such challenges is through data science competitions at Kaggle. A while back, I started playing around with some of their 'learning' competitions, which are a great place to start since there's tons of tutorials and starter codes to get you started. However, lately I've wanted to take on a full competition from scratch, which has brought me to an especially cool challenge, the StumbleUpon Evergreen Classifier Challenge.

     The goal of this competition is to classify a bunch of webpages as "evergreen" or "ephemeral". We're given quite a bit of information about each webpage, but the most useful information seems to come from the boilerplate text. Basically, each webpage has a url, a title, and a body (i.e. a bunch of text), and we want to build some sort of document classifier from this. Our plan is the following:


  • 1. Preprocess the Data
  • 2. Run Feature Selection
  • 3. Train a Classifier


1. Preprocessing the Data


     The first thing we want to do is clean up the data a bit. There's a few things we want to do: First, we want to turn the text into a list of words. To do this, we're going to use the tokenizer in the nltk package. This will simply convert the text for each webpage into a list of words, i.e.:
     

'The cat swims while the cats swim'
becomes
['The', 'cat', 'swims', 'while', 'the', 'cats', 'swim']

We also want to do something to recognize similar words (i.e. 'cats' vs. 'cat'). Fortunately, the nltk package also has a bunch of ways to do this, one of which is the WordNet Lemmatizer. For example:      
['The', 'cat', 'swims', 'while', 'the', 'cats', 'swim']
becomes
['The', 'cat', 'swim', 'while', 'the', 'cat', 'swim']

The scikit-learn example page provides a quick example for combining these two steps into the format we'll need later:

           class LemmaTokenizer(object):
                def __init__(self):
                    self.wnl = WordNetLemmatizer()
                def __call__(self, doc):
                     return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

The next thing we want to do is to use a vectorizer to put this in the format we need for our classifier. We're going to use a Tf-Idf Vectorizer, which will transform the list of words in each webpage into a matrix, where each element is weights the number of occurrences of a word with its inverse total number of occurrences in all pages. This will weight the different words such that common words that appear everywhere don't swamp out rare words that appear preferentially in "ephemeral" or "evergreen" webpages. We can combine all the steps so far:

           vect = TfidfVectorizer(stop_words = 'english', min_df = 3, max_df = 1.0,
               strip_accents = 'unicode', analyzer = 'word', ngram_range = (1,2), use_idf = 1,
               smooth_idf = 1, sublinear_tf = 1, tokenizer = LemmaTokenizer())
           vect.fit_transform(trainset)


     

2. Running Feature Selection


      Now we have a big matrix (in this case, around 10000 x 160000) where each row is a webpage, and each column represents a specific word (feature). Most of these words, however, are not particularly informative. We can improve our model by eliminating those words that are least informative and are likely to lead to overfitting (A really great reference for this step is here).

      There are a number of ways to evaluate the 'usefulness' of a feature -- we're going to use a simple one, the Chi-Squared score. A Chi-Squared test is typically used to evaluate the independence of events. For example, if we have some feature where P(evergreen | feature) = P(evergreen), then we know that the feature is not particularly informative, whereas if P(evergreen | feature) is very different from the base probability P(evergreen), then the feature is very informative. Chi-Squared feature selection will help us to identify the most informative features. This is also very easily implemented using scikit-learn:

           FS = SelectPercentile(score_func = chi2, percentile = k)
           FS.fit_transform(trainset)

In order to figure out which percentage of features to keep, we will use a CV-loop in the next step.

     

3. Training a Classifier


      Now we're ready to train the classifier. There are many different options here, but we're going to start off using a Multinomial Naive-Bayes classifier. There are going to be two main parameters in our model: First, we have to decide which percentage of features to keep in the feature selection step (k). Second, we have to decide which alpha to use in our Naive-Bayes classifier (alpha). To do this, we'll run a 5-fold cross-validation loop for a range of different k's and alpha's, and we'll select the best k and alpha. Scikit-learn provides an easy way to perform cross-validation:

           kf = StratifiedKFold(trainlabels, n_folds = 5, indices = True)
           for train,cv in kf:
                X_train, X_cv, y_train, y_cv =
                              trainset[train], trainset[cv], trainlabels[train], trainlabels[cv]
                ## Train Model and Calculate AUC Score for this Fold, for Each k, Alpha ##

After we do this, we can use our best k and best alpha to fit the classifier to the full training set, and then apply it to the test set. And we're done! This particular model gives a score of about 0.869; however, when combined in an ensemble with other models, I've been able to get up to about 0.883. The next step is to incorporate some of the non-text features into our model and see how much higher we can get!

     

Source Code: ModelNB.py
     In Part I, we laid out our plan for classifying populations that have undergone significant amounts of selection. Our plan is to simulate a very large number of populations experiencing a wide variety of different demographic scenarios. We'll then train a classifier on these simulated populations, in hopes of distinguishing between populations with or without selection. We laid out a number of potential pitfalls and difficulties. In Part II, we carried out this plan for the simpler case when the population size remained constant. Now, we want to attempt the full project.

     First, we generated the training and test sets. We're going to start out by generating about 30,000 populations (we'll add more later). For now, we have the following scenarios:


  • 1. Constant Populations
  • 2. Exponentially Growing Populations
  • 3. Logistically Growing Populations
  • 4. Bottleneck Populations
  • 5. "Alike" Populations

     For "Alike" populations, we use our selected populations to infer the time-varying population size that would be the most consistent with the population, if we assumed there was no selection. In other words, we choose the population size that looks as close as possible to what we see. Check out the research section for more information about how we get this.

     We'll now use these 30,000 populations to train our classifier. For now, we'll just use logistic regression, and make a histogram of our results:

Histogram of the logistic regression prediction for neutral
populations (blue) and selected populations (red).

     This doesn't do nearly as well as the simple case did earlier. The distribution of outputs for the neutral populations is far more spread out, which is potentially very problematic if we try to generalize to more complicated scenarios. It is also the case that certain demographic scenarios perform much better than others -- this isn't very confidence inspiring, since it implies that the model may only be picking out differences between these specific neutral populations and the selected ones, as opposed to a general signal of selection, which is what we're hoping for.

     What do we do next then? Well, at the moment, we're only using a very tiny portion of the total information we have about the populations. We're only using the site-frequency spectrum, which summarizes all of the data into just 50 numbers. However, in practice, we actually have the complete distribution of mutations in all individuals throughout the whole genome. Thus, the next step is to come up with better ways to use this data.

     There are many potential candidates: we can look at the distance between neighboring mutations, or the clustering of common/rare mutations. We can look at statistics like Tajima's D in sliding windows along the genome, etc.

     Many of these statistics show only small differences between neutral and selected populations; however, taken together, they could provide the key for distinguishing them. One such example is plotted below: this is a histogram of the difference between the distance to the nearest mutation from a mutation that appears in only one individual vs. a mutation that appears in an intermediate number of individuals. In other words, it measures the typical clustering of rare vs. common mutations:


Histogram of the Average Distance to the Nearest Mutation
for a Singleton vs. an Intermediate

     Although the distributions are fairly similar, there is a noticeable shift between the two. By including many such statistics, we can potentially improve our model and hopefully, figure out how to classify populations that have undergone selection. Stay tuned for Part IV in the future, in which we (hopefully) discover a model that solves everything!
     In Part I, we laid out our plan for classifying populations based upon whether they experienced significant amounts of selection. Our goal is to simulate a very large set of populations under a variety of demographic scenarios and use machine learning algorithms to classify them.

     However, before we move on to the full-scale question, we can first look at a simpler question: distinguishing constant-sized populations with selection from those without. This will be very helpful, as we already know how to do this using simple statistics such as Tajima's D, and will ensure that we're on the right track.

     First, we simulate a bunch of populations with a range of population sizes and mutation rates, some with selection and some without. In this case, we have around 1000 neutral populations and around 1000 selected populations. We've divided this set of populations 50/50 into a training set and a test set. The first thing we'll do is look at a simple statistic that is commonly used to detect selection: Fu and Li's D. As we saw in Part I, constant neutral populations will tend to have a Fu and Li's D near zero, while selected populations will often have a significantly negative value. If we calculate this statistic for our test set and make a histogram we find:

Histogram of Fu and Li's D for neutral populations (blue)
and selected populations (red).

     We see that the neutral populations are distributed around zero, while the selected populations are skewed towards more-negative values of Fu and Li's D. (Note: The actual shape of the distribution depends strongly upon the specific range of parameters we have chosen -- if we preferentially choose very small or very large selection coefficients, the distribution will be closer to neutral).

     In order to compare the predictions using Fu and Li's D with subsequent models, we can compare the false positive and false negative rates at different cutoff points. In general, there is a tradeoff between these two rates (precision-recall tradeoff). For the cutoff point shown above, there is a 1% false positive rate and a 9.3% false negative rate.

     Now, we can attempt to perform the same analysis using machine learning techniques. Instead of relying on our understanding of population genetics, we will simply feed features from our training set into a classification algorithm, and see how well this algorithm performs on the same test set as above. To begin, we will use only two features: the number of segregating sites (total number of sites that have a mutation) and the number of singletons (number of sites that have a mutation in exactly one individual). These are the same two features that Fu and Li's D uses. We'll then train a simple logistic regression classifier on the training set, and apply this to the test set. Doing so, we find:

Histogram of the logistic regression prediction for neutral
populations (blue) and selected populations (red).

     We see that the vast majority of neutral populations are clustered around zero, while the vast majority of selected populations are clustered around one. For the cutoff point shown in the histogram, there is a 1% false positive rate and a 10% false negative rate. Thus, the logistic regression algorithm performs about as well as the summary statistic approach.

     As a side note, one thing that is always very helpful is to look at when our model is getting it wrong. We mentioned previously that in certain parameter regimes, selected populations appear neutral. In the case of very strong purifying (negative) selection, when Ne s >> 1, selection does not lead to a significant deviation in the relative branch lengths. Thus, we expect that as Ne s becomes large, our prediction should be less and less accurate. Plotting our prediction as a function of Ne s:

Plot of our prediction vs. Log(Ne s) for negatively selected populations.

     This plot confirms our expectations: as Ne s becomes larger, populations begin to appear more and more neutral, such that our prediction falls off. This is consistent with what we expect, and provides some confirmation that the model is working as expected.

     So far, we have shown that we can recreate a simple model using the same features as Fu and Li's D. However, the main advantage of using these techniques is that we are not restricted to models that we can predict analytically: instead, we can incorporate much more complicated features into our model. Thus, we want to repeat our analysis using additional features. In this case, we'll now include the complete site frequency spectrum. Doing so, we now get:

Histogram of the logistic regression prediction for neutral
populations (blue) and selected populations (red).

     Now, the cutoff point gives a 1% false positive rate and a 7.4% false negative rate. Thus, we have significantly improved our model by incorporating additional information. So far, everything is working as expected. However, we now want to move on to the much more complicated situation in which populations may also experience arbitrary demographic changes.