Facebook RSS

Machine Learning to Detect Selection, Part II: The Simple Case

     In Part I, we laid out our plan for classifying populations based upon whether they experienced significant amounts of selection. Our goal is to simulate a very large set of populations under a variety of demographic scenarios and use machine learning algorithms to classify them.

     However, before we move on to the full-scale question, we can first look at a simpler question: distinguishing constant-sized populations with selection from those without. This will be very helpful, as we already know how to do this using simple statistics such as Tajima's D, and will ensure that we're on the right track.

     First, we simulate a bunch of populations with a range of population sizes and mutation rates, some with selection and some without. In this case, we have around 1000 neutral populations and around 1000 selected populations. We've divided this set of populations 50/50 into a training set and a test set. The first thing we'll do is look at a simple statistic that is commonly used to detect selection: Fu and Li's D. As we saw in Part I, constant neutral populations will tend to have a Fu and Li's D near zero, while selected populations will often have a significantly negative value. If we calculate this statistic for our test set and make a histogram we find:

Histogram of Fu and Li's D for neutral populations (blue)
and selected populations (red).

     We see that the neutral populations are distributed around zero, while the selected populations are skewed towards more-negative values of Fu and Li's D. (Note: The actual shape of the distribution depends strongly upon the specific range of parameters we have chosen -- if we preferentially choose very small or very large selection coefficients, the distribution will be closer to neutral).

     In order to compare the predictions using Fu and Li's D with subsequent models, we can compare the false positive and false negative rates at different cutoff points. In general, there is a tradeoff between these two rates (precision-recall tradeoff). For the cutoff point shown above, there is a 1% false positive rate and a 9.3% false negative rate.

     Now, we can attempt to perform the same analysis using machine learning techniques. Instead of relying on our understanding of population genetics, we will simply feed features from our training set into a classification algorithm, and see how well this algorithm performs on the same test set as above. To begin, we will use only two features: the number of segregating sites (total number of sites that have a mutation) and the number of singletons (number of sites that have a mutation in exactly one individual). These are the same two features that Fu and Li's D uses. We'll then train a simple logistic regression classifier on the training set, and apply this to the test set. Doing so, we find:

Histogram of the logistic regression prediction for neutral
populations (blue) and selected populations (red).

     We see that the vast majority of neutral populations are clustered around zero, while the vast majority of selected populations are clustered around one. For the cutoff point shown in the histogram, there is a 1% false positive rate and a 10% false negative rate. Thus, the logistic regression algorithm performs about as well as the summary statistic approach.

     As a side note, one thing that is always very helpful is to look at when our model is getting it wrong. We mentioned previously that in certain parameter regimes, selected populations appear neutral. In the case of very strong purifying (negative) selection, when Ne s >> 1, selection does not lead to a significant deviation in the relative branch lengths. Thus, we expect that as Ne s becomes large, our prediction should be less and less accurate. Plotting our prediction as a function of Ne s:

Plot of our prediction vs. Log(Ne s) for negatively selected populations.

     This plot confirms our expectations: as Ne s becomes larger, populations begin to appear more and more neutral, such that our prediction falls off. This is consistent with what we expect, and provides some confirmation that the model is working as expected.

     So far, we have shown that we can recreate a simple model using the same features as Fu and Li's D. However, the main advantage of using these techniques is that we are not restricted to models that we can predict analytically: instead, we can incorporate much more complicated features into our model. Thus, we want to repeat our analysis using additional features. In this case, we'll now include the complete site frequency spectrum. Doing so, we now get:

Histogram of the logistic regression prediction for neutral
populations (blue) and selected populations (red).

     Now, the cutoff point gives a 1% false positive rate and a 7.4% false negative rate. Thus, we have significantly improved our model by incorporating additional information. So far, everything is working as expected. However, we now want to move on to the much more complicated situation in which populations may also experience arbitrary demographic changes.


  • Facebook
  • Twitter
  • Myspace
  • Google Buzz
  • Reddit
  • Stumnleupon
  • Delicious
  • Digg
  • Technorati