Machine Learning to Detect Selection, Part III: The Full Case
First, we generated the training and test sets. We're going to start out by generating about 30,000 populations (we'll add more later). For now, we have the following scenarios:
- 1. Constant Populations
- 2. Exponentially Growing Populations
- 3. Logistically Growing Populations
- 4. Bottleneck Populations
- 5. "Alike" Populations
For "Alike" populations, we use our selected populations to infer the time-varying population size that would be the most consistent with the population, if we assumed there was no selection. In other words, we choose the population size that looks as close as possible to what we see. Check out the research section for more information about how we get this.
We'll now use these 30,000 populations to train our classifier. For now, we'll just use logistic regression, and make a histogram of our results:
populations (blue) and selected populations (red).
This doesn't do nearly as well as the simple case did earlier. The distribution of outputs for the neutral populations is far more spread out, which is potentially very problematic if we try to generalize to more complicated scenarios. It is also the case that certain demographic scenarios perform much better than others -- this isn't very confidence inspiring, since it implies that the model may only be picking out differences between these specific neutral populations and the selected ones, as opposed to a general signal of selection, which is what we're hoping for.
What do we do next then? Well, at the moment, we're only using a very tiny portion of the total information we have about the populations. We're only using the site-frequency spectrum, which summarizes all of the data into just 50 numbers. However, in practice, we actually have the complete distribution of mutations in all individuals throughout the whole genome. Thus, the next step is to come up with better ways to use this data.
There are many potential candidates: we can look at the distance between neighboring mutations, or the clustering of common/rare mutations. We can look at statistics like Tajima's D in sliding windows along the genome, etc.
Many of these statistics show only small differences between neutral and selected populations; however, taken together, they could provide the key for distinguishing them. One such example is plotted below: this is a histogram of the difference between the distance to the nearest mutation from a mutation that appears in only one individual vs. a mutation that appears in an intermediate number of individuals. In other words, it measures the typical clustering of rare vs. common mutations:
for a Singleton vs. an Intermediate
Although the distributions are fairly similar, there is a noticeable shift between the two. By including many such statistics, we can potentially improve our model and hopefully, figure out how to classify populations that have undergone selection. Stay tuned for Part IV in the future, in which we (hopefully) discover a model that solves everything!