# Machine Learning to Detect Selection, Part III: The Full Case

In Part I, we laid out our plan for classifying populations that have undergone significant amounts of selection. Our plan is to simulate a very large number of populations experiencing a wide variety of different demographic scenarios. We'll then train a classifier on these simulated populations, in hopes of distinguishing between populations with or without selection. We laid out a number of potential pitfalls and difficulties. In Part II, we carried out this plan for the simpler case when the population size remained constant. Now, we want to attempt the full project.

First, we generated the training and test sets. We're going to start out by generating about 30,000 populations (we'll add more later). For now, we have the following scenarios:

For "Alike" populations, we use our selected populations to infer the time-varying population size that would be the most consistent with the population, if we assumed there was no selection. In other words, we choose the population size that looks as close as possible to what we see. Check out the research section for more information about how we get this.

We'll now use these 30,000 populations to train our classifier. For now, we'll just use logistic regression, and make a histogram of our results:

This doesn't do nearly as well as the simple case did earlier. The distribution of outputs for the neutral populations is far more spread out, which is potentially very problematic if we try to generalize to more complicated scenarios. It is also the case that certain demographic scenarios perform much better than others -- this isn't very confidence inspiring, since it implies that the model may only be picking out differences between these specific neutral populations and the selected ones, as opposed to a general signal of selection, which is what we're hoping for.

What do we do next then? Well, at the moment, we're only using a very tiny portion of the total information we have about the populations. We're only using the site-frequency spectrum, which summarizes all of the data into just 50 numbers. However, in practice, we actually have the complete distribution of mutations in all individuals throughout the whole genome. Thus, the next step is to come up with better ways to use this data.

There are many potential candidates: we can look at the distance between neighboring mutations, or the clustering of common/rare mutations. We can look at statistics like Tajima's D in sliding windows along the genome, etc.

Many of these statistics show only small differences between neutral and selected populations; however, taken together, they could provide the key for distinguishing them. One such example is plotted below: this is a histogram of the difference between the distance to the nearest mutation from a mutation that appears in only one individual vs. a mutation that appears in an intermediate number of individuals. In other words, it measures the typical clustering of rare vs. common mutations:

Although the distributions are fairly similar, there is a noticeable shift between the two. By including many such statistics, we can potentially improve our model and hopefully, figure out how to classify populations that have undergone selection. Stay tuned for Part IV in the future, in which we (hopefully) discover a model that solves everything!

First, we generated the training and test sets. We're going to start out by generating about 30,000 populations (we'll add more later). For now, we have the following scenarios:

- 1. Constant Populations
- 2. Exponentially Growing Populations
- 3. Logistically Growing Populations
- 4. Bottleneck Populations
- 5. "Alike" Populations

For "Alike" populations, we use our selected populations to infer the time-varying population size that would be the most consistent with the population, if we assumed there was no selection. In other words, we choose the population size that looks as close as possible to what we see. Check out the research section for more information about how we get this.

We'll now use these 30,000 populations to train our classifier. For now, we'll just use logistic regression, and make a histogram of our results:

**Histogram of the logistic regression prediction for neutral**

populations (blue) and selected populations (red).

populations (blue) and selected populations (red).

This doesn't do nearly as well as the simple case did earlier. The distribution of outputs for the neutral populations is far more spread out, which is potentially very problematic if we try to generalize to more complicated scenarios. It is also the case that certain demographic scenarios perform much better than others -- this isn't very confidence inspiring, since it implies that the model may only be picking out differences between these specific neutral populations and the selected ones, as opposed to a general signal of selection, which is what we're hoping for.

What do we do next then? Well, at the moment, we're only using a very tiny portion of the total information we have about the populations. We're only using the site-frequency spectrum, which summarizes all of the data into just 50 numbers. However, in practice, we actually have the complete distribution of mutations in all individuals throughout the whole genome. Thus, the next step is to come up with better ways to use this data.

There are many potential candidates: we can look at the distance between neighboring mutations, or the clustering of common/rare mutations. We can look at statistics like Tajima's D in sliding windows along the genome, etc.

Many of these statistics show only small differences between neutral and selected populations; however, taken together, they could provide the key for distinguishing them. One such example is plotted below: this is a histogram of the difference between the distance to the nearest mutation from a mutation that appears in only one individual vs. a mutation that appears in an intermediate number of individuals. In other words, it measures the typical clustering of rare vs. common mutations:

**Histogram of the Average Distance to the Nearest Mutation**

for a Singleton vs. an Intermediate

for a Singleton vs. an Intermediate

Although the distributions are fairly similar, there is a noticeable shift between the two. By including many such statistics, we can potentially improve our model and hopefully, figure out how to classify populations that have undergone selection. Stay tuned for Part IV in the future, in which we (hopefully) discover a model that solves everything!

## 0 comments: