# Machine Learning to Detect Selection, Part I: The Problem

In the past few decades, as the cost of DNA sequencing has decreased dramatically, an enormous amount of sequence data has emerged. A central goal in biology is to extract as much information about the evolutionary process as we can from this data. Already, there have been tremendous advances as a consequence of sequence data: we have improved our understanding of the relatedness between species, the patterns of migration and population structure in humans, the evolution of antibiotic resistance, the evolution of disease, and far more.

One interesting question that is often asked is whether a population experienced significant amounts of selection. Several different "tests of neutrality" have been developed to answer this question. Early tests of neutrality, such as Tajima's D, relied on the fact that under a neutral model, different summary statistics describing a sample are unbiased-estimators of a single quantity, Θ. Thus, if a population is evolving under neutrality, the estimated values of Θ from different estimators should be consistent with one another. Tajima's D and other statistics provide a mathematical framework to evaluate whether there is a statistically significant deviation from the null hypothesis.

When a population is under directional selection, less-fit individuals tend to be out-competed in the population, which leads to an overall reduction in diversity, and an excess of rare mutations relative to more common ones. This leads to negative values for Tajima's D (for more detailed information about this and other tests of neutrality, I find the online lecture notes by Prof. Wolfgang Stephan to be very helpful). Thus, negative values of Tajima's D can sometimes be indicative of directional selection. However, it is important to note that this is only one possible explanation -- this can also be caused by simple demographic effects, such as population expansion.

To illustrate this, consider two populations: One population is under selection but has a constant population size, while the other is neutral but growing exponentially. If we calculate Tajima's D for these two populations, or another feature often used for detecting selection/growth, a Bayesian skyline plot, we see very similar signals for each:

Thus, we see that the signal for selection can look very similar to that for population growth. An open question is whether we can develop more sophisticated methods to distinguish between these scenarios. Currently, our analytical understanding of these effects is still incomplete, making it very difficult to develop statistical tests incorporating both effects.

This leads to my current research idea: instead of relying on deriving a statistical test from first-principles, can we adopt traditional machine learning approaches to address this question? The plan is to generate a very large set of simulated populations, evolving under a wide variety of demographic scenarios, with and without selection. We'll then use machine learning algorithms to classify populations that have undergone selection vs. those that have not.

There are a number of potential problems with this idea: First, although our goal is to differentiate

A second major problem is the following: Although all neutral populations should 'look' like neutral populations, selected populations do not always have a noticeable signature of selection. In fact, there are many regimes of selection under which we do not expect to see any signal at all -- for example, in the very strong background selection regime. Thus, although we expect the false positive rate to be very low (i.e. neutral populations should never be labeled as selected), there are many cases in which we expect a significant number of false negatives (i.e. selected populations can be labeled as neutral). Furthermore, we can make this false negative rate arbitrarily higher or lower simply by including more or less populations that are in this region. Thus, there are a number of potential caveats that have to be addressed.

One interesting question that is often asked is whether a population experienced significant amounts of selection. Several different "tests of neutrality" have been developed to answer this question. Early tests of neutrality, such as Tajima's D, relied on the fact that under a neutral model, different summary statistics describing a sample are unbiased-estimators of a single quantity, Θ. Thus, if a population is evolving under neutrality, the estimated values of Θ from different estimators should be consistent with one another. Tajima's D and other statistics provide a mathematical framework to evaluate whether there is a statistically significant deviation from the null hypothesis.

When a population is under directional selection, less-fit individuals tend to be out-competed in the population, which leads to an overall reduction in diversity, and an excess of rare mutations relative to more common ones. This leads to negative values for Tajima's D (for more detailed information about this and other tests of neutrality, I find the online lecture notes by Prof. Wolfgang Stephan to be very helpful). Thus, negative values of Tajima's D can sometimes be indicative of directional selection. However, it is important to note that this is only one possible explanation -- this can also be caused by simple demographic effects, such as population expansion.

To illustrate this, consider two populations: One population is under selection but has a constant population size, while the other is neutral but growing exponentially. If we calculate Tajima's D for these two populations, or another feature often used for detecting selection/growth, a Bayesian skyline plot, we see very similar signals for each:

**Common features used to identify pop. growth / selection for two populations, one experiencing**

exponential growth but no selection, the other experiencing selection but no growth.

exponential growth but no selection, the other experiencing selection but no growth.

Thus, we see that the signal for selection can look very similar to that for population growth. An open question is whether we can develop more sophisticated methods to distinguish between these scenarios. Currently, our analytical understanding of these effects is still incomplete, making it very difficult to develop statistical tests incorporating both effects.

This leads to my current research idea: instead of relying on deriving a statistical test from first-principles, can we adopt traditional machine learning approaches to address this question? The plan is to generate a very large set of simulated populations, evolving under a wide variety of demographic scenarios, with and without selection. We'll then use machine learning algorithms to classify populations that have undergone selection vs. those that have not.

There are a number of potential problems with this idea: First, although our goal is to differentiate

*any*demographic model from selection, we will only be able to show that we can differentiate the specific demographic scenarios that we have tested from selection. Thus, it is not at all clear that the results will generalize to cover the complete space of all demographic models. We will have to be very careful in our interpretation of the results and in our selection of appropriate training / test sets.A second major problem is the following: Although all neutral populations should 'look' like neutral populations, selected populations do not always have a noticeable signature of selection. In fact, there are many regimes of selection under which we do not expect to see any signal at all -- for example, in the very strong background selection regime. Thus, although we expect the false positive rate to be very low (i.e. neutral populations should never be labeled as selected), there are many cases in which we expect a significant number of false negatives (i.e. selected populations can be labeled as neutral). Furthermore, we can make this false negative rate arbitrarily higher or lower simply by including more or less populations that are in this region. Thus, there are a number of potential caveats that have to be addressed.

## 0 comments: