Facebook RSS


     DNA sequence data contains a vast amount of information about the evolutionary history of populations. By looking at just a small subset of modern-day populations, we can infer incredible amounts of information about their histories: their mutation rates and population sizes, the rates of and types of migration, whether selection played a prominent role, and more. Each of these different scenarios leads to specific patterns that we can, in principle, observe in our data. However, actually doing so in practice is often extremely difficult -- our understanding of the patterns we'd expect to see is incomplete, different scenarios can often lead to very similar or even indistinguishable patterns, and even when inference is theoretically possible, there is significant noise in the data.

Sequence data contains an enormous amount of
information about the evolutionary history of populations.

     My research has focused on understanding the effects of selection on these patterns. To understand the effects that we expect to see under selection, it is helpful to first consider what we expect to see in neutral populations. To do this, we make use of a genealogical tree (below). Here, we have four individuals, labeled A through D. We trace their ancestral lineages backwards in time (upwards on the tree). At some point, Individual C and Individual D will share a common ancestor, at which point they 'coalesce'. We then continue to trace the remaining lineages backwards in time until all individuals have coalesced. Next, we incorporate neutral mutations into the framework by assuming that they occur as a Poisson process along the branch lengths. This process of describing the ancestry of a sample backwards-in-time is known as coalescent theory and provides an elegant mathematical framework for calculating probabilities of gene trees. In a neutral (random-mating, non-recombining, etc.) population, these probabilities are well-described, and we are able to calculate the complete probability of any gene tree or tree-related statistic. 
A gene tree for a sample of four individuals.

     One major result of neutral coalescent theory is that the expected lengths of branches are proportional to the population size. This makes sense: in a randomly-mating, neutral population, the probability of any two individuals sharing a parent in the previous generation is simply 1/N, so the expected time until two individuals share a parent is N generations. If a sample contains n individuals, then the expected time to the first event is simply N/(n choose 2). Thus, in a neutral population, there is a specific relationship between the branch lengths that we expect to hold. However, many of the demographic scenarios discussed above lead to a distortion in the relative branch lengths. The simplest example is to imagine a population that is growing in time. In this case, the population size in the distant past is smaller than the population size in the recent past, such that the branch lengths in the distant past are short relative to the branch lengths in the recent past. This implies that there will be an excess of rare mutations (mutations that only appear in a small number of individuals) relative to more common mutations. 

In an expanding population, branch lengths in the distant past are short relative to those in the recent past.

     However, there are other scenarios that can lead to a similar effect: suppose a population is experiencing fairly strong purifying (negative) selection. Individuals that carry deleterious mutations will tend to die out from the population more quickly than average. This implies that, if you trace an individuals' ancestry backwards in time, his ancestor will be likely to contain fewer deleterious mutations, since we know that ancestor has a descendant in the present. This implies that individuals tend to be descended from more-fit ancestors. A consequence of this is that, as we go backwards in time, the rate of coalescence increases with time. Thus, the branch lengths in the distant past are short relative to the branch lengths in the recent past, and there is once again an excess of rare mutations. In the end, the signal of selection in this regime is very similar to that of an expanding population, and it can be virtually impossible to distinguish between the two effects. 

In a selected population, branch lengths in the distant past are short relative to those in the recent past.

     My research has focused on better understanding this effect. In the strong purifying selection regime (Ne s >> 1), the main strategy for understanding the effects of selection is through the structured coalescent (originally developed by Hudson and Kaplan). In this case, the population is subdivided into classes based upon the fitnesses of individuals. We then trace the ancestry of individuals backwards-in-time as before, this time allowing individuals to jump between fitness classes. In order for two individuals to coalescence, they must co-exist in the same fitness class, in which case they will coalesce with probability equal to the inverse of the size of the class. This provides a mathematical framework for calculating various statistics. We used this framework in The Structure of Genealogies in the Presence of Purifying Selection:  A "Fitness-Class Coalescent" and The Structure of Allelic Diversity in the Presence of Purifying Selection. to better understand the effects of selection on various genealogical statistics 

     More recently, we have made the additional assumption that we may treat each ancestral lineage independently, which is reasonable provided selection is sufficiently strong (Ne s >> 1). When this is the case, instead of jointly considering the paths of multiple lineages through the population, we can instead simply calculate the probability each independent lineage is in a particular class at a particular time. This drastically simplifies the analytical framework, and allows us to describe a population using a time-dependent effective population size Ne(t), which we calculate in Distortions in Genealogies Due to Purifying Selection. Thus, we find that a population experiencing strong purifying selection is virtually indistinguishable from a population that is evolving with this time-varying population size. This result has a number of significant advantages: It is extremely simple, such that we can incorporate it directly into the neutral coalescent framework to calculate virtually any statistic describing a genealogy. Furthermore, we can incorporate the effects of selection directly into any neutral method of inference or estimation simply by incorporating the appropriate time-dependent population size.  However, this result also implies a significant drawback: since a strongly-selected population appears equivalent to a time-varying neutral population, it will be very difficult or even impossible to determine which effect is causing the distortions we see.

     In a recent paper, Distortions in Genealogies Due to Purifying Selection and Recombination, we have shown that we can extend this result to also incorporate recombination, provided we make a further assumption that we may treat each site as independent.