Theoretical Population Genetics and Applications

Modeling ancestry proportion in admixed populations

Individuals in admixed populations differ in the proportion of ancestry they carry from each ancestral population. Under a model of instantaneous population merging, we computed the distribution of the ancestry proportions in the extant population. Since the ancestry proportions are easily measured, our results can be used to infer the admixture time in real populations, and we applied our method to infer the time of European gene flow into Ashkenazi Jews.

Estimating the time of origin of rare mutations

Dating the time a mutation originated is a classical problem with medical and historical implications, but existing methods are ineffective for recent, rare mutations. A promising approach is to use the lengths of segments shared between carriers, in addition to the mutation’s frequency. However, exact theoretical approaches have been proved intractable. We developed a new algorithm based on advanced machine learning methods. In the context of Ashkenazi Jewish genetics, we plan to determine whether rare Ashkenazi mutations originated before or after the founder event.

The effect of crossover interference on patterns of genetic variation

Recombination is often modeled as a simple Poisson process along the genome, leading to random switches in the ancestry of the given lineage. However, in reality, the molecular process is far more complex. Crossover interference is known to constrain the distance between nearby crossovers in a single meiosis. However, it is only recently, with the availability of the genomes of thousands of individuals, that interference can be reliably characterized. We developed a new probabilistic framework that enables the study of crossover interference over multiple meioses, even if crossovers from individual meioses cannot be distinguished, which expands significantly the amount and type of data that can be examined.

Markovian approximations of the coalescent with recombination

Under the coalescent with recombination, the time to the most recent common ancestor at each genomic position depends on the history of the entire upstream sequence. Sequentially Markov coalescent models (SMC and SMC’) have relaxed this constraint to enable the development of extremely fast population-genetic inference methods, capable of analyzing entire genome sequences. However, theory is lacking for key properties of the Markovian models. Using a new model of coalescence at two fixed points along a pair of sequences, we derived the joint distribution of tree heights under SMC’, as well as other results. We found that SMC’ is an excellent approximation of the complete model. [1]

[1] Wilton et al., Genetics, 2015 (link)

The joint density of the time to the common ancestor (t1 and t2) and two genomic loci separated by distance rho. Each panel shows the difference between the densities under the complete model (of the coalescent with recombination) to one of the Markovian approximates, SMC or SMC'. From Wilton et al., Genetics, 2015.

The joint density of the time to the common ancestor (t1 and t2) at two genomic loci separated by distance rho. Each panel shows the difference between the densities under the complete model (the coalescent with recombination) and one of the Markovian approximations, SMC or SMC’. From Wilton et al., Genetics, 2015.

Simulation results for the inference of the time of admixture between two populations. We first recorded, for each individual, the proportion of the genomes coming from each ancestral population. We then used our newly developed expression for the distribution of ancestry proportions to infer the admixture time. The inferred time is shown vs the true simulated time.

Simulation results for the inference of the time of admixture between two populations. We first recorded, for each individual, the proportion of the genome coming from each ancestral population. We then used our newly developed expression for the distribution of ancestry proportions to infer the admixture time. The inferred time is plotted vs the true, simulated time.