Efficiently inferring the demographic history of many populations with allele count data

The sample frequency spectrum (SFS), or histogram of allele counts, is an important summary statistic in evolutionary biology, and is often used to infer the history of population size changes, migrations, and other demographic events affecting a set of populations. The expected multipopulation SFS under a given demographic model can be efficiently computed when the populations in the model are related by a tree, scaling to hundreds of populations. Admixture, back-migration, and introgression are common natural processes that violate the assumption of a tree-like population history, however, and until now the expected SFS could be computed for only a handful of populations when the demographic history is not a tree. In this article, we present a new method for efficiently computing the expected SFS and linear functionals of it, for demographies described by general directed acyclic graphs. This method can scale to more populations than previously possible for complex demographic histories including admixture. We apply our method to an 8-population SFS to estimate the timing and strength of a proposed “basal Eurasian” admixture event in human history. We implement and release our method in a new open-source software package momi2.


Introduction
All natural populations undergo evolutionary processes of migration, size changes, and divergence, and the history of these demographic events shape their present genetic diversity. Thus, inferring demographic history is of central concern in evolutionary and population genetics, both for its intrinsic interest (e.g., in dating the out-of-Africa migration of modern humans (Schaffner et al., 2005;Gutenkunst et al., 2009)) and also for biological applications (such as distinguishing the effects of natural selection from demography (Beaumont and Nichols, 1996;Boyko et al., 2008)). However, genetic sequence data and the space of possible demographic models are both very high dimensional objects, leading to numerous statistical and computational challenges when inferring demographic history from genetic data.
Demographic history can be inferred by fitting the observed value of the SFS to its expected value in a composite likelihood framework. The expected SFS can be efficiently computed when the demographic history is a tree, and in previous work we developed a method momi to compute the SFS of hundreds of populations related by a tree . However, natural populations are often related by a more complex history that is not tree-like, as gene flow (the exchange of migrants between populations) adds extra edges to the topology associated with the demographic history. In this case, computing the expected SFS is much more computationally demanding, and existing methods for computing the exact expected SFS can scale to only a handful of populations (Gutenkunst et al., 2009;Jouganous et al., 2017).
In this article, we extend our previous algorithm momi to handle discrete (or pulse) migration events between populations, in which case demographies are described by general directed acyclic graphs (DAGs). Our new method momi2 can compute the exact expected SFS with admixture for more populations than previously possible, and uses novel insights from a stochastic process known as the Lookdown Construction (Donnelly and Kurtz, 1996;Donnelly et al., 1999). In addition, momi2 utilizes automatic differentiation (Corliss et al., 2002;Bhaskar et al., 2015) to compute gradients of the SFS, which we use to efficiently search the parameter space during optimization. Finally, momi2 can efficiently compute linear functionals of the SFS, which we exploit to compute the expected values of a number of standard population genetic summary statistics under complex demography.
The rest of this paper as organized as follows. In Section 2 we provide some background and survey related work. Section 3 describes our method, first with an illustrative example in Section 3.1, and then with formulas and pseudocode in Section 3.2. Finally, in Section 4, we apply our method to an 8-population SFS, including ancient and contemporary human populations, to estimate the timing and strength of a proposed "basal Eurasian" admixture event in human history. We defer all proofs to the Appendix.

Background
Suppose a sample of = ( 1 , … ,  ) genomes have been sampled from  "demes" or populations. The positions in the genome where the samples are not all identical are called segregating sites. In most organisms mutations are rare; most sites are not segregating. It is therefore reasonable to assume, as we do from now on, that each position in the genome has experienced at most a single mutation in its history, and that each individual can be labeled as having the "ancestral" or "derived" (mutant) allele at each segregating site. In population genetics, this simplifying assumption is known as the infinite sites model.
The sample frequency spectrum (SFS) is a -dimensional array [f ] ∈ ℤ ( 1 +1)×⋯×(  +1) whose entry f counts the number of segregating sites with exactly copies of the derived allele and − copies of the ancestral allele, where = ( 1 , … ,  ) ∈ ℕ  0 with 0 ≤ ≤ for each = 1, … , . Note we only consider segregating sites with 2 alleles, so f = f = 0 by definition. Compared to the full data set (i.e., the complete genetic sequences of all = 1 + ⋯ + genomes), the SFS [f ] is a compressed, lowdimensional summary which nevertheless preserves much of the signal about the various population size changes, divergence times, and admixture events that occurred over the course of the populations' history.

Demographic events
The expected multipopulation SFS can be obtained by integrating over random genealogies formed by a backwards-in-time stochastic process known as the structured coalescent (Kingman, 1982;Takahata, 1988;Notohara, 1990). Before getting to the technical details, we first review the basic dynamics of this process in order to build intuition for how the data have power to infer population splits, size changes, and gene flow events. See Durrett (2008) for a more detailed introduction.
Informally, the topology and branch lengths of genealogies are affected by a demographic history in two ways: 1. Two lineages may not coalesce into a common ancestor until they reside in the same population, and the time until this occurs is affected by migration patterns and population split times.
2. At any particular point in time, two members within the same population are more likely to have a common parent if the population size is small; so, for example, residents of a small village will typically be more closely related than residents of a large city.
Regarding the second point, we define the scaled effective population size ( ) such that the rate at which any two lineages find a common ancestor at time is 1∕ ( ). Under the simplest random mating model (the Wright Fisher model, cf. Durrett (2008)), the census population size exactly equals , where is the number of generations per unit time; more generally, scales with the number of breeding individuals in the population. Thus, estimating allows us to infer size change events such as bottlenecks, exponential growth, and population crashes.
If we could observe the true distribution of genealogies, then we could directly infer demographic history from the waiting times between coalescence events, following the principles listed in items 1 and 2 above. However, since genealogies are never directly observed, we must make inferences about demographic history indirectly using mutation data.

Likelihoods and the site frequency spectrum
Consider a genome with positions and mutation rate per position. So at any given position in the genome, mutations arise on the tree there as a Poisson point process with rate . The chance of 2 or more mutations at a single position is ( 1 2 ), and taking the limit → ∞ we arrive at the aforementioned infinite sites approximation (Kimura, 1969;Durrett, 2008), which assumes that each segregating site was caused by a single mutation, so that each allele may be labeled as ancestral or derived.
The observed segregating sites are not independent, because trees at neighboring positions are correlated. Unfortunately, even in the simplest case of a single population with constant size, an analytic expression for the likelihood of mutation data at a set of linked (non-independent) sites is not known (Bhaskar et al., 2015). Therefore, SFS data are generally used with composite likelihood methods. Recall that f is the total number of segregating sites with derived allele count pattern = ( 1 , … ,  ). Define = 1 [f ], i.e., is the expected frequency of per unit mutation rate. Equivalently, is the expected branch length subtending leaves in a random coalescent tree (i.e., with 1 descendants in population 1, 2 in population 2, and so on). A commonly used composite likelihood is the Poisson random field model (Sawyer and Hartl, 1992), which assumes that the total number of segregating sites is Poisson with rate ∑ , and that the patterns at the observed sites are independent with sampling probabilities proportional to ; this yields a log-likelihood of where ‖f ‖ 1 ∶= ∑ f and ‖ ‖ 1 ∶= ∑ . Demographic history can then be inferred by searching for the parameter values that maximize.

Existing work and our contribution
The composite log-likelihood in (1) requires us to compute , the expected branch length subtending . Let be a random genealogical tree sampled under the demography, and ( ) be the total length of all branches in subtending , so that The integral (2) is difficult since the support of is at least as large as the number of labeled binary trees with leaves, a quantity which grows faster than exponentially in the sample size . Consequently, a number of different methods have been proposed for evaluating (or approximating) the integral (2). Sampling-based methods include Markov chain Monte Carlo (Griffiths and Tavaré, 1997;Nielsen, 2000), importance sampling (Stephens and Donnelly, 2000;De Iorio and Griffiths, 2004), simulation (Excoffier and Foll, 2011;Excoffier et al., 2013), and ABC (Wegmann et al., 2010). The main advantage of these methods is their flexibility: since, as mentioned above, computing the likelihood of the data is trivial conditional on , these methods can be used on a rich class of models. However, the high dimension of makes it impractical to compute by sampling unless  is small. In particular, since the support of grows like (  ), Monte Carlo methods will assign zero mass to configurations that are actually observed in the data.
A second approach, implemented in the software (Gutenkunst et al., 2009), computes by numerically solving PDEs arising from the Wright-Fisher diffusion (Ewens, 2004), which is dual to the coalescent process described above. For  populations, this involves numerically solving a -dimensional integral. The initial method in Gutenkunst et al. (2009) could handle up to  = 3 populations; subsequent improvements (Lukić and Hey, 2012;Jouganous et al., 2017) extended this to  = 4 and then  = 5 populations by using spectral representations or alternative basis functions for solving the PDEs.
The third approach for computing , which includes our method, integrates over the sample allele frequencies "backwards-in-time", exploiting conditional independence relationships to reduce computation. This involves considering the alleles of the sample's ancestors at different points in the demographic history, and integrating out these random variables via inference algorithms for probabilistic graphical models (Pearl, 1982;Felsenstein, 1981;Lauritzen and Spiegelhalter, 1988;Koller and Friedman, 2009). Bryant et al. (2012) and Chen (2012) computed the SFS for finite-and infinite-sites models using this backward-in-time approach under the coalescent. De Maio et al. (2015) and Kamm et al. (2017) substantially lowered the computational burden of this approach by replacing the coalescent with the continuous-time Moran model (Durrett, 2008), a stochastic process which induces the same sampling distribution as the coalescent, but using a much smaller state space. However, until now these Moran-based approaches have been limited to analyzing tree-shaped demographies without admixture between populations.
The main contribution of this paper is to extend our previous Moran-based method  to allow for demographies defined on general directed acyclic graphs (DAGs), thus allowing for admixture between populations. We describe and implement an algorithm for computing the expected infinite-sites SFS under the multipopulation Moran model with size changes, exponential growth, population splits, and point admixture events (i.e., instantaneous migration "pulses"). This substantially enlarges the space of demographies for which the expected SFS can be accurately computed.
Additionally, our algorithm computes not only individual SFS entries, but also linear functionals of the expected SFS. Specifically, our method computes rank-1 tensor products of the SFS in the same time as a single entry (general linear functionals are sums of these rank-1 products). A number of widely-used statistics in population genetics can be expressed as SFS functionals, and to our knowledge our method is the first to compute expectations of these statistics under complex demography.
We demonstrate our method by using it to infer the history of eight human subpopulations that have undergone multiple admixture events. We complement our theoretical contributions with an open-source, Shorthand for (0) , ( ) respectively "  Event tree  user-friendly software implementation that will enable practitioners to deploy our method. The software uses automatic differentiation (Corliss et al., 2002;Bhaskar et al., 2015;Maclaurin et al., 2015) to compute derivatives of the SFS, leading to efficient optimization and parameter inference. Our package, called momi2, is available for download at https://github.com/popgenmethods/momi2.

Method
In this section, we describe the algorithm implemented in momi2 for computing the expected SFS under complex demographies. We begin in Subsection 3.1 with an illustrative example that highlights the novel aspects of our work. Then in Subsection 3.2 we provide pseudocode for our algorithm, and state the formulas used by our algorithm as Propositions. The proofs of these Propositions, and the proof for the correctness of our algorithm, requires substantial additional notation, and we defer this to Appendix A.2. For ease of reference, the symbols we use are summarized in Table 1.    We now present the formulas used by Algorithm 1, in a series of lemmas that also prove Theorem 1. We start with a formula to compute ⇠ v from`v ,0 µ and the partial SFS at the child events C T (v).
The corresponding event tree  is a junction tree of the DAG in  (b); each vertex of  corresponds to a set of vertices from .

Example
Consider the model depicted in Figure 1. This model has 3 sampled (leaf) populations, related as follows.
Populations 1 and 2 are sisters, with Population 3 an outgroup to them; however a pulse of migrants from 3 to 2 occurs after the split between populations 1 and 2. At the leaves, we observe ( 1 , 2 , 3 ) = (2, 2, 1) samples, with ( 1 , 2 , 3 ) = (1, 1, 0) copies of the derived (blue star) allele. We wish to compute the expected number of mutations with pattern ( 1 , 2 , 3 ); to do this, we will integrate over the unobserved variables ( 4 , 5 , 6 , 7 , 8 ) which represent allele counts at certain internal positions within the demography. The random variables 1 , … , 8 are related to each other by the DAG  in Figure 1b, with an edge from population to if alleles may pass directly from into .
In the realization of Figure 1, the hidden blue allele counts are ( 4 , 5 , 6 , 7 , 8 ) = (3, 2, 0, 0, 0). The blue mutation occurs in the edge above 4 , spreading to 3 of the lineages. One copy of the blue allele moves on to 1 , while 2 copies move on to 5 . However, due to the admixture, 2 only inherits 1 blue allele from 5 , inheriting a red allele from the 2 red alleles at 6 . Under the infinite-sites assumption described above, the mutation observed at this site arose at a single point in the genealogical tree depicted in Figure 1. To compute the expected number of mutations with ( 1 , 2 , 3 ) = (1, 1, 0), we may condition on the population (i.e., the edge in Figure 1) on which it arose: We call 1 [# mutations at with = ], the "truncated SFS"; this only concerns events within a single population and can be computed using the method momi1. We refer the interested reader to  for further details.
The second term ℙ( 1 = 1, 2 = 1, 3 = 0 | mutation at with = ) gives the conditional likelihood of observing the data given a mutation and its allele count at population . In momi1, we computed this term in the case where  is a tree without admixture using the sum-product (also known as belief propagation) algorithm (Felsenstein, 1981;Koller and Friedman, 2009).
The main result of this work is to extend our previous dynamic program to the case where  is given by a DAG due to admixture. We will use a dynamic program that is essentially a kind of junction tree algorithm (Koller and Friedman, 2009). This algorithm works by decomposing a DAG graph into a tree in such a way that vertices in the tree correspond to collections of nodes in the original graph. Belief propagation is then applied to the tree decomposition.
We illustrate our algorithm using the example demography in Figure 1c. We call the tree decomposition  an event tree, because each internal node corresponds to either an admixture or split event. 1 We construct the event tree  , and compute the conditional likelihoods ℙ( 1 = 1, 2 = 1, 3 = 0 | mutation at with = ), as follows: 1. We initially start with a collection of 3 singleton sets of leaf populations: {{1}, {2}, {3}}. For = 1, 2, 3, we also keep track of conditional likelihoods of the data beneath ; since we are at the leaves, these are simply the Kronecker delta functions In addition, we compute [ℙ( 2 | 5 = , 6 = )] , , the conditional likelihood of the data beneath {5, 6} given the allele counts 5 , 6 . We obtain this by applying Lemmas 1 and 2 to the previous conditional likelihood [ℙ( 2 | 2 = )] . Specifically, we apply Lemma 1 to "lift" the conditional likelihood at population 2 to the time point immediately below the admixture event; then, we apply Lemma 2 to obtain ℙ( 2 | 5 , 6 ), the conditional likelihood at {5, 6} immediately above the admixture event.
3. The next event has the alleles from 6 splitting off from population 3. To process this event, we merge the clusters containing the relevant populations ({5, 6} and {3}); then we remove 3,6 and replace them with their parent population 7, to obtain {5, 7} as the parent cluster of {5, 6} and {3} in  . After this stage, our collection of sets becomes {{1}, {5, 7}}.
5. The final event has populations 4 and 7 merging into population 8. We replace the cluster {4, 7} with its parent cluster {8} at the root of  .
To compute ℙ( 1 , 2 , 3 | 8 ) from the child likelihoods ℙ( 1 , 2 , 3 | 4 , 7 ), we first apply Lemma 1 to lift the likelihoods immediately below the split event, and then apply Lemma 4, which computes the conditional likelihood at the parent of a split when the children belong to the same cluster ({4, 7} in this case).
Remark. The observant reader may have noticed that in Figure 1, population 8 has only 8 = 5 alleles, despite its children having 4 + 7 = 7 > 5 alleles in total. This is due to the fact that there are only 5 alleles at the leaves, so there are at most 5 ancestors in any population at any point; thus, when populations 4 and 7 merge into population 8, we can drop two of the tracked lineages. To show this formally, we use a stochastic process called the lookdown construction, which is a version of the Moran model with a countably infinite number of lineages. However, this analysis requires a great deal of additional notation and is not essential to the remainder of the text, so we defer it to Appendix A.1.

Algorithms and formulas
We now describe the algorithm to construct the event tree  . We assume that the population graph  has two types of topological events: 1. Population split: two child populations , split from each other; their parent population is . Looking backward in time, , merge and become the population .
2. Population admixture: a single child population , inherits from exactly two parent populations , , with the probability that an allele comes from ( , respectively) being (1 − , respectively).
Note that more complicated events, such as trifurcating splits or symmetric pulse migrations, may be expressed as a succession of these 2-way split and admixture events.
We provide pseudocode to construct the event tree  in Algorithm 1. In words, we initially start with  equal to a collection of singleton sets corresponding to the leaves of . Processing each split or admixture event back in time, we merge all blocks containing the child population(s) of . Then we remove the child population(s) from this merged block, and add in the parent population(s). Within  , the new merged block is the parent of the blocks removed at this stage.
We now describe Algorithm 2, the dynamic program to compute the joint SFS. We need a bit more notation. For population , let be the number of samples with ancestry in ; we will be keeping track of lineages within population . Let denote the amount of time between the top and bottom events of , and for let ( ) denote the scaled population size of at time ∈ [0, ] above the base of . Let 0 ≤ ( ) ≤ be the allele count within the lineages of at time above its bottom, so (0) = in our earlier notation. At event in  , let = { 1 , … , | | } be the corresponding block of populations in . We define the conditional likelihood at as where ( ) = ( is the vector of allele counts in populations at times , and Leaves( ) is the observed data at the leaves beneath .
In addition, we define the "partial SFS" as the expected branch length at or below subtending leaves. Note that Root( ) gives the desired final result, and corresponds to equation (3) in the previous subsection.
For the remainder of the subsection, we will fix Leaves() = , and drop the dependence on in , , and . Algorithm 2 defines a dynamic program (DP) over the conditional likelihoods , and partial SFS . The DP takes input vectors 1 , … ,  corresponding to the leaf populations Leaves() = {1, … , }. If the inputs 1 , … ,  are set to indicator vectors corresponding to the observed counts 1 , … ,  , the DP of Algorithm 2 will return the corresponding SFS entry, as stated in the theorem below: = (0, … , 1, … , 0) ∈ ℝ +1 denotes the vector with 1 at coordinate and 0 elsewhere.
We now present the formulas used by Algorithm 2, in a series of lemmas also used to prove Theorem 1. We start with a formula to "lift" ℙ(… | … , (0) , …) up to ℙ(… | … , ( ) , …). That is, this formula transforms a likelihood conditioned on (0) , the allele count at the bottom of , into a likelihood conditioned on ( ) , the allele count at the top of .
if 1 ≠ 2 then 15: where ( ) ∈ ℝ ( +1)×( +1) is the transition rate matrix of the Moran model with lineages; in particular, ( ) = ( ( ) ) 0≤ , ≤ and To process event , we first use Lemma 1 to lift up the conditional likelihoods at the child populations, up to the time of . We then apply one of Lemma 2, Lemma 3, or Lemma 4, to obtain the conditional likelihood return Root( ) ⊳ Return partial SFS at the root event 17: end procedure at from the lifted child likelihoods, depending on whether is an admixture or split event, and whether the child populations of fall into a single cluster or two independent clusters.
We first consider admixture events; Lemma 2 describes how to compute the conditional likelihood in this case. Let the child population be and the parent populations be 1 , 2 . Each of the lineages in independently inherits from 1 with probability 1 , or from 2 with probability 2 = 1 − 1 . So the number of lineages inheriting from 1 is Binomial( , 1 ). Then, given that 1 alleles are inherited from 1 and 2 = − 1 inherited from 2 , the particular alleles inherited from 1 or 2 are chosen by sampling without replacement.
Lemma 2. (Admixture event) Let be an admixture event, with child population and parent populations 1 , 2 . Let ′ be the child event in  . Suppose each lineage in comes from 1 with probability 1 , and from 2 with probability 2 = 1 − 1 . For the population cluster at , let ∩ be a vector of allele counts on ⧵ { 1 , 2 }. Then the conditional likelihood of allele counts ∩ + 1 1 + 2 2 at is given by We next consider a split event , with parent population and child populations 1 , 2 . We first consider the case where has 2 distinct children in  , i.e., 1 , 2 fall into 2 distinct blocks beneath . Denote the corresponding child events as ′ 1 , ′ 2 respectively. Then the conditional likelihood at is given by a convolution of the conditional likelihoods at ′ 1 , ′ 2 , as described in Lemma 3: Lemma 3. (Population split, 2 Now consider the case where the population split has just 1 child event ′ . That is, the child populations 1 , 2 fall into the same cluster beneath . Then Lemma 4 describes how to obtain the conditional likelihood at from the conditional likelihood at ′ . This involves summing over the dimensions corresponding to 1 , 2 within the conditional likelihood ′ , 1 1 + 2 2 . In addition, note that we may have < 1 + 2 (recall is the number of samples with ancestry in ). That is, after merging 1 , 2 backwards in time, we may be keeping track of more alleles than originally sampled, allowing us to "drop" some extraneous non-ancestral lineages, as illustrated in the root population of Figure 1. This is done by multiplying the pseudoinverse of a matrix whose entries are hypergeometric probabilities.

Then the conditional likelihood at is given by
with + denoting the Moore-Penrose pseudoinverse of .
Finally, having computed the conditional likelihoods at event , we wish to compute the partial SFS at the event. This is given by the recursive formula in Lemma 5, which involves the conditional likelihood at , the expected number of mutations arising within each parent population , and the partial SFS at the child events.

Lemma 5. For an event , let new =
⧵ ∪ ′ ≺ ′ be the populations newly formed at (i.e., formed by a population split or admixture at ). For ∈ new , let ( ) be the truncated SFS in population , which can be computed by the formulas in . Then the partial SFS at is given by

Normalizing constant and other linear functionals
To compute the probability ‖ ‖ 1 of a mutation having observed allele counts , we need not just , but also the normalizing constant ‖ ‖ 1 = ∑ the expected total branch length. Computing ‖ ‖ 1 directly is inefficient because of the ( ∏  =1 ) possible entries . Instead, we can use Algorithm 2 to compute ‖ ‖ 1 , and many more statistics of the SFS, in the same time as a single entry:

Corollary 1 says that for any rank-tensor
can be computed in calls to DP( 1 , … ,  ). 2 In particular, the expected total branch length ‖ ‖ 1 is given by with the vector with 1 at every coordinate.
Beyond the applications we explore here, we expect this result to be useful in a number of related settings. A number of population genetic statistics can be expressed as f ⊙ , including Watterson's estimator̂ of the mutation rate (Watterson, 1975), Fay and Wu's statistic for positive selection (Fay and Wu, 2000), and Patterson's 2 , 3 , 4 statistics for assessing topology (Patterson et al., 2012). Corollary 1 allows us to compute their expected values [f ⊙ ] = ⊙ , and to construct test statistics from the deviance f ⊙ − ⊙ under an appropriate null model. An even wider class of population genetic statistics can be written as nonlinear functions of SFS-tensor products like ( 1 f ⊙ 1 , 1 f ⊙ 2 , …); this class includes Tajima's statistic for selection (Tajima, 1989), the statistic for population structure (Holsinger and Weir, 2009), and Patterson's statistic for introgression (Patterson et al., 2012). These statistics may be viewed as plug-in estimators for ( ⊙ 1 , ⊙ 2 , …), which we can compute with Corollary 1. Note that these estimators are biased due to the nonlinear function , but the bias can be estimated via block jackknife, and will typically be small since 1 f ⊙ → ⊙ almost surely as the number of independent SNPs grows.
Another interesting linear statistic of the SFS that can be computed with Corollary 1 is [ MRCA ], the expected time of the most recent common ancestor. In particular, let be any leaf population; for simplicity assume is sampled at the present (i.e. is not archaic). Then To see this, note that MRCA is proportional to the expected number of mutations hitting an arbitrary lineage in , and if a mutation has configuration = ( 1 , … ,  ) derived copies, then the chance of hitting the lineage is (Zeng et al., 2006).

Application
We tested our method on a demographic inference problem in human genetics that is currently of interest. Lazaridis et al. (2014) showed that genetic variation in present-day Europeans suggests an admixture model involving three ancestral meta-populations: Ancient North Eurasian (ANE), Western Hunter Gatherers (WHG), and Early European Farmers (EEF). They also showed that EEF contains ancestry from a source that is an outgroup to all non-African populations, and yet shares much of the genetic drift common to non-African populations; they dubbed this ancestry component as "Basal Eurasian" ancestry. Later work (Lazaridis et al., 2016) showed that Basal Eurasian ancestry is shared by ancient and contemporary Middle Eastern populations, and is correlated with a decrease in Neanderthal ancestry, implying that Basal Eurasian ancestry contains lower levels of Neanderthal admixture when compared with non-Basal ancestry. The results from Lazaridis et al. (2014Lazaridis et al. ( , 2016 were based on several related methods for modeling covariances in population allele frequencies, most notably qpGraph and qpAdm (Patterson et al., 2012;Haak et al., 2015). These methods are computationally efficient and robust, but are unable to infer the timing of demographic events.
We applied momi2 to estimate the strength and timing of basal Eurasian admixture into early European farmers, and the split time of the basal Eurasian lineage. To do this, we built a demographic model relating 12 samples from 8 populations, shown in Figure 2. These samples consisted of the Altai Neanderthal ; the 45,000 year old Ust'Ishim man from Siberia (Fu et al., 2014); 3 present-day populations (Mbuti, Sardinian, Han) with 3, 2, and 2 samples respectively; and 3 ancient samples representing the European ancestry components identified by Lazaridis et al. (2014): a 7,500 year old sample from the Linearbandkeramik (LBK) culture (representing EEF), an 8,000 year old sample from the Loschbour rock shelter in Luxembourg (representing WHG), and the 24,000 year old Mal'ta boy ("MA1") from Siberia (representing ANE). After data cleaning, our dataset consisted of 2.4 × 10 6 autosomal transversion SNPs. See Appendix A.3 for more details about the data.
To construct the topology of the model in Figure 2, we first obtained a tree by neighbor joining (Saitou and Nei, 1987), then added 3 extra admixture events reflecting prior knowledge, as well as a recent Neanderthal population decline starting at the Mbuti-Eurasian split. We inferred split times, population sizes (including the Neanderthal decline), and admixture times and proportions by maximizing a composite likelihood , given by the product of the likelihoods at every SNP: where ℙ( ) ∝ and was computed by momi2. The low-coverage of the MA1 sample and the deep divergence of the Neanderthal sample may cause bias in SFS entries where only these samples contain derived alleles; we thus excluded these entries and corrected the normalizing constant appropriately (see Appendix A.3).
We used automatic differentiation to compute the gradient ∇ log , which we used to search for the optimum of log . We constructed nonparametric bootstrap confidence intervals by splitting the genome into 100 equally sized blocks, resampling these blocks to create 300 bootstrap datasets, and re-inferring the demography for each bootstrap dataset. We also used 300 parametric bootstraps to assess how well we could infer the demography under simulated data; for each parametric bootstrap dataset, we used msprime (Kelleher et al., 2016) to simulate ten 300 Mb chromosomes from our initial point estimate, and re-inferred the demography. Note the nonparametric bootstrap is better able to account for model misspecification, and we use it for all confidence intervals reported below.

Pulse Probability
Figure 2: Inferred model and bootstraps for the 11 population demography described in Section 4. In the foreground (blue) is our point estimate from maximum composite likelihood; in the background (gray) are 300 bootstrap reestimates, which were created by splitting the data into 100 equally sized contiguous blocks, resampling these blocks with replacement, and refitting the model. The y-axis is linear below 5 × 10 4 , then follows a logarithmic scale above 5 × 10 4 .
Our inferred demography, along with nonparametric bootstrap re-estimates, are shown in Figure 2 and Table 2. Our parametric bootstrap estimates are shown in Figure 3. We inferred a pulse of 0.094 (95% CI of 0.049-0.174) from the ghost Basal Eurasian population to EEF ancestry (LBK), substantially less than the 0.44 inferred by (Lazaridis et al., 2014). This admixture was inferred to occur 33.7 kya (95% CI of 10.8-41.1 kya), shortly after the Loschbour-LBK split at 37.7 kya (95% CI of 32.2-42.3 kya). The split time of the ghost Basal Eurasian lineage from other Eurasians was inferred at 79.8 kya (95% CI of 67.4-101 kya). Other parameters were broadly in line with previous estimates, such as a Mbuti-Eurasian split of 96 kya, a Han-European split of 50 kya, a Neanderthal split of 696 kya, and Eurasians deriving 0.03 of their ancestry from Neanderthal Green et al., 2010;Meyer et al., 2016).
Inferring the optimal demography from start to finish took 2.5 hours on a laptop with 4 CPU cores, and used 2 GB RAM. The 300 bootstraps were run separately on a high-performance compute cluster. To our knowledge, no other method can infer this demographic model using the full SFS. The moments software package (Jouganous et al., 2017) is capable of computing the SFS for up to 5 populations, less than the 8 populations here, though it can scale to more individuals per population than momi2. While the fastsimcoal2 1.56e+03 −1.32e+04 1.07e+04 964 3.49e+04 Table 2: Estimated parameters of the demography in Figure 2, along with nonparametric bootstrap estimates of the bias and standard deviation, and 95% bootstrap quantiles. We use (A,B) to denote the ancestor of A and B; and to denote the height and size at vertex ; and A→B and A→B to denote respectively the time and strength of an admixture arrow from A to B.
software package (Excoffier et al., 2013) is capable of handling demographies of this size and larger, it doesn't compute the full, exact SFS, and also does not include an option for the ascertainment scheme we use here (excluding mutations private to Neanderthal and MA1).
We also used our model to produce estimates of the human mutation rate, which we estimated as 1.22×10 −8 per base per generation, with a bootstrap-quantile 95% CI of (1.12, 1.32) × 10 −8 , closely matching previous estimates of the human mutation rate (Scally, 2016). To obtain this estimate, we compared the observed nucleotide diversity with that expected under the inferred demography, adjusting by the empirical transition to transversion ratio (Appendix A.3). Estimating the mutation rate was possible here because we did not use a prespecified mutation rate to estimate the model in Figure 2, instead using the known ages of the Ust'Ishim, LBK, Loschbour, and MA1 samples to calibrate dates. M b u t i B a s a l E u r a s i a n L B K S a r d i n i a n L o s c h b o u r M A 1 H a n U s t I s h i m N e a n d e r t h a l 0 10 4 5 × 10 4   Figure 2. The inferred demography (blue tree) was used to simulate 300 bootstrap replicates. Re-running our method on each replicate produced a new inferred demography which is plotted in gray.

A.1 Model and notation
In this section, we formally define the stochastic process underlying our model, and introduce some additional notation needed for the proofs in Section A.2.
The main stochastic process we use is the lookdown construction of the Moran model (Donnelly and Kurtz, 1996;Donnelly et al., 1999), a variant of the Moran model where copying only occurs in one "direction" (as in Figure 4). We will make use of certain conditional independence properties that result from this oneway copying. However, we note that there is a simple coupling between the lookdown and standard Moran models, and these two models generate data with the same distribution.
We now describe our lookdown model in more detail. Within each of the  leaf populations we consider a countably infinite number of lineages, each with a unique label in ℤ + . We arbitrarily assign the ( 1 , … ,  ) sampled lineages to the lowest labels {1, 2, … , tot }, where tot ≡ ∑ 

=1
, and arbitrarily assign the remaining unsampled lineages to labels { tot + 1, tot + 2, …}. Each lineage extends infinitely backwards in time, and at each admixture event, each lineage randomly chooses the parent population it extends into. In addition, each lineage has an allele, which changes through time due to mutation and copying events. Similar to the usual Moran model, copying between a pair of lineages in population occurs at rate 1 ( ) , where ( ) is the scaled population size of . However, copying only occurs in one direction, from lineages with lower labels to lineages with higher labels (Figure 4). So if two lineages are labeled and respectively, with < , then the copying event → occurs at rate 1 ( ) , whereas the reverse copying event ← never happens (has rate 0). By contrast, in the standard (non-lookdown) Moran model, both → and ← copying events happen at rate 1 2 ( ) . However, under both models, two lineages going backwards in time coalesce at rate 1 ( ) , as in the coalescent. Thus, the sample genealogies under both models follow the same multipopulation coalescent distribution.
Denote the labeled lineages in population , and their alleles at height , by where  , ,( ) = (label ,( ) , allele , ,( ) ) ∈ ℤ + × {0, 1} is a pair, consisting of the th lowest label at , along with its corresponding allele at height ∈ [0, ). Note that we have ordered the lineages by their labels, so that label ,( ) < label ,( +1) . Also note that we measure the time from the bottom of , so that allele ,0,( ) denotes the th allele at the bottom of , and allele , ,( ) the th allele at the top of . Let ( ) , = ∑ =1 { , ,( ) derived} denote the number of derived alleles among the lowest lineages at , . Let = ∑ ⪰ be the number of samples in leaves below . During computation, we will only need to keep track of the lowest lineages in , so we denote ( ) ≡ ( ) , to be the number of derived alleles among the first lineages at , .

A.2 Proofs
The following two lemmas will be useful for several of the proofs below: Lemma 6. The distribution of  , is invariant to finite permutations of the labels within any population ∈ . Furthermore, the labels are independent of the alleles.
Proof. By construction, none of the lineages within the populations in are ancestral to each other. Thus the sample genealogy of any finite subsample of the lineages is the multipopulation coalescent, because going backwards in time, coalescence (copying) between each pair of lineages occurs at rate 1 ( ) at vertex time . The invariance to permutation, and independence of alleles and labels, follows from the coalescent.
with the second equality because higher lineages cannot copy to lower lineages (so Leaves( ) ⟂  , , |  , , ), and the third equality because of the exchangeability and independence from Lemma 6 (so given ( ) , the alleles of  , , are ordered by a uniform permutation independent of ( ) , , and the labels of  , , are independent of ( ) , ( ) , ).

A.2.1 Proof of Theorem 1
Lemmas 2,4,3 give formulas for the partial likelihood , using terms like ′ , ′ ′ , where ′ ∈ Children( ) and is an admixture, 1-cluster split, or 2-cluster split, respectively. The terms ′ , ′ ′ in turn can be computed from the partial likelihoods ′ , ′ using Lemma 1; then Lemma 5 provides a formula for in terms of , and the partial SFS ′ . Algorithm 2 traverses the event tree  in a depth-first search, applying Lemmas 5,1,2,4,3 to compute , , at each event from their values at the children of . The input to Algorithm 2 are the likelihoods { }, at the leaves, and the output is the partial SFS Root( ) at the root. Thus, since it follows that DP( 1 , … ,  ) = Root( ) = .

A.2.2 Proof of Lemma 5
Without loss of generality assume = 1. Then is the expected number of mutations at or below with observed counts Leaves( ) . We split into two parts: the mutations within new = ⧵ ∪ ′′′ ≺ ′ (i.e. the populations formed by a split or join at ); and the mutations that occur in ∪ ′ ≺ ′ , the populations that arise strictly below .
The first part, of mutations at the new populations of , is given by since ( ) is the expected number of mutations arising at with = , and , = ℙ( Leaves( ) | = ). For the second part, of mutations strictly below , we split into three cases: either is a leaf event, or has a single child event, or has two child events. In the first case, if is a leaf event, then no mutations can occur below . In the second case, if has a single child event ′ , then the expected number of mutations strictly below is simply ′ by definition. Finally, if has two child events ′ 1 , ′ 2 , then ′ 1 and ′ 2 share no leaves, so a mutation underneath ′ 1 will have no derived alleles in Leaves( ′ 2 ) and vice versa. Thus, the number of mutations strictly below is

A.2.3 Proof of Lemma 1
Define a "quasi-lookdown" Moran model  * , which is identical to , except within the lowest lineages of , where we allow copying in both directions at rate 1 2 ( ) (as in the non-lookdown Moran model). For an event with populations = ( 1 , 2 , …) and corresponding times = ( 1 , 2 , …), let ( ) = 2 , …) be the corresponding allele counts. Next, define ⪯ , = { ′ } ( ′ , )⪯( , ) as the sample path of allele counts below , , where ( ′ , ) ⪯ ( , ) if either is above ′ , or = ′ and ≤ component-wise. It will suffice to show ℙ  ( ⪯ , ) = ℙ  * ( ⪯ , ), because then for = + − , follows from a coupling argument. Let  ⪯ , = { ′ , } ( ′ , )⪯( , ) the sample path of  below , . We can map the partial sample paths of  * ⪯ , onto those of  ⪯ , as follows: moving from the bottom to the top of , whenever a lower label is copied over by a higher label, swap the labels of the lineages above the copying. Then the relabeled sample path has the same distribution as the lookdown construction, since the allele with the higher label is always copied over, and the rate of copying between pairs of lineages is 1 ( ) . Since this relabeling also leaves ( ) unchanged, we have ℙ  ( ⪯ , ) = ℙ  * ( ⪯ , ).

A.2.6 Proof of Lemma 3
Notice that , −1 + −2 + = ∑ 1 , 2 ∶ 1 + 2 = ℙ( with the first equality from the Markov property of the Moran process, and the second equality following from the exchangeability of the alleles at vertex (Lemma 6).  Table 3: Populations and samples used for the example application in Section 4. We only used 1 random allele for MA1 due to low coverage. The ages of the samples are given in thousands of years ago (kya).
Similarly, (8) expresses , as a product of ′ 1 , 1 and ′ 2 , 2 , where ′ , is a multilinear function of { ∶ ∈ Leaves( ′ )} by induction, and furthermore Leaves( ′ 1 ) ∩ Leaves( ′ 2 ) = ∅ and Leaves( ′ 1 ) ∪ Leaves( ′ 2 ) = Leaves( ). Thus , is a multilinear function of 1 , … , . Thus, each of the operations (6), (7), (8), (9) in Algorithm 2 preserves multilinearity of , , and thus each , is a multilinear function of its leaf vectors by induction. Finally, a similar induction argument shows that each is a multilinear function of { ∶ ∈ Leaves( )}, since (10) expresses as a sum of multilinear functions by induction. Table 3 gives the populations and samples we used for our example application in Section 4. All samples except for MA1 were taken from the SGDP dataset . We added on the additional lowcoverage MA1 sample (Raghavan et al., 2014) to represent the Ancient North Eurasian (ANE) component of European ancestry. Due to the low-coverage of MA1, we only sampled a single random allele from it at each site, using a GATK pileup (McKenna et al., 2010) and restricting ourselves to reads with quality ≥ 30.

A.3 Application supplement A.3.1 Data
We ascertained SNPs at SGDP filter level 1, then used the genotype calls at filter level 0 at the ascertained SNPs. When ascertaining SNPs, we limited ourselves to sites that were polymorphic among the samples excluding MA1 and Neanderthal. We excluded MA1 during ascertainment due to its low coverage. We also excluded the Neanderthal sample during ascertainment because it had substantially fewer new mutations than expected based on its age; it was unclear whether this was due to changes in the mutation rate on that lineage since its deep split with modern humans, or whether this was an artifact of the SNP calling strategy used by SGDP. We used Corollary 1 to correct the normalizing constant of the SFS due to excluding MA1 and Neanderthal during ascertainment; in particular, we normalized the SFS by the total branch length of the subtree excluding MA1 and Neanderthal.
To avoid biases in ancient DNA caused by deamination (Dabney et al., 2013), we removed all transitions (i.e. A↔G and C↔T mutations), keeping only the transversions. We used Chimp as a proxy for the ancestral allele, removing all sites where the Chimp allele was missing. After data cleaning, we were left with 2,444,888 autosomal transversion SNPs that were segregating among the samples excluding MA1 and Neanderthal.
while the expected value of the nucleotide diversity per unit mutation rate is This yields a mutation rate estimatê for each population, given bŷ =̂ . We then averaged over̂ to obtain our combined estimatê = 1  ∑  =1̂ . To obtain the per-site mutation rate we need to divide by the number of bases after our data cleaning process. We approximated this as follows: at SGDP filter level 1, there are about 2.13 × 10 9 sites per individual; multiply this by 0.93 due to excluding the sex chromosomes; multiply this by 0.32 due to only using transversions; finally, multiply by 0.93 to account for excluding sites missing the Chimp allele, and multiply by 1.1 to account for using genotype calls at filter level 0. The latter 2 factors (accounting for sites missing the Chimp allele and for adding in genotype calls at filter level 0) we estimated by observing how the number of heterozygotes per individual changed after these data cleaning steps.