Modelling mitochondrial site polymorphisms to infer the number of segregating units and mutation rate

We present a mathematical model of mitochondrial inheritance evolving under neutral evolution to interpret the heteroplasmies observed at some sites. A comparison of the levels of heteroplasmies transmitted from mother to her offspring allows us to estimate the number Nx of inherited mitochondrial genomes (segregating units). The model demonstrates the necessity of accounting for both the multiplicity of an unknown number Nx, and the threshold θ, below which heteroplasmy cannot be detected reliably, in order to estimate the mitochondrial mutation rate μm in the maternal line of descent. Our model is applicable to pedigree studies of any eukaryotic species where site heteroplasmies are observed in regions of the mitochondria, provided neutrality can be assumed. The model is illustrated with an analysis of site heteroplasmies in the first hypervariable region of mitochondrial sequence data sampled from Adélie penguin families, providing an estimate Nx and μm. This estimate of μm was found to be consistent with earlier estimates from ancient DNA analysis.


INTRODUCTION
It has recently been found that estimates of the human mitochondrial mutation rate from pedigree data are many times higher than those estimated from sequence divergence at putatively neutral sites (e.g. Parsons et al. 1997;Howell et al. 2003;Santos et al. 2005). One possible reason for this is that heteroplasmy (variation among different organelle genomes within the same individual) may be maintained for many generations, after the origin by mutation of a new variant, provided that there is not a very small bottleneck in the number of organelle genomes during transmission from parent to offspring (reviewed by Birky 1991). The persistence of heteroplasmic mutations for many generations after their origin may lead to an inflated contribution to a pedigree-based estimate of mutation rate, if they are treated as mutations that have become fixed within individuals. We propose a method for estimating the mutation rate and size of the transmission bottleneck for maternally transmitted mitochondrial genomes. We illustrate this using the dataset of Millar et al. (2008) on the Adélie penguin. Millar et al. (2008) sequenced a segment of 344 base pairs in a fast evolving region of first hypervariable region (a region they argued was not under strong selective pressure) from 1931 Adélie penguins (508 families) at Antarctic nesting sites. This was a follow-up to the study of Lambert et al. (2002) of ancient and contemporary samples of the same region of mitochondrial DNA. In the data reported in Millar et al. (2008), a low proportion of the sites in individual birds exhibited heteroplasmies, where two bases were called on the electropherogram at that site, each with a signal greater than qZ0.23 of the total signal. Signals below the detection threshold q, could not be reliably distinguished from background noise and were not reported there. Fifty-five mothers (out of 508) exhibited site heteroplasmies (above q), with seven having two heteroplasmic sites. Hence, the proportion of sites with an observed heteroplasmy across all mothers waŝ b Z 48 C 2 !7 508 !344 Z 3:55 !10 K4 : ð1:1Þ In all but three cases, the observed site heteroplasmies persisted above the threshold in both mother and chicks. Site heteroplasmies in a chick were only checked at sites where a heteroplasmy had been observed in its mother. Following Lambert et al. (2002) and Millar et al. (2008), we define mutation rate as the rate at which a base substitution is incorporated into all mitochondrial genomes of an individual. Previous studies linking mutation rate with observed site heteroplasmies include two studies (Brandstätter et al. 2004;Santos et al. 2005) of human populations of up to three generations, where the mutation rate was estimated from assuming that each observed site heteroplasmy represented a transitional substitution. We find most of the substitutions giving rise to an observed site heteroplasmy are subsequently lost, and those that eventually become fixed may persist as an observed site heteroplasmy for many generations and be oversampled.
For our study, we follow the maternal ancestry of an individual, tracking the trajectory of a site substitution introduced into the germ line by some ancestor. At each generation, the population of mitochondrial genomes will pass through a bottleneck when the oocyte segregates a small number of its mother's genomes. Lambert et al. (2002) argued that this region of the mt-genome is not under strong selective pressure, so our model assumes random drift among the N x segregating genomes.
This model should be applicable for any region of the mt-genome assumed neutral under selection for any eukaryote species.
One contribution of 11 to a Special Feature on 'Whole organism perspectives on understanding molecular evolution'.

MATERIAL AND METHODS
Our model focuses on an individual site of the mitochondrial genomes in the maternal ancestry of one individual. When a heteroplasmy (comprising two different nucleotides, generically X, Y) is observed at a site, we presume, this is a consequence of a somatic substitution X/Y or Y/X at that site in a maternal ancestor some g generations prior, inherited by a daughter, which has persisted for g generations.
Consider the maternal ancestry of an individual A 0 , with A 1 its mother and A 2 its maternal grandmother, etc. Suppose a nucleotide substitution X/Y has occurred at a site in one genome of a maternal line ancestor A g ( gR1), which is inherited by A gK1 . Our model shows that the probability of an additional substitution inherited at that site among A gK1 , ., A 0 is less than 10 K2 , so we will neglect this possibility. Suppose this site heteroplasmy persists for (exactly) kR1 generations, inherited by A gK1 , ., A gKk , but not by A gKkK1 . If kRg, then A 0 will be heteroplasmic at that site, although not necessarily observable. If k!g, A gKkK1 , ., A 0 are not heteroplasmic at that site, with their genomes either containing all X or all Y.
We propose a model where each oocyte recruits N x mitochondrial genomes independently from the population of its mother's genomes. (The number N x is sometimes referred to as the number of segregating units. For the Adélie penguins, we estimated N x to have a 95% confidence interval (CI ) of 25!N x !69.) The observed levels of most site heteroplasmies from the blood samples of a mother and her chicks closely agree, which reflects a close agreement in the germ line. We will assume that the variation, after accounting for measurement uncertainty, is due to sampling at the recruiting bottleneck and that the proportions in the blood sample estimate the inherited proportions.
If A 1 (the mother of A 0 ) contains a site heteroplasmy with the nucleotides Y and X appearing in proportions f and 1Kf, then the probability (using a binomial selection with replacement) that A 0 inherits i genomes with allele Y and N x Ki genomes of allele X is If A 1 had inherited j copies of allele Y and N x Kj of allele X from her mother A 2 , we assume she exhibits the proportions fZj/N x and 1KfZ1Kj/N x of the alleles in the genomes available for inheritance. Hence, is the probability that A 0 inherits i copies of allele Y, and N x Ki copies of allele X, given her mother had inherited j and N x Kj corresponding copies. Let be the matrix of these probabilities, where the N x K1 rows and columns are indexed by i, j2{1,2, ., N x K1}. Given that A g has a somatic mutation in a descendant of one of her N x founding genomes, the probability that A gK1 inherits more than one mutated genome is very small. Hence, we will assume A gK1 is heteroplasmic at that site, with proportion 1/N x of its genomes containing Y. If the heteroplasmy is lost g generations later, then A 0 has either all X or all Y at that site. Let h X , Y be the proportion of cases where the heteroplasmy persists and h X and h Y be the proportions where the mutation is lost or fixed. The neutral model predicts that as g increases In a simulation study of 10 7 heteroplasmic site histories, we followed the introduction of one mutation until the site heteroplasmy was lost, for N x Z20 and for N x Z40. We found (table 1) for qZ0.23 that h X z((N x K1)/N x ) and h Y z1/N x . Table 1 also gives the average numbers of generations that the site heteroplasmies persist, and are observable. We note that over all histories, the average numbers of generations a site heteroplasmy was observable were almost identical for N x Z20 and 40, but the average variation in the levels of a heteroplasmy at a site between a mother and her chick differed significantly.
For each A k in the ancestry, let n k be the number of its founding genomes with nucleotide Y at that site. We have assumed that n gK1 Z1. If for k!gK1, n k Z0 or N x , the heteroplasmy is lost. Suppose 1%n kC1 Zj!N x , then the probability that n k Zi, (1%i!N x ) is In lemma 1 of the electronic supplementary material, we show that which is the ith entry of the leading column of P ð gKkK1Þ N x . Assuming that the probability that A g introduces a somatic mutation into the germ line at a selected site is a, then summing over all generations gR1, the probability that A 0 has a site heteroplasmy with n 0 Zi (1%i!N x ) is where ðQ N x Þ i;1 is the first entry in the ith row of As ðQ N x Þ i;1 is the expected number of generations that a new site heteroplasmy persists with i copies (figure 1), the expected number of generations a heteroplasmy is observable at that site is ðQ N x Þ i;1 : Most site heteroplasmies never reach the detection threshold, those that do, usually persist for many more than obs N x generations (table 1).
In figure 2, we plot the values of (Q 20 ) i,1 and (Q 40 ) i,1 , noting that these two distributions are almost identical, and that ðQ N x Þ i;1 is closely approximated by 2/i within the observed region. (The limit ðQ N x Þ i;1 / 2=i as N x /N was noted by Fisher and Wright ( Ewens 2004, eqn (1.56)).) We show in lemma 2 of the electronic supplementary material, that assuming Q N x ðiÞ zð2=iÞ, the probability b that a site has an observable heteroplasmy is closely approximated by 2a ln(q K1 K1). Assuming neutral evolution, 1/N x of the substitutions entering the germ line will become fixed in the maternal line of descent, so that the mutation rate can be estimated as where t is the generation time.

RESULTS
We have shown that majority of heteroplasmies cannot be observed, leading to undersampling; but that those that do may persist for many generations, leading to oversampling. These two effects do not balance. We have demonstrated that the estimate of the mutation rate from the density of observed heteroplasmic sites is dependent both on the number of segregating units N x and on the detection threshold q. For the Adélie penguin data of Millar et al. (2008), with qZ0.23, tZ6.46 andbZ 3:55 !10 K4 , we used Bayesian analysis (see 'Estimating N x from motherchick comparisons' of the electronic supplementary material) to obtain a maximum-likelihood estimate of N x Z36.5, with 95% CI of 25.0%N x %66.9. Using equation (2.4), Millar et al. (2008) estimated the median value to be m m Z0.55 s s K1 Myr K1 (CI 0.29%m m %0.88), a value not significantly different from the ancient evolutionary rate of kZ0.86 s s K1 Myr K1 (CI 0.53%k%1.17) estimated by Lambert et al. (2002) for the same region. In their study, Millar et al. showed that if they had assumed that each observed heteroplasmy represented a mutation in transition, the estimate of m would have been increased by a factor of nearly 100.

DISCUSSION
This model has assumed neutral evolution for the regions of mitochondria genome under analysis. Whether specific regions are under selective pressure is outside the scope of this study; however, Rand et al. (1994), for example, addressed this issue. Our model is applicable wherever the assumption of neutrality can be assumed for pedigree studies of any species where multiple copies of the mitochondrial genome are inherited, and a sufficient number of heteroplasmies are observed.
Accounting for the effects modelled here may illuminate the apparent discrepancies reported between molecular substitution rates and those estimated from pedigree studies, such as for human studies, e.g. Parsons et al. (1997), Howell et al. (2003 and Santos et al. (2005). As these studies do not report observational thresholds, our model cannot be applied directly to their data. Improvement in the accuracy of determining the relative expression levels of site heteroplasmies (with q/ 1=N x ) might lead to a direct estimate of N x . However, it is likely that N x may vary across the sample, in which case the distribution would not clump around the i/N x values. We have shown (by simulation) that the model is robust against moderate variations in N x , provided we take N x to be an idealized harmonic mean of the individual values in the sample. Table 1. Site heteroplasmy histories: results from 10 7 simulations for N x Z20 and 40 segregating units. (In more than 99.9% of the histories, the introduced mutation is either lost or fixed within 200 generations, and no site heteroplasmy survived more than 520 generations. Statistics are presented for the cases HX n (X/Y lost and never observed at qZ0.23), HX o (lost but observable for at least one generation) and HY (the mutation X/Y is ultimately fixed). We note that the proportions fixed ( HY ) is close to its expected value of 1/N x . g av records the mean number of generations the site heteroplasmy survives, and g av (obs) records the mean number of generations the site heteroplasmy is in the observable range. Note as N x is doubled, the g av (obs) values double, but the proportions are halved so the mean number of generations for which a site heteroplasmy is observable is approximately constant, close to 2.30 in both cases. In the final row, we note the average mean square difference between the mother/chick pairs over all observed heteroplasmic sites differs for N x Z20 and N x Z40.) N x Z20 N x Z40 category proportion g av g av (obs) proportion g av g av (