## Abstract

We present a mathematical model of mitochondrial inheritance evolving under neutral evolution to interpret the heteroplasmies observed at some sites. A comparison of the levels of heteroplasmies transmitted from mother to her offspring allows us to estimate the number *N*_{x} of inherited mitochondrial genomes (segregating units). The model demonstrates the necessity of accounting for both the multiplicity of an unknown number *N*_{x}, and the threshold *θ*, below which heteroplasmy cannot be detected reliably, in order to estimate the mitochondrial mutation rate *μ*_{m} in the maternal line of descent. Our model is applicable to pedigree studies of any eukaryotic species where site heteroplasmies are observed in regions of the mitochondria, provided neutrality can be assumed. The model is illustrated with an analysis of site heteroplasmies in the first hypervariable region of mitochondrial sequence data sampled from Adélie penguin families, providing an estimate *N*_{x} and *μ*_{m}. This estimate of *μ*_{m} was found to be consistent with earlier estimates from ancient DNA analysis.

## 1. Introduction

It has recently been found that estimates of the human mitochondrial mutation rate from pedigree data are many times higher than those estimated from sequence divergence at putatively neutral sites (e.g. Parsons *et al*. 1997; Howell *et al*. 2003; Santos *et al*. 2005). One possible reason for this is that heteroplasmy (variation among different organelle genomes within the same individual) may be maintained for many generations, after the origin by mutation of a new variant, provided that there is not a very small bottleneck in the number of organelle genomes during transmission from parent to offspring (reviewed by Birky 1991). The persistence of heteroplasmic mutations for many generations after their origin may lead to an inflated contribution to a pedigree-based estimate of mutation rate, if they are treated as mutations that have become fixed within individuals. We propose a method for estimating the mutation rate and size of the transmission bottleneck for maternally transmitted mitochondrial genomes. We illustrate this using the dataset of Millar *et al*. (2008) on the Adélie penguin.

Millar *et al*. (2008) sequenced a segment of 344 base pairs in a fast evolving region of first hypervariable region (a region they argued was not under strong selective pressure) from 1931 Adélie penguins (508 families) at Antarctic nesting sites. This was a follow-up to the study of Lambert *et al*. (2002) of ancient and contemporary samples of the same region of mitochondrial DNA. In the data reported in Millar *et al*. (2008), a low proportion of the sites in individual birds exhibited heteroplasmies, where two bases were called on the electropherogram at that site, each with a signal greater than *θ*=0.23 of the total signal. Signals below the detection threshold *θ*, could not be reliably distinguished from background noise and were not reported there. Fifty-five mothers (out of 508) exhibited site heteroplasmies (above *θ*), with seven having two heteroplasmic sites. Hence, the proportion of sites with an observed heteroplasmy across all mothers was(1.1)In all but three cases, the observed site heteroplasmies persisted above the threshold in both mother and chicks. Site heteroplasmies in a chick were only checked at sites where a heteroplasmy had been observed in its mother.

Following Lambert *et al*. (2002) and Millar *et al*. (2008), we define mutation rate as the rate at which a base substitution is incorporated into all mitochondrial genomes of an individual. Previous studies linking mutation rate with observed site heteroplasmies include two studies (Brandstätter *et al*. 2004; Santos *et al*. 2005) of human populations of up to three generations, where the mutation rate was estimated from assuming that each observed site heteroplasmy represented a transitional substitution. We find most of the substitutions giving rise to an observed site heteroplasmy are subsequently lost, and those that eventually become fixed may persist as an observed site heteroplasmy for many generations and be oversampled.

For our study, we follow the maternal ancestry of an individual, tracking the trajectory of a site substitution introduced into the germ line by some ancestor. At each generation, the population of mitochondrial genomes will pass through a bottleneck when the oocyte segregates a small number of its mother's genomes. Lambert *et al*. (2002) argued that this region of the mt-genome is not under strong selective pressure, so our model assumes random drift among the *N*_{x} segregating genomes.

This model should be applicable for any region of the mt-genome assumed neutral under selection for any eukaryote species.

## 2. Material and methods

Our model focuses on an individual site of the mitochondrial genomes in the maternal ancestry of one individual. When a heteroplasmy (comprising two different nucleotides, generically `X`, `Y`) is observed at a site, we presume, this is a consequence of a somatic substitution `X`→`Y` or `Y`→`X` at that site in a maternal ancestor some *g* generations prior, inherited by a daughter, which has persisted for *g* generations.

Consider the maternal ancestry of an individual *A*_{0}, with *A*_{1} its mother and *A*_{2} its maternal grandmother, etc. Suppose a nucleotide substitution `X`→`Y` has occurred at a site in one genome of a maternal line ancestor *A*_{g} (*g*≥1), which is inherited by *A*_{g−1}. Our model shows that the probability of an additional substitution inherited at that site among *A*_{g−1}, …, *A*_{0} is less than 10^{−2}, so we will neglect this possibility. Suppose this site heteroplasmy persists for (exactly) *k*≥1 generations, inherited by *A*_{g−1}, …, *A*_{g−k}, but not by *A*_{g−k−1}. If *k*≥*g*, then *A*_{0} will be heteroplasmic at that site, although not necessarily observable. If *k*<*g*, *A*_{g−k−1}, …, *A*_{0} are not heteroplasmic at that site, with their genomes either containing all `X` or all `Y`.

We propose a model where each oocyte recruits *N*_{x} mitochondrial genomes independently from the population of its mother's genomes. (The number *N*_{x} is sometimes referred to as the *number of segregating units*. For the Adélie penguins, we estimated *N*_{x} to have a 95% confidence interval (CI) of 25<*N*_{x}<69.)

The observed levels of most site heteroplasmies from the blood samples of a mother and her chicks closely agree, which reflects a close agreement in the germ line. We will assume that the variation, after accounting for measurement uncertainty, is due to sampling at the recruiting bottleneck and that the proportions in the blood sample estimate the inherited proportions.

If *A*_{1} (the mother of *A*_{0}) contains a site heteroplasmy with the nucleotides `Y` and `X` appearing in proportions *ϕ* and 1−*ϕ*, then the probability (using a binomial selection with replacement) that *A*_{0} inherits *i* genomes with allele `Y` and *N*_{x}−*i* genomes of allele `X` isIf *A*_{1} had inherited *j* copies of allele `Y` and *N*_{x}−*j* of allele `X` from her mother *A*_{2}, we assume she exhibits the proportions *ϕ*=*j*/*N*_{x} and 1−*ϕ*=1−*j*/*N*_{x} of the alleles in the genomes available for inheritance. Hence,(2.1)is the probability that *A*_{0} inherits *i* copies of allele `Y`, and *N*_{x}−*i* copies of allele `X`, given her mother had inherited *j* and *N*_{x}−*j* corresponding copies. Let(2.2)be the matrix of these probabilities, where the *N*_{x}−1 rows and columns are indexed by *i*,*j*∈{1,2, …, *N*_{x}−1}.

Given that *A*_{g} has a somatic mutation in a descendant of one of her *N*_{x} founding genomes, the probability that *A*_{g−1} inherits more than one mutated genome is very small. Hence, we will assume *A*_{g−1} is heteroplasmic at that site, with proportion 1/*N*_{x} of its genomes containing `Y`. If the heteroplasmy is lost *g* generations later, then *A*_{0} has either all `X` or all `Y` at that site. Let *h*_{X},_{Y} be the proportion of cases where the heteroplasmy persists and *h*_{X} and *h*_{Y} be the proportions where the mutation is lost or fixed. The neutral model predicts that as *g* increases

In a simulation study of 10^{7} heteroplasmic site histories, we followed the introduction of one mutation until the site heteroplasmy was lost, for *N*_{x}=20 and for *N*_{x}=40. We found (table 1) for *θ*=0.23 that *h*_{X}≈((*N*_{x}−1)/*N*_{x}) and *h*_{Y}≈1/*N*_{x}. Table 1 also gives the average numbers of generations that the site heteroplasmies persist, and are observable. We note that over all histories, the average numbers of generations a site heteroplasmy was observable were almost identical for *N*_{x}=20 and 40, but the average variation in the levels of a heteroplasmy at a site between a mother and her chick differed significantly.

For each *A*_{k} in the ancestry, let *n*_{k} be the number of its founding genomes with nucleotide `Y` at that site. We have assumed that *n*_{g−1}=1. If for *k*<*g*−1, *n*_{k}=0 or *N*_{x}, the heteroplasmy is lost. Suppose 1≤*n*_{k+1}=*j*<*N*_{x}, then the probability that *n*_{k}=*i*, (1≤*i*<*N*_{x}) isIn lemma 1 of the electronic supplementary material, we show thatwhich is the *i*th entry of the leading column of . Assuming that the probability that *A*_{g} introduces a somatic mutation into the germ line at a selected site is *α*, then summing over all generations *g*≥1, the probability that *A*_{0} has a site heteroplasmy with *n*_{0}=*i* (1≤*i*<*N*_{x}) is(2.3)where is the first entry in the *i*th row ofAs is the expected number of generations that a new site heteroplasmy persists with *i* copies (figure 1), the expected number of generations a heteroplasmy is observable at that site isMost site heteroplasmies never reach the detection threshold, those that do, usually persist for many more than generations (table 1).

In figure 2, we plot the values of (*Q*_{20})_{i,1} and (*Q*_{40})_{i,1}, noting that these two distributions are almost identical, and that is closely approximated by 2/*i* within the observed region. (The limit as *N*_{x}→∞ was noted by Fisher and Wright (Ewens 2004, eqn (1.56)).)

We show in lemma 2 of the electronic supplementary material, that assuming , the probability *β* that a site has an observable heteroplasmy is closely approximated by 2*α* ln(*θ*^{−1}−1). Assuming neutral evolution, 1/*N*_{x} of the substitutions entering the germ line will become fixed in the maternal line of descent, so that the mutation rate can be estimated as(2.4)where *t* is the generation time.

## 3. Results

We have shown that majority of heteroplasmies cannot be observed, leading to undersampling; but that those that do may persist for many generations, leading to oversampling. These two effects do not balance. We have demonstrated that the estimate of the mutation rate from the density of observed heteroplasmic sites is dependent both on the number of segregating units *N*_{x} and on the detection threshold *θ*.

For the Adélie penguin data of Millar *et al*. (2008), with *θ*=0.23, *t*=6.46 and , we used Bayesian analysis (see ‘Estimating N_{x} from mother–chick comparisons’ of the electronic supplementary material) to obtain a maximum-likelihood estimate of *N*_{x}=36.5, with 95% CI of 25.0≤*N*_{x}≤66.9. Using equation (2.4), Millar *et al*. (2008) estimated the median value to be *μ*_{m}=0.55 s s^{−1}Myr^{−1} (CI 0.29≤*μ*_{m}≤0.88), a value not significantly different from the ancient evolutionary rate of *k*=0.86 s s^{−1}Myr^{−1} (CI 0.53≤*k*≤1.17) estimated by Lambert *et al*. (2002) for the same region. In their study, Millar *et al*. showed that if they had assumed that each observed heteroplasmy represented a mutation in transition, the estimate of *μ* would have been increased by a factor of nearly 100.

## 4. Discussion

This model has assumed neutral evolution for the regions of mitochondria genome under analysis. Whether specific regions are under selective pressure is outside the scope of this study; however, Rand *et al*. (1994), for example, addressed this issue. Our model is applicable wherever the assumption of neutrality can be assumed for pedigree studies of any species where multiple copies of the mitochondrial genome are inherited, and a sufficient number of heteroplasmies are observed.

Accounting for the effects modelled here may illuminate the apparent discrepancies reported between molecular substitution rates and those estimated from pedigree studies, such as for human studies, e.g. Parsons *et al*. (1997), Howell *et al*. (2003) and Santos *et al*. (2005). As these studies do not report observational thresholds, our model cannot be applied directly to their data.

Improvement in the accuracy of determining the relative expression levels of site heteroplasmies (with ) might lead to a direct estimate of *N*_{x}. However, it is likely that *N*_{x} may vary across the sample, in which case the distribution would not clump around the *i*/*N*_{x} values. We have shown (by simulation) that the model is robust against moderate variations in *N*_{x}, provided we take *N*_{x} to be an idealized harmonic mean of the individual values in the sample.

## Acknowledgments

We gratefully acknowledge support from the New Zealand Centres of Research Excellence Fund. We thank the following people who assisted in the development of this paper: J. Esins, D. M. Lambert, C. D. Millar, D. Penny, and K. Schliep. We thank the editor and referees for their very helpful comments in improving the presentation.

## Footnotes

↵† Present address: School of Information Technologies, University of Sydney, Sydney, New South Wales 2006, Australia.

↵‡ Present address: Summit plc, Abingdon, Oxfordshire OX14 4RY, UK.

One contribution to a Special Feature on ‘Whole organism perspectives on understanding molecular evolution’.

- Received February 8, 2009.
- Accepted February 27, 2009.

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

- Copyright © 2009 The Royal Society