The properties of the distribution of deleterious mutational effects on fitness (DDME) are of fundamental importance for evolutionary genetics. Since it is extremely difficult to determine the nature of this distribution, several methods using various assumptions about the DDME have been developed, for the purpose of parameter estimation. We apply a newly developed method to DNA sequence polymorphism data from two Drosophila species and compare estimates of the parameters of the distribution of the heterozygous fitness effects of amino acid mutations for several different distribution functions. The results exclude normal and gamma distributions, since these predict too few effectively lethal mutations and power-law distributions as a result of predicting too many lethals. Only the lognormal distribution appears to fit both the diversity data and the frequency of lethals. This DDME arises naturally in complex systems when independent factors contribute multiplicatively to an increase in fitness-reducing damage. Several important parameters, such as the fraction of effectively neutral non-synonymous mutations and the harmonic mean of non-neutral selection coefficients, are robust to the form of the DDME. Our results suggest that the majority of non-synonymous mutations in Drosophila are under effective purifying selection.
Recent advances in evolutionary genetics have led to a number of approaches for estimating the distribution of deleterious mutational effects on fitness (DDME) of non-synonymous mutations, using data on between-species sequence divergence and/or within-species sequence diversity (Bustamante et al. 2003; Nielsen & Yang 2003; Piganeau & Eyre-Walker 2003; Sawyer et al. 2003; Loewe et al. 2006). All assume a specific type of distribution of selection coefficients, which is then used to fit the data. Previous investigations have used a variety of distributions, including the normal, exponential and gamma distributions (Bustamante et al. 2003; Nielsen & Yang 2003; Piganeau & Eyre-Walker 2003; Sawyer et al. 2003; Loewe et al. 2006). The latter is widely used because of its convenient two-parameter form, which allows a wide range of curve shapes. However, none of these distributions has a special status, since there is currently no basis for a rational choice.
Here, we propose rejection of a candidate DDME if it cannot explain (i) DNA sequence diversity data in two related species with different effective population sizes, and (ii) the frequency of dominant, effectively lethal mutations caused by amino acid mutations. We find that a lognormal DDME satisfies these conditions much better than a gamma distribution or a power law.
2. Material and methods
We define the DDME as the genome-wide distribution of the heterozygous selection coefficient, s, associated with a new deleterious, non-synonymous mutation. We use diversity data from 17 loci of Drosophila miranda and 14 loci of Drosophila pseudoobscura, two closely related species of fruitfly with similar habitats, but significantly different effective population sizes (Ne) as estimated from their silent nucleotide site diversities, πS (Loewe et al. 2006). The similarity of the two species means that they probably share the same DDME, so that the larger Ne of D. pseudoobscura compared with D. miranda causes a larger fraction of sites to experience effective purifying selection. This results in a smaller increase in non-synonymous diversity, πA, than in πS. Assuming a given type of DDME for mutations affecting amino acid sites and a fraction of completely neutral, non-synonymous mutations (cn), we can calculate the expectation of πA for each species. By equating these to the pair of observed mean values of πA, we estimate the parameters of the DDME, assuming that it can be described by two parameters. Our method assumes approximate mutation–selection-drift equilibrium and independence among non-synonymous polymorphisms, but is not affected by the details of the frequency distributions of variants. It should, therefore, provide robust estimates of the parameters of the DDME for a fixed value of cn. Statistical accuracy is assessed by computing 1000 bootstraps. To improve analysis, we used the diversity index, DI (the ratio of the values πA/πS for the two species) to eliminate the 12.2% of all bootstraps with DI≤1, since these cannot be explained by any plausible model. Further details are described in the electronic supplementary material and by Loewe et al. (2006).
To test whether a DDME that is compatible with observed diversity data also satisfies our second criterion requires an estimate of the rate at which non-synonymous mutations with effectively lethal (i.e. lethal or sterile) heterozygous effects arise. The difficulty is that most lethal mutations are recessive and it is hard to study those that are not. While point mutations, indels and transposable elements (TEs) can all induce dominant, effectively lethal mutations, we are only concerned with non-synonymous mutations. It is virtually impossible to estimate the rate of spontaneous dominant lethal mutations, and most of these are probably due to chromosome breaks (Ashburner 1989). However, the results of ethylmethane sulphonate (EMS) mutagenesis, which mainly but not exclusively induces point mutations, suggest that dominant female sterile mutations arise in D. melanogaster at about 1/500th of the rate for recessive lethal mutations (Ashburner 1989). Molecular analyses of two of the genes concerned show that the majority of the mutational lesions involved are non-synonymous mutations (Timinszky et al. 2002; Venkei & Szabad 2005).
The approximate overall frequency of such mutations can be assessed as follows. Recent data suggest that spontaneous recessive lethal mutations arise at a rate of about 0.045 per zygote per generation in D. melanogaster, but many of these are probably due to TE insertions (Charlesworth et al. 2004). The rate of TE insertions is about 0.2 per zygote per generation (Maside et al. 2000). Assuming that 25% of the genome is coding sequence (Misra et al. 2002) and 25% of all genes are vital (Oh et al. 2003) gives an estimate of 0.0125 recessive lethal TE-insertions per zygote per generation. If 25% of non-TE mutations are indels (Charlesworth et al. 2004), this suggests that the recessive lethal mutation rate due to base substitutions is around 0.75×(0.045−0.0125)=0.024. The rate of mutation to dominant female sterile non-synonymous mutations in Drosophila is thus about 5×10−5. If dominant lethal and dominant male sterility mutations arise at a similar rate, the net rate of mutation to effectively lethal dominant mutations is about 2×10−4. There is 27.8 Mb of exon sequence in D. melanogaster (Misra et al. 2002); about 70% of these sites can generate non-synonymous mutations without shifting the reading frame. With a mutation rate of 1.5×10−9 (Loewe et al. 2006), this results in a total mutation rate for non-synonymous mutations of U=0.058 per zygote per generation. The fraction λ of non-synonymous mutations that are dominant effective lethals is thus about 3.4×10−3. The precise value of this parameter is not important; however, the fact that effectively lethal non-synonymous mutations occur at a detectable, but low rate means that any candidate DDME that predicts either no or many such mutations can be rejected.
Table 1 gives an overview of the various types of DDME that were tested against the data. Figure 1 plots the best-fitting estimates for visual inspection. Table 2 reports the distributional parameters, as well as scaled measures of selection intensities, for the DDMEs that can be fitted to the diversity data (which is not the case for the normal distribution with cn=0 or 2.5%). The gamma distribution fits the diversity data well (Loewe et al. 2006). The number of dominant effective lethals predicted by the best gamma DDME is, however, very small, because the diversity data can only be fitted by a gamma distribution with a relatively small width (table 2), so that the right-hand end of the distribution falls off quickly (figure 1). Inspection of the results for the gamma distribution shows that 6.5% of the 802 bootstraps that could be fitted to the data for cn=0 lead to potentially realistic genomic lethal rates (Uλ between 10−5 and 0.004), but most results gave unacceptably low values. Thus, the gamma DDME does not easily predict plausible numbers of these mutations.
Recent searches for general principles have frequently uncovered or attempted to fit power laws (Reed & Hughes 2002; Mitzenmacher 2004), and so a power law such as the pareto distribution (table 1) might be a good candidate DDME. While we found that a pareto DDME can fit the diversity data reasonably well, it failed to predict plausible fractions of lethals (table 2). In contrast, a lognormal DDME can fit both observed diversities and fractions of lethals. This result seems to be relatively insensitive to different assumptions about cn, and the bootstraps for Uλ largely overlap the range of values that are consistent with the data. For a lognormal DDME and cn=0, a much larger fraction of bootstraps (60% out of 610) predicts a plausible genomic lethal rate than with a gamma DDME. The fact that this fraction is not higher is probably due to the limited size of our dataset and the correspondingly noisy statistics.
Is there a theoretical reason for the relatively good performance of the lognormal distribution? The central limit theorem, which states that a variable affected by independent additive effects of several other variables is normally distributed, implies that a variable controlled multiplicatively by several independent factors follows the lognormal distribution (Koch 1969; Mitzenmacher 2004; NIST/SEMATECH 2005). Darwinian fitness must be affected by many different factors operating at different levels of biological organization. We suggest that the extent of a reduction in fitness caused by a deleterious mutation, as measured by s, is a function of the amount of damage that it causes at several independent functional levels. If the total amount of damage were a multiplicative function of the amounts of damage at each level, a lognormal distribution of s would result (NIST/SEMATECH 2005).
Regardless of the question of the true nature of the DDME, the results presented in table 2 are encouraging in that some of the more important parameters derived from its properties are relatively invariant with respect to the type of distribution and are also fairly well bounded by the bootstrap procedure. In particular, the harmonic mean of Nes for effectively deleterious mutations is fairly similar for the different distributions and its bootstrap intervals are bounded well above 1. As noted previously, this parameter is close to the mean selection coefficient associated with polymorphic mutations that are not effectively neutral i.e. have Nes>0.5 (Loewe et al. 2006). It plays an important role in processes such as background selection and Muller's ratchet (Charlesworth & Charlesworth 2000). Similarly, the proportion of effectively neutral mutations is consistently estimated to be less than 20% and usually less than 10% (table 2). These conclusions are consistent with results from other methods (Bustamante et al. 2003; Nielsen & Yang 2003; Piganeau & Eyre-Walker 2003; Sawyer et al. 2003; Loewe et al. 2006).
It is likely that datasets of larger diversity and more accurate estimates of the fraction of dominant effective lethals will lead to more precise estimates in the future. Obviously, it is possible that there could be a mixture of distributions, with widely different means, contributing to the overall DDME and mimicking the results we have obtained by fitting the lognormal. For practical purposes, it is preferable to use a single distribution that matches the data successfully.
The results strengthen the conclusion that most amino acid mutations segregating in natural populations of Drosophila have sufficiently large deleterious effects on fitness that they behave quasi-deterministically (Nes>1), although large Ne values imply that selection against deleterious mutations in Drosophila is mostly weak (Loewe et al. 2006; mean s≈10−4). Despite the very small selection coefficients associated with most mutations, our best estimates for the lognormal DDME predict appreciable numbers of mutations with detectable fitness effects; we expect 11 or 7% of all non-synonymous mutations to have heterozygous effects between s=0.01−0.1, assuming cn=0% or 2.5%, respectively. Recent experiments on isolating EMS mutations in specific Drosophila genes by TILLING have shown that a substantial fraction of non-synonymous mutations can have drastic homozygous fitness effects (Winkler et al. 2005). Together with earlier findings that EMS-induced mutations with drastic homozygous fitness effects typically have small, but detectable fitness losses when heterozygous (Simmons et al. 1978), this is consistent with the predictions of the lognormal DDME, as is the fact that many human dominant Mendelian disorders are caused by single amino acid changes (Yampolsky et al. 2005). The hypothesis that there is a substantial minority of amino acid mutations with experimentally detectable heterozygous fitness effects can be tested in model organisms such as yeast or Drosophila, by measuring the fitness effects of induced non-synonymous mutations of known identity.
We thank Jay Taylor for helpful discussions and the Royal Society and the Leverhulme foundation for support. We have no competing financial interests.