## Abstract

The properties of the distribution of deleterious mutational effects on fitness (DDME) are of fundamental importance for evolutionary genetics. Since it is extremely difficult to determine the nature of this distribution, several methods using various assumptions about the DDME have been developed, for the purpose of parameter estimation. We apply a newly developed method to DNA sequence polymorphism data from two *Drosophila* species and compare estimates of the parameters of the distribution of the heterozygous fitness effects of amino acid mutations for several different distribution functions. The results exclude normal and gamma distributions, since these predict too few effectively lethal mutations and power-law distributions as a result of predicting too many lethals. Only the lognormal distribution appears to fit both the diversity data and the frequency of lethals. This DDME arises naturally in complex systems when independent factors contribute multiplicatively to an increase in fitness-reducing damage. Several important parameters, such as the fraction of effectively neutral non-synonymous mutations and the harmonic mean of non-neutral selection coefficients, are robust to the form of the DDME. Our results suggest that the majority of non-synonymous mutations in *Drosophila* are under effective purifying selection.

## 1. Introduction

Recent advances in evolutionary genetics have led to a number of approaches for estimating the distribution of deleterious mutational effects on fitness (DDME) of non-synonymous mutations, using data on between-species sequence divergence and/or within-species sequence diversity (Bustamante *et al*. 2003; Nielsen & Yang 2003; Piganeau & Eyre-Walker 2003; Sawyer *et al*. 2003; Loewe *et al*. 2006). All assume a specific type of distribution of selection coefficients, which is then used to fit the data. Previous investigations have used a variety of distributions, including the normal, exponential and gamma distributions (Bustamante *et al*. 2003; Nielsen & Yang 2003; Piganeau & Eyre-Walker 2003; Sawyer *et al*. 2003; Loewe *et al*. 2006). The latter is widely used because of its convenient two-parameter form, which allows a wide range of curve shapes. However, none of these distributions has a special status, since there is currently no basis for a rational choice.

Here, we propose rejection of a candidate DDME if it cannot explain (i) DNA sequence diversity data in two related species with different effective population sizes, and (ii) the frequency of dominant, effectively lethal mutations caused by amino acid mutations. We find that a lognormal DDME satisfies these conditions much better than a gamma distribution or a power law.

## 2. Material and methods

We define the DDME as the genome-wide distribution of the heterozygous selection coefficient, *s*, associated with a new deleterious, non-synonymous mutation. We use diversity data from 17 loci of *Drosophila miranda* and 14 loci of *Drosophila pseudoobscura*, two closely related species of fruitfly with similar habitats, but significantly different effective population sizes (*N*_{e}) as estimated from their silent nucleotide site diversities, *π*_{S} (Loewe *et al*. 2006). The similarity of the two species means that they probably share the same DDME, so that the larger *N*_{e} of *D. pseudoobscura* compared with *D. miranda* causes a larger fraction of sites to experience effective purifying selection. This results in a smaller increase in non-synonymous diversity, *π*_{A}, than in *π*_{S}. Assuming a given type of DDME for mutations affecting amino acid sites and a fraction of completely neutral, non-synonymous mutations (*c*_{n}), we can calculate the expectation of *π*_{A} for each species. By equating these to the pair of observed mean values of *π*_{A}, we estimate the parameters of the DDME, assuming that it can be described by two parameters. Our method assumes approximate mutation–selection-drift equilibrium and independence among non-synonymous polymorphisms, but is not affected by the details of the frequency distributions of variants. It should, therefore, provide robust estimates of the parameters of the DDME for a fixed value of *c*_{n}. Statistical accuracy is assessed by computing 1000 bootstraps. To improve analysis, we used the diversity index, DI (the ratio of the values *π*_{A}/*π*_{S} for the two species) to eliminate the 12.2% of all bootstraps with DI≤1, since these cannot be explained by any plausible model. Further details are described in the electronic supplementary material and by Loewe *et al*. (2006).

## 3. Results

To test whether a DDME that is compatible with observed diversity data also satisfies our second criterion requires an estimate of the rate at which non-synonymous mutations with effectively lethal (i.e. lethal or sterile) heterozygous effects arise. The difficulty is that most lethal mutations are recessive and it is hard to study those that are not. While point mutations, indels and transposable elements (TEs) can all induce dominant, effectively lethal mutations, we are only concerned with non-synonymous mutations. It is virtually impossible to estimate the rate of spontaneous dominant lethal mutations, and most of these are probably due to chromosome breaks (Ashburner 1989). However, the results of ethylmethane sulphonate (EMS) mutagenesis, which mainly but not exclusively induces point mutations, suggest that dominant female sterile mutations arise in *D. melanogaster* at about 1/500th of the rate for recessive lethal mutations (Ashburner 1989). Molecular analyses of two of the genes concerned show that the majority of the mutational lesions involved are non-synonymous mutations (Timinszky *et al*. 2002; Venkei & Szabad 2005).

The approximate overall frequency of such mutations can be assessed as follows. Recent data suggest that spontaneous recessive lethal mutations arise at a rate of about 0.045 per zygote per generation in *D. melanogaster*, but many of these are probably due to TE insertions (Charlesworth *et al*. 2004). The rate of TE insertions is about 0.2 per zygote per generation (Maside *et al*. 2000). Assuming that 25% of the genome is coding sequence (Misra *et al*. 2002) and 25% of all genes are vital (Oh *et al*. 2003) gives an estimate of 0.0125 recessive lethal TE-insertions per zygote per generation. If 25% of non-TE mutations are indels (Charlesworth *et al*. 2004), this suggests that the recessive lethal mutation rate due to base substitutions is around 0.75×(0.045−0.0125)=0.024. The rate of mutation to dominant female sterile non-synonymous mutations in *Drosophila* is thus about 5×10^{−5}. If dominant lethal and dominant male sterility mutations arise at a similar rate, the net rate of mutation to effectively lethal dominant mutations is about 2×10^{−4}. There is 27.8 Mb of exon sequence in *D. melanogaster* (Misra *et al*. 2002); about 70% of these sites can generate non-synonymous mutations without shifting the reading frame. With a mutation rate of 1.5×10^{−9} (Loewe *et al*. 2006), this results in a total mutation rate for non-synonymous mutations of *U*=0.058 per zygote per generation. The fraction *λ* of non-synonymous mutations that are dominant effective lethals is thus about 3.4×10^{−3}. The precise value of this parameter is not important; however, the fact that effectively lethal non-synonymous mutations occur at a detectable, but low rate means that any candidate DDME that predicts either no or many such mutations can be rejected.

Table 1 gives an overview of the various types of DDME that were tested against the data. Figure 1 plots the best-fitting estimates for visual inspection. Table 2 reports the distributional parameters, as well as scaled measures of selection intensities, for the DDMEs that can be fitted to the diversity data (which is not the case for the normal distribution with *c*_{n}=0 or 2.5%). The gamma distribution fits the diversity data well (Loewe *et al*. 2006). The number of dominant effective lethals predicted by the best gamma DDME is, however, very small, because the diversity data can only be fitted by a gamma distribution with a relatively small width (table 2), so that the right-hand end of the distribution falls off quickly (figure 1). Inspection of the results for the gamma distribution shows that 6.5% of the 802 bootstraps that could be fitted to the data for *c*_{n}=0 lead to potentially realistic genomic lethal rates (*Uλ* between 10^{−5} and 0.004), but most results gave unacceptably low values. Thus, the gamma DDME does not easily predict plausible numbers of these mutations.

Recent searches for general principles have frequently uncovered or attempted to fit power laws (Reed & Hughes 2002; Mitzenmacher 2004), and so a power law such as the pareto distribution (table 1) might be a good candidate DDME. While we found that a pareto DDME can fit the diversity data reasonably well, it failed to predict plausible fractions of lethals (table 2). In contrast, a lognormal DDME can fit both observed diversities and fractions of lethals. This result seems to be relatively insensitive to different assumptions about *c*_{n}, and the bootstraps for *Uλ* largely overlap the range of values that are consistent with the data. For a lognormal DDME and *c*_{n}=0, a much larger fraction of bootstraps (60% out of 610) predicts a plausible genomic lethal rate than with a gamma DDME. The fact that this fraction is not higher is probably due to the limited size of our dataset and the correspondingly noisy statistics.

## 4. Discussion

Is there a theoretical reason for the relatively good performance of the lognormal distribution? The central limit theorem, which states that a variable affected by independent additive effects of several other variables is normally distributed, implies that a variable controlled multiplicatively by several independent factors follows the lognormal distribution (Koch 1969; Mitzenmacher 2004; NIST/SEMATECH 2005). Darwinian fitness must be affected by many different factors operating at different levels of biological organization. We suggest that the extent of a reduction in fitness caused by a deleterious mutation, as measured by *s*, is a function of the amount of damage that it causes at several independent functional levels. If the total amount of damage were a multiplicative function of the amounts of damage at each level, a lognormal distribution of *s* would result (NIST/SEMATECH 2005).

Regardless of the question of the true nature of the DDME, the results presented in table 2 are encouraging in that some of the more important parameters derived from its properties are relatively invariant with respect to the type of distribution and are also fairly well bounded by the bootstrap procedure. In particular, the harmonic mean of *N*_{e}*s* for effectively deleterious mutations is fairly similar for the different distributions and its bootstrap intervals are bounded well above 1. As noted previously, this parameter is close to the mean selection coefficient associated with polymorphic mutations that are not effectively neutral i.e. have *N*_{e}*s*>0.5 (Loewe *et al*. 2006). It plays an important role in processes such as background selection and Muller's ratchet (Charlesworth & Charlesworth 2000). Similarly, the proportion of effectively neutral mutations is consistently estimated to be less than 20% and usually less than 10% (table 2). These conclusions are consistent with results from other methods (Bustamante *et al*. 2003; Nielsen & Yang 2003; Piganeau & Eyre-Walker 2003; Sawyer *et al*. 2003; Loewe *et al*. 2006).

It is likely that datasets of larger diversity and more accurate estimates of the fraction of dominant effective lethals will lead to more precise estimates in the future. Obviously, it is possible that there could be a mixture of distributions, with widely different means, contributing to the overall DDME and mimicking the results we have obtained by fitting the lognormal. For practical purposes, it is preferable to use a single distribution that matches the data successfully.

The results strengthen the conclusion that most amino acid mutations segregating in natural populations of *Drosophila* have sufficiently large deleterious effects on fitness that they behave quasi-deterministically (*N*_{e}*s*>1), although large *N*_{e} values imply that selection against deleterious mutations in *Drosophila* is mostly weak (Loewe *et al*. 2006; mean *s*≈10^{−4}). Despite the very small selection coefficients associated with most mutations, our best estimates for the lognormal DDME predict appreciable numbers of mutations with detectable fitness effects; we expect 11 or 7% of all non-synonymous mutations to have heterozygous effects between *s*=0.01−0.1, assuming *c*_{n}=0% or 2.5%, respectively. Recent experiments on isolating EMS mutations in specific *Drosophila* genes by TILLING have shown that a substantial fraction of non-synonymous mutations can have drastic homozygous fitness effects (Winkler *et al*. 2005). Together with earlier findings that EMS-induced mutations with drastic homozygous fitness effects typically have small, but detectable fitness losses when heterozygous (Simmons *et al*. 1978), this is consistent with the predictions of the lognormal DDME, as is the fact that many human dominant Mendelian disorders are caused by single amino acid changes (Yampolsky *et al*. 2005). The hypothesis that there is a substantial minority of amino acid mutations with experimentally detectable heterozygous fitness effects can be tested in model organisms such as yeast or *Drosophila*, by measuring the fitness effects of induced non-synonymous mutations of known identity.

## Acknowledgments

We thank Jay Taylor for helpful discussions and the Royal Society and the Leverhulme foundation for support. We have no competing financial interests.

## Footnotes

The electronic supplementary material is available at http://dx.doi.org/10.1098/rsbl.2006.0481 or via http://www.journals.royalsoc.ac.uk.

- Received February 1, 2006.
- Accepted March 23, 2006.

- © 2006 The Royal Society