We have used a polymorphism dataset on introns and coding sequences of X-linked loci in Drosophila americana to estimate the strength of selection on codon usage and/or biased gene conversion (BGC), taking into account a recent population expansion detected by a maximum-likelihood method. Drosophila americana was previously thought to have a stable demographic history, so that this evidence for a recent population expansion means that previous estimates of selection need revision. There was evidence for natural selection or BGC favouring GC over AT variants in introns, which is stronger for GC-rich than GC-poor introns. By comparing introns and coding sequences, we found evidence for selection on codon usage bias, which is much stronger than the forces acting on GC versus AT basepairs in introns.
In bacteria, yeast, Drosophila and plants, there is evidence for selection on codon usage at synonymous coding sites, probably because of selection on translational efficiency and/or accuracy . Several population genetic studies of Drosophila have used polymorphism data to estimate the intensity of selection on codon usage [2–7]. In addition, genome evolution is affected by the process of biased gene conversion (BGC), which tends to favour GC over AT basepairs in the meiotic products of GC/AT heterozygotes, and acts in a similar way to directional selection . Its effects and strength can be inferred from polymorphism data on non-coding sequences [9,10].
Here, we present results on the nature and intensity of selection and/or BGC on non-coding and synonymous sites, using polymorphism data on X-linked loci of Drosophila americana, a close relative of Drosophila virilis. The virilis group diverged from the Drosophila melanogaster group about 62 Ma  and has somewhat different patterns of codon usage and base composition [12,13] making it of special interest for studies of these genomic features. Drosophila americana has been used in evolutionary genetic studies for several decades [14–18]. It has a well-defined ecology, independent of human activity , and might thus be expected to have a relatively stable demographic history, which is advantageous for estimating the parameters of natural selection from polymorphism data .
This paper presents, to our knowledge, the first analysis of a species in the virilis group to detect both selection on codon usage and BGC from polymorphism data, using a population genetic method that allows for a recent population size change , whereas a previous study of selection on codon usage assumed demographic equilibrium . We provide evidence for a recent population expansion, and for selection on codon usage at synonymous sites, as well as selection or BGC favouring GC over AT in GC-rich introns.
2. Material and methods
For DNA extractions, we used males from 14 D. americana isofemale lines from the HI99 population on the south bank of the Missouri River (http://www.biology.uiowa.edu/mcallister/HI.html), provided by Bryant McAllister. About 85 per cent of genomes from this population have a fusion between the X and chromosome 4 [15,16]. Because genes located near the fusion region or in inversions may suffer from hitchhiking effects of the rearrangements, regions affected by the X/4 fusion or known segregating inversions were excluded.
Details of DNA extraction, amplification, sequencing and alignment of sequences are provided in the electronic supplementary material. The resulting dataset contains sequences for 32 introns sampled from 18 loci, including 12 short introns and 20 long introns (electronic supplementary material, figure S1). We also obtained the coding sequences of 15 X-linked genes, and retrieved four additional X-linked coding sequences from Maside & Charlesworth , in order to compare synonymous sites and introns. Sequences were deposited in GenBank (accession numbers JN246676–JN246926).
Using the codon preference table for D. virilis from Betancourt et al. , we assigned preferred (P) and unpreferred (U) alternatives to each synonymous site in both species, and then used parsimony to determine whether the synonymous site change within D. americana was P > P, U > U, P > U or U > P. Similarly, we obtained the counts and frequencies of AT > TA, GC > CG, GC > AT and AT > GC polymorphic changes for each intron in the D. americana intron dataset to test for selection or BGC favouring GC over AT basepairs [9,10].
We used the maximum-likelihood (ML) method of Zeng & Charlesworth , as modified by Haddrill et al. , for fitting the observed frequencies of variants to models of selection and demography, to estimate the strength of selection/BGC on U > P synonymous polymorphisms or GC > AT basepairs and the extent of mutational bias in favour of GC > AT versus GC > AT changes, allowing for the possibility of a recent population size change in D. americana. Details are given in the electronic supplementary material.
Our major findings are presented below; other results are described in the electronic supplementary material. The mean values of various summary statistics are shown in table 1. The mean diversity and divergence values are broadly consistent with those reported previously, even after excluding the four coding sequences in common with Maside & Charlesworth . There are no significant differences in mean Tajima's D values between the different classes of sites, or in variation and divergence values among intronic versus synonymous sites. The consistently negative Tajima's D values suggest a recent population expansion , as confirmed by the analysis below.
We first examined selection on variants affecting codon usage, using data on 19 X-linked coding sequences. There are four classes of mutations: P > P and U > U (expected to be selectively nearly neutral), P > U (potentially deleterious), and U > P mutations (potentially advantageous) [2,22]. Selection favouring P versus U variants is usually expected to yield an excess of P > U over U > P variants [2–4]. Consistent with this, we found nearly three times as many P > U variants as U > P variants (162 versus 56). In addition, P > U variants are disproportionately present at low frequencies compared with U > P variants (figure 1); the mean frequency of U > P mutations over the segregating sites in the sample was significantly higher than that of both P > U changes (Wilcoxon's W = 916.5, p = 0.022) and the pooled P > P and U > U changes (W = 856, p = 0.030).
We also explored the possible effect of BGC on intronic base composition, which is expected to favour GC over AT variants . The total numbers of GC > AT and AT > GC variants over the set of 32 introns are similar (248 versus 242), whereas the mean frequency of AT > GC variants is higher than that of GC > AT variants (0.28 versus 0.19) (W = 197.5, p = 0.002).
We also analysed these datasets by the method of Zeng & Charlesworth [4,6]. The ML estimates of mutational bias under all models examined indicate higher rates of mutations towards P > U and GC > AT variants compared with the reverse mutations, as found in previous Drosophila studies . The contrasts between the model with no expansion, but with all other parameters fitted (L0), and the other models (L1) indicate a recent 4.2-fold increase in population size (table 2), with an ML estimate of the time since the event of τ = 0.11, where τ is the number of generations since the expansion divided by twice the current effective population size.
To test for selection on codon usage, we compared the full L1 model with the reduced version with γcod = 0, where γcod is the estimate of the strength of selection/BGC at a synonymous site, scaled by four times the effective population size before the expansion. The full model has strong statistical support (χ21 = 29.9, p < 0.0001), with γcod = 1.6, implying selection in favour of preferred codons, consistent with the patterns of P > U versus U > P variants described above. To test for selection/BGC on intronic variants, we compared the full L1 model with γint = 0 (χ12 = 8.27, p = 0.004). Selection or BGC in favour of GC intronic basepairs is thus implied, with γint = 0.36. We tested whether γcod is significantly larger than γint, by comparing a model with a single γ for both categories: the full L1 model is significantly more likely than that with γcod = γint (χ21 = 14.6, p < 0.0001). We similarly found that the γint estimates are significantly different for introns with high and low GC content (χ21 = 18.9, p < 0.0001).
Our analysis provides evidence for a fairly large, recent increase in population size in D. americana, within a time-span of approximately 0.11 × 2Ne generations. This is consistent with the results for another widespread North American species, Drosophila pseudoobscura . Given the mean silent site diversity values of about 2 per cent (table 1), using the standard formula for equilibrium neutral diversity (4Neμ) together with the D. melanogaster mutation rate estimate of 3.5 × 10−9 , we estimate that the current Ne of D. americana is about 1.4 million, implying that the expansion took place about 308 000 generations ago. Assuming five generations per year for this slowly breeding species , this corresponds to 61 600 years, although there is considerable uncertainty about the exact value.
The results in table 2 show that both synonymous sites and intron sequences in D. americana are influenced by selection and/or BGC, even after the recent population expansion was taken into account. The γ estimate of about 1.6 for selection favouring preferred over unpreferred codons is in line with values for other Drosophila species [6,19], but is lower than the value of 2.6 found previously in D. americana , suggesting that population expansion caused the strength of selection to be overestimated, as expected theoretically . Consistent with other evidence from Drosophila for selection or BGC favouring GC over AT base pairs in non-coding sequences [10,19], we found evidence for natural selection or BGC favouring GC over AT basepairs. As in Haddrill & Charlesworth , selection/BGC appears to be significantly stronger in GC-rich compared with GC-poor introns, consistent with the idea that the intensity of BGC shapes the GC content of genomes .
As preferred codons are mostly GC-ending, selection for codon usage largely works in the same direction as BGC. The difference in γ between the synonymous sites and introns almost certainly reflects the action of selection on codon usage bias at synonymous sites, possibly in addition to the effects of BGC, whereas the apparent selection on intron sites may result from BGC alone . This difference could also be owing to a higher rate of recombination in exons than in introns, resulting in a higher rate of BGC in exons , although we did not find any evidence for this (see electronic supplementary material).
This work formed part of the GENACT Project, funded by a Marie Curie Host Fellowship for Early Stage Training awarded to S.M.P., as part of the Framework 6 Programme of the European Commission. K.Z. was supported by a Biomedical Personal Research Fellowship, awarded by the Royal Society of Edinburgh and the Caledonian Research Foundation. A.J.B. was supported by a research grant from the Biotechnology and Biological Sciences Research Council. We thank Penelope Haddrill and three anonymous reviewers for helpful comments on the manuscript.
- Received June 14, 2011.
- Accepted July 25, 2011.
- This journal is © 2011 The Royal Society