## Abstract

First principles of population genetics are used to obtain formulae relating the non-synonymous to synonymous substitution rate ratio to the selection coefficients acting at codon sites in protein-coding genes. Two theoretical cases are discussed and two examples from real data (a chloroplast gene and a virus polymerase) are given. The formulae give much insight into the dynamics of non-synonymous substitutions and may inform the development of methods to detect adaptive evolution.

## 1. Introduction

Halpern & Bruno [1] devised a model to study the divergence of protein-coding genes based on the Fisher–Wright model of mutation, selection and random genetic drift [2,3]. In the model, each particular codon site in the gene is assigned its own set of amino acid fitnesses, and then the Fisher–Wright model is used to work out the evolutionary rate of the site. The model has seen a resurgence in recent years, and variations of it have been used, for example, to study performance of phylogenetic inference methods [4,5], to study codon usage [6] and to estimate the distribution of selection coefficients in protein-coding genes [7,8]. Perhaps surprisingly, the model has not been used to study the dynamics of the non-synonymous to synonymous rate ratio (also known as *ω* = *d*_{N}/*d*_{S}) of protein-coding genes and its significance in the study of adaptive molecular evolution.

The purpose of this note is to propose a way to define and calculate an equivalent of the classical concept of the non-synonymous to synonymous rate ratio, in the context of the mutation–selection model of Halpern & Bruno [1]. It is hoped that by using first principles of population genetics, we can obtain an expression of *ω* as a function of the selection coefficients acting at codon sites in the protein-coding gene. This should provide much insight into the evolutionary dynamics of codon sites and it should be of advantage in the building of statistical models to detect adaptive evolution in protein-coding genes.

## 2. The site-wise mutation–selection model

Consider the evolution of a codon site *k* in a protein-coding gene in a population with *N* haploid genomes. Assume the site is currently fixed for codon *I* (i.e. all *N* alleles carry *I* at site *k*). In the mutation–selection framework [1,8], the substitution rate (the rate at which novel mutant codons *J* appear and eventually become fixed in the population) is2.1Here *μ _{IJ}* is the neutral mutation rate (per generation) from

*I*to

*J*, and

*S*=

_{IJ,k}*F*–

_{J,k}*F*is the selection coefficient in favour of codon

_{I,k}*J*and

*F*= 2

_{J,k}*Nf*is the scaled Malthusian fitness of

_{J,k}*J*. Natural selection affects the relative substitution rate. When the mutation is advantageous (

*S*> 0), the substitution rate is higher than the neutral rate (

_{IJ,k}*q*>

_{IJ,k}*μ*), but if the mutation is deleterious (

_{IJ}*S*< 0), then the substitution rate is reduced (

_{IJ,k}*q*<

_{IJ,k}*μ*). Here, we assume that synonymous substitutions are neutral (

_{IJ}*S*= 0), and thus, evolution at site

_{IJ,k}*k*is determined by 20 amino acid fitnesses. The

*μ*can be constructed from standard DNA substitution models (for example, if

_{IJ}*I*= TTT and

*J*= TTC, then under the HKY substitution model, see [8] for details).

Equation (2.1) describes codon substitution in populations as a continuous-time Markov process. This is sensible if the per generation mutation rate is small compared with the population size (), so that there is little polymorphism in the population, and at most, two alleles segregate at a site at a time. The proportion of time, *π _{I,k}*, that site

*k*spends fixed for

*I*(i.e. the stationary frequency of

*I*) is[8], where is the frequency for a neutrally evolving sequence (i.e. a pseudo-gene). Thus, the substitution rate at

*k*, averaged over time, iswhere the sum is over all codon pairs

*I*≠

*J*. This rate can be partitioned into its non-synonymous and synonymous component rates,

*ρ*=

_{k}*ρ*+

_{N,k}*ρ*, whereand where the indicator function I

_{S,k}_{N}= 1 if the substitution is non-synonymous and = 0 if otherwise. Note that the synonymous rate

*ρ*varies among sites (for example, if a site is conserved for methionine, then the synonymous rate is zero). For a neutrally evolving sequence, the rates are given by

_{S,k}Note that equation (2.1) gives the instantaneous substitution rate, that is, the rate conditioned on site *k* being fixed for *I* at the present time. On the other hand, *ρ _{k}* is the rate at equilibrium, averaged over all codons and weighted by their stationary frequencies.

## 3. The relative non-synonymous substitution rate

The absolute non-synonymous to synonymous substitution rate ratio at site *k* is *ρ _{N,k}*/

*ρ*. However, because synonymous rates vary over sites, we need to normalize the ratio by the site's synonymous rate ratio, , and then normalize by (to correct for the different proportions of synonymous and non-synonymous substitutions at neutrality). This leads to the following definition:3.1

_{S,k}Alternatively, we can define *ω _{k}* as the relative non-synonymous rate

*ω*=

_{k}*cρ*where constant

_{N,k}*c*is set so that the ratio is one for neutrally evolving sequences, that is, under the constraint . The obvious solution is leading to the same definition as above. Note that

*c*has the desirable property of being constant over sites. The reader should not be surprised that the synonymous rate drops out from equation (3.1). When doing statistical inference, synonymous substitutions have information about the neutral mutation rates, and thus inform the value of . Similarly, the relative synonymous rate at site

*k*is3.2Figure 1

*a*shows an example for the

*rbcL*gene of flowering plants. Fitness values were estimated under the Halpern–Bruno model by Tamuri

*et al*. [9], and we use their values to calculate

*ω*and

_{k}*γ*here. The average rates across sites are and . Note that for many sites, synonymous rates are faster than for a neutrally evolving sequence (i.e.

_{k}*γ*> 1). This is owing to the quirky nature of the genetic code coupled with the mutational biases ().

_{k}## 4. The non-synonymous rate during adaptive evolution

When the fitnesses of amino acids are constant through time, sites will spend most of the time fixed for the optimal amino acid. Occasionally, suboptimal amino acids may become fixed, and then substituted after a short period of evolutionary time. This means that the non-synonymous rate at sites is reduced compared with the rate for neutrally evolving sequences (i.e. *ω _{k}* < 1). However, when fitnesses at sites vary over time (for example, after an environment shift or under intense frequency-dependent selection [10]), the non-synonymous rate may be accelerated compared with the rate for neutrally evolving sequences (

*ω*> 1). We now study the case where fitnesses change as an adaptation to a novel environment.

_{k}Consider a site *k* where the fitness of *I* is in environment *A*. The stationary frequencies and instantaneous substitution rates are . Now, imagine that the environment shifts (for example, a population of mammals living in a suddenly colder climate, or a virus colonizing a new host, where the intracellular environment in the new host is different from the reservoir host). The fitness of *I* in the new environment *B* is now . The probability that the site is currently fixed for *I* at the moment of the environment shift is , but the substitution rate is now that of the new environment . Thus, the expected absolute and relative non-synonymous rates *at* the environment shift are4.1

If the shift in fitness values is large, then the rate will be much accelerated (). This occurs because the site is likely to find itself fixed for a suboptimal amino acid in the new environment, and novel mutations to optimal amino acids will become fixed quickly. However, if the fitness shift is moderate, the rate may still be lower than the neutral rate ().

Figure 1*b* shows an example for the *pb2* gene of the influenza virus. Fitness values were estimated under the Halpern–Bruno model by Tamuri *et al*. [9]. A subset of 25 adaptive sites (where fitnesses are different for viruses evolving in human versus avian hosts [8,11]) were identified by Tamuri *et al*. [11], and their fitnesses estimated by Tamuri *et al*. [8]. We use the estimates to calculate *ω _{k}*,

*γ*and here. The classical lineage of human influenza probably originated from a host shift from an avian to a mammal reservoir in the early-twentieth century [12]. We calculate at the putative host shift. The average rate at adaptive sites is (across all sites and ). Note that for 16 sites for which fitnesses are different between hosts, we find that . This indicates that the criterion

_{k}*ω*> 1 to detect adaptive evolution is conservative in this case.

_{k}The probability that the site is fixed for *I*, time *t* after the environment shift is4.2where are the transition probabilities obtained using standard Markov theory, i.e. by calculating . Thus, the absolute and relative non-synonymous rates, time *t* after the shift, areThe transition probabilities in equation (4.2) are exponential decay functions of time, and so is also an exponential decay. Initially, the value of will be high, and as the time goes to infinity, will approach the stationary value given by equation (3.1). In other words, soon after an environment shift, a burst of adaptive substitutions will occur at sites where fitnesses have changed, and substitutions will accumulate until the protein-coding gene reaches a state of adaptive equilibrium. For example, figure 2*a* shows the decay of for the 25 adaptive sites in the *pb2* gene after a host shift.

## 5. Conclusion

Previous authors have shown that the relationship between the non-synonymous rate and the selection coefficient is approximately *ω* = *S*/(1 − exp(−*S*)) [13,14], but the approximation relies either on the infinite-sites model or assumes that all mutant amino acids have the same fitness. Equations (3.1) and (4.1) provide more realistic approximations but are hard to visualize. Consider a site fixed for *I*. The probability that the next mutation will be *J* is for *I* ≠ *J*. Over time, the proportion of *I* to *J* mutations at the site will be *π _{I,k}P_{IJ}*. Thus, the average selection coefficient on mutations at site

*k*is5.1Figure 2

*b*shows

*ω*as a function of for simulated sites when fitnesses are constant or when they shift with the environment. Note that the approximation

_{k}*ω*=

*S*/(1 − exp(−

*S*)) provides a reasonable lower bound on

*ω*. In general, increases with , but the relationship is not as simple as in the previous approximations [13,14].

_{k}In the site-wise mutation–selection model, one calculates the selection coefficients first, and hence one may know whether a site has been under positive selection without calculating *ω _{k}* [8]. However, the model is over-parametrized, computationally expensive, and fitnesses can be well estimated only in large datasets [9]. Instead, the model should be of advantage in evolutionary reasoning and in model building. For example, the behaviour of

*ω*under more complex models (such as frequency-dependent selection [10], adaptation to gradual environment changes [10] or selection on codon usage [6]) can also be studied under the site-wise mutation–selection framework. This will be a worthwhile effort as it will shed light on our ability to detect adaptive evolution in molecular sequences.

_{k}## Data accessibility

The data accompanying with this study are available at Dryad doi:10.5061/dryad.3r3q4.

## Funding statement

M.d.R. is supported by BBSRC (UK) grant no. BB/J009709/1 awarded to Ziheng Yang.

## Conflicts of interest

I have no competing interests.

## Acknowledgements

I thank Ziheng Yang, Richard Goldstein and Asif Tamuri for valuable comments.

- Received December 8, 2014.
- Accepted March 16, 2015.

- © 2015 The Author(s) Published by the Royal Society. All rights reserved.