TREEVOLVE v. 1.3

a program to simulate the evolution of DNA sequences under different population dynamic scenarios

N. C. GRASSLY AND A. RAMBAUT

Wellcome Centre for Infectious Disease, Department of Zoology, Oxford University, South Parks Road. Oxford. U.K.

e-mail: nicholas.grassly@zoo.ox.ac.uk
WWW: http://evolve.zoo.ox.ac.uk/
Tel: +44 1865 271263

Contents

Introduction
Methods
Implementation
Running treevolve
Interpreting output
Registration
Acknowledgements
References


INTRODUCTION

treevolve will simulate the evolution of DNA sequences under a coalescent model (Kingman 1982a,b,c), which allows exponential population growth, population subdivision according to an island model (Wright 1931), migration and recombination. In addition different periods of population dynamics can be enforced at different times. For example, a period of exponential growth can be followed by a period of stasis where the population is subdivided into demes. Multiple sets of such simulated sequence data can then be compared to sequence data sampled from a population of interest using suitable statistics, and various evolutionary hypotheses concerning the evolution of this population tested.

Methods

The coalescent model implemented extends the work of (Hudson 1983; Takahata 1991) and is documented in Grassly and Rambaut (in prep.). Gene genealogies are generated under the chosen population dynamic model(s) (with or without population subdivision, migration and exponential growth), with the specified recombination rate r. When r=0 this genealogy is a simple bifurcating tree, whereas with recombination a network is produced (Figure 1).

Figure 1: A gene genealogy subject without (A) and with (B) recombination.

The coalescent process operates backwards in time and is terminated when the number of lineages in the sample equals one. A recombination event resulting in the production of two parents can occur anywhere along the sequence, and in the model implemented here according to a uniform distribution. However some recombination events may produce a parent without any genetic material ancestral to our sample. Such events are therefore ignored, and discounted for by the algorithm.

Coalescent events result in a reduction of the number of lineages by one, and if the population is subdivided, can only occur when the sampled lineages are in the same deme. All demes are assumed to be of equal size. If the migration rate is low, then waiting times between coalescent events can be long. If the exponential growth rate is high, and the migration rate low, as we go backwards in time sampled lineages may become 'stranded' in demes of very small size. In such cases the coalescent approximation breaks down, and hence it was determined below an arbitrary deme size all lineages would instantly coalesce.

Molecular sequences are evolved down the genealogy under the chosen substitution model. The number of substitutions down any branch is Poisson distributed with a mean proportional to the branch length. The models implemented for nucleotide substitutions are F84 (Felsenstein and Churchill 1996), HKY85 (Hasegawa, et al. 1985) and the general reversible model, REV (Lanave, et al. 1984; Rodriguez, et al. 1990). The former two models are conceptually identical, but are implemented within different mathematical frameworks, and require a transition: transversion (ts: tv) ratio and base frequencies as parameters. The latter model allows any time reversible substitution model to be implemented and uses six relative rate parameters, and base frequencies. Other models such as K2P(Kimura 1980), F81(Felsenstein 1981) and JC69(Jukes and Cantor 1969) are special cases of HKY and can be obtained by setting the nucleotide frequencies equal (for K2P) or the transition transversion ratio to 1.0 (for F81) or both (for JC69).

Rate heterogeneity of the evolutionary process along the sequences is also allowed, and modelled by a discrete gamma model (Yang 1996). This requires an additional parameter , which governs the shape of the gamma distribution. The smaller the value of , the greater the rate heterogeneity.

Implementation

treevolve is written in ANSI C and should compile on most UNIX systems and workstations. The source and header files are: const.c, const.h, const.sub.c, const.sub.h, coseq.c, coseq.h, exp.c, exp.h, exp.sub.c, exp.sub.h, mathfuncs.c, mathfuncs.h, models.c, models.h, nodestack.c, nodestack.h, treevolve.c, treevolve.h. They will also compile using Metrowerks Codewarrior on the Apple Macintosh and a project file and compiled 'fat' executable are included in the Macintosh archive.

Compilation on UNIX

A simple Makefile is included in each package. You should edit this and change the name of the compiler (by default this is 'cc') and add any flags for optimisation on your machine. Once this is done, type:make

and the program will compile giving an executable program: 'treevolve'

Compilation on Macintosh

The Apple Macintosh version of treevolve is compiled with Metrowerks Codewarrior. Only two extra lines of code are required to allow this program to compile on the Mac. These are included in the source code but will only be added to the programs if compiled under the Codewarrior compiler. The effect of these lines is to bring up a dialog box when the program is run that allows the user to redirect the input and output by clicking on the radio buttons. This is the equivalent of the UNIX '< input_file' and '> output_file' redirection.

Figure 2: The pop-up dialogue box you will see on running the Macintosh version of treevolve.

Warning: Memory requirements increase as the number of sequences and also the recombination rate increases. More memory can be allocated to treevolve either on compilation or by using the 'Get Info' dialogue box accessible within Finder.

Running treevolve

On UNIX machines treevolve can be run by typing:

treevolve <parameter_file >output

otherwise double click the icon in the Macintosh window.

The parameter file should contain all the parameter settings that are to be set to something other than the default values (some of these parameters are similar to the command line parameters entered in earlier versions of this program). Each parameter is denoted by one or two letters, and is usually followed directly by the value to which this parameter is to be set. This is best explained with reference to an example parameter file:

BEGIN TVBLOCK
[sequence length] l1000
[sample size] s30
[mutation rate] u0.0000001
[number of datasets to generate] n1
[substitution model] vF84 t1.0 [f0.3,0.2,0.3,0.2] [r0.1,0.1,0.5,0.2,0.1,1.0]
[rate heterogeneity] [c0.2,0.2,1.0] [a0.1 g4]
[output coalescent times] oCoal.Times
[haploid]
[generation time/var. offspring number] b1.0

*PERIOD 1
[length of period] t100000.0
[population size] n1000000 e0.0
[subdivision] d1 m0.000001
[recombination] r0.0
*END

*PERIOD 2
[population size] np e0.001
[subdivision] d1 [m0.000001]
[recombination] r0.0
*END

 

Global parameters

The parameter file must begin with the phrase 'BEGIN TVBLOCK'. All comments in square brackets are ignored. The first four parameters are self explanatory (for their default values see Table 1). The substitution models are abbreviated by F84, HKY and REV respectively and follow m. If F84 or HKY are selected then the ts: tv ratio and base frequencies can be entered. If the REV model is selected, then the relative rate parameters should be entered after r, separated by commas. The relative rate parameters should be in the order: A-C, A-G, A-T, C-G, C-T, G-T.

If a model of rate heterogeneity along the sequences is to be implemented, then there is a choice of codon-position-specific rates, or discrete-gamma distributed rate heterogeneity, but not both. The former can be implemented by entering three relative rate parameters preceded by c. The latter by giving the number of categories for the discrete gamma approximation (after g), and the shape parameter after a.

If a separate file listing the time to the most recent common ancestor (MRCA) for each simulated dataset is required, the filename can be specified after an o. If a haploid model is required this can be selected by simply typing haploid or h in the parameter file.

The last global parameter, b, is a compound parameter specifying the rate of coalescence. It is the generation time of sampled individuals divided by the variance in their offspring number. The smaller the parameter, the more rapid the rate of coalescence. This makes intuitive sense, since if the generation time is short there are more chances for coalescence per unit time. Furthermore, if the variance in offspring number is large, there is a larger chance of coalescence since there is a bigger probabilty that two sampled individuals have the same parent in the previous generation. Typically Wright-Fisher reproduction is assumed (Fisher 1930; Wright 1931), where generations are non-overlapping and produced by sampling with replacement from the previous generation. In this case the variance in offspring number equals one. In addition, time is often measured in units of generations and other rate parameters are given per generation, and in this case the compound parameter b can be ignored (set to one).

Local parameters

Different periods of population dynamics can be specified in *PERIOD blocks. These should be numbered in the order they are to be encountered going backwards in time (the *PERIOD marker should be followed by the period number). The end of each block is denoted by *END. Within these blocks a variety of parameters can be specified, or left to their default values (Table 1). The length of time for which each period runs can be specified by t, unless it is desired that the period run until all lineages have coalesced.

The effective population size of each deme should follow an n. If the number of demes (d) is set to one then this is the overall size of a panmictic population. If there is more than one deme then a migration rate can be set following an m. If it is required that n is set to the previous period population size, then it can be followed by a p. If exponential growth of that period is required. the exponential growth rate can be entered following an e. The recombination rate can also be set for each period (r).

Table 1 Default parameter values
Parameter
Default value
sequence length, l
500
sample size, s
20
mutation rate, u
1.0 x 10 -7
number of replicates, n
1
substitution model, m
F84
ts: tv ratio, t
2.0
base frequencies, f
equal (0.25)
rate heterogeneity
none (if gamma selected, default a = 1.0)
ploidy
diploid
generation time/ var. in offspring no., b
1.0
run time for population dynamic period, t
infinity
effective population size, n
1.0 x 10 6
exponential growth rate, e
0.0
number of demes, d
1
migration rate, m
0.0
recombination rate, r
0.0

 

Interpreting output

On running treevolve prints the various parameter settings to stderr (the screen unless redirected). At the conclusion of the coalescent calculations for each sample, the number of recombination, migration and coalescent events are output. Sequence evolution ensues and the sequences are output to stdout (the redirected output) in sequential PHYLIP format (Felsenstein 1993). This format is suitable for reading into most sequence analysis packages.

Registration

Please e-mail me to register the fact that you are using treevolve. The programs are free, but registration enables me to e-mail you concerning any updates or bug reports. If you should discover a bug please let me know!

Acknowledgements

Many thanks to Ziheng Yang, who allowed us to use some invaluable code from PAML.

References

Felsenstein, J. (1981). Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17:368-376.
Felsenstein, J. (1993). PHYLIP, version 3.5c. University of Washington, Seattle.
Felsenstein, J., and G. A. Churchill. (1996). A hidden markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93-104.
Fisher, R. A. (1930). The genetical theory of natural selection. Clarendon Press, Oxford.
Hasegawa, M., H. Kishino, and T. Yano. (1985). Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160-174.
Hudson, R. R. (1983). Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol. 23:183-201.
Jukes, T. H., and C. R. Cantor. (1969). Evolution of protein molecules, p. 21-123. In H. N. Munro (ed.), Mammalian Protein Metabolism. Academic Press, New York.
Kimura, M. (1980). A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111-120.
Kingman, J. F. C. (1982a). The coalescent. Stoch. Process. Appl. 13:235-248.
Kingman, J. F. C. (1982c). Exchangeability and the evolution of large populations, p. 97-112. In G. Koch, and F. Spizzichino (ed.), Exchangeabilty in probability and statistics. North-Holland, Amsterdam.
Kingman, J. F. C. (1982b). On the genealogy of large populations. J. Appl. Prob. 19A:27-43.
Lanave, C., G. Preparata, C. Saccone, and G. Serio. (1984). A new method for calculating evolutionary substitution rates. J Mol. Evol. 20:86-93.
Rodriguez, F., J. L. Oliver, A. Marin, and J. R. Medina. (1990). The general stochastic model of nucleotide substitution. J.Theor. Biol. 142:485-501.
Takahata, N. (1991). Genealogy of neutral genes and spreading of selected mutations in a geographically structured population. Genetics. 129:585-595.
Wright, S. (1931). Evolution in Mendelian populations. Genetics. 16:97-159.
Yang, Z. H. (1996). Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11:367-372.