Chapter 3. Simulation experiments
Simulation models are widely used in the genetic management of ex situ, populations for which pedigree data are available [see for example MacCluer et al., 1986; Princée, 1988,1989b; Lacy, 1994]. These models do not consider the genome structure of the species and the assumptions made regarding genetic composition do not, generally, reflect the patterns observed in natural populations. Risks involved in using these models have been discussed in Risks in genetic management in Chapter 1. This chapter presents a study on the effects of assumptions made regarding genetic structures and genetic composition on the variances in expected levels of genetic variation. The ChromoFlow simulation model (see Chapter 2) is used in this study. Models of genome structure vary from single locus to multiple chromosomes with linked loci. Models of genetic composition vary between 'two alleles - equal frequency' to multiple allelic variants per locus and rare alleles.
Materials and methods
Populations
Two hypothetical populations that were generated with the GSPED model are used in these simulation experiments. These populations are referred to as POP-MAI and POP-FS. Both populations are founded on 32 unrelated individuals (equal sex-ratio). A pair mating system is assumed and each pair produces two offspring (male and female). Parents are assumed to die after they have produced offspring (i.e. no generation overlap is assumed). Breeding pairs in the first population (POP-MAI) are formed for 10 successive generations following Maximal Avoidance of Inbreeding schemes (MAI, see Chapter 5). Pair formation in the second population (POP-FS) is based on an extreme inbreeding scheme. This population is maintained by line breeding, i.e. full-sib matings (see Figure 1) , for 10 successive generations.
Genome models
The genome models that are used in the simulations are described in table 3. These models vary in complexity from 'single locus' to 'multiple loci - multiple chromosomes'. The number of loci and the linkage pattern of loci for each model are presented in this table. Models 1, 2 and 5-9 assume various numbers of independent autosomal loci. Models 10 to 14 assume a genome of one chromosome on which 32 loci are equally distributed with an inter-locus distance varying between 1 to 16 cM. Genome models 15 to 19 assume 4 chromosomes with 8 loci each. The inter-locus distance varies in these models from 1 to 16 cM. Genome models are referred to as GNM #x in the various experiments.
Genetic composition models
Models of genetic composition, i.e. the number of alleles per locus and their frequencies, that are assumed for an infinite large population, are presented in table 4. The option, offered by the ChromoFlow simulation model to incorporate different assumptions with respect to alleles per individual locus, has not been applied in the experiments which are described in this chapter. The number of alleles and their frequencies are similar for all loci that have been defined in a genome. The models of genetic composition that are used in the various simulation models will be referred to as FRQ #x.
Experiments
Three series of simulation experiments were undertaken. The experiments in these series are listed in tables 5, 6 and 7. References to the population models (MAI or FS), genome models (GNM) and models of genetic composition (FRQ) are used in these tables whenever appropriate. See tables 3 and 4 for details of GNM and FRQ models, respectively. Experiments that assume single-locus models use 10,000 runs (iterations), whereas multilocus models use 2,000 runs. Gene diversity (He) and the proportion of polymorphic loci (P) in the source population are presented for each experiment. The number of alleles (n) is presented for experiments using multi-allele models.
The number of alleles and allele frequencies are varied in the first series of experiments (Exp # 0 - 19). The genome models used in this series are a 'one-locus' model (GNM #1) and a '36 independent autosomal loci' model (GNM #2). Experiments have been conducted on both POP-MAI and POP-FS populations. Further specifications of experiments in the first series are presented in table 5.
The numbers of independent autosomal loci are varied from 2 to 32 in the second series of experiments (Exp #26 - 36). Two alleles with equal frequencies are assumed at each locus in all these experiments. The effect of the number of loci on genetic loss is studied in both POP-MAI and POP-FS populations. The experiments in the second series are listed in table 6.
The third series of experiments (Exp #37 -55) involves variation in the number of chromosomes and the linkage patterns of 32 loci on these chromosomes in population POP-MAI. Two alleles with equal frequencies are assumed at each locus in these experiments. The loci are equally distributed over the chromosomes. Experiments 37 to 46 involve one chromosome with 32 loci. Four chromosomes with 8 loci each are assumed in experiments 47 to 55. The distance between loci is varied from 1 to 16 cM in these two sub-groups of experiments. Furthermore, each genome model in this series is used in experiments with and without recombination. Table 7 presents a list of experiments in the third series.
Statistical tests
Student's (t)-test and an approximate t-test are used to test the hypothesis that mean values of genetic measures between two experiments do not differ significantly ( = 0.05) [see Sokal and Rohlf, 1981]. Critical values for the t-distribution for degrees of freedom of 1999 and 9999 were computed using an computer algorithm described by Cooke et al. [1985]. Critical values for these degrees of freedom approximate t.05[] = 1.960.
A two-tailed F-test ( = 0.05) is used to test the hypothesis that population variances of genetic measures do not differ significantly between two experiments [see Sokal and Rohlf, 1981]. Critical values of the F-distribution for sets of degrees of freedom that are based on numbers of simulation runs (2000 and 10,000) have been computed with an algorithm as described by Cooke et al. [1985]. Critical values for sets of degrees freedom of 1999 and 9999 and a one-tailed probability of 0.025 range between 1.092 and 1.040, respectively.
The terms significant and not significant that are used in the following sections to describe differences between mean values or between variances refer to results of the statistical tests as described above.
Results
Series 1: allele frequencies
The results of experiments in series 1 for genetic variation in the F10 generation are presented in table 9. They are grouped per population and per number of loci (e.g. 1 or 36 independent autosomal loci). These results show the impact of number of alleles per locus and/or allele frequencies on arithmetic mean and, variance of the various measures of genetic variation. Results of experiments 2, 7, 12 and 17, which involve a rare allele (p1=0.96, p2 = 0.04), show a relatively high variance around the mean genetic variation.
Both mean values and variances of these experiments with rare alleles differ significantly from the other experiments in this series. Note that mean gene diversity (He) under line breeding conditions (FS) is larger than mean gene diversity under a MAI breeding scheme. However, observed heterozygosity (Ho) is lower than under line breeding (see Table 9).
Figure 2 shows the frequency distributions of classes (of proportions) of initial
gene diversity over 10,000 simulation runs that have been constructed for the F10
generation of population POP-MAI for experiments 0 to 2. This figure shows that
the dispersion patterns for gene diversity become narrower around the mean (class
18) if initial allele frequencies are equal. The dispersion patterns for models with
initial allele frequencies of p1= 0.5 and p1=0.75 (Exp #0 and #1, respectively) are
skewed to the left. The distribution pattern of a locus model involving a rare allele
(Exp #2) differs from the two previously described models. First the distribution pattern is scattered from class 0 to 40 (see Figure 2). Second, 51.9 percent of the
simulation runs yield values for proportions of initial gene diversity that range from
0 to 0.05 (class 0) while the mean gene diversity for the F10 generation in this
simulation experiment is 0.868 (class 17). Furthermore, in 17.9 percent of
simulation runs gene diversity exceeds twice the initial gene diversity (class 40).
The simulation experiments in this series that involve 36 autosomal independent loci (Exp # 10-14 and 15-19) demonstrate that the variances in the mean values of various genetic measures are significantly lower for the larger number of loci. These differences can be observed in both populations POP-MAI and POP-FS. Further effects of numbers of loci on genetic variation are studied in the second series of experiments.
Series 2: number of loci
Results of experiments in which numbers of independent autosomal loci are varied (experiments in series 2) are presented in table 8. A 'two allele - equal frequency' model (FRQ #1) has been used in these experiments. The results in table 8 refer to various measures of genetic variation in that have been estimated for the F10 generation of populations POP-MAI and POP-FS, respectively. Variances of these measures decrease, as indicated by the results of experiments in series 1 (see Table 9), with increasing numbers of loci.Maximal Avoidance of Inbreeding
Results for experiments involving population POP-MAI (Exp #0, 26-30) show that the mean values for gene diversity, observed heterozygosity and proportion of polymorphic loci do not differ significantly between genomes that are composed of different numbers of independent loci. However, major differences can be observed in their variances. Variances decrease significantly with increasing number of loci for all three measures. Frequency distributions of proportions of initial observed heterozygosity and initial gene diversity are presented for the F10 generation in figures 3 and 4, respectively. It can be seen that the dispersion pattern of gene diversity is skewed to the left. The mode of distributions of numbers of loci less than 8 are larger than their arithmetic means (class 18). Higher number of loci result in a mode that equals the arithmetic mean.Frequency distributions of observed heterozygosity follow a normal dispersion pattern. The mode and arithmetic mean of frequency distributions of models that involve more than one locus both belong to distribution class 19 (see Figure 3). Note that this distribution class for a single locus model is zero. Thus, the observed heterozygosity for this model is either lower or higher than the arithmetic mean. Dispersion patterns become narrower around the mean for increasing numbers of loci.
Line breeding
Experiments 5 and 31-36 involve simulations with an inbred population (POP-FS) using the same models for genomes and genetic compositions as described for the previous experiments. Results of these experiments are also presented in table 8. Note that the mean gene diversity under continuous inbreeding is significantly higher than in experiments under MAI conditions. The mean observed heterozygosity, however, is far lower in inbred populations than in populations with a MAI dispersal. Similar trends in variances for gene diversity and observed heterozygosity, as described for POP-MAI, can be observed in the series of experiments with POP-FS. However, variances are significantly smaller under line-breeding conditions than under MAI conditions (e.g. compare experiment 0 and 5). Note that polymorphism (P) under line breeding does not change within 10 generations while polymorphism under a MAI breeding scheme declines slightly (see Table 8). Figures 5 and 6 present frequency distributions for fractions of initial gene diversity and observed heterozygosity as observed in the F10 generation of POP-FS. The dispersion pattern of both gene diversity and observed heterozygosity is narrower (as can be expected from the variances) than under MAI breeding.
Series 3: chromosomes and locus distance
Experiments 38 and 48 involve simulations with 32 loci on 1 chromosome and 32 loci which are equally divided over 4 chromosomes, respectively. No intra-chromosomal recombination is assumed in these experiments. Results, including those of experiment 30 (32 independent loci = 32 chromosomes) are presented for the F10 generation of POP-MAI in table 10. Mean values do not differ significantly between these three experiments. The variance in gene diversity for 32 loci on 1 chromosome (Exp # 38) differs slightly, but significantly, from variances in genome models that involve 4 and 32 chromosomes, respectively. Variances in observed heterozygosity do differ significantly between different genome models. Figure 7 presents the frequency distributions of observed heterozygosity over 2,000 ChromoFlow runs for experiments 30, 38 and 40. Dispersion patterns become narrower around the mean observed heterozygosity with increasing number of chromosomes.
Results of experiments involving recombination are presented for the F10 generation in table 11. The variance in gene diversity in experiment 37 is slightly, but significantly, higher than variances in the other experiments. Variances in observed heterozygosity decrease significantly for increasing inter-locus distances in both 1 and 4 chromosomes models. Frequency distributions of fractions of initial observed heterozygosity that have been maintained in the F10 generation of population POP-MAI are presented, for different inter-locus distances on one chromosome, in figure 8. Results of simulation experiment #30 that involves 32 independent autosomal loci and experiment #38 in which no recombination is assumed (see Table 10) are also presented in this figure. The dispersion pattern of the frequency distribution under no recombination conditions is broader than dispersion patterns under crossing-over.
Dispersion patterns also become more narrow around the mean for increasing inter-locus distances.
Discussion
The results of ChromoFlow simulation experiments show that allele frequencies, number of loci, linkage and recombination effect the sampling variances of the various measures of genetic variation. Sampling variances in gene diversity, observed heterozygosity and the number of alleles that are retained in the F10 generation ( the last generation included in the simulation experiments) increase if allele frequencies deviate from a uniform frequency distribution (see Exp #0,1 and 2, Table 9). This occurs under both maximal avoidance of inbreeding (POP-MAI) and line-breeding (POP-FS). Models with rare alleles (p < 0.05) in particular result in high variances. Mean values for these measures can also be significantly lower for loci with rare alleles than for loci with more uniformly distributed allele frequencies (see Table 9). Rare alleles are easily lost during bottlenecks and through genetic drift in small populations [Denniston, 1978; Allendorf, 1986; Fuerst and Maruyama, 1986]. The expected number of alleles (n') that are retained after a bottleneck can be computed by:
where n is the original number of alleles, pj is the frequency of allele j, and N is the population size after the bottleneck. Figure 9 shows the number of alleles which are expected using equation 16, to be retained at a single locus in 32 founders for different frequencies of two alleles. The number of alleles that is expected to be retained per locus is 1.93 for rare alleles (model FRQ # 3, p1=0.96, p2=0.04). This means that 7 rare alleles per 100 loci can be expected to be lost due to the bottleneck or founder effects. After a bottleneck further loss of rare alleles can be expected in subsequent generations due to genetic drift [Allendorf, 1986; see Chapter 1). Once alleles are lost they do not contribute to gene diversity or observed heterozygosity. Due to the stochastic nature of genetic drift frequencies of rare alleles also have a chance increasing in number. Therefore, gene diversity and observed heterozygosity can even be larger in a generation group than in the source population or previous generation groups (see Figure 2).
Numbers of independent autosomal loci effect sampling variances of mean gene diversity and observed heterozygosity. Sampling variances decrease with increasing numbers of loci (see Table 8). Nei and Roychoudhury [1974] and Nei [1987] studied the effect of loci numbers and allele frequencies at those loci on sampling variances. These authors distinguish two components in the sampling variance: the inter-locus variance which depends on the genetic structure of a population; and the intra-locus variance which depends on allele frequencies at (sampled) loci in the population. Inter-locus variance decreases when larger number of loci are sampled and intra-locus variance decreases when larger numbers of individuals are sampled. The use of inter-locus variance and intra-locus variance may, however, not be that meaningful in this study. Sampling variances of measures of genetic variation in generations following a founder effect are not only due to sampling of founders but also toy sampling alleles from the parental generation groups. Since ChromoFlow includes all individuals of each generation group in the computation of measures of genetic variation, additive sampling variances in these groups are due to the random nature of genetic drift. Variances due to genetic drift in ideal populations can effectively be compared with intra-locus variances i.e. sampling alleles (individuals) from populations. This comparison is not valid for populations in which the drift process is influenced by non-random mating as in the populations in this study.
Sampling variances of mean gene diversity and mean observed heterozygosity for different numbers of independent autosomal loci are presented on a semi-logarithmic scale for the F10 generation of POP-MAI (Exp # 0, 26-30) in figure 10.
The relationship between variance (s2) and number of loci (r) in this series of simulation experiments approximates to the following equation:
where s2 [1] is the sampling variance as observed for a genome that is composed of one locus. This equation can be used to determine levels of reliability in simulation experiments. For example, the expected sampling variance for gene diversity in experiment 11 (36 loci, p1=0.75, p2=0.25) can be estimated from the sampling variance that is observed in experiment 1. The estimated and observed sampling variances are both 0.0030.
This relationship between sampling variance and number of loci may only be valid when all loci have the same initial genetic composition. Further studies are required to determine whether equation 17 can be used to estimate expected sampling variances for multiple loci based on the sampling variance in a single locus model that represents the average genetic composition of the genome.
Linkage has a minor effect on the sampling variances of mean gene diversity as can be observed from the results of simulation experiments in series 3 (see Table 10. A slight (but significant) difference in sampling variances can be observed between single chromosome model (Exp #38) and models that involve the 4 and 32 chromosomes (Exp #48 and #30, respectively). Sampling variances of mean observed heterozygosity, however, do differ significantly between these different models (see Table 10 on page 66 ). The sampling variance for a genome model with one chromosome (Exp #38) is more than twice the value that is observed in a model assuming 32 independent loci (Exp #30). Equation 17 can be used to express the effect of linkage of loci on sampling variance in terms of numbers of independent autosomal loci (with the same genetic composition). Distribution of 32 loci over one chromosome (Exp #38) and distribution over four chromosomes (Exp #48) result in sampling variances of observed heterozygosity that would be expected for 14 and 24 independent autosomal loci, respectively.
Recombination is expected to reduce effects of linked loci on sampling variances. These effects are influenced by both numbers of loci, as discussed previously, and the interlocus distance. Table 11 shows the effects of inter-locus distance (and recombination) on sampling variances in gene diversity and observed heterozygosity for genomes that are composed of one and four chromosomes. As would be expected from the previous series of experiments (see Table 10) inter-locus distance hardly has any effect on the variance in gene diversity. Sampling variances in observed heterozygosity, however, are effected by inter-locus distances. Figure 11 presents sampling variances of mean observed heterozygosity for different inter-locus distances in both a single chromosome and a four chromosome model. The values are presented on an semi-logarithmic scale. This figure shows that sampling variances decrease with increased inter-locus distances. An exponential relation between inter-locus distance and sampling variance can be observed, particularly in the "four chromosome model".
Recombination and inter-locus distance can have a relatively large impact on the sampling variance in observed heterozygosity. For example, recombination in a single chromosome model with an interlocus distance of 2 cM reduces the sampling variance in observed heterozygosity from 0.0034 to 0.0019 (see Exp #39 and #38 in Tables 11 and 10, respectively). Inter-locus distances that are larger than 8 cM overrule the effects of loci distribution over multiple chromosomes on sampling variances. Sampling variances do not differ significantly between a single chromosome model and a four chromosome model (see Exp #43 and #53 in Table 11, respectively). The "behaviour" of linked loci that are uniformly distributed over four chromosomes can, for inter-locus distances of 16 cM and more, be compared with 32 independent autosomal loci (see Exp #55 and #30 in Table 11). Effects of linkage can be expressed in terms of numbers of independent autosomal loci that result in the same sampling variance. This measure is named effective loci number or re in this study, and can be approximated for (complex) genome models by deriving the following equation from equation 17:
where s2[1] and s2 are the sampling variances for gene diversity as observed in a genome that is composed of one locus and as observed in the simulation experiment, respectively. For example, re of a full linkage model involving 32 loci (Exp # 38; s2 = 0.0008; ) approximates 18 loci based on a sampling variance of 0.0145 in a one-locus model (Exp #0). The validity of this equation may be restricted to models that assume no differences in initial genetic composition of loci (see discussion on Equation 17).
Results of this study show that sampling variances of measures of genetic variation are effected by genome models and genetic composition models that are assumed in simulation experiments. Although sampling variances may differ significantly between various models, these values do not necessarily indicate whether to accept associated risks. Sampling variances are indicative of the probability that values for measures of genetic variation deviate from mean values. However, statistical tests to determine confidence limits that are based on variances require distribution patterns to follow a normal distribution [see Sokal and Rohlf, 1981]. Since the distribution pattern of gene diversity is skewed to the left (see Figure 4) sampling variances cannot be used to determine confidence limits for this measure. Alternatively, probabilities that values of genetic variation are lower than mean values can be computed from frequency distributions of individual simulation runs as provided by ChromoFlow. Table 12 presents such a risk assessment for gene diversity and observed heterozygosity in the F10 generation of population POP-MAI under different genome models. A 'one locus - two allele model' with equal allele frequencies is assumed in these experiments. This table also combines mean values and sampling variances as presented in previous tables.
Mean values for gene diversity and observed heterozygosity in these experiments range in the lower limits of classes 18 and 19 (see Table 12; see Chapter 2 for details on frequency classes). The "lower classes" have been defined as values that belong to classes 16 and 17 or lower for gene diversity and observed heterozygosity, respectively. The upper limits of these classes correspond to values that range between 94 and 95 percent of the mean for both genetic measures. Table 12 presents the experiments in declining order of probabilities that gene diversity belongs to "lower classes". Clear trends for gene diversity, similar to those discussed previously for sampling variances, can be observed within series of experiments and between series of experiments.
Evaluation of these results, however, depends on the level of probability that is considered acceptable. For example, a genome model that assumes 8 independent autosomal loci (Exp #28) involves a probability of 0.05 that gene diversity will belong to "lower classes" (see Table 12). Increasing the number of independent autosomal loci to 16 or more, results in probabilities of 0.015 or lower (Exp #29 and #30). Experiments involving 32 autosomal loci show that recombination results in probabilities that range from 0.001 to 0.005 (see Table 12). Full linkage of 32 loci (Exp #38), however, increases the probability to 0.015. Probabilities that observed heterozygosity belongs to "lower classes" are considerably higher than those for gene diversity (see Table 12). Genome models that result in low probabilities for gene diversity still result in probabilities for observed heterozygosity varying from 0.023 to 0.110 (Exp #55 and #38, respectively). Trends of declining probabilities between series of experiments are less obvious than those observed for gene diversity. For example, full linkage of 32 loci (Exp #38) results in a higher probability (0.110) than a genome model involving 16 independent autosomal loci (Exp #29; 0.086). These effects may be due to sampling errors, implying that the number of simulation runs need to be increased to study observed heterozygosity.
Assumptions made for genome models and genetic composition can effect sampling variances in simulation models. This, consequently, increases the probability that genetic variation in individual simulation runs is lower than the mean value over the total number of runs. Genetic variation in real populations can be compared, within the context of simulation experiments, to the result of a single run. Therefore, it is important to determine whether effects of assumptions, as mentioned above, result in the risk that levels of genetic variation in real populations are lower than mean values in simulations. Two different categories can be distinguished within these risks: (1) risks that mean values as determined by simulation models, deviate significantly from expected values and; (2) risks that the actual genetic variation that has been retained in populations is lower than the mean value as determined in simulation experiments.
Evaluation of the first risk category, obviously, needs to assume that algorithms and their implementation in computer models are correct. Risks that mean values in simulation models deviate from expected values are determined by sampling errors due to the number of simulation runs. For example, ChromoFlow experiments with rare alleles (i.e. frequencies or rare alleles < 0.05) result in significantly lower mean values than obtained for more uniformly distributed allele frequencies (see Table 9)(11). The number of simulation runs to reduce sampling errors need to be determined empirically, i.e. until sampling variances do not decrease significantly anymore.
The second risk category is determined by the actual genome structure of the species involved, genetic composition of the population and models that are used in simulation experiments. ChromoFlow experiments show that sampling variances and, therefore, risks that the genetic variation that has been retained is lower than expected, decrease with increasing number of loci. Numbers of loci that are included in ChromoFlow experiments represent a small fraction of the loci in natural genomes. This means that sampling variances in real populations can be considerably smaller than in simulation experiments. Note that the sampling variances of the genetic variation that has been retained in generation groups are due to genetic drift and inbreeding. Although linkage increases sampling variances (and associated risks), the effective loci number of a natural genome is expected to be much larger than the number of loci assumed in this study. Therefore, effects of linkage may be ignored in the simulation models that are used to estimate the genetic variation that is retained in generation groups.
Mean values for genetic measures do not differ significantly between 'single locus' and 'multiple loci' models (see Table 12). This implies that even assuming a 'single locus' model in simulation experiments would be sufficient to determine the actual genetic variation whenever sampling variance in the actual genome is expected to be low. However, ChromoFlow experiments show that sampling variances, in models that assume non-uniform distributions of allele frequencies, can be significantly larger than in models that assume a uniform distribution (see Table 9). This means that sampling variances in real populations may not be ignored whenever allele frequencies are not uniformly distributed. Consequently, genetic variation in such populations may be lower than expected from simulation models. Effects of non-uniform distribution of allele frequencies are not only relevant within the context of genetic composition of the source population. Allele frequencies will change in generation groups due to genetic drift. Figure 12 shows sampling variances for gene diversity in generation groups of population POP-MAI for a 'single locus' model that assumes two equally distributed alleles in the source population. It is, therefore, recommended to use 'multiple loci' models regardless the initial genetic composition. Models that reflect the genetic composition of loci in real populations are preferred. Such models can be based on results of biochemical techniques like electrophoresis [see Genetic variation in populations]. Since such data may not be available for most species (or populations), simulation experiments with 'multiple locus' models that are based on general theoretical models on distribution of allele frequencies in populations could be carried out. Alternatively, series of experiments, involving a 'single locus' model, in which allele frequencies are varied may enable assessment of minimal and maximal risks that actual genetic variation is lower than expected.
Results of the ChromoFlow experiments can also provide guidelines for studies that involve assessment of genetic variation on the basis of real genotypes (e.g. through electrophoresis) that are considered selectively neutral. These studies may be required in the management of endangered species populations for which pedigree data are incomplete. The number of loci involved in studies such as protein electropphoresis are generally limited [see Genetic variation in populations]. This implies that results of such studies are subject to sampling errors. Inter-locus and intra-locus variances determine sampling errors in assessments of genetic variation in founder populations [Nei and Roychoudhury, 1974; Nei, 1987](12). Furthermore, non-random sampling of specimens from the original population effects the sampling errors in results of genetic studies(13). Information on original genetic variation is relevant in the context of management that aims at maintenance of at least 90 percent of the original genetic variation in ex situ populations [Soulé et al., 1986].