Kang Huang1, Rui Mi1, Derek W Dunn1, Tongcheng Wang1, Baoguo Li2,3. 1. Shaanxi Key Laboratory for Animal Conservation, College of Life Sciences, Northwest University, Xi'an 710069, China. 2. Shaanxi Key Laboratory for Animal Conservation, College of Life Sciences, Northwest University, Xi'an 710069, China baoguoli@nwu.edu.cn. 3. Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China.
Abstract
Parentage analysis is an important method that is used widely in zoological and ecological studies. Current mathematical models of parentage analyses usually assume that a population has a uniform genetic structure and that mating is panmictic. In a natural population, the geographic or social structure of a population, and/or nonrandom mating, usually leads to a genetic structure and results in genotypic frequencies deviating from those expected under the Hardy-Weinberg equilibrium (HWE). In addition, in the presence of null alleles, an observed genotype represents one of several possible true genotypes. The true father of a given offspring may thus be erroneously excluded in parentage analyses, or may have a low or negative LOD score. Here, we present a new mathematical model to estimate parentage that includes simultaneously the effects of inbreeding, null alleles, and negative amplification. The influences of these three factors on previous model are evaluated by Monte-Carlo simulations and empirical data, and the performance of our new model is compared under controlled conditions. We found that, for both simulated and empirical data, our new model outperformed other methods in many situations. We make available our methods in a new, free software package entitled parentage This can be downloaded via http://github.com/huangkang1987/parentage.
Parentage analysis is an important method that is used widely in zoological and ecological studies. Current mathematical models of parentage analyses usually assume that a population has a uniform genetic structure and that mating is panmictic. In a natural population, the geographic or social structure of a population, and/or nonrandom mating, usually leads to a genetic structure and results in genotypic frequencies deviating from those expected under the Hardy-Weinberg equilibrium (HWE). In addition, in the presence of null alleles, an observed genotype represents one of several possible true genotypes. The true father of a given offspring may thus be erroneously excluded in parentage analyses, or may have a low or negative LOD score. Here, we present a new mathematical model to estimate parentage that includes simultaneously the effects of inbreeding, null alleles, and negative amplification. The influences of these three factors on previous model are evaluated by Monte-Carlo simulations and empirical data, and the performance of our new model is compared under controlled conditions. We found that, for both simulated and empirical data, our new model outperformed other methods in many situations. We make available our methods in a new, free software package entitled parentage This can be downloaded via http://github.com/huangkang1987/parentage.
THE use of genetic markers to investigate the relationships between individuals is common in studies of animal populations (Goodnight and Queller 1999), and various methods have provided much insight into animal reproductive biology and population structure that would be difficult or impossible to obtain from observation alone (Kalinowski ). The most common of these techniques, parentage analyses, enables researchers to obtain data on mating systems (Monteiro ), social organization (Garber ), reproductive success (Gerzabek ), multi-generational survival (Cremona ), sexual selection (Johannesson ), and kin selection (Dias ).Current parentage analysis methods assume that population genotypic frequencies accord with those of the Hardy-Weinberg equilibrium (HWE) (Marshall ; Kalinowski ). Such an assumption implies the absence of both close inbreeding (due to mating between relatives, such as siblings) and pervasive inbreeding (due to genetic drift in a finite population or population subdivision) (Wang 2011). Therefore, current methods only allow the frequency of a genotype, or the transitional probability from a parental genotype to an offspring’s genotype, to be calculated based on the HWE and basic Mendelian inheritance, but do not allow for the inclusion of both inbreeding factors.Both artificial and natural populations are finite and are usually genetically structured. Mating is also usually confined to a subset of individuals within a population (Wang 2011). Thus, both types of inbreeding (close and pervasive) may exist, and more extreme forms of close inbreeding, such as back-crossing, may also be present. Depending on the mating system and other ecological factors, a false father may be a potential mate of the true mother. For close inbreeding, the false father may be related to the true mother. For pervasive inbreeding, the false father may come from the same population as the true mother. Hence, a false parent may share identical-by-descent (IBD) alleles with the offspring, and may thus be mistakenly identified as the true parent.In addition, microsatellites are the most frequently used genetic marker for parentage analyses, but null alleles are pervasive in microsatellite markers (Kalinowski ; Ravinet ). Such alleles cause two types of genotyping problems: (i) a homozygote fails to be amplified, where is a null allele; (ii) a heterozygote is mistyped as a homozygote where is a visible allele (Wagner ). These incorrect genotypes can be problematic for parentage analyses, because such genotyping errors can mistakenly reject a true parent due to an observed lack of the shared alleles with the offspring (Blouin 2003). Moreover, negative amplification reduces the accuracy of parentage analysis because of the loss of genotypic data. When the genotype of an individual fails to be amplified, all genotypes at this locus in a duo or a trio will be discarded from the analysis.In this paper, we consider the effects of inbreeding, null alleles, and negative amplification in a parentage analysis. We will first extend the model of Kalinowski to an alternative model, so as to accommodate the effects of these three factors. Second, we use a simulated dataset to evaluate the influences of these three factors on the model of Kalinowski , and the performance is also compared with those of our alternative model and another model presented by Wang (2016). Finally, we use a real microsatellite genotyping dataset to test and compare four applications using the models of Kalinowski , Wang (2016), and our new model in natural situations. Our model can be applied to any codominant markers that may be affected by inbreeding and/or null alleles. We make available a free software package entitled parentage v1.0, which can be downloaded via http://github.com/huangkang1987/parentage.
Theory and Modeling
Genotypic frequencies
Under the HWE, alleles appear randomly within a genotype according to their frequencies. The frequency of a genotype G can be expressed as a piecewise function:where and denote the and alleles, respectively, which are different identical-by-state (IBS) alleles, and and are their frequencies.If the inbreeding in a population is more frequent than random, the homozygosity of a population is increased. We use the inbreeding coefficient f (also known as Wright’s ) to measure the degree of inbreeding, which is defined as the correlation between the frequencies of two alleles within an individual. According to Equation 1, the frequency of G in the presence of inbreeding is given byIn the presence of null alleles, for an observed genotype (denoted by ) the actual genotype may be a heterozygote or a homozygote where is a null allele. If an observed genotype has no any detected alleles, it is termed negative, denoted by Let be the frequency of the null allele According to Equation 2, the frequency of an observed genotype in the presence of both inbreeding and null alleles isFurthermore, in the presence of null alleles and negative amplification, a negative observed genotype may arise from either a null allele homozygote or a negative amplification (Kalinowski ). Let β be the negative amplification rate. Then, under the three factors: inbreeding, null alleles, and negative amplification, Equation 3 should be modified as follows:
Procedures of parentage analysis
There are three typical categories of parentage analysis: (i) identifying the father when the mother is unknown; (ii) identifying the father when the mother is known; and (iii) identifying the father and mother jointly. The procedures of a parentage analysis are roughly as follows:For each of the first two categories, two hypotheses are established: the first hypothesis is that the alleged father is the true father, denoted by the alternative hypothesis is that the alleged father is not the true father, denoted by For the third category, “father” needs to be altered to “parents” in hypotheses andGiven a hypothesis H, the likelihood is defined as the probability of some observed data given H, written as Returning to and in the previous paragraph, we refer to the logarithm of the ratio of to as the LOD score (abbreviated to LOD); symbolically in other words, Moreover, a positive LOD score means that the first hypothesis is more likely to be true than the second hypothesis Similarly, a negative LOD score means that is more likely to be true thanMarshall provided a statistic Δ for resolving paternity. Let LOD1 and LOD2 be the LOD scores of the most-likely and the next most-likely alleged fathers, respectively, and let n be the number of all alleged fathers. Then, Δ is defined as follows:A separate statistic Δ has to be calculated for each individual offspring.Monte-Carlo simulations are subsequently used to assess the confidence level of each value of Δ. The symbol Δ0.99 represents the threshold of Δ to reach the correct assignment rate of 99%, in the sense that, if it implies up to a confidence level of 99%. In other words, the proportion 99% of assignments is correct if
The Ka-model
Kalinowski developed a model of parentage analyses, called the Ka-model for short, which accommodates the effect of genotyping errors. This model consists of two likelihood formulas (see Equation 5 below), together with the rules and methods for a general parentage analysis.As stated in the previous section, the procedures of using Ka-model to conduct a parentage analysis are as follows: (i) calculating and (ii) calculating LOD and Δ, (iii) finding the thresholds of Δ, and (iv) using the values obtained in the previous three steps to determine the significance of the parentage analysis.We will here use the first category in a parentage analysis (i.e., identifying the father when the mother is unknown) as an example to show how to calculate the likelihoods and with the consideration of genotyping errors. The two likelihoods in Ka-model are expressed as the following formulas:where e is the genotyping error rate, and are respectively the observed genotypes of the offspring and the alleged father, and Pr are their frequencies, and is the transitional probability from to .Denote , and for the genotypes of the offspring, the alleged father and the true father, respectively. For the term both genotypes of the offspring and the alleged father are assumed to be correctly genotyped, then and Therefore, the expression can be rewritten as when holds.Under the assumption of the HWE, for the genotype one allele is randomly inherited from the parent, and the other is randomly sampled from the population according to the allele frequencies. Then, the transitional probability can be expressed aswhere
and are non-IBS alleles,
and are their frequencies. According to the above analyses, the final formula can be used to calculate the value of in Equation 5.The genotyping error can be considered as the replacement of the true genotype with a random genotype at a probability of e. The conditional probability of an observed genotype given a genotype G is given byThus, the genotyping error does not change the observed genotypic frequencies; in other words, Because any null alleles, negative amplification and inbreeding are not considered in the Ka-model, can be directly calculated by Equation 1, and so the values of and in Equation 5 can be obtained.
Alternative forms of likelihoods
The likelihoods and in Equation 5 can be obtained by taking the weighted sum of products of the corresponding frequencies of and conditional on their genotypes, with their genotypic frequencies as the weights. Then, Equation 5 can be rewritten to the following alternative forms:where
and are, respectively, taken from all possible genotypes of the offspring, the true father, the true mother, and the false father; is the joint distribution of and and is the transitional probability from and to Additionally, the three conditional probabilities in Equation 7 can be calculated by Equation 6.In the absence of inbreeding, if matings are random, then the genotypes and will be independent to each other. ThusAdditionally, if the genotypes of both parents are known, then the distribution of the genotype of offspring can be derived by Mendelian segregation in the sense that each parent randomly contributes one allele to the genotype of an offspring. Thus,where is a possible offspring’s genotype, in which is the allele copy in and is the allele copy in and is a Kronecker operator, such that if and if
Schemes of our model
Our model will be established by giving several likelihood formulas based on Equation 7. We use the first category in a parentage analysis as an example to describe the scheme of establishing our model. For the second and third categories, the schemes are presented in Appendices A and B.In order to simultaneously accommodate the effects of both inbreeding and null alleles together with negative amplification, Equation 7 needs to be modified by replacing the probabilities with those under these effects, including the genotypic frequencies (e.g., ), the transitional probability and the conditional probabilities (e.g., ). It is noteworthy that the alternative hypothesis should be modified if inbreeding is involved. For close inbreeding, a false father may be a relative of the true mother; for pervasive inbreeding, a false father may be sampled from a same population as the true mother. The genotype of a false father may therefore be either dependent or independent of the true mother in both of these inbreeding scenarios. For identifying an alleged father, the hypothesis will thus imply two possibilities: (i) he is unrelated to the true parents and to the offspring, denoted by (ii) he is a relative of the true mother, denoted byBecause we lack a priori information (e.g., information about the pedigree, mating system, population origin, or allele frequency of each population), we cannot determine which of the alternatives and is most likely. We thus define the likelihood of as the geometrical mean of and Then, using Equation 7, the likelihoods of and can be calculated by the following formulas:where and are the joint distributions of genotypes of mates, and is the transitional probability from to and is conditional on an inbreeding coefficient of f. These joint distributions and the transitional probability will be derived in the next section. Moreover, we can calculate by Equation 8, and and by Equation 2. Additionally, the values of
and can all be calculated by the following formula:where and are two Kronecker operators, in which is the observed genotype of G accounting for the effect of null alleles (without accounting for the effects of genotyping error and negative amplification), whose expression is
Joint distributions of genotypes
In the presence of close inbreeding, both parents will be related to each other. Their genotypes will thus be correlated. For example, always shares an IBD allele with in backcrossing. It is noteworthy that there are several forms of mating, e.g., self-fertilization, and matings between parents and offspring (backcrossing), full-siblings, half-siblings, or other relatives, such that the joint distributions of genotypes of a parent pair among these forms of mating will differ, even if their inbreeding coefficients are equal.Jacquard (1972) defined nine configurations of IBD alleles between two individuals (denoted by see Figure 1), and used a vector to measure the degree of relationship between two individuals in the presence of inbreeding. Where consists of nine elements, whose element
represents one probability that the four alleles at a single locus in two diploid individuals share the configuration of IBD alleles (Milligan 2003). Table 1 summarizes the values of elements in for various mating forms. For example, if the mating form is selfing, then consists of the elements in the SE row (the top row) in Table 1, denoted by
i.e.,
Figure 1
Configurations of IBD alleles between two diploids. For each configuration, we denote the upper two dots for the two alleles of one individual, and the lower two dots for those of the other individual. Moreover, two dots connected by a line indicate that two alleles are IBD.
Table 1
The values of δ between mates in different mating forms
Relation
D1
D2
D3
D4
D5
D6
D7
D8
D9
s
SE
f
λ3
2f1+f
PO
f2
λ1
λ1
λ2
4f(1+f)2
FS
14f+12f2
14f2
12λ1
14λ1
12λ1
14λ1
14λ3
12λ2
14λ2
8f2+3f+f2
HS
12f2
12f2
12λ1
12λ1
12λ1
12λ1
12λ2
12λ2
8f(1+f)2
FC
14f2
34f2
14λ1
34λ1
14λ1
34λ1
14λ2
34λ2
16f(1+f)2
HFC
18f2
78f2
18λ1
78λ1
18λ1
78λ1
18λ2
78λ2
32f(1+f)2
SC
116f2
1516f2
116λ1
1516λ1
116λ1
1516λ1
116λ2
1516λ2
64f(1+f)2
NR
f2
λ1
λ1
λ2
Where λ1 = f(1−f), λ2 = (1−f)2, λ3 = 1−f and s are the proportion of offspring produced by the corresponding inbreeding form under the equilibrium state (assuming that only the outcrossing can happen except for this inbreeding form). SE: self; PO: parent-offspring; FS: full-sibs; HS: half-sibs; FC: first-cousins; HFC: half first-cousins; SC: second-cousins; NR: nonrelatives.
Configurations of IBD alleles between two diploids. For each configuration, we denote the upper two dots for the two alleles of one individual, and the lower two dots for those of the other individual. Moreover, two dots connected by a line indicate that two alleles are IBD.Where λ1 = f(1−f), λ2 = (1−f)2, λ3 = 1−f and s are the proportion of offspring produced by the corresponding inbreeding form under the equilibrium state (assuming that only the outcrossing can happen except for this inbreeding form). SE: self; PO: parent-offspring; FS: full-sibs; HS: half-sibs; FC: first-cousins; HFC: half first-cousins; SC: second-cousins; NR: nonrelatives.We will use the symbols to denote the vector consisting of the elements in the rows from the top to bottom in Table 1, respectively. Similarly, the symbols are used to denote the proportions of offspring in an inbred population that are the results the corresponding mating forms. Now, the degree of relationship between mates can be measured by the weighted average Let Then, the inbreeding coefficient in the next generation can be expressed asIf only one inbreeding form occurs at the proportion s, then Equation 11 can be simplified. For example, if there is only the occurrence of selfing or backcrossing with or then and Equation 11 becomes or Under such condition, Equation 12 can be written asUnder the equilibrium state, such that the proportion s can be solved. The solutions of s for each inbreeding form are listed in the right-most column in Table 1.Table 1 reveals that the elements in the five rows determined by PO, HS, FC, HFC, and SC are proportional with the ratio This shows that there are many similarities among the five inbreeding forms: parents-offspring, half-siblings, first-cousins, half first-cousins, and second-cousins. We therefore chose backcrossing as the representative form (which ensures ), and add the value to . Hence, the last equation becomesUnfortunately, there are still three unknown inbreeding proportions
and in Equation 13, whose solutions are not unique (f is regarded as a constant). In order to obtain a unique solution, some constraints have to be added to this equation according to relevant a priori knowledge of the focal population.If is known, the expression of the joint distribution of and iswhere, for every n, the value of is listed in Table 2.
Table 2
Joint distribution of genotypes under different IBD configurations
IBD configuration
Genotypic template
Pr(GF,GM|Dn)
D1
AiAi,AiAi
pi
D2
AiAi,AjAj
pipj
D3
AiAi,AiAj
pipj
D4
AiAi,AjAk
(2−Kj,k)pipjpk
D5
AiAj,AiAi
pipj
D6
AiAj,AkAk
(2−Ki,j)pipjpk
D7
AiAj,AiAj
(2−Ki,j)pipj
D8
AiAj,AiAk
pipjpk
D9
AiAj,AkAl
(2−Ki,j)(2−Kk,l)pipjpkpl
For each genotypic template, if the alleles are with the same subscript, then they are IBD alleles, otherwise they are IBS or non-IBS alleles. If A and A are IBS alleles, then K = 1, otherwise K = 0.
For each genotypic template, if the alleles are with the same subscript, then they are IBD alleles, otherwise they are IBS or non-IBS alleles. If A and A are IBS alleles, then K = 1, otherwise K = 0.Assuming that the false parents are relatives of the true parents of opposite sexes, and let various joint distributions of genotypes of parent pairs be the same as thenHere, is the genotype of the false mother.Thus far, the joint distributions of genotypes in Equation 9 have been derived. Next, we derive the transitional probability in Equation 9. By the generalized product rule of probabilities, can be rewritten asTherefore, the conditional probability is made available, and will be used to calculate the transitional probability:
Allele frequency estimator
In this section, we develop a novel estimator to estimate the allele frequencies in the presence of inbreeding and negative amplification, which is a modification of Summers and Amos (1997) estimator.Suppose that there are altogether k visible alleles. Denote for the number of observed genotypes consisting of the visible allele
and for the number of visible observed genotypes. Because and can be obtained directly from the observed genotypic data, their ratio is a constant. Also, according to Equation 3, and assuming that the rate of negative amplification is independent of the genotype, this ratio can be expressed aswhere f and are known, in which the inbreeding coefficient f is an a priori value, these are obtained by the average estimate of the inbreeding coefficients from Nei (1977) estimator from all polymorphic loci. Then, Equation 15 is a quadratic equation, with as the unknown, whose solution iswhere The latter solution should be excluded because it is outside of the rangeA half-interval search algorithm is used to estimate the allele frequencies, the procedure of which is described as follows.Set the initial minimum and maximum values of at and respectively.Substitute with in the former solution to Equation 15 (where and in this solution will be regarded as and ), and then find the value ofTest the value of If this is greater than, or equal to, zero, then update with otherwise update withRepeat steps (ii) and (iii) until the difference is less than a threshold, e.g., 10−12.The final values of
in the above procedures are the estimates of allele frequencies.We now consider the estimation of the negative amplification rate β. Denote for the true sample size (i.e., the number of observed genotypes excluding those with negative amplification). By Equation 3, the ratio of to is thenThus, the estimate can be obtained by substituting for in the final expression. The estimate of β can therefore be calculated by where is the total number of individuals.
Data availability
Genotyping data used to test the model’s efficiency may be found at doi: 10.5061/dryad.689v4.The software parentage V1.0, user manual and example dataset are available on GitHub (http://github.com/huangkang1987/parentage). Supplemental material available at Figshare: https://doi.org/10.25386/genetics.7221965.
Results
Evaluation
In this study, we use Monte-Carlo simulations to generate the observed genotypic data and to perform parentage analyses for four typical applications. The influences of the following three factors on the Ka-model are evaluated: inbreeding and null alleles either each singly or in unison. The performance of both our model and an additional model, named the Wa-model (Wang 2016), under the same conditions are compared with that of the Ka-model. We also use the empirical data published by Nietlisbach to test and compare the accuracy of all three models under natural conditions.
Simulated data
In order to evaluate the influences of the three factors under scrutiny (inbreeding, null alleles, and negative amplification), we first set some levels for the inbreeding coefficient f or for the null allele frequency For null alleles, we set and 0.05, 0.15, or 0.3, where the four values of represent the minimum, low, medium, and high levels, respectively. For inbreeding, we set and 0.05, 0.15, or 0.3. For inbreeding and null alleles jointly, we set 0.05, 0.15, or 0.3.For the first two categories in a parentage analysis (i.e., identifying the father when the mother is either unknown or known), each is designated its own application [named Application (i) or (ii)], with 100 alleged fathers randomly generated for each offspring. For the third category (i.e., identifying the father and mother jointly), we also designate two applications [named Applications (iii) and (iv)]. For Application (iii), the sexes of the alleged parents are known, and 100 alleged fathers and 100 alleged mothers are randomly generated for each offspring; for Application (iv), the sexes of the alleged parents are unknown, and 100 alleged parents with the predefined sex ratio of are generated for each offspring.For each application, 1000 offspring and their true and alleged parents are simulated. The observed genotypes of all individuals are generated at 4–16 unlinked loci. Based on these observed genotypes, parentage analyses are performed by either the Ka-model or by our model with three different thresholds (0, and ) of Δ, or by the Wa-model with three different thresholds (0, 0.80, and 0.99) of posterior probability. The performance of each of these three models are presented in two graphical formats. For the Ka-model, the graphs of the correct assignment rate as a function of the number of loci under four applications and under different levels of and/or f are shown in Figure 2. For these three models, each correct assignment rate is shown by a part of the overlapped bar charts (see Figure 3 in detail, and Supplemental Material, Figures S1 and S2 shows the results with the thresholds and ). Here, a correct assignment means that the true parents have been assigned correctly and there is either a Δ value or a posterior probability above the corresponding threshold.
Figure 2
The influence of inbreeding and null alleles on the Ka-model in each of four applications. Each column denotes one application. The top (or middle) row shows the effect of null alleles (or inbreeding). Each factor has four levels (i.e., or 0.05, 0.15, or 0.3) and the corresponding results are shown by solid, dashed, dash-dotted, and dotted lines, respectively. Each line is a graph of the correct assignment rate as a function of the number L of loci. The bottom row shows the influence of the effect of both null alleles and inbreeding acting simultaneously. Each factor also has four levels ( and f are set as equal, i.e., 0.05, 0.15, or 0.3), with the line styles of different levels the same as for the previous rows.
Figure 3
The correct assignment rates of our model, the Ka-model and the Wa-model as a function of number L of loci under 10 different levels of null allele and inbreeding. Each column denotes an application. Each row shows all correct assignment rates with the same level (the value representing this level is listed in the subfigure located in the rightmost column). Every correct assignment rate is shown by a part of the overlapped bar charts. The results of the Ka-model are shown by the red bars, and those of both the Wa-model and our model by the green and blue bars, respectively. The bars with light, medium, and bright colors denote the correct assignment rates with the thresholds 0, and for our model and the Ka-model, or with the thresholds 0, 0.8, and 0.99 for the Wa-model, respectively.
The influence of inbreeding and null alleles on the Ka-model in each of four applications. Each column denotes one application. The top (or middle) row shows the effect of null alleles (or inbreeding). Each factor has four levels (i.e., or 0.05, 0.15, or 0.3) and the corresponding results are shown by solid, dashed, dash-dotted, and dotted lines, respectively. Each line is a graph of the correct assignment rate as a function of the number L of loci. The bottom row shows the influence of the effect of both null alleles and inbreeding acting simultaneously. Each factor also has four levels ( and f are set as equal, i.e., 0.05, 0.15, or 0.3), with the line styles of different levels the same as for the previous rows.The correct assignment rates of our model, the Ka-model and the Wa-model as a function of number L of loci under 10 different levels of null allele and inbreeding. Each column denotes an application. Each row shows all correct assignment rates with the same level (the value representing this level is listed in the subfigure located in the rightmost column). Every correct assignment rate is shown by a part of the overlapped bar charts. The results of the Ka-model are shown by the red bars, and those of both the Wa-model and our model by the green and blue bars, respectively. The bars with light, medium, and bright colors denote the correct assignment rates with the thresholds 0, and for our model and the Ka-model, or with the thresholds 0, 0.8, and 0.99 for the Wa-model, respectively.The procedures used to generate the observed genotypes are as follows. First, L unlinked loci are created, and the allele frequencies at all loci are equal. In order to accelerate the simulation, we reduce the number of alleles at a locus and modify their frequencies used in Kalinowski . We then use the loci with six visible alleles to perform our simulation, and the vector of visible allele frequencies is set as In the presence of null alleles, each frequency of visible alleles is multiplied by to unify the allele frequencies.In order to simulate inbreeding, the true parents should be regarded as relatives, so their genotypes are not independent. Hence the genotypes of true parents are generated via Equation 12. Thereafter, the genotypes of offspring are generated by Equation 8. The false parents may be related to the true parents of the opposite sex, and their genotypes are generated by Equation 14. The three proportions , and are assumed to be equal, i.e., their ratio is When our model is applied to perform parentage analysis, we will also use this ratio as the relative ratio among the corresponding three mating forms. In other words, we will use this ratio as a constraint for Equation 13, then there is a unique solution of Equation 13 as follows:Finally, the generated genotypes are converted into observed genotypes. Each generated genotype is randomly replaced with a false genotype according to Equation 2 at a probability e to simulate the genotyping error. Next, to account for the presence of null alleles, the genotype obtained after the previous step is converted to an observed genotype according to Equation 10. Furthermore, this observed genotype is randomly set as at a probability β to simulate the effect of negative amplification. The negative amplification rate β and the genotyping error rate e are set as 0.05 and 0.01, respectively. All alleged parents are sampled in our simulation.The generated observed genotypes are used to perform parentage analysis. Unfortunately, because the false parents are assumed to be relatives of the true parents, the alleles carried by the true parents will appear at a higher frequency than their true frequencies, which will bias the allele frequency estimation. In order to avoid this bias, 100 nonrelatives are generated according to Equation 2, and their observed genotypes are converted by using the same method described above, and are used to estimate the allele frequencies.For both the Ka-model and the Wa-model, the allele frequencies are estimated by counting the numbers of alleles without considering the effects of both null alleles and negative amplification. For our model, these frequencies are estimated by our new allele frequency estimator, and the three proportions
and are estimated from the inbreeding coefficient f according to Equation 16. Because we do not develop an estimator to estimate the inbreeding coefficient f under null alleles and negative amplification, f is estimated by the Nei (1977) estimator. Additionally, the true values of both genotyping error rate e and sampling rate of true parents are used in all models.For the Wa-model, we write the individual observed genotypes and the allele frequency estimates together with other necessary parameters into a file, named *.dat, according to the input file format of colony V2.0.6.4. After calling colony2p.exe by a command-line mode, we read the results from the output files. colony uses a different algorithm to perform parentage analysis: by evaluating the likelihood of pedigrees, it searches the optimal full- and half-sibs families (Wang 2016). This algorithm neither performs a simulation to obtain the thresholds of Δ, nor calculates the LOD scores. Instead, it uses the posterior probability as an indicator of confidence. Therefore, three thresholds (0, 0.8, and 0.99) of the posterior probability are used to denote three levels of confidence, where a threshold of posterior probability equal to 0 means that the alleged parent(s) with the highest posterior probability is chosen. The mating system for both sexes is assumed to be polygamous, and allele frequencies are not updated during iteration. The rates of two genotyping errors (allelic dropout, and all other errors involved in genotyping) are both assumed to be equal to the true value of 0.01.In addition, to evaluate the performance of the Nei (1977) estimator relative to our allele frequency estimator, an extra 100,000 simulations are performed. In each simulation, the observed genotypes at 10 loci of 100 nonrelatives are generated by Equation 2. These observed genotypes are used to estimate the inbreeding coefficient, the negative amplification rate, and the null allele frequency. We use bias and SD to evaluate the accuracy of each inbreeding coefficient, negative amplification rate and null allele frequency.The estimation of allele frequency uses the estimate of the inbreeding coefficient, which may introduce some errors. To account for possible effects of the inbreeding coefficient estimator on the accuracy of the estimated parameters, and to explore the potential of our allele frequency estimator, we also use the true value of f to perform simulation. The corresponding results are used for comparison.
Simulated results
The influences of both inbreeding and null alleles on Ka-model in the four applications with a Δ > 0 are shown in Figure 2. Because the distribution of genotypes of the alleged parents are deviated from the HWE, the threshold of or cannot ensure a confidence level of 80 or 99%. Therefore, we do not show the results involving these thresholds in Figure 2. Although these results are shown in Figure 3 for reference, these are not analyzed nor discussed. It is clear that both inbreeding and null alleles significantly affect the accuracy of our parentage analysis.In the presence of null alleles, the curves inside each subfigure are nearly equally spaced. Application (i) is relatively less affected, the values of correct assignment rate (denoted by c) are decreased at most 0.2 when while those values are decreased at most 0.3 for Application (ii), and at most 0.4 for Applications (iii) and (iv).In the presence of inbreeding, Application (ii) is barely affected, although the values of the correct assignment rate c are slightly increased (at most 0.03) when For the remaining applications, the values of c are slightly decreased when but they are greatly reduced when f increases from 0.15 to 0.3.When null alleles and inbreeding are both present, the influences of both factors are cumulated, and the performance are greatly affected. The curves in all applications become increasingly flat as f and also increase.The results of the Ka-model, the Wa-model and our model for the four applications are presented in Figure 3. In order to compare the performance of all models, we consolidate the results of each of the three models under the same conditions, and we use bar charts to show the various correct assignment rates.In the absence of both inbreeding and null alleles, our model and the Ka-model perform similarly, although some small changes in the estimated allele frequencies result in small differences between their correct assignment rates. The Wa-model performs well in all applications when However, when L is small, the Wa-model performs a little worse than our model and the Ka-model in Applications (i) and (ii), with the highest differences between the correct assignment rates being 0.03 and 0.06, respectively. The Wa-model performs even worse in Applications (iii) and (iv), with the highest differences increasing to 0.12 and 0.14, respectively.In the presence of null alleles, our model outperforms the Ka-model, especially when and for Applications (iii) and (iv), which resulted in maximum differences between correct assignment rates of up to 0.15. In Applications (i) and (ii), both Wa- and Ka-models perform similarly. However, the Wa-model performs worse than the Ka-model in Applications (iii) and (iv), with differences being ∼0.1–0.2.In the presence of inbreeding, our model and the Ka-model perform similarly if Our model performs a little worse than the Ka-model in Application (ii), but more or less better in Applications (i) and (iv) if or in Application (iii) if The Wa- and Ka-models perform similarly in Applications (i) and (ii). However, the Wa-model performs worse than Ka-model in Applications (iii) and (iv), for example, if the maximum differences between the correct assignment rates being ∼0.35 and 0.5, respectively.In the presence of null alleles and inbreeding, our model outperforms the other two models in most cases, with the Ka- and Wa-models performing similarly in Applications (i) and (ii). However, both Ka- and Wa-models are strongly negatively affected in Application (iii) and (iv). For instance, if the correct assignment rates for both models are ∼75 and 35% of those for our model, respectively.The evaluation results of our allele frequency estimator are shown in Table S1. The presence of null alleles introduces an overestimation of the inbreeding coefficient using Nei (1977) estimator. The SD of
and are slightly increased as both f and also increase, respectively. The bias of becomes extremely large as increases, and reaches 0.25 when The bias of is also affected by and reaches 0.1 at If the true value of the inbreeding coefficient is used in simulation, these biases are considerably reduced (at most 0.02, Table S1).
Empirical data
We used microsatellite genotyping data for a population of song sparrows (Melospiza melodia) on Mandarte Island, Canada (Nietlisbach ), to test the efficiency of our model. These data are available at doi: 10.5061/dryad.689v4. The song sparrow is a medium-sized passerine bird, native to North America. The past breeding density of this population has fluctuated 18-fold due to two major population crashes (Keller ). In 1988–1989, only four adult females and seven adult males were present (Keller ). This resulted in inbreeding in this population due to a bottle-neck effect, and across the 2364 birds whose all four grandparents were genetically verified during 1993–2013, the mean inbreeding coefficient was 0.087 (Nietlisbach ).The dataset of Nietlisbach contains the genotypes of 3301 individuals at 209 microsatellite loci. There were 3186 individuals whose father or mother was genotyped, and these were used as offspring in the parentage analysis. The fathers and mothers of these individuals were recorded from long-term pedigree data, but the sexes of the offspring are not given. The average inbreeding coefficient estimated by the Nei (1977) estimator based on 199 autosomal loci is 0.074, which is used as an a priori inbreeding coefficient in our model.Because the microsatellites used by Nietlisbach were less polymorphic than those used in the simulation, and because their genotyping ratios were also lower, we used more loci to perform the parentage analysis. We scanned the 199 autosomal microsatellites and chose two subsets of loci. Subset one consisted of the loci with the highest estimated null allele frequencies. Subset two consisted of haphazardly selected loci, that were chosen by first ranking all loci alphabetically, and then selecting the top 40 name-ranked loci. The indices of the genetic diversity of the selected loci are shown in Tables S2 and S3. The average numbers of visible alleles are 8.175 and 8.725, respectively, the average genotyping ratios are 68.28 and 72.98%, respectively, the average estimated inbreeding coefficients are 0.333 and 0.097, respectively, and the average estimated null allele frequencies are 0.200 and 0.053, respectively.Following the definition of the above applications for the simulated data, four similar applications are considered as follows: (I)–(II) identifying one parent when the other is unknown (6284 cases, including 3162 for identifying father and 3122 for identifying mother), or when the other parent is known (6196 cases); (III)–(IV) identifying jointly the father and mother in which the sexes of the candidate parents are both known (3098 cases), or unknown (also 3098 cases). Here, the hypotheses and in Applications (I) and (II) need to be modified as follows: the alleged parent is the true parent or is not the true parentIn Applications (I)–(IV), all males or females are included as either the alleged fathers or the alleged mothers, whereas an offspring itself is excluded from the pool of alleged parents in each case. The average numbers of candidate parents for each case are 290 in Applications (I) and (II), and 581 in Application (IV). For Application (III), the average numbers of candidate fathers and mothers in each case are 297 and 284, respectively.We use 5–40 loci to perform our parentage analysis and use the correct assignment rate to measure the efficiencies of each of the three models. For Applications (III) and (IV), the identification is considered as correct when both parents are correctly identified.For the Ka- and Wa-models, the allele frequencies are estimated by counting the numbers of alleles without considering the effects of both null alleles and negative amplification. For our model, an a priori inbreeding coefficient is set as 0.074, and the allele frequencies (including the null allele frequency) and the negative amplification rate are both estimated simultaneously.Three thresholds (0, and ) of Δ are obtained by using a Monte-Carlo simulation (Marshall ). In each application, 100,000 offspring are generated, and the number of alleged parents for each offspring is taken from the average number of alleged parents. For the Ka-model, the false parents are nonrelatives of the true parents. For our model, inbreeding is assumed to be present due to backcrossing (because this mating form represents many inbreeding forms with reduced relatedness between mates, it is probably the common form of inbreeding), the false parents are related to the true parents of the opposite sex, the genotyping rate is equal to the average genotyping rate among the loci currently being used, the sampling rate is equal to one, and the genotyping error rate is assumed to be 0.01. For the Wa-model, the configuration of colony is identical to that of the simulated data.
Empirical results
In the four applications, the results of the parentage analysis of all three models for subset 1 are shown in Figure 4 (Figures S3 and S4 show the results with the thresholds and ), and those for subset 2 are shown in Figure S5 (Figures S6 and S7 show the results with the thresholds and ). Because some individuals are typed at only a few loci (e.g., 783 individuals are typed at four loci), the true parents cannot easily be identified. Therefore, each curve for the correct assignment rates in the range from 0.5 to 0.7 reaches near to an asymptote.
Figure 4
Results of the parentage analysis using the dataset of Nietlisbach , in which the loci chosen are with the highest estimated null allele frequency. Each row denotes an application. The definitions of bars together with their colors are as for Figure 3.
Results of the parentage analysis using the dataset of Nietlisbach , in which the loci chosen are with the highest estimated null allele frequency. Each row denotes an application. The definitions of bars together with their colors are as for Figure 3.Figure 4 also shows that our model outperforms both the Ka- and Wa-models in all four applications, especially in Applications (III) and (IV). In Applications (I) and (II), for ∼65–80% of loci, our model achieves similar levels of accuracy as the Ka-model. Similarly, according to the simulation results, Applications (III) and (IV) are more sensitive to the presences of both inbreeding and null alleles, and, for ∼55–70% of loci, our model achieves similar levels of accuracy as the Ka-model.For subset 2, the average estimated null allele frequency (0.053) and the average estimated inbreeding coefficient (0.097) are both low (Table S3). Therefore, the performance of our model for subset 2 is not so good as that for subset 1, which is consistent with our simulated data. However, our model still a little outperforms both the Ka- and the Wa-models. For example, the correct assignment rates in Applications (I) and (IV) for our model are at most 3.2 and 7.4% higher than for the Ka-model (Figure S5), respectively.
Discussion
Impacts of inbreeding and null alleles
Both inbreeding and null alleles can cause serious problems in parentage analyses (Wagner ; Wang 2011). In the presence of null alleles, the genotypes of parents and offspring may be mismatched (Brookfield 1996). In addition, a null allele homozygote may also be treated as a negative observed genotype, and hence it is omitted from any likelihood calculations (Marshall ; Kalinowski ). In the presence of inbreeding, genotypic frequencies become deviated from the HWE and the genotypes of both parents are not independent. Moreover, the false parents may also be potential mates and relatives of the true parents of the opposite sex, who may also share IBD alleles with the offspring.With our computer simulations, we found that even a small inbreeding coefficient (e.g., 0.05) or a small null allele frequency can result in a large reduction in the correct assignment rate (up to 0.15; Figure 2). In the presence of inbreeding and/or null alleles, the information given by the genotyping data are reduced, and so more loci should be used in order to reach the same level of accuracy. For example, 180% additional loci are required to reach the correct assignment rate of 50% for the Ka-model if the inbreeding coefficient and null allele frequency are both 0.3 (Figure 2).
Corrections for inbreeding and null alleles
In the process of establishing our model, we made several modifications to the Ka-model. These included the actual genotypic and observed genotypic frequencies, joint distribution of parental genotypes, conditional probability of parental genotypes, alternative hypotheses, alternative forms of likelihood calculation, and allele frequency estimations.The performance of our model using both computer generated and empirical data were also evaluated under the same conditions as the Ka-model. The results showed almost ubiquitous improvement except for some situations with only few alleles. Although our model is still affected negatively by either the presence of inbreeding and/or null alleles, it is still able to recover much information. Importantly, our new method requires at least 55% of all loci to attain an equal degree of accuracy as the Ka-model (Figure 3).Compared with the effects of inbreeding, our model performs better in the presence of null alleles. In the Ka-model, negative amplification is not considered so any negative observed genotypes are ignored in the calculation of the likelihoods. For example, in Application (i), if the observed genotype of either the offspring or the alleged father at a locus is negative, and if such an observed genotype is ignored, then the likelihood at this locus is omitted. This omission is equivalent to the likelihoods of and at this locus being set to one, which will result in the overestimation of both likelihoods and a bias of the LOD score.A negative observed genotype is similar to a visible allele homozygote, representing one of several possible genotypes. In our model, negative amplification is considered and each negative observed genotype is treated as a normal observed genotype. In our alternative forms of likelihoods, each possible genotype is weighted according to its probability (either conditional, prior or joint), such that the likelihoods considering any negative amplification and any negative observed genotypes can be calculated.
Alternative hypothesis
Among the four applications used to test the efficiency of the Ka-model, the first two (identifying the father when the mother is either known or unknown) are both affected to a relatively lesser degree (Figure 2). The latter two (identifying jointly the two parents when the sexes of the candidate parents are either known or unknown) are more sensitive to the effects of inbreeding and/or null alleles (Figure 2).The scheme of the Ka-model also contributes to the relatively poor performance for Applications (iii) and (iv) [or (III) and (IV)]. Here, the hypothesis that the alleged parents are the true parents, is evaluated relative to the alternative hypothesis that the alleged parents are unrelated to the offspring. However, in this scheme, the scenario that one alleged parent is a true parent while the other is not is not considered.We give two additional events to this scheme, and use the geometrical mean of the corresponding likelihoods as the likelihood of From the validation of both simulations (Figure 3) and the empirical data (Figure 4), the performances after the scheme has made the appropriate correction are significantly improved.In this paper, we develop a novel estimator to estimate the allele frequencies in the presence of both inbreeding and negative amplification, which is a modification of Summers and Amos (1997) estimator. This estimator estimates the allele frequency and negative amplification rate separately. The allele frequencies are first estimated, with only the visible observed genotypes being used. This approach can eliminate the impact of negative amplification, because the ratio has nothing to do with β under the assumption that the negative amplification is independent of the genotypes. The negative amplification rate is subsequently estimated from the estimates of these allele frequencies.Our allele frequency estimator assumes that the inbreeding coefficient has an a priori value, and we thus use the Nei (1977) estimator to estimate f; however, this estimator does not consider either negative amplification or null alleles, the errors are accumulated during allele frequency estimation. Therefore, the biases of and are high when f and are also high (0.102 and respectively, Table S1). However, our model still works well and outperforms both the Ka- and Wa-models in many cases. If the true value of the inbreeding coefficient is used in simulation, these biases will be largely reduced (at most 0.02, Table S1) which suggests that our allele frequency estimator has considerable potential to improve estimations of both allele frequencies and parentage analysis.
Pervasive inbreeding
Although we only consider close inbreeding, pervasive inbreeding will also have a similar influence on the parentage analysis in two ways: (i) the genotypic frequencies deviate from the HWE, which will bias the likelihood estimate; (ii) the candidate parents may be sampled from the same population as the true parents, who may share the same IBD alleles as the true parents and the offspring, which will result in an overestimation of the LOD score of the false parents and interfere with our analysis.These problems are solved by a two-step process: (i) estimating the allele frequency for each population, and then (ii) using the local allele frequencies to calculate the likelihoods. Unfortunately, due to dispersal among populations, for an individual, the natal population may not be the same as the sampled population. Although the natal population can be calculated by Bayesian clustering (Pritchard ) or by population assignment (Peakall and Smouse 2012), the results may be unreliable in cases in which the population structure is weak (e.g., Wright’s ). Moreover, the estimation of local allele frequencies will also be inaccurate due to the limited sample size. Hence, this two-step process can result in an accumulation of errors.The level of pervasive inbreeding can be measured by Wright’s (Wang 2011). An alternative approach is to consolidate both and into a single parameter. Using the formula the effects of pervasive inbreeding can be incorporated into our model. The genotypic frequency in the total population and by incorporating both types of inbreeding can be expressed asWright’s is a measure of the correlation of gene frequencies among all individuals in the total population. Comparing Equation 17 with Equation 2, both have the same form, but the applied range of Equation 17 is wider.Similar to the method for applying Equation 2, the joint distribution of parental genotypes or the conditional probability of the alleged parental genotypes in the total population, can now be derived. This alternative method only makes a slight change to our model, but it can be applied without involving both an estimation of the local allele frequencies and an identification of the natal population of each individual. This will thus prevent the accumulation of errors. However, when the population structure is strong (i.e., is large), Equation 17 cannot accurately predict the genotypic frequency. Meanwhile, if the sample size is large, the natal population of each individual can be accurately obtained and the initial approach will perform better than the alternative approach.