| Literature DB >> 19098988 |
Thore Egeland1, Antonio Salas.
Abstract
A variety of forensic, population, and disease studies are based on haploid DNA (e.g. mitochondrial DNA or Y-chromosome data). For any set of genetic markers databases of conventional size will normally contain only a fraction of all haplotypes. For several applications, reliable estimates of haplotype frequencies, the total number of haplotypes and coverage of the database (the probability that the next random haplotype is contained in the database) will be useful. We propose different approaches to the problem based on classical methods as well as new applications of Principal Component Analysis (PCA). We also discuss previous proposals based on saturation curves. Several conclusions can be inferred from simulated and real data. First, classical estimates of the fraction of unseen haplotypes can be seriously biased. Second, there is no obvious way to decide on required sample size based on traditional approaches. Methods based on testing of hypotheses or length of confidence intervals may appear artificial since no single test or parameter stands out as particularly relevant. Rather the coverage may be more relevant since it indicates the percentage of different haplotypes that are contained in a database; if the coverage is low, there is a considerable chance that the next haplotype to be observed does not appear in the database and this indicates that the database needs to be expanded. Finally, freeware and example data sets accompany the methods discussed in this paper: http://folk.uio.no/thoree/nhap/.Entities:
Mesh:
Year: 2008 PMID: 19098988 PMCID: PMC2602601 DOI: 10.1371/journal.pone.0003988
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Performance of different approaches to the estimation of the number of different haplotypes based on simulated data.
| θ | N | Sample | N* 1 | N* 2 | N* 3 | D | C* |
| 10 | 66 | 50 | 23 | 26 | 28 | 19 | 0.84 |
| 10 | 66 | 100 | 33 | 45 | 56 | 28 | 0.84 |
| 10 | 66 | 400 | 50 | 57 | 62 | 45 | 0.88 |
| 10 | 66 | 1600 | 67 | 74 | 79 | 61 | 0.84 |
| 50 | 222 | 50 | 64 | 91 | 124 | 33 | 0.52 |
| 50 | 222 | 100 | 72 | 90 | 103 | 52 | 0.72 |
| 50 | 222 | 400 | 136 | 162 | 178 | 115 | 0.83 |
| 50 | 222 | 1600 | 191 | 215 | 227 | 175 | 0.89 |
| 100 | 412 | 50 | 215 | 506 | 1173 | 43 | 0.20 |
| 100 | 412 | 100 | 148 | 172 | 194 | 71 | 0.48 |
| 100 | 412 | 400 | 234 | 293 | 339 | 179 | 0.76 |
| 100 | 412 | 1600 | 349 | 410 | 444 | 311 | 0.87 |
A total of n = 5000 profiles were sampled from the coalescent for varying θ ( = 2Eμ where μ is the mutation rate per gene per generation and E the effective population size). The column ‘N’ gives the number of different haplotypes in this sample, the quantity to be estimated. The column ‘Sample’ shows the sample sizes used. Next follows the estimators , and . D gives the number of different observed haplotypes and is followed by the coverage estimate.
Accuracy of estimators of the number of different haplotypes measured by the root of the mean squared error assessed by 100 simulations.
| Sample | N* 1 | N* 2 | N* 3 |
| 50 | 39.5 | 34.5 | 34.2 |
| 100 | 34 | 28.3 | 26.1 |
| 400 | 21.8 | 17.3 | 16.2 |
| 1600 | 8.7 | 6.4 | 10.3 |
| 50 | 159.65 | 143.76 | 131.87 |
| 100 | 148.38 | 123.47 | 105.3 |
| 400 | 100.58 | 71.44 | 54.09 |
| 1600 | 38.33 | 17.29 | 18.84 |
| 50 | 269.71 | 250.71 | 265.69 |
| 100 | 252.11 | 215.64 | 186.23 |
| 400 | 188.13 | 128.37 | 86.92 |
| 1600 | 73.58 | 23.75 | 33.31 |
Each simulation was carried out as in Table 1.
Haplotype estimates from several population datasets.
| pop | n | D | n.single | C | N* 1 | N* 2 | N* 3 |
| Andalucia | 50 | 39 | 33 | 0.34 | 115 | 254 | 541 |
| Basques | 171 | 68 | 46 | 0.59 | 114 | 195 | 313 |
| Catalonia | 118 | 79 | 67 | 0.32 | 248 | 398 | 620 |
| Galicia | 135 | 76 | 58 | 0.43 | 177 | 217 | 256 |
| Germany | 1314 | 462 | 309 | 0.59 | 772 | 1333 | 2142 |
| Icelandic | 396 | 111 | 59 | 0.77 | 142 | 210 | 283 |
| Mozambique | 306 | 111 | 72 | 0.63 | 174 | 295 | 462 |
| PortCent | 160 | 93 | 74 | 0.42 | 219 | 378 | 621 |
| PortNorth | 184 | 106 | 79 | 0.45 | 234 | 288 | 342 |
| PortSouth | 196 | 113 | 88 | 0.43 | 260 | 392 | 564 |
| Spain | 474 | 203 | 147 | 0.55 | 365 | 695 | 1241 |
| Portugal | 540 | 242 | 162 | 0.58 | 411 | 632 | 903 |
| Iberia | 1014 | 383 | 261 | 0.58 | 650 | 1018 | 1488 |
The first sample (Andalucia) consists of 50 mtDNA HVS-I profiles, of which 39 are different. There are 33 singletons and so the fraction of unseen haplotypes is estimated as 33/50 = 0.66 and the coverage is 0.34. , and are different estimates of the number of haplotypes as explained in the text. The last three lines lump data from previous lines.
Summary of the results of the simulation part of Example 5.
| n | naïve bound | fraction1 | median1 | fraction2 | median2 | |
| Scenario 1 | 50 | 0.0196 | 0.87 | 0.01698 | 0.52 | 0.01858 |
| Scenario 2 | 50 | 0.0196 | 1.00 | 0.01187 | 0.52 | 0.01861 |
| Scenario 3 | 50 | 0.0196 | 1.00 | 0.00906 | 0.54 | 0.01861 |
| Scenario 1 | 100 | 0.0099 | 1.00 | 0.00718 | 0.52 | 0.00939 |
| Scenario 2 | 100 | 0.0099 | 1.00 | 0.00524 | 0.52 | 0.00940 |
| Scenario 3 | 100 | 0.0099 | 1.00 | 0.00418 | 0.52 | 0.00940 |
| Scenario 1 | 400 | 0.0025 | 1.00 | 0.00155 | 0.52 | 0.00237 |
| Scenario 2 | 400 | 0.0025 | 1.00 | 0.00119 | 0.52 | 0.00237 |
| Scenario 3 | 400 | 0.0025 | 1.00 | 0.00099 | 0.52 | 0.00237 |
The naïve bound, the estimate 1/(n+1), provides for a classical alternative to the estimates given in columns ‘median1’(based on unseen haplotypes generated from a different population) and ‘median2’ (based on singletons). Further details are provided in text.
Figure 1Estimates of frequencies of unseen Iberian haplotypes.
The values were calculated following the PCA approach. The test set for the upper panel are those haplotypes of the Mozambique database which are unseen in the Iberian. The singletons in the Iberian database are used as the test set in the lower panel.
Figure 2The saturation curve for the Portuguese database (see Table 3) based on the Michaelis-Menten function.