| Literature DB >> 24244403 |
André Kahles1, Fahad Sarqume, Peter Savolainen, Lars Arvestad.
Abstract
Genetic markers, defined as variable regions of DNA, can be utilized for distinguishing individuals or populations. As long as markers are independent, it is easy to combine the information they provide. For nonrecombinant sequences like mtDNA, choosing the right set of markers for forensic applications can be difficult and requires careful consideration. In particular, one wants to maximize the utility of the markers. Until now, this has mainly been done by hand. We propose an algorithm that finds the most informative subset of a set of markers. The algorithm uses a depth first search combined with a branch-and-bound approach. Since the worst case complexity is exponential, we also propose some data-reduction techniques and a heuristic. We implemented the algorithm and applied it to two forensic caseworks using mitochondrial DNA, which resulted in marker sets with significantly improved haplotypic diversity compared to previous suggestions. Additionally, we evaluated the quality of the estimation with an artificial dataset of mtDNA. The heuristic is shown to provide extensive speedup at little cost in accuracy.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24244403 PMCID: PMC3820696 DOI: 10.1371/journal.pone.0079012
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Data representation.
Each polymorphic column is transformed into an integer vector. a) Multiple sequence alignment, polymorphic columns/markers in bold face. b) Haplocode representation of the markers. c) Haplocode representation of a combination of the markers.
Figure 2The excap algorithm.
Figure 3Influence of sample size.
The estimated heterogozities and the corresponding standard deviations for sample sizes 10, 50, 100 and 1000.
Correlation between sample size and quality of the estimated -values.
| Estimates for different sample sizes | |||||
| Region | True | 10 | 50 | 100 | 1000 |
| 1 | 0.18 | 0.19 (0.18) | 0.18 (0.07) | 0.18 (0.05) | 0.18 (0.02) |
| 2 | 0.20 | 0.19 (0.18) | 0.19 (0.09) | 0.20 (0.05) | 0.20 (0.02) |
| 3 | 0.22 | 0.20 (0.16) | 0.23 (0.08) | 0.22 (0.06) | 0.22 (0.02) |
| 4 | 0.25 | 0.26 (0.17) | 0.24 (0.08) | 0.23 (0.05) | 0.25 (0.02) |
| 5 | 0.28 | 0.28 (0.21) | 0.27 (0.07) | 0.29 (0.07) | 0.28 (0.02) |
| 6 | 0.33 | 0.31 (0.20) | 0.33 (0.09) | 0.33 (0.06) | 0.33 (0.02) |
| 7 | 0.39 | 0.39 (0.20) | 0.41 (0.09) | 0.41 (0.06) | 0.39 (0.02) |
| 8 | 0.49 | 0.47 (0.21) | 0.51 (0.09) | 0.48 (0.07) | 0.48 (0.02) |
| 9 | 0.64 | 0.63 (0.21) | 0.62 (0.07) | 0.63 (0.06) | 0.63 (0.01) |
| 10 | 0.86 | 0.88 (0.10) | 0.86 (0.06) | 0.86 (0.03) | 0.86 (0.01) |
The second column contains the -values of the population, which were estimated from samples (10, 50, 100 and 1000 individuals) using the Excap algorithm. The values in parentheses are the standard deviations of the estimated values.
Comparison of multiplexed SNP panels I.
| Panel | Group | Size | Achieved | Size | Diff | Improvement (%) |
| A | H:1 | 32 | 0.90 (0.89) | 11 (11) | 0.01 |
|
| B | H:2, H:3, H:6 | 48 | 0.91 (0.90) | 11 (11) | 0.01 |
|
| C | V:1, H:5 | 38 | 0.84 (0.81) | 10 | 0.03 |
|
| D | J:1, J:2, K:2, K:3 | 38 | 0.89 (0.78) | 10 (10) | 0.11 |
|
| E | J:4, T:2, T:3, H:4 | 35 | 0.87 (0.82) | 7 (7) | 0.05 |
|
| F | V:1, H:1, H:2, H:3 | 93 | 0.91 (0.45) | 10 (10) | 0.46 |
|
| G | J:1, J:3, T:1 | 50 | 0.86 (0.74) | 11 (11) | 0.12 |
|
| H | K:1 | 15 | 0.87 (0.87) | 6 | 0.00 |
|
The values in parenthesis are the results from Coble et al.
This is the best possible result for the given input data — all individuals could be singled out.
The polymorphic positions combined by the algorithm were limited to the 59 SNPs which were part of Coble's presented multiplexed panel. The size of a haplogroup refers to the number of sequences in it.
Improved panels for HV subtypes.
| A | B | C | D | E | F | G | H |
| H:1 | H:2, H:3, H:6 | V:1, H :5 | J:1, J:2, K:2, K:3 | J:4, T:2, T:3, H:4 | V:1, H:1, H:2, H:3 | J:1, J:3, T:1 | K :1 |
|
| 477 (0.12) | 72 (0.50) | 482 (0.24) |
|
|
| 64 (0.14) |
| 477 (0.18) | 3010 (0.29) | 513 (0.11) |
| 4808 (0.11) |
|
| 4688 (0.14) |
| 3010 (0.47) | 3915 (0.12) | 4580 (0.47) | 5198 (0.15) | 5147 (0.11) |
| 3826 (0.08) | 11377 (0.44) |
| 4793 (0.32) |
| 5250 (0.11) | 6260 (0.15) | 9380 (0.11) |
| 3834 (0.08) | 13293 (0.26) |
| 5004 (0.12) | 5004 (0.29) |
|
| 9899 (0.40) |
|
| 14305 (0.44) |
| 7202 (0.12) | 6776 (0.39) | 11719 (0.11) | 9548 (0.15) | 15067 (0.21) |
| 6293 (0.15) | 16519 (0.14) |
| 10211 (0.18) | 8592 (0.16) | 12438 (0.15) | 11485 (0.35) | 16519 (0.26) | 10211 (0.08) | 7891 (0.08) | |
|
| 10394 (0.23) | 14770 (0.15) |
| 10394 (0.14) | 11533 (0.08) | ||
| 12858 (0.12) | 10754 (0.12) | 15833 (0.32) | 15355 (0.24) |
| 12795 (0.04) | ||
| 14470 (0.12) |
| 16519 (0.41) |
|
| 15043 (0.08) | ||
| 16519 (0.23) | 16519 (0.37) | 16519 (0.42) |
The modified panels A thorugh H from 2, as suggested by Excap. For each haplogroup, the suggested SNP are listed, with their corresponding (when analyzed independently). Underlined SNPs are those that were introduced by Excap and the lower part of the table contains SNPs that were excluded by Excap but suggested by Coble et al. The data shows how some SNPs have a high , but are not informative in combination with other SNPs. For example, panel H where 12795 has 4th highest , but it does not contribute at all towards a better combined .
Comparison of multiplexed SNP panels II.
| Panel | Achieved | Size | Diff | Improvement (%) |
| A | 0.93 (0.89) | 11 | 0.04 |
|
| B | 0.92 (0.90) | 11 | 0.02 |
|
| C | 0.90 (0.81) | 11 | 0.09 |
|
| D | 0.92 (0.78) | 10 | 0.15 |
|
| E | 0.87 (0.82) | 7 | 0.05 |
|
| F | 0.91 (0.45) | 10 | 0.46 |
|
| G | 0.87 (0.74) | 11 | 0.13 |
|
| H | 0.90 (0.87) | 7 | 0.03 |
|
The values in parenthesis are the results from Coble et al.
For this calculations HV1, HV2, the poly AC region and all positions related to diseases were excluded.
Figure 4Estimated heterogozities.
The more markers a combination contains the less effect the width has to the overall information of the combination.
Figure 5Marker set size and .
The more markers a combination contains and the more variable the combinations get, the more stable are the estimations also under a stronger heuristic.