| Literature DB >> 22735696 |
Matteo Chiara1, Graziano Pesole, David S Horner.
Abstract
Several bioinformatics methods have been proposed for the detection and characterization of genomic structural variation (SV) from ultra high-throughput genome resequencing data. Recent surveys show that comprehensive detection of SV events of different types between an individual resequenced genome and a reference sequence is best achieved through the combination of methods based on different principles (split mapping, reassembly, read depth, insert size, etc.). The improvement of individual predictors is thus an important objective. In this study, we propose a new method that combines deviations from expected library insert sizes and additional information from local patterns of read mapping and uses supervised learning to predict the position and nature of structural variants. We show that our approach provides greatly increased sensitivity with respect to other tools based on paired end read mapping at no cost in specificity, and it makes reliable predictions of very short insertions and deletions in repetitive and low-complexity genomic contexts that can confound tools based on split mapping of reads.Entities:
Mesh:
Year: 2012 PMID: 22735696 PMCID: PMC3467043 DOI: 10.1093/nar/gks606
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Simulation
| Size | Recall | Recall as any | FP rate (%) |
|---|---|---|---|
| Deletions | |||
| 1–5 | 69 | 83 | 8.5 |
| 6–10 | 82 | 89 | 6.3 |
| 11–20 | 91 | 92 | 2 |
| 21–40 | 94 | 94 | 0 |
| 41–60 | 97 | 97 | 0 |
| 61–100 | 95 | 95 | 0 |
| 101–200 | 97 | 97 | 0 |
| >200 | 97 | 97 | 0 |
| Insertions | |||
| 1–5 | 70 | 86 | 9 |
| 6–10 | 82 | 88 | 6 |
| 11–20 | 94 | 94 | 2 |
| 21–40 | 92 | 92 | 0.5 |
| 41–60 | 93 | 93 | 0 |
| 61–100 | 91 | 91 | 0 |
| 101–200 | 89 | 89 | 0 |
| >200 | 86 | 86.00 | 0 |
aCorrectly classified as insertion or deletion.
bCorrectly identified locus, includes indindel and hypervariable predictions.
Simulations of heterozygous events
| Size | Recall rate | Correctly classified | Recall rate if homozygous |
|---|---|---|---|
| Deletions | |||
| 1 | 10 | 0 | 83 |
| 3 | 10 | 0 | 87 |
| 5 | 13 | 15 | 94 |
| 10 | 40 | 20 | 98 |
| 15 | 53 | 29 | 99 |
| 20 | 63 | 45 | 99 |
| 30 | 85 | 87.5 | 99 |
| 40 | 87.5 | 93.5 | 99 |
| Insertions | |||
| 1 | 10 | 0 | 80 |
| 3 | 10 | 0 | 86 |
| 5 | 17 | 3 | 94 |
| 10 | 28 | 14 | 99 |
| 15 | 48 | 32 | 99 |
| 20 | 57 | 47 | 99 |
| 30 | 81 | 89 | 99 |
| 40 | 88 | 96 | 99 |
aSize of the event.
bRecall rate for the heterozygous case.
cProportion of recalled indels classified as heterozygous.
dRecall rates for equivalent (same locus) homozygous indels.
Figure 1.Sensitivity and specificity of different methods with the Kidd et al. data set. (A) Number of indels from the Kidd et al. data set (binned by size of event in bp) recalled by each method. (B) Proportion of predicted indels (binned by predicted sizes) that are validated by an indel in the Kidd et al. validation set. Size bins: size ≤ 1, size ≤ 2, size ≤ 3, size ≤ 4, 5 ≤ size ≤ 10, 10 < size ≤ 20, 20 < size ≤ 30 and size > 30.
Figure 2.Venn diagram showing intersection between validated (by Kidd et al.) predictions by each method.
Classification accuracy of short indels predicted by SVM2
| SVM2 | Total | Kidd | |||
|---|---|---|---|---|---|
| Insertions | Deletions | Validation rate | Misclassification rate | ||
| Insertions | 50 688 | 11 068 | 2288 | 13 356 (26.3%) | 17.1 |
| Deletions | 46 102 | 3991 | 8111 | 12 102 (26.2%) | 32 |
| IndIndels | 9118 | 1308 | 1049 | 2357 (25.8%) | |
| Hypervariable | 8503 | 1268 | 982 | 2250 (26.4%) | |
aClass predicted by SVM2.
bNumber of predictions by SVM2 by category.
cClass of the validating event in the data set.
dValidation rate for each category of SVM2 predictions (applies for small only).
eMisclassification rate for validated insertions and deletions.
Figure 3.Sensitivity by size and genomic context. Fraction of events in the Kidd et al. data set, in different genomic contexts (tDNA = DNA transposon, LTR = long terminal repeats, NR = non-repetitive), recalled at different size ranges [size ≤ 5 (A), 5 < size ≤ 10 (B), 10 < size ≤ 20 (C), size > 20 (D)] by different methods.