| Literature DB >> 25407965 |
Zhiyi Li1, Xiaowei Wu2, Bin He3, Liqing Zhang4.
Abstract
BACKGROUND: With the advance of next generation sequencing (NGS) technologies, a large number of insertion and deletion (indel) variants have been identified in human populations. Despite much research into variant calling, it has been found that a non-negligible proportion of the identified indel variants might be false positives due to sequencing errors, artifacts caused by ambiguous alignments, and annotation errors.Entities:
Mesh:
Year: 2014 PMID: 25407965 PMCID: PMC4245841 DOI: 10.1186/s12859-014-0359-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Examples of indel redundancy in dbSNP. (A) Two indels, both of insertion type, result in the same variant sequences with respect to the reference sequence (B) Two indels, both of deletion type, result in the same variant sequences with respect to the reference sequence.
Figure 2Histograms of adjacent-SNP distances and adjacent-indel distances (before redundancy filtration) on human chromosome 22. Histograms are plotted in probability densities, with blue color representing SNPs and red representing indels.
Algorithm for clustering indels into candidate redundant indel groups
|
| Clustering indels into candidate redundant indel groups Algorithm |
|
| An indel List: |
|
| Candidate redundant indel groups List |
| 1 | Candidate-Group-Generation (indel list: |
| 2 | Set List ( |
| 3 | Set |
| 4 | Set current indel |
| 5 |
|
| 6 |
|
| 7 | Add the next indel into the current candidate group |
| 8 | Set current indel |
| 9 |
|
| 10 | Append |
| 11 |
|
| 12 |
|
| 13 |
|
Figure 3A demonstration of how we check whether two indel variants are the same.
Algorithm for indel pair redundancy check by applying sliding window on reference genome
|
| Indel pair redundancy check Algorithm |
|
| Two candidate redundant indels |
| same type | |
| allele information | |
| reference genome sequence | |
|
| A pair indels |
| 1 | Set Redundancy = |
| 2 | Phase 1: template substring formation |
| 3 | Form template substring for insertion type |
| 4 |
|
| 5 |
|
| 6 | Phase 2: variant substring formation for insertion type |
| 7 |
|
| 8 | Insert |
| 9 | Append |
| 10 |
|
| 11 | Redundancy found: Redundancy = |
| 12 |
|
| 13 | No Redundancy |
| 14 | Phase 3: variant substring formation for deletion type |
| 15 |
|
| 16 | Cut |
| 17 | Cut |
| 18 |
|
| 19 | Redundancy found: Redundancy = |
| 20 |
|
| 21 | No Redundancy |
| 22 |
|
Figure 4Fitting Pareto distribution to indel sizes for chromosome 22. Left panel: indel size histogram with fitted Pareto density function shown in red line; Right panel: QQ-plot (sample quantiles of indel sizes vs. quantiles of the fitted Pareto distribution).
Figure 5Fitting gamma distribution to adjacent-SNP distances and adjacent-indel distances for chromosome 22. (A) Fitting gamma distribution to adjacent-SNP distances. Left panel: fitted gamma density function shown in red, observed distribution in black; Right Panel: adjacent-SNP distance QQ-plot (sample quantiles vs. quantiles of the fitted gamma distribution). (B) Fitting gamma distribution to adjacent-indel distances (after redundancy filtration). Left panel: adjacent-indel distance histogram with fitted gamma density function shown in red line; Right Panel: adjacent-indel distance QQ-plot (sample quantiles vs. quantiles of the fitted gamma distribution).
Figure 6The percentage of redundant indels as a function of distance threshold for human chromosome 22. Orange column represents Insertion type indels; Gray column represents Deletion type indels; Yellow column represents total inels (Insertion type + Deletion type).
Figure 7The percentage of redundant indels for human chromosome across 1–22. Blue line represents Insertion type indels; Orange line represents Deletion type indels; Gray line represents total inels (Insertion type + Deletion type).
Various statistics
|
|
|
|
|
|
|
| ||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
| ||
| 1 | 485270 | 9.63 | 47.99 | 1.44 | 0.83 | 0.016 | 0.39 | 7.55E-04 |
| 2 | 503609 | 9.57 | 48.18 | 1.43 | 0.86 | 0.016 | 0.4 | 7.57E-04 |
| 3 | 431820 | 9.38 | 47.57 | 1.47 | 0.89 | 0.017 | 0.4 | 8.03E-04 |
| 4 | 417942 | 9.98 | 48.62 | 1.47 | 0.89 | 0.017 | 0.39 | 7.82E-04 |
| 5 | 374774 | 9.49 | 47.06 | 1.46 | 0.88 | 0.017 | 0.4 | 7.53E-04 |
| 6 | 388741 | 10.16 | 48.14 | 1.43 | 0.84 | 0.017 | 0.39 | 7.99E-04 |
| 7 | 349376 | 9.68 | 48.38 | 1.43 | 0.86 | 0.017 | 0.4 | 7.96E-04 |
| 8 | 307212 | 9.55 | 48.56 | 1.47 | 0.88 | 0.018 | 0.4 | 7.72E-04 |
| 9 | 250802 | 9.29 | 49.25 | 1.45 | 0.82 | 0.016 | 0.4 | 7.37E-04 |
| 10 | 292264 | 9.58 | 48.23 | 1.42 | 0.85 | 0.017 | 0.39 | 7.82E-04 |
| 11 | 280673 | 9.6 | 47.66 | 1.46 | 0.87 | 0.017 | 0.4 | 7.72E-04 |
| 12 | 298606 | 9.65 | 48.73 | 1.45 | 0.87 | 0.017 | 0.4 | 8.16E-04 |
| 13 | 223181 | 10.31 | 46.21 | 1.39 | 0.9 | 0.017 | 0.38 | 7.95E-04 |
| 14 | 195779 | 9.8 | 49.23 | 1.46 | 0.88 | 0.017 | 0.4 | 7.94E-04 |
| 15 | 182417 | 9.77 | 48.87 | 1.44 | 0.85 | 0.016 | 0.4 | 7.86E-04 |
| 16 | 180020 | 9.41 | 50.29 | 1.39 | 0.81 | 0.018 | 0.4 | 8.09E-04 |
| 17 | 185888 | 10.03 | 51.8 | 1.4 | 0.83 | 0.016 | 0.4 | 8.40E-04 |
| 18 | 169830 | 10.02 | 48.61 | 1.41 | 0.89 | 0.017 | 0.4 | 8.07E-04 |
| 19 | 148904 | 9.93 | 53.31 | 1.39 | 0.81 | 0.018 | 0.41 | 9.69E-04 |
| 20 | 139927 | 10.54 | 51.3 | 1.4 | 0.88 | 0.018 | 0.39 | 8.16E-04 |
| 21 | 96577 | 9.3 | 49.52 | 1.45 | 0.85 | 0.018 | 0.4 | 9.83E-04 |
| 22 | 90621 | 10.31 | 50.46 | 1.33 | 0.83 | 0.018 | 0.39 | 8.77E-04 |
| Mean | 5994233h | 9.77 | 49 | 1.43 | 0.86 | 0.017 | 0.4 | 8.09E-04 |
| STD | 0.35 | 1.62 | 0.035 | 2.73E-02 | 7.61E-04 | 5.54E-03 | 6.21E-05 | |
aThe total redundancy rate on individual chromosomes. bThe percentage of redundant insertions. cThe shape parameter of the Pareto distributions fitted to the indel sizes (after redundancy filtration). d,eThe shape and rate parameter estimates for the Gamma distributions fitted to the adjacent-SNP distances. f,gThe shape and rate parameter estimates for the Gamma distributions fitted to the adjacent-indel distances. hTotal number of indels studied.
Figure 8Histogram of redundant indel sizes for human chromosome 22.