| Literature DB >> 29074871 |
Mohammad Shabbir Hasan1, Xiaowei Wu2, Layne T Watson3,4,5, Liqing Zhang6.
Abstract
Storing biologically equivalent indels as distinct entries in databases causes data redundancy, and misleads downstream analysis. It is thus desirable to have a unified system for identifying and representing equivalent indels. Moreover, a unified system is also desirable to compare the indel calling results produced by different tools. This paper describes UPS-indel, a utility tool that creates a universal positioning system for indels so that equivalent indels can be uniquely determined by their coordinates in the new system, which also can be used to compare different indel calling results. UPS-indel identifies 15% redundant indels in dbSNP, 29% in COSMIC coding, and 13% in COSMIC noncoding datasets across all human chromosomes, higher than previously reported. Comparing the performance of UPS-indel with existing variant normalization tools vt normalize, BCFtools, and GATK LeftAlignAndTrimVariants shows that UPS-indel is able to identify 456,352 more redundant indels in dbSNP; 2,118 more in COSMIC coding, and 553 more in COSMIC noncoding indel dataset in addition to the ones reported jointly by these tools. Moreover, comparing UPS-indel to state-of-the-art approaches for indel call set comparison demonstrates its clear superiority in finding common indels among call sets. UPS-indel is theoretically proven to find all equivalent indels, and thus exhaustive.Entities:
Year: 2017 PMID: 29074871 PMCID: PMC5658412 DOI: 10.1038/s41598-017-14400-1
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
An example of equivalent indels.
| Equivalent insertions | Equivalent deletions | ||
|---|---|---|---|
| Reference, | GTCTA | Reference, | ACTGTTGTG |
| Case 1, | G[TC/+]TCTA | Case 1, | AC[TGT/−]TGTG |
| Case 2, | GT[CT/+]CTA | Case 2, | ACT[GTT/−]GTG |
| Case 3, | GTC[TC/+]TA | Case 3, | ACTG[TTG/−]TG |
| Case 4, | GTCT[CT/+]A | Case 4, | ACTGT[TGT/−]G |
Figure 1Illustration of two cases of Theorem 1. (A): |d | > |S|, (B): |d | < |S|.
UPS-indel algorithm.
| UPS-indel(list_of_indels_in_VCF_file, reference_sequence) |
| { |
| For each indel in the list |
| 1. Extract REF allele and ALT allele from VCF file |
| 2. pattern ← diff(REF, ALT) |
| 3. indel ← pattern |
| 4. eq_indel ← getCyclicPermutationFromLeft(indel) |
| 5. pos ← position of indel according to the VCF file |
| 6. position ← pos |
| 7. upperBound ← position |
| 8. str ← reference_sequence (position + 1) |
| 9. while((indel + str) = = (str + eq_indel)) |
| 10. indel ← eq_indel |
| 11. upperBound++ |
| 12. str ← reference_sequence (position + 1) |
| 13. eq_indel ← getCyclicPermutationFromLeft(indel) |
| 14. End while |
| 15. indel ← pattern |
| 16. eq_indel ← getCyclicPermutationFromRight(indel) |
| 17. position ← pos |
| 18. lowerBound ← position |
| 19. str ← reference_sequence (position - 1) |
| 20. while((str + indel) = = (eq_indel + str)) |
| 21. indel ← eq_indel |
| 22. lowerBound– |
| 23. str ← reference_sequence (position - 1) |
| 24. eq_indel ← getCyclicPermutationFromRight(indel) |
| 25. End while |
| 26. if (pattern is an insertion) |
| 27. UPS-coordinate ← + pattern[lowerBound, upperBound] |
| 28. else //pattern is a deletion |
| 29. UPS-coordinate ← –pattern[lowerBound, upperBound] |
| 30. End for |
| } |
Figure 2Different utilities of UPS-indel. (A)UVCF format, (B) redundant indel list, and (C) comparing two uvcf files.
UVCF file for redundant indels.
| #CHRM | POS | ID | REF | ALT | QUAL | FILTER | UPS-COORDINATE |
|---|---|---|---|---|---|---|---|
| 1 | 10009638 | rs34748242 | T | TTG | . | . | + TG[10009639 – 10009648] |
| 1 | 10009639 | rs59148039 | T | TGT | . | . | + TG[10009639 – 10009648] |
Figure 3Main user interface of UPS-indel.
An example explaining why considering only normalized position does not suffice for identifying redundant indels for vt normalize and BCFtools.
| #CHRM | POS | ID | REF | ALT |
|---|---|---|---|---|
|
| ||||
| 1 | 39549110 | rs371246544 | AT | ACATAC |
| 1 | 39549111 | rs71724031 | T | TAC |
|
| ||||
| 1 | 39549111 | rs371246544 | T | CATAC |
| 1 | 39549111 | rs71724031 | T | TAC |
Figure 4Comparison among the tools based on redundant indel ratio for the dbSNP dataset.
Figure 5Venn diagram to compare the number of redundant indels detected by UPS-indel and other tools. (Venn Diagrams are generated using the R package VennDiagram[25].)
Example of multiallelic insertion type indels missed by other tools but detected as redundant by UPS-indel.
| (A) VCF Entry | |||||
|---|---|---|---|---|---|
| #CHRM | POS | ID | REF | ALT | |
| 1 | 724188 | rs60022176 | A | AATGGA, AATGGAATGGAATGGA, AATGGAATGGG | |
|
| |||||
|
|
|
|
|
|
|
| 1 | 724188 | rs60022176 | A | AATGGA | + AATGG[724138 − 724189] |
| 1 | 724188 | rs60022176 | A | AATGGAATGGAATGGA | + AATGGAATGGAATGG[724138 − 724189] |
| 1 | 724188 | rs60022176 | A | AATGGAATGGG | + ATGGAATGGG[724189 − 724189] |
|
| |||||
|
|
|
|
|
|
|
| 1 | 724137 | rs374587598 | T | TAATGG | + AATGG[724138 − 724189] |
| 1 | 724188 | rs60022176 | A | AATGGA | + AATGG[724138 − 724189] |
Example of multiallelic deletion type indels missed by other tools but detected as redundant by UPS-indel.
| VCF Entry | |||||
|---|---|---|---|---|---|
| #CHRM | POS | ID | REF | ALT | |
| 1 | 7552657 | rs376707888 | GTG | G, GTGCA | |
|
| |||||
| #CHRM | POS | ID | REF | ALT | UPS-COORDINATE |
| 1 | 7552657 | rs376707888 | GTG | G | −GT[7552657 − 7552658] |
| 1 | 7552657 | rs376707888 | GTG | GTGCA | + CA[7552658 − 7552658] |
|
| |||||
|
|
|
|
|
|
|
| 1 | 7552656 | rs139294420 | CGT | C | −GT[7552657 − 7552658] |
| 1 | 7552657 | rs376707888 | GTG | G | −GT[7552657 − 7552658] |
Example of a multiallelic indel that is normalized by vt normalize and BCFtools but not by GATKLeftAlignAndTrim.
| VCF Entry for dbSNP | ||||
|---|---|---|---|---|
| #CHRM | POS | ID | REF | ALT |
| 1 | 823905 | rs397728418 | AA | A, AAA |
|
| ||||
|
|
|
|
|
|
| 1 | 823905 | rs397728418 | AA | A, AAA |
|
| ||||
|
|
|
|
|
|
| 1 | 823903 | rs397728418 | GA | G, GAA |
Example of a complex variant that is missed by vt normalize but detected as redundant by UPS-indel.
| VCF Entry (A) | |||||
|---|---|---|---|---|---|
| #CHRM | POS | ID | REF | ALT | |
| 1 | 2273131 | rs369694942 | GAAA | G | |
| 1 | 2273140 | rs373243812 | AAAAA | AG | |
|
| |||||
|
|
|
|
|
|
|
| 1 | 2273131 | rs369694942 | GAAA | G | −AAA[2273132 - 2273147] |
| 1 | 2273140 | rs373243812 | AAAAA | AG | −AAA[2273132 - 2273147] |
An example of the two types of complex variant.
| #CHRM | POS | ID | REF | ALT | Type |
|---|---|---|---|---|---|
| 1 | 565003 | rs386627398 | ATAT | AAAC | MNP |
| 1 | 884423 | rs386627415 | AGCA | AACAACAGCAAAG | Clumped Indel |
Figure 6Comparison of redundant indel ratio for (A) COSMIC coding and (B) COSMIC noncoding indels.
Figure 7Venn diagram to compare the number of redundant indels detected by UPS-indel and other tools in (A) COSMIC coding and (B) COSMIC noncoding indel datasets.
Example of COSMIC indel that is missed by other tools but detected as redundant by UPS-indel.
| VCF Entry for COSMIC | |||||
|---|---|---|---|---|---|
| #CHRM | POS | ID | REF | ALT | |
| 1 | 150917623 | COSM5068028 | TG | TGG | |
| 1 | 150917623 | COSM3732389 | T | TG | |
| 1 | 150917623 | COSM3685916 | TG | T | |
| 1 | 150917624 | COSM5348791 | G | GG | |
|
| |||||
|
|
|
|
|
| |
| 1 | 150917623 | COSM5068028 | TG | TGG | |
| 1 | 150917623 | COSM3732389 | T | TG | |
| 1 | 150917623 | COSM3685916 | TG | T | |
| 1 | 150917623 | COSM5348791 | T | TG | |
|
| |||||
|
|
|
|
|
|
|
| 1 | 150917623 | COSM5068028 | TG | TGG | + G[150917624 - 150917632] |
| 1 | 150917623 | COSM3732389 | T | TG | + G[150917624 - 150917632] |
| 1 | 150917623 | COSM3685916 | TG | T | −G[150917624 - 150917631] |
| 1 | 150917624 | COSM5348791 | G | GG | + G[150917624 - 150917632] |
Comparison between VarMatch, RTG Tools, READDI, and UPS-indel based on the number of true positives found between the baseline and query call sets from chromosome 11 of an individual.
| Variant Caller Name | VarMatch (EVQ mode) | RTG Tools | READDI | UPS-indel |
|---|---|---|---|---|
| Dindel | 8,933 | 8,933 | 8,796 | 8,973 |
| GATK Haplotype Caller | 11,113 | 11,113 | 10,954 | 11,129 |
| GATK Unified Genotyper | 7,734 | 7,734 | 7,563 | 7,734 |
| Pindel | 6,507 | 9,893 | 9,524 | 9,836 |
A combination of variants identified by RTG Tools but missed by UPS-indel.
| Reference | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| T | G | C | C | G | C | T | A | |
| Deletion D1: Delete CC from position 3 and 4 | TGGCTA | |||||||
| Deletion D2: Delete C from position 6 | TGCCGTA | |||||||
| Deletion D1D2 (Combination of D1 and D2): Delete CC from position 3 and 4, also delete C from position 6 | TGGTA | |||||||
| Deletion D3: GCCGC - > GG starting from position 2 | TGGTA | |||||||