| Literature DB >> 26657142 |
Kai Ye1,2, Jiayin Wang1, Reyka Jayasinghe1,3, Eric-Wubbo Lameijer4, Joshua F McMichael1, Jie Ning1, Michael D McLellan1, Mingchao Xie1,3, Song Cao1, Venkata Yellapantula1,3, Kuan-lin Huang1,3, Adam Scott1,3, Steven Foltz1,3, Beifang Niu1, Kimberly J Johnson5, Matthijs Moed4, P Eline Slagboom4, Feng Chen3,6, Michael C Wendl1,2,7, Li Ding1,2,3,6.
Abstract
Complex insertions and deletions (indels) are formed by simultaneously deleting and inserting DNA fragments of different sizes at a common genomic location. Here we present a systematic analysis of somatic complex indels in the coding sequences of samples from over 8,000 cancer cases using Pindel-C. We discovered 285 complex indels in cancer-associated genes (such as PIK3R1, TP53, ARID1A, GATA3 and KMT2D) in approximately 3.5% of cases analyzed; nearly all instances of complex indels were overlooked (81.1%) or misannotated (17.6%) in previous reports of 2,199 samples. In-frame complex indels are enriched in PIK3R1 and EGFR, whereas frameshifts are prevalent in VHL, GATA3, TP53, ARID1A, PTEN and ATRX. Furthermore, complex indels display strong tissue specificity (such as VHL in kidney cancer samples and GATA3 in breast cancer samples). Finally, structural analyses support findings of previously missed, but potentially druggable, mutations in the EGFR, MET and KIT oncogenes. This study indicates the critical importance of improving complex indel discovery and interpretation in medical research.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26657142 PMCID: PMC5003782 DOI: 10.1038/nm.4002
Source DB: PubMed Journal: Nat Med ISSN: 1078-8956 Impact factor: 53.440
Figure 1The somatic complex indel detection and filtering workflow and algorithm testing
(a) Soft-clipped and unmapped reads are extracted from BAM files and then split aligned with pattern growth. The alignment result is examined to determine whether certain reads support complex variants. Various filtering, annotation, and statistical analysis steps follow to maintain quality of the complex indel call list. Inset shows three 3 basic configurations as pseudo de-Bruijn graphs (where circular or square loops represent sequences removed to obtain alignment): a simple deletion (top), a complex indel with template sequence from the 5′ sense strand (middle), and a complex indel with template sequence of reverse complement to the deleted fragment (bottom). Ref is reference allele while alt is alternative allele. (b) Results of simulation testing on chromosome 1 of the Venter genome for Pindel-C versus GATK and VarScan. Of 1128 simulated complex indels, Pindel-C found 541 (48% sensitivity), but neither GATK nor VarScan were able to identify any. Pindel-C also mistakenly called 88 additional events as simple indels, implying a false-discovery rate of 14%.
Figure 2The exome-wide landscape and characteristics of somatic complex indels across 19 cancer types
a) Box plot of the number of somatic complex indels in 19 cancer types. In total, 5 samples having more than 15 indels are not shown. They are listed according to the following in the format of cancer type, sample name, and the number of somatic complex indels: KIRC, TCGA-AK-3430, 16; KIRC, TCGA-AK-3451, 18; COAD, TCGA-AY-5543, 18; COAD, TCGA-CM-5860, 20; COAD, TCGA-G4-6299, 22; LIHC, TCGA-G3-A25T, 69. b) Genes most frequently affected by somatic complex indels. The x-axis is the number of somatic complex indels in a given gene while the y-axis is the number of distinct genes. c) Complex indels dissected as deletion and insertion at the same breakpoint, with sizes of each plotted per variant. Density plots of deletion and insertion sizes are depicted accordingly.
Figure 3Schematics of simple and complex indel configurations
The first two diagrams depict simple deletion and simple insertion. A total of 12 distinct scenarios were observed. In Ref F, part of the deleted sequence is inserted right at the breakpoint, but in Ref R the reverse complementary sequence of the deleted fragment is inserted. The definitions of terms in the figures are the following: Ref 5 and 3 mean the origin of the inserted sequence is from part of the deleted 5′ flanking and 3′ flanking sequence, respectively. F and R indicate whether the inserted sequence is a direct copy or a copy of the reverse complement. Among the 12 scenarios, 6 are single source and the rest are combinations of various single sources. The coloring scheme of unchanged (static) and mutated (transformed bases) is illustrated.
Figure 4Abundance of somatic complex indels in key cancer genes per cancer type and the contribution of somatic complex indels to the total numbers of indels for 10 cancer genes
a) Plot of the number of samples carrying somatic complex indels in 37 cancer genes across 20 cancer types. Dot size indicates the number of samples. b) Histogram of simple and complex indel counts in 12 key cancer genes with the largest percent gain.
Statistical test on whether variant allele fraction (VAF) of complex indels is higher than VAF of simple variants.
| Rank | Cancer | Gene | Case VAF average | Control VAF average | Case VAFs | Control VAFs | P-value | FDR |
|---|---|---|---|---|---|---|---|---|
| 1 | LUAD | 53.4% | 14.9% | 3 | 12 | 0.00659 | 0.03956 | |
| 2 | BRCA | 53.6% | 30.6% | 3 | 14 | 0.09118 | 0.27353 | |
| 3 | KIRC | 43.9% | 33.0% | 6 | 12 | 0.09820 | 0.27353 | |
| 4 | KIRC | 43.8% | 31.8% | 3 | 10 | 0.16434 | 0.27353 | |
| 5 | UCEC | 42.3% | 39.4% | 11 | 50 | 0.26250 | 0.31500 | |
| 6 | UCEC | 40.6% | 39.1% | 4 | 21 | 0.39976 | 0.39976 |