| Literature DB >> 26231518 |
Lei Wei1, Lu T Liu2, Jacob R Conroy3, Qiang Hu4, Jeffrey M Conroy5, Carl D Morrison6, Candace S Johnson7, Jianmin Wang8, Song Liu9.
Abstract
BACKGROUND: Next-Generation Sequencing (NGS) technologies have rapidly advanced our understanding of human variation in cancer. To accurately translate the raw sequencing data into practical knowledge, annotation tools, algorithms and pipelines must be developed that keep pace with the rapidly evolving technology. Currently, a challenge exists in accurately annotating multi-nucleotide variants (MNVs). These tandem substitutions, when affecting multiple nucleotides within a single protein codon of a gene, result in a translated amino acid involving all nucleotides in that codon. Most existing variant callers report a MNV as individual single-nucleotide variants (SNVs), often resulting in multiple triplet codon sequences and incorrect amino acid predictions. To correct potentially misannotated MNVs among reported SNVs, a primary challenge resides in haplotype phasing which is to determine whether the neighboring SNVs are co-located on the same chromosome.Entities:
Mesh:
Year: 2015 PMID: 26231518 PMCID: PMC4521406 DOI: 10.1186/s12864-015-1779-7
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Amino acid predictions for two neighboring SNVs scenarios. (A1) Two consecutive SNVs in gene TP53 codon 285. The fact the two SNVs are present on the same read suggests they are originated from the same chromosome. (A2) Incorrect annotation based on prediction of individual SNVs. The first and second SNVs were predicted to introduce E285V and E285Q, respectively. (A3) The correct amino acid change based on MNV is E285L. (B1) Two SNVs are located in gene OR6Y1 codon 252 but on different reads, suggesting they originated from separate chromosomes. (B2) The two SNVs in B1 were correctly predicted to introduce V252V and V252I based on individual SNVs. The sequencing reads are displayed in IGV viewer [14]
Fig. 2Depiction of MAC workflow (left panel) and a MAC test run (right panel). Left: (A1) A list of SNVs identified by any variant caller; (A2) Reads extracted from the BAM file for all SNVs to identify Block of Mutations; (A3) Identify Block of Mutations within Codon within each subgraph using an annotation tool. Right: MAC test run using 3024 input SNVs from a breast cancer data set identified 56 BMs and 4 BMCs containing 8 SNVs. After re-annotation, 7 of 8 SNVs were classified as MNVs with different amino acid changes than the original SNV-based annotation
Results of SNV- and MNV- based amino acid predictions in test MAC run
| Mutation | SNV annotationa | MNV annotation | Gene (mRNA) | |
|---|---|---|---|---|
| 1 | chr6.12121325.C > G | P433A (missense) | P433G (missense) | HIVEP1 (NM_002114) |
| 2 | chr6.12121326.C > G | P433R (missense) | ||
| 3 | chr17.7577084.T > A | E285V (missense) | E285L (missense) | TP53 (NM_000546) |
| 4 | chr17.7577085.C > G | E285Q (missense) | ||
| 5 | chr6.44376224.C > G | A316G (missense) | A316G (missense) | CDC5L (NM_001253) |
| 6 | chr6.44376225.G > C | A316A (silent) | ||
| 7 | chr18.72775594.T > A | L1973I (missense) | L1973K (missense) | ZNF407 (NM_017757) |
| 8 | chr18.72775595.T > A | L1973a (nonsense) |
aThe underscores indicate difference between SNV- and MNV- annotations
Performance evaluation of MAC on 10 TCGA tumor samples (LUSC)
| Sample barcode | Num. of input SNVs | Num. of BMs | Size of the largest BM | No Annotation | Annovar | VEP | Snpeff | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Memory (Kb) | Run time (min:sec) | Num. of BMCs | Memory (Kb) | Run time (min:sec) | Num. of BMCs | Memory (Kb) | Run time (min:sec) | Num. of BMCs | Memory (Kb) | Run time (min:sec) | ||||
| TCGA-18-3409-01A-01D-0983-08 | 3910 | 303 | 4 | 1123088 | 02:18.1 | 108 | 5185632 | 01:47.5 | 108 | 6701680 | 10:08.5 | 112 | 13766144 | 05:00.4 |
| TCGA-22-5473-01A-01D-1632-08 | 944 | 21 | 3 | 652336 | 01:05.9 | 6 | 5180416 | 01:10.1 | 6 | 5017200 | 02:35.5 | 6 | 13654800 | 04:10.1 |
| TCGA-33-4566-01A-01D-1441-08 | 1451 | 41 | 3 | 844304 | 01:23.5 | 18 | 5180864 | 01:29.0 | 20 | 5020896 | 03:20.2 | 19 | 13706992 | 04:27.2 |
| TCGA-34-5231-01A-21D-1817-08 | 743 | 15 | 4 | 459488 | 00:53.8 | 7 | 5180416 | 00:58.6 | 7 | 5015952 | 02:12.6 | 7 | 13635536 | 03:55.1 |
| TCGA-37-5819-01A-01D-1632-08 | 839 | 31 | 2 | 380144 | 00:41.0 | 10 | 5184032 | 00:44.1 | 11 | 5013136 | 02:29.6 | 12 | 13725376 | 03:55.3 |
| TCGA-39-5031-01A-01D-1441-08 | 754 | 22 | 3 | 549296 | 00:52.5 | 1 | 5180384 | 00:59.2 | 1 | 5018256 | 02:21.9 | 1 | 13644784 | 03:58.5 |
| TCGA-46-3769-01A-01D-0983-08 | 1037 | 29 | 3 | 446448 | 00:41.8 | 13 | 5184032 | 01:04.3 | 13 | 5025968 | 02:26.0 | 13 | 13639456 | 03:49.8 |
| TCGA-60-2698-01A-01D-1522-08 | 1396 | 34 | 5 | 1198992 | 02:07.6 | 4 | 5180400 | 02:07.2 | 4 | 5021424 | 03:33.8 | 4 | 13715488 | 04:54.1 |
| TCGA-66-2785-01A-01D-1522-08 | 1338 | 44 | 3 | 1101216 | 01:56.5 | 4 | 5180848 | 01:54.0 | 4 | 5025424 | 03:33.9 | 4 | 13731600 | 04:53.6 |
| TCGA-85-6561-01A-11D-1817-08 | 984 | 14 | 2 | 602112 | 00:44.0 | 3 | 5180400 | 01:31.6 | 3 | 5016176 | 02:26.3 | 3 | 13637520 | 04:09.9 |
Num. of input SNVs: the number of somatic SNVs from TCGA data matrix
Num. of BMs: the number of identified Blocks of Mutations
Size of the largest BM: the maximu number of SNVs in a Block of Mutations
Num. of BMCs: the number of BMCs (Block of Mutations wtihin Codon)
Memory: the peak memory used during the run
Run time: the eclipsed wall time for the run