| Literature DB >> 25352556 |
Xiaojia Tang1, Saurabh Baheti1, Khader Shameer1, Kevin J Thompson1, Quin Wills2, Nifang Niu2, Ilona N Holcomb3, Stephane C Boutet3, Ramesh Ramakrishnan3, Jennifer M Kachergus4, Jean-Pierre A Kocher1, Richard M Weinshilboum2, Liewei Wang2, E Aubrey Thompson5, Krishna R Kalari6.
Abstract
Rapid development of next generation sequencing technology has enabled the identification of genomic alterations from short sequencing reads. There are a number of software pipelines available for calling single nucleotide variants from genomic DNA but, no comprehensive pipelines to identify, annotate and prioritize expressed SNVs (eSNVs) from non-directional paired-end RNA-Seq data. We have developed the eSNV-Detect, a novel computational system, which utilizes data from multiple aligners to call, even at low read depths, and rank variants from RNA-Seq. Multi-platform comparisons with the eSNV-Detect variant candidates were performed. The method was first applied to RNA-Seq from a lymphoblastoid cell-line, achieving 99.7% precision and 91.0% sensitivity in the expressed SNPs for the matching HumanOmni2.5 BeadChip data. Comparison of RNA-Seq eSNV candidates from 25 ER+ breast tumors from The Cancer Genome Atlas (TCGA) project with whole exome coding data showed 90.6-96.8% precision and 91.6-95.7% sensitivity. Contrasting single-cell mRNA-Seq variants with matching traditional multicellular RNA-Seq data for the MD-MB231 breast cancer cell-line delineated variant heterogeneity among the single-cells. Further, Sanger sequencing validation was performed for an ER+ breast tumor with paired normal adjacent tissue validating 29 out of 31 candidate eSNVs. The source code and user manuals of the eSNV-Detect pipeline for Sun Grid Engine and virtual machine are available at http://bioinformaticstools.mayo.edu/research/esnv-detect/.Entities:
Mesh:
Year: 2014 PMID: 25352556 PMCID: PMC4267611 DOI: 10.1093/nar/gku1005
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Flow chart of the eSNV-Detect pipeline. (a) Bam files are pre-processed by Picard and GATK to remove reads that are duplicated or have multiple hits. (b) SNVs called from two aligners are merged, annotated and filtered by genetic features like total read depth, alternative allele depth and frequency, as well as annotations.
The eSNV criteria used in the eSNV-Detect pipeline
| Criteria | Threshold |
|---|---|
| Alternative allele supporting read depth | d_alt>3 |
| Alternative allele frequency | if (total read depth >100) alt/ref >0.05 else alt/ref>0.1 |
| Strand bias ratio | if (total read depth >100) alt/ref >0.05 else alt/ref>0.1 |
| ReadRankPosSum | -8<RPKS<8 |
Figure 2.Validation of the eSNVs in NA07347 mRNA-Seq data against the Omni 2.5 Chip data. (a) 15 753 out of 15 796 eSNVs were validated by the Omni data. There were 1554 Omni SNPs that were expressed but not called by the eSNV-Detect; (b) The validated 16 441 validated eSNVs distributed across the whole genome, mainly in exonic (36.9%), UTR (38.4%), intronic region (14.3%).
Figure 3.Sanger sequencing validated the eSNVs called. Not only eSNVs with higher allele frequency were validated, an eSNV in PDCL3 gene called with low minor allele frequency was also validated by Sanger sequencing.
The precision and recall in the 25 TCGA ER+ samples when validated with the protected mutation list from WES data
| TCGA sample ID | Validated eSNVs | Total eSNVs | WES No Coverage | Expressed WES SNV | Precision | Recall | F |
|---|---|---|---|---|---|---|---|
| 9464 | 10 183 | 231 | 10 292 | 0.951 | 0.920 | 0.935 | |
| 8353 | 9436 | 534 | 8838 | 0.938 | 0.945 | 0.942 | |
| 9608 | 10 319 | 251 | 10 405 | 0.954 | 0.923 | 0.939 | |
| 9871 | 10 650 | 181 | 10 817 | 0.943 | 0.913 | 0.928 | |
| 9316 | 10 733 | 1058 | 9868 | 0.963 | 0.944 | 0.953 | |
| 9288 | 11 418 | 1161 | 9808 | 0.906 | 0.947 | 0.926 | |
| 10 315 | 11 711 | 934 | 10 894 | 0.957 | 0.947 | 0.952 | |
| 9628 | 10 556 | 272 | 10 365 | 0.936 | 0.929 | 0.933 | |
| 8169 | 9339 | 891 | 8706 | 0.967 | 0.938 | 0.953 | |
| 10 177 | 10 770 | 188 | 11 029 | 0.962 | 0.923 | 0.942 | |
| 9265 | 10 525 | 776 | 9840 | 0.950 | 0.942 | 0.946 | |
| 9271 | 10 496 | 763 | 10 064 | 0.953 | 0.921 | 0.937 | |
| 8613 | 10 141 | 848 | 9318 | 0.927 | 0.924 | 0.926 | |
| 10 606 | 12 528 | 1246 | 11 116 | 0.940 | 0.954 | 0.947 | |
| 10 432 | 12 113 | 1135 | 11 008 | 0.950 | 0.948 | 0.949 | |
| 9284 | 10 664 | 974 | 9824 | 0.958 | 0.945 | 0.952 | |
| 8751 | 9729 | 692 | 9215 | 0.968 | 0.950 | 0.959 | |
| 9532 | 11 604 | 1573 | 9959 | 0.950 | 0.957 | 0.954 | |
| 9286 | 10 333 | 685 | 9790 | 0.962 | 0.949 | 0.955 | |
| 9157 | 10 394 | 815 | 9680 | 0.956 | 0.946 | 0.951 | |
| 9786 | 10 559 | 208 | 10 510 | 0.945 | 0.931 | 0.938 | |
| 6720 | 7261 | 287 | 7300 | 0.964 | 0.921 | 0.942 | |
| 9119 | 10 465 | 691 | 9719 | 0.933 | 0.938 | 0.936 | |
| 11 042 | 12 351 | 842 | 11 639 | 0.959 | 0.949 | 0.954 | |
| 9927 | 10 599 | 200 | 10 834 | 0.955 | 0.916 | 0.935 |
Gene level eSNVs summary for most frequently mutated genes listed in the TCGA paper (7)
| Gene | # of samples with mutations | # of samples with mutations in protein domain | # of samples with deleterious mutations (AVSIFT) | # of samples with deleterious mutations in domain |
|---|---|---|---|---|
| PIK3CA | 10 | 3 | 2 | 1 |
| MAP3K1 | 25 | 1 | 3 | 1 |
| GATA3 | 1 | 1 | 1 | 1 |
| TP53 | 21 | 4 | 3 | 3 |
| CDH1 | 4 | 4 | 4 | 4 |
| MAP2K4 | 1 | 1 | 1 | 1 |
| MLL3 | 5 | 2 | 2 | 0 |
| PIK3R1 | 3 | 3 | 1 | 1 |
| AKT1 | 1 | 1 | 1 | 1 |
| PUNX1 | 1 | 1 | 1 | 1 |
| CBFB | 1 | 1 | 1 | 1 |
| TBX3 | 1 | 0 | 1 | 1 |
| NCOR1 | 4 | 0 | 3 | 0 |
| CTCF | 1 | 1 | 1 | 1 |
| FOXA1 | 9 | 9 | 8 | 8 |
| SF3B1 | 1 | 1 | 1 | 1 |
| CDKN1B | 9 | 0 | 0 | 0 |
| RB1 | 1 | 1 | 1 | 1 |
| AFF2 | 1 | 1 | 1 | 1 |
| NF1 | 1 | 0 | 0 | 0 |
| PTPN22 | 19 | 0 | 0 | 0 |
| PTPRD | 1 | 1 | 0 | 0 |
| ATM | 23 | 1 | 5 | 1 |
| BRCA1 | 11 | 11 | 10 | 4 |
| BRCA2 | 15 | 0 | 2 | 0 |
| BRIP1 | 16 | 0 | 16 | 0 |
| CHEK2 | 1 | 1 | 1 | 1 |
| NBN | 13 | 0 | 0 | 0 |
| PTEN | 1 | 1 | 1 | 1 |
| RAD51C | 1 | 1 | 1 | 1 |
Figure 4.Apply eSNV-Detect to Single Cell Sequencing and the matching multicellular GA-II RNA-seq. The comparison between the single-cell data and the multicellular data shows the celluar heterogeneity of the single cells in variant calling.