| Literature DB >> 34946814 |
Baoting Nong1, Mengbiao Guo1, Weiwen Wang2, Zhou Songyang1, Yuanyan Xiong1.
Abstract
Various abnormalities of transcriptional regulation revealed by RNA sequencing (RNA-seq) have been reported in cancers. However, strategies to integrate multi-modal information from RNA-seq, which would help uncover more disease mechanisms, are still limited. Here, we present PipeOne, a cross-platform one-stop analysis workflow for large-scale transcriptome data. It was developed based on Nextflow, a reproducible workflow management system. PipeOne is composed of three modules, data processing and feature matrices construction, disease feature prioritization, and disease subtyping. It first integrates eight different tools to extract different information from RNA-seq data, and then used random forest algorithm to study and stratify patients according to evidences from multiple-modal information. Its application in five cancers (colon, liver, kidney, stomach, or thyroid; total samples n = 2024) identified various dysregulated key features (such as PVT1 expression and ABI3BP alternative splicing) and pathways (especially liver and kidney dysfunction) shared by multiple cancers. Furthermore, we demonstrated clinically-relevant patient subtypes in four of five cancers, with most subtypes characterized by distinct driver somatic mutations, such as TP53, TTN, BRAF, HRAS, MET, KMT2D, and KMT2C mutations. Importantly, these subtyping results were frequently contributed by dysregulated biological processes, such as ribosome biogenesis, RNA binding, and mitochondria functions. PipeOne is efficient and accurate in studying different cancer types to reveal the specificity and cross-cancer contributing factors of each cancer.It could be easily applied to other diseases and is available at GitHub.Entities:
Keywords: RNA-seq workflow; TCGA; alternative splicing; cancer subtyping; feature prioritization; mitochondria; ribosome; somatic mutation
Mesh:
Substances:
Year: 2021 PMID: 34946814 PMCID: PMC8701385 DOI: 10.3390/genes12121865
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Overall design of PipeOne. (A) Three modules of PipeOne: data processing and various feature identification (one), feature prioritizing (two), and disease subtyping (three). (B) Details of module one. Raw sequencing reads were quality controlled by FASTP and then went through eight tools to extract information from RNA-seq data, including expression levels of mRNA, lncRNA, circRNA, and retrotransposons, alternative splicing events, alternative polyadenylation, RNA editing, gene fusions, and SNPs. These information was used to construct the feature matrices for machine learning (only the top 1000 most variable features were used for each type of information) in module two and three. (C) Details of module two. First, feature importance was calculated by using random forest on all features from module one. Then the top K (20, 20, 50, 100, 200, all) ranked by feature importance were used to test and validate the importance of those top features. (D) Details of module three. First, a robust NMF integration algorithm was applied to obtain latent features and associated weights for all samples. Then K-Means clustering evaluated by silhouette width was used to cluster samples based on the latent feature matrix. Differential survival analysis by log-rank test was used to assess the clinical relevance of those stable clusters as potential subtypes. Finally, similar to module two, random forest was used to select features contributing to the subtyping results.
Comparing PipeOne with two other pipelines, RNACocktail and VIPER.
| PipeOne | RNACocktail | VIPER | |
|---|---|---|---|
|
| |||
| Quality control | √ | x | √ |
| Alignment | √ | √ | √ |
| Transcriptome reconstruction | √ | √ | x |
| Gene quantification | √ | √ | √ |
| Novel lncRNA prediction | √ | x | x |
| CircRNA prediction | √ | x | x |
| Gene quantification | √ | √ | √ |
| Fusion prediction | √ | √ | √ |
| Variant calling | √ | √ | √ |
| RNA editing prediction | √ | √ | x |
| Retrotranscriptome | √ | x | x |
| Alternative splicing | √ | x | x |
| viral DNA detection | x | x | √ |
| Long-read | x | √ | x |
|
| |||
| Result visualization | x | x | √ |
| Differential expression analysis | x | √ | √ |
| Pathway analysis | x | x | √ |
| Batch correction | x | x | √ |
| immunological analysis | x | x | √ |
| Virus analysis | x | x | √ |
| Feature prioritization | √ | x | x |
| subtyping/clustering | √ | x | √ |
| Multi-modal integration | √ | x | x |
|
| |||
| Management systems | Nextflow | Python | Snakemake |
| Resume | √ | x | √ |
| Parrallel | √ | x | √ |
| Docker | √ | √ | x |
| Conda | √ | √ | √ |
| Singularity | √ | x | √ |
Figure 2Cancer-associated features and pathways identified by PipeOne in five cancer types. (A) Cancer-associated features grouped by feature types across five cancers. (B) Cancer-associated features shared by at least two cancer types. (C–E) Cancer-associated multi-exon skipping AS event of ABI3BP identified in three cancers, COAD (C), KIRP (D), and THCA (E). (F) Top three enriched pathways (only two for STAD) of cancer-associated features in each cancer. (G) Shared enriched disease annotations for cancer-associated genes in each cancer. Two annotations shared by four cancer types were marked by red arrows.
Figure 3Disease subtypes identified by PipeOne in each cancer. (A–D) Disease subtypes and survival difference in KIRP (A), LIHC (B), STAD (C), and THCA (D). (E–G) Different frequencies of somatic mutations between cancer subtypes in THCA (E), KIRP (F), and LIHC (G). (H) Enriched gene ontology terms of subtype-associated features shared by at least three cancer types. (I) Subtype-associated features (many mitochondria genes) shared by at least four cancer types.