| Literature DB >> 36153392 |
Qingwang Chen1, Yaqing Liu1, Yuechen Gao1, Ruolan Zhang1, Wanwan Hou1, Zehui Cao1, Yi-Zhou Jiang2, Yuanting Zheng1, Leming Shi1,3,4, Ding Ma5, Jingcheng Yang6,7, Zhi-Ming Shao8, Ying Yu9.
Abstract
Molecular subtyping of triple-negative breast cancer (TNBC) is essential for understanding the mechanisms and discovering actionable targets of this highly heterogeneous type of breast cancer. We previously performed a large single-center and multiomics study consisting of genomics, transcriptomics, and clinical information from 465 patients with primary TNBC. To facilitate reusing this unique dataset, we provided a detailed description of the dataset with special attention to data quality in this study. The multiomics data were generally of high quality, but a few sequencing data had quality issues and should be noted in subsequent data reuse. Furthermore, we reconduct data analyses with updated pipelines and the updated version of the human reference genome from hg19 to hg38. The updated profiles were in good concordance with those previously published in terms of gene quantification, variant calling, and copy number alteration. Additionally, we developed a user-friendly web-based database for convenient access and interactive exploration of the dataset. Our work will facilitate reusing the dataset, maximize the values of data and further accelerate cancer research.Entities:
Mesh:
Year: 2022 PMID: 36153392 PMCID: PMC9509351 DOI: 10.1038/s41597-022-01681-z
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 8.501
Fig. 1Schematic overview of the multiomics TNBC dataset. (a) Data generation and data processing. Detailed workflows of data processing pipelines including (b) RNAseq pipeline, (c) WES pipeline, and (d) CNA pipeline. The tools/algorithms, QC reports, and files are marked with grey, blue and white backgrounds, respectively. FP: False Positive.
Constitution of a multiomics dataset of TNBC cohort.
| Omics | Number of patients with matched samples (n) | Number of patients with tumor samples only (n) | Total number of patients (n) | |
|---|---|---|---|---|
| Matched tumor tissues and normal tissues | Matched tumor tissues and white blood cells | |||
| Transcriptome (RNAseq) | 88 | 272 | 360 | |
| Genome (WES) | 279 | 279 | ||
| Genome (CNA) | 23 | 378 | 401 | |
Fig. 2Quality control metrics of RNA-seq data (N = 448). (a) Line plot presenting an overview of high-quality scores across all bases at each position in the FASTQ file of all samples. (b) Box plot representing mapping ratio (%) across all samples of multiple sources. (c) Bar plot representing the percentage of reads that map to gene sequence categories across all samples. (d) Frequency distribution plot representing mapped reads. Box plots representing (e) median insert size of the data of all samples per batch and (f) GC content across all samples per batch (Orange represents the normal group, and purple represents the tumor group). (g) Heatmap and hierarchical clustering representing the expression level of the seven sex-specific genes across all samples, including five male-specific genes (RPS4Y1, DDX3Y, EIF1AY, KDM5D, TXLNGY) and two female-specific genes (XIST and TSIX). (h) Principal Component Analysis (PCA) of all tumor and normal samples.
Fig. 3Quality assessment results of genomics data. (a) Line plot from FastQC presenting sequences of all samples have universally high quality in WES. (b) Box plots representing mapping ratio (%) across all samples of the human genome (left) and of multiple sources (right). (c) Frequency distribution plots representing the median coverage of all samples. (d) Violin plot representing duplication across all samples. And box plots representing (e) median insert size and (f) GC content of the data of all samples per batch (Yellow represents the samples from white blood cells and purple represents samples from tumor tissues). Frequency distribution plots representing quality assessment results of OncoScan CNV data, including (g) MADP score and (h) ndSNPQC score.
Fig. 4Comparisons of the results from the same dataset but the different pipelines (old vs. new). (a) Scatter plot representing consistency of the log2FC of differentially expressed genes by comparing tumor and normal samples across different pipelines based on RNAseq read counts. (b) Expression-based (FPKM) unsupervised clustering shows good concordance in TNBC molecular subtyping. BLIS: basal-like and immune-suppressed subtype, IM: immunomodulatory subtype, MES: mesenchymal-like subtype, and LAR: luminal androgen receptor subtype. (c) PCA of 360 TNBC patients including 448 samples (360 tumor samples and 88 normal samples) using SD top 2000 genes used in the old analysis. Tumor purity (%) of 40 tumor samples from MES patients (N = 53) was labeled on the graph. Box plots representing (d) tumor purity of 315 tumor samples across four molecular subtypes in the study (ns: no significance) and (e) the Jaccard index of detected mutations from five mutation datasets through different processes. (f) Scatter plot representing the high consistency in allele frequency of all mutated genes between different pipelines (The displays of the x-axis and y-axis are scaled by log10, and the known cancer-related genes were labeled. There are many genes are overlaid on each other in the panel because their mutation frequencies are consistently low.). (g) Venn diagrams representing the good consistency at the peak level and the box plot representing (h) the Jaccard Index of detected genes across different CNA types.
| Measurement(s) | RNA expression profiling • whole-exome sequencing (WES) • somatic mutations • copy number alterations (CNAs) |
| Technology Type(s) | RNA sequencing • DNA sequencing • OncoScan CNV assay |
| Factor Type(s) | Intervention or procedure |
| Sample Characteristic - Organism | Homo sapiens |
| Sample Characteristic - Location | China |