| Literature DB >> 35562965 |
Ville-Petteri Mäkinen1,2,3,4, Jacqueline Rehn5,6, James Breen6,7,8, David Yeung5,6,9,10, Deborah L White5,6,9,11,12.
Abstract
RNA sequencing provides a snapshot of the functional consequences of genomic lesions that drive acute lymphoblastic leukemia (ALL). The aims of this study were to elucidate diagnostic associations (via machine learning) between mRNA-seq profiles, independently verify ALL lesions and develop easy-to-interpret transcriptome-wide biomarkers for ALL subtyping in the clinical setting. A training dataset of 1279 ALL patients from six North American cohorts was used for developing machine learning models. Results were validated in 767 patients from Australia with a quality control dataset across 31 tissues from 1160 non-ALL donors. A novel batch correction method was introduced and applied to adjust for cohort differences. Out of 18,503 genes with usable expression, 11,830 (64%) were confounded by cohort effects and excluded. Six ALL subtypes (ETV6::RUNX1, KMT2A, DUX4, PAX5 P80R, TCF3::PBX1, ZNF384) that covered 32% of patients were robustly detected by mRNA-seq (positive predictive value ≥ 87%). Five other frequent subtypes (CRLF2, hypodiploid, hyperdiploid, PAX5 alterations and Ph-positive) were distinguishable in 40% of patients at lower accuracy (52% ≤ positive predictive value ≤ 73%). Based on these findings, we introduce the Allspice R package to predict ALL subtypes and driver genes from unadjusted mRNA-seq read counts as encountered in real-world settings. Two examples of Allspice applied to previously unseen ALL patient samples with atypical lesions are included.Entities:
Keywords: RNA-seq; acute lymphoblastic leukemia; confounder adjustment; machine learning
Mesh:
Substances:
Year: 2022 PMID: 35562965 PMCID: PMC9099612 DOI: 10.3390/ijms23094574
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 6.208
Patient characteristics and frequencies (%) of ALL subtypes according to known genetic alterations. The participants were grouped primarily by the recruiting institute and secondarily by age (> 99% of participants in the pediatric cohorts were below 20 years of age). Mean and standard deviation are shown for age. p-values for cohort differences were calculated by the χ2-test.
| Pediatric Cohorts | Adult and General Cohorts | ||||||
|---|---|---|---|---|---|---|---|
| COG | St Jude | Australia | Multiple * | Australia | |||
| Male | 177 | 221 | 83 | 4.4 × 10−6 | 251 | 304 | 0.021 |
| Female | 120 | 188 | 73 | 0.016 | 233 | 205 | 0.013 |
| Unknown | 27 | 52 | 89 | 1.8 × 10−21 | 10 | 13 | 0.77 |
| Age (years) | 9.2 ± 5.7 | 6.4 ± 4.4 | 7.4 ± 4.0 | 7.6 × 10−11 | 45.1 ± 14.8 | 35.7 ± 22.6 | 3.6 × 10−13 |
| BCL2/MYC | 0.0 | 0.4 | 0.4 | 0.50 | 2.8 | 0.8 | 0.024 |
| CDX2 hi-exp | 0.3 | 0.0 | 0.0 | 0.34 | 0.6 | 0.6 | 1.0 |
| CRLF2 | 7.4 | 3.0 | 18.4 | 8.4 × 10−12 | 13.2 | 8.4 | 0.020 |
| DUX4 | 7.4 | 7.4 | 2.0 | 0.0096 | 4.9 | 6.9 | 0.21 |
| ETV6::RUNX1 | 9.9 | 26.7 | 17.6 | 2.3 × 10−8 | 1.0 | 2.3 | 0.18 |
| HLF | 0.0 | 0.4 | 0.4 | 0.50 | 0.6 | 0.6 | 1.0 |
| Hyperdiploid | 21.0 | 21.3 | 9.8 | 0.00034 | 4.0 | 5.6 | 0.27 |
| Hypodiploid | 1.2 | 0.7 | 0.8 | 0.68 | 11.5 | 3.6 | 3.1 × 10−6 |
| iAMP21 | 6.5 | 0.4 | 0.8 | 0.72 × 10−8 | 0.2 | 0.4 | 1.0 |
| IKZF1 N159Y | 0.3 | 0.0 | 1.2 | 0.043 | 0.4 | 0.6 | 1.0 |
| KMT2A | 2.2 | 10.8 | 2.9 | 2.0 × 10−7 | 13.8 | 4.4 | 3.2 × 10−7 |
| MEF2D | 2.2 | 1.1 | 0.0 | 0.058 | 2.0 | 1.1 | 0.39 |
| NUTM1 | 0.3 | 1.1 | 0.4 | 0.36 | 0.0 | 0.2 | 1.0 |
| PAX5 Alt | 4.0 | 5.9 | 3.7 | 0.32 | 8.3 | 1.9 | 6.4 × 10−6 |
| PAX5 P80R | 0.9 | 1.1 | 0.8 | 0.94 | 3.4 | 2.9 | 0.74 |
| BCR::ABL1 | 5.6 | 2.4 | 4.1 | 0.070 | 12.3 | 21.5 | 1.6 × 10−4 |
| TCF3::PBX1 | 2.2 | 6.7 | 1.2 | 0.00023 | 3.0 | 2.3 | 0.59 |
| ZNF384 | 1.9 | 2.0 | 2.0 | 0.99 | 2.8 | 4.4 | 0.24 |
| Undefined | 26.9 | 8.7 | 33.5 | 1.7 × 10−16 | 15.0 | 31.6 | 7.0 × 10−9 |
* ECOG-ACRIN, Toronto, MDACC and CALGB.
Figure 1Study design. (A,B) RNA-seq pre-processing was applied separately for North American and Australian datasets. (C) Non-biological differences due to technical artifacts and cohort effects were adjusted according to genetically matched subsets (details in Methods). (D) Correlation coefficients were calculated between unadjusted and adjusted expression levels to exclude genes that were heavily influenced by confounders. A total of 6673 stable genes with R ≥ 0.9 were included for subtype modeling. (E) Internal training and testing sets were randomly chosen from the North American participants as initial material to evaluate machine learning models and to determine hyperparameters. The randomization was conducted separately for each subtype to ensure matching subtype frequencies, which explains the small difference in set sizes. (F) Final machine learning models were trained with the full North American dataset and validated in the Australian dataset.
Figure 2Impact of confounder adjustments. Pearson correlation coefficients were calculated between 18,503 log-transformed genes, and a technical or clinical variable before and after gene expression levels were adjusted as described in Methods. A wide histogram indicates substantial covariation across the transcriptome, while a narrow histogram indicates successful removal of covariance. (A,D) The North American cohorts included 204 (16%) RNA samples that were sequenced with an unstranded library. (B,E) Strandedness and other technical differences manifested as substantial covariation between the transcriptome and the continent of origin. (C,F) Covariation between gene expression and ALL subtypes, such as ETV6::RUNX1, was preserved.
Figure 3Transcriptional landscape of the most frequent ALL subtypes. Clustering structure in the North American data was modeled by the Uniform Manifold Approximation and Projection (UMAP) algorithm and using the same genes that were prioritized and pre-processed for the centroid classifier. (A) Standardized but unadjusted gene expression values were used for re-projecting the North American samples onto the UMAP layout. We use the unadjusted expression profiles here since, in practical settings where patients arrive one-by-one, adjustments for batch effects that would be available in research settings cannot be made. (B) Unadjusted Australian data were projected onto the same UMAP layout as an independent external validation set.
Figure 4Comparison of three machine learning algorithms. Each model was fit to the batch-corrected North American data (training set n = 1078) and then evaluated in unadjusted Australian data (external validation set n = 520). Samples with undefined genetic subtypes were excluded from the analyses. (A–C) The forest plots show 95% confidence intervals that reflect the statistical uncertainty due to finite category sizes. The percentages written in the plots indicate the prevalence of the genetic subtype in the Australian dataset. (D) Description of subtype supergroups.
Figure 5Example of a report card from the Allspice classifier for an adult male patient from North America with Philadelphia (BCR::ABL1) genetic B-cell ALL subtype. (A) Sample identifier and predicted subtype based on RNA data in top-left corner. (B) The report shows the frequency of the corresponding genetic subtype in the training data, given the observed gene expression profile. In this case, there was a 79% chance that sequence and cytogenetics analyses would confirm the presence of the Philadelphia chromosome. (C) Visualization of how similar the sample is to each ALL subtype profile. The display can be interpreted as a panel of RNA “biomarkers” that are specific to each subtype. In this case, the high value for Ph indicates that the gene expression profile is compatible with a typical patient with the Philadelphia chromosome. (D) Allspice also indicates how similar the sample is to the RNA profiles associated with the presence of genetic alterations in one or more genes in parallel. In this case, the gene expression profile matches the typical profile of patients with independently verified BCR::ABL1 fusion (i.e., both BCR and ABL1 are altered). (E) Samples with high leukemia burden will typically produce a strong B-cell ALL signal in the tissue panel. (F) Minimal example of how to generate the report in the R programming environment.
Figure 6Classification performance of the Allspice centroid classifier. (A) The bars show positive predictive values for all samples including those with undefined genetic subtypes. They represent conservative estimates on how likely it is that the subtype predicted by RNA-seq expression levels can be confirmed as a specific sequence alteration or is also indicated by cytogenetics. (B) Proportions of genetic subtypes in the dataset.
Positive predictive values for correct classification into genetically defined B-cell ALL subtypes. The values are presented as the percentages of samples for which the best matching RNA centroid was the same as the genetic subtype if defined. Quality control was set at ≥ 50% proximity and ≥ 50% exclusivity (details in Methods).
| All Samples | Defined Subtypes | Defined Subtypes and QC Pass | |
|---|---|---|---|
| Number of samples | 2046 | 1598 | 1292 |
| DUX4 (%) | 96.8 | 100.0 | 100.0 |
| ETV6::RUNX1 (%) | 86.6 | 97.7 | 98.6 |
| KMT2A (%) | 91.1 | 96.6 | 100.0 |
| PAX5 P80R (%) | 95.2 | 97.6 | 100.0 |
| TCF3::PBX1 (%) | 98.5 | 98.5 | 100.0 |
| ZNF384 (%) | 89.1 | 96.6 | 100.0 |
| CRLF2 (%) | 62.0 | 91.2 | 97.6 |
| Hyperdiploid (%) | 69.8 | 92.1 | 99.0 |
| Hypodiploid (%) | 64.2 | 79.0 | 89.2 |
| PAX5 Alt (%) | 52.1 | 74.8 | 92.7 |
| BCR::ABL1 (%) | 67.2 | 84.8 | 96.6 |
| BCL2/MYC (%) | 17.5 | 54.1 | 66.7 |
| CDX2 hi-exp (%) | 38.9 | 77.8 | 100.0 |
| HLF (%) | 64.3 | 75.0 | 90.0 |
| iAMP21 (%) | 27.9 | 66.7 | 100.0 |
| IKZF1 N159Y (%) | 64.3 | 90.0 | 100.0 |
| MEF2D (%) | 72.2 | 78.8 | 100.0 |
| NUTM1 (%) | 40.0 | 66.7 | 100.0 |