Literature DB >> 33937015

Comprehensive and Computable Molecular Diagnostic Panel (C2Dx) From Small Volume Specimens for Precision Oncology: Molecular Subtyping of Non-Small Cell Lung Cancer From Fine Needle Aspirates.

Jing Su^1,2,3, Lynn S Huang³, Ryan Barnard³, Graham Parks⁴, James Cappellari⁴, Christina Bellinger⁵, Travis Dotson⁵, Lou Craddock¹, Bharat Prakash⁵, Jonathan Hovda⁶, Hollins Clark⁷, William Jeffrey Petty⁸, Boris Pasche¹, Michael D Chan⁶, Lance D Miller¹, Jimmy Ruiz^8,9.

Abstract

The Comprehensive, Computable NanoString Diagnostic gene panel (C2Dx) is a promising solution to address the need for a molecular pathological research and diagnostic tool for precision oncology utilizing small volume tumor specimens. We translate subtyping-related gene expression patterns of Non-Small Cell Lung Cancer (NSCLC) derived from public transcriptomic data which establish a highly robust and accurate subtyping system. The C2Dx demonstrates supreme performance on the NanoString platform using microgram-level FNA samples and has excellent portability to frozen tissues and RNA-Seq transcriptomic data. This workflow shows great potential for research and the clinical practice of cancer molecular diagnosis.

Entities: Chemical Disease Gene Mutation Species

Keywords: cancer subtyping; elastic net regularization; fine needle aspiration; logistic regression; molecular signature; non-small cell lung cancer

Year: 2021 PMID： 33937015 PMCID： PMC8085404 DOI： 10.3389/fonc.2021.584896

Source DB: PubMed Journal: Front Oncol ISSN： 2234-943X Impact factor: 6.244

Background

In this era of precision oncology, there has been a rapid adoption of targeted therapy for detected driver mutations in non-small-cell lung cancer (NSCLC). As a result, there is a growing tension between the increasing tissue demands required for comprehensive molecular diagnosis and the traditional, immunohistochemical and morphology-based lung cancer subtyping approaches. Currently, the treatment of NSCLC is dependent on accurate histological and molecular subtyping as well as other clinical and pathological features (1, 2). Clinical guidelines such as the NCCN (the National Comprehensive Cancer Network) guidelines (Version 5.2018) (3) and the 2015 World Health Organization classification (4) recommend that histologic and molecular features be used in determining treatment options. Pathologists use hematoxylin-eosin (H&E)-staining on samples to identify NSCLC subtypes like adenocarcinoma (LUAD) and squamous cell carcinoma (LUSC). Immunohistochemical stains are used to diagnose subtype, but in the event further stains are unable to do so, a diagnosis of “NSCLC, not otherwise specified (NOS)” is made. Positive staining for thyroid transcription factor 1 (TTF1, coded by gene NKX2-1) and napsin A (novel aspartic proteinase of the pepsin family) help to classify LUAD. LUSC stains for markers tumor protein p63, and cytokeratins (CK) 5, 6, and 7. After lung cancer subtyping is complete, non LUSC are evaluated for targetable driver genomic alterations such as epidermal growth factor receptor (EGFR) mutations, anaplastic lymphoma kinase (ALK) rearrangements, c-Ros oncogene 1 (ROS1) rearrangements, and BRAF V600E mutation. Checkpoint markers such as PD-1/PD-L1 (programmed cell death-1 and programmed death-ligand 1) (1) are further examined in all subtypes of NSCLC. This process determines the treatment options for patients. In advanced and metastatic NSCLC cases, small volume biopsies are generally obtained through the least invasive diagnostic option. The morphology-based subtyping can utilize a significant amount of limited tissue and can lead to insufficient biopsy material for molecular testing. Despite the fast-growing quantity demands for diagnostic tissues, minimally invasive small volume biopsies are commonly utilized for diagnosis. The fine-needle aspirate (FNA) is now the most routinely utilized biopsy approach for NSCLC (5–7). While sufficient for confirming the existence of cancers through transthoracic needle biopsy (TTNB), the amount of cellular material from FNAs can be sparse. This is often cited (8) as the limiting factor to the correct classification of NSCLC histology, a problem further compounded by the poor cellular differentiation characteristic of advanced-stage disease. Over the last decade, we have been introduced to precision oncology initiatives (9) and novel immunotherapy treatment options that require additional biopsy tissue for molecular testing and next generation sequencing (10, 11). New and emerging therapies on or off clinical study require additional tissues for examination and continues to contribute to the growing demand on limited tissues obtained at diagnosis. In real-world clinical settings, small tumor sample volume limits the implementation of many next-generation sequencing (NGS) technologies (12). For example, approximately 14% of clinical FNA samples qualified for the 324-gene FoundationOne CDx targeted genomics sequencing (13–15) after the samples were used for diagnosis and molecular profiling. NGS devices that are specially designed for small sample volumes, such as the 26-gene Oncomine Dx Target Test, can only test a small number of genes. Liquid biopsy technologies, though available in NSCLC patients (16, 17), are only utilized if tissue samples are exhausted and additional specimens cannot be obtained. Therefore, greater effort and novel strategies are required to minimize consumption of diagnostic tissues in order to maximize diagnostic output. We have been pioneering the development of a novel device, the Comprehensive, Computable NanoString Diagnostic gene panel (C2Dx), as a promising solution to address the burning clinical need in molecular pathological diagnosis with small volume tumor specimens. The NanoString nCounter® (18) is a digital multiplexed molecular diagnostic platform supporting the molecular profiling of up to 800 genes in a single panel. It uses extremely small amounts of tissues and has been used successfully in precision oncology applications (19, 20), including the detection of fusion genes (21) in lung cancer cytology and molecular subtyping of diffuse large B-cell lymphoma (22). We established clinical and laboratory protocols for reliable lung cancer RNA acquisition from a single transbronchial or transthoracic needle pass using fine needle aspiration (23). Key differentially expressed genes in LUAD and LUSC cases demonstrated high consistency across the NanoString platform using FNA samples and the RNA-Seq platform using the bulk tumor samples. We have also demonstrated the successful translation of knowledge learned from bulk-based transcriptomics data to a NanoString gene panel in exploring novel prognostic molecular subtypes in appendiceal mucinous neoplasms (24). A critical step from what we have achieved toward a clinical C2Dx device is to transfer knowledge discovered from existing omics data to a highly robust computable gene panel. In this work, we used the NSCLC subtyping task as a working example to demonstrate the feasibility of such quantitative translational modeling from bulk transcriptomic data to FNA-based NanoString gene panels. In this work, we developed a highly robust, high performing, low tissue demanding, molecular subtyping system to compliment histology determination ( ). Using a 67-gene NanoString transcriptomic gene panel learned from four existing NSCLC omics datasets and generated from nanogram-level RNA samples, we implemented an intensive resampling strategy for robust elastic net regularization of logistic regression. We established a NanoString-based 19-gene panel (15 signature genes and 4 housekeeping genes) and demonstrated a competitive NSCLC subtyping classification with sound accuracy when compared to interpretations from experienced pathologists. We were able to utilize a single FNA pass to determine adenocarcinoma from squamous histology. This strategy potentially allows for minimal diagnostic tissue utilization for subtyping which can lead to a decrease rate of tissue depletion and need for additional invasive biopsies. The 19-gene subtyping model demonstrates high accuracy across different sample types (FNA specimens and frozen tumor tissues) and transcriptomic platforms (NanoString gene panels and RNA-seq). This work proved the concept of transferring knowledge from omics data generated at bench-side to a robust computable device implementable at bedside through robust translational modeling and FNA-based NanoString C2Dx system.

Figure 1

Overall workflow of the C2Dx development. The development of the C2Dx NSCLC subtyping device was composed of the following four steps: 1. Knowledge translation. Four microarray-based transcriptomics datasets of NSCLC cases were extracted from Gene Expression Omnibus (GEO) and used to identify 67 subtyping-related genes, including 23 LUAD genes, 40 LUSC genes, and 4 housekeeping genes. A corresponding nanoString gene panel was established. 2. Data generation. Targeted transcriptomics data were generated using the nanoString 67-gene panel for 83 FNA biopsy samples collected from WFBH NSCLC patients, including 47 LUAD, 25 LUSC, and 11 NOS cases. 3. Model development. Intensive resampling strategy was used in elastic-net regularized logistic regression to develop the final 19-gene molecular subtyping model. 4. Model validation. The developed subtyping model was validated using the transcriptomics data of the 19 genes collected from frozen tissue from WFBH’s tumor bank using nanoString platform and from the TCGA’s RNA-Seq data generated from bulk tissues.

Methods

Overview of the Pipeline

The aim of this study is to establish a robust, comprehensive, computable molecular subtyping device using FNA-based nanoString gene panel and knowledge of subtype-related gene expression from public data to distinguish LUAD and LUSC in NSCLC cases. As demonstrated in , as previously reported (23), knowledge about LUAD/LUSC subtyping related genes was learned from a meta-cohort (n = 490) microarray transcriptomics data extracted from public source [Gene Expression Omnibus: GSE10445 (25), GSE4573 (26), and GSE3141 (27); the Director’s Challenge Consortium for the Molecular Classification of Lung Adenocarcinoma (28)] and translated to a nanoString 67-gene panel for clinical use. This panel was used to generate targeted transcriptomics data on tissues collected by FNA-based biopsy from 83 WFBH’s NSCLC patients. A robust 19-gene molecular subtyping model were trained using the generated data, intensive resampling, and elastic-net regularized logistic regression. The molecular phenotyping model was validated using nanoString-generated data of 44 frozen samples collected from WFBH’s tumor tissue bank and RNA-Seq data of 1,016 samples from TCGA’s LUAD and LUSC cohorts.

Classifier Development

Four publicly available lung cancer cohorts were used for 67-gene panel construction, including the DUKE (n = 111, GEO: GSE3141), PARIS (n = 74, GEO: GSE10445), UM (n = 13, GEO: GSE4573), and the DC (n = 442, NCI caArray: Jacob-00182).

Patient Selection and Procedures

The study was approved by the Institutional Review Board of the Wake Forest Baptist Medical Center from 2013 and 2015. Patients completed informed consent forms prior to enrollment. Eligibility requirements were for radiographic evidence for lung cancer or have a previously diagnosed NSCLC with potential recurrence requiring re-biopsy. Patients enrolled on study agreed to provide one additional small volume aspirate pass for RNA collection after completion of standard of care diagnostic tissue collection for presumed NSCLC. Patients underwent FNA of primary or secondary metastatic lesions. Exclusion criteria included patients whose FNA biopsy was unable to provide subtype classification by pathology, consistent with small cell carcinoma, non-malignant etiologies, or returned non-diagnostic. Small volume FNA diagnostic modalities were obtained by: endobronchial ultrasound guided transbronchial needle aspiration (EBUS-TBNA), conventional trans-bronchial needle aspiration (cTBNA) and trans-thoracic needle biopsy (TTNB). Bronchoscopy cases were performed by pulmonogists using rapid on-site cytology evaluation (ROSE) to help decipher results. TTNB were performed by interventional radiologists with training and experience with CT-guided thoracic procedures.

Model Development

For developing the molecular subtyping model, we established a WFBH-FNA cohort by collected 83 FNA samples (1.645 ± 2.298 μg RNA) from tumor or lymph node tissues of the Wake Forest Baptist Hospital (WFBH) non-small cell lung cancer patients, including 47 adenocarcinoma (LUAD), 25 squamous cell carcinoma (LUSC), and 11 non-small cell lung cancer but not otherwise specified (NOS) cases. Two cohorts were used for model validation: 1) WFBH-TB. We collected 44 frozen tissue samples (LUAD: 21; LUSC: 23) from WFBH Tissue Bank for internal validation; 2) TCGA. The TCGA RNA-Seq dataset (sample size: 1,145; version: Data Release 8.0; data source: Genomic Data Commons (GDC); data extraction date: September 7, 2017) of 594 LUAD and 551 LUSC samples was used for external validation. Histology classification of the 127 WFBH samples was determined through diagnostic pathological review of Haemotoxylin and Eosin (H&E) stained microscopic slides of tumor tissues by two mutually blinded pathologists and discordant cases were labeled as NOS.

NanoString-Based Transcriptomic Data Generation

A 67-gene panel has been constructed and the nanoString-based transcriptomic data has been generated for the study cohort as previously reported (23). Among these genes, 23 were LUAD-associated, 40 were LUSC associated, and 4 were housekeeping genes for normalization purpose. The full gene list can be found in .

Individual Normalization of Gene Expression

For NanoString-based FNA data, the raw Reporter Code Count data was used. For TCGA samples, the expressions of the 67 genes in the form of FPKM-UQ (fragments per kilobase per million reads, upper quartile normalized) (29) were used. For each sample, original expression of all genes were log2 transformed, normalized against the mean of the 4 housekeeping genes, and z-transformed. Cross-sample normalization was avoided to simplify clinical implementation. Details are provided in Supplement: Data Processing.

Logistic Classification Model

Assuming that the subtypes LUAD and LUSC followed a binomial distribution, the odd of the probability of a given sample i was from an LUAD patient (p) vs. from an LUSC patient (1 - p) was modeled using a logistic regression model given by: where ≡ [g, g, …, g] was the expression levels of the m genes of sample i, and β and ≡ [β, β, …, β] were the regression coefficients.

Feature Selection and Model Training

Elastic net regularization (30) and cross-validation with intensive resampling was used for feature selection. We first determined optimal model complexity, that is, the optimal number of genes in the model. Let y indicate the true subtype of sample i, y = 1 for LUAD and y = 0 for LUSC, the objective function for the elastic net penalized logistic regression model was: where n was the sample size, λ > 0 was the tuning factor, 0 ≤ α ≤ 1 was the elastic net mixing factor that controlled the mixing percentage between the ridge regularization (α = 0) and the LASSO regularization (α = 1), ||•||1 and ||•||2 represented the L1 and L2 norm, respectively. To reach a robust solution, 3-fold cross validation was used for 1,000 resampling iterations with α screened from 0 to 1 at step size of 0.1. Evaluating the model performance by accuracy, the 15-signature-gene model was selected with estimated accuracy of 0.902 ± 0.053. All FNA samples was then used to train the final 15-signature-gene classification model using logistic regression with elastic net regularization. provided more details on elastic net feature selection. R (version: 3.3.1) packages glmnet (version 2.0-10) (31) and caret (version 6.0-77) (32) was used for feature selection and model training.

Model Validation

The classification model, with authentic coefficients, was directly implemented on the internal (44 NanoString samples from frozen tumor tissues) and external (1,016 RNA-Seq data from TCGA datasets) to classify TCGA sample subtypes for evaluating model performance.

Profiling NOS Cases

The 11 FNA samples diagnosed as NOS were analyzed with the 19-gene molecular subtyping model and compared with the LUAD and LUSC cases. The logistic probability patterns of NOS, LUAD, and LUSC cases were analyzed using enrichment analysis and Kolmogoro-Smirnov test.

Statistics Methods

The convergence of model stability during elastic net regularization with respect to resampling rounds was measured by the estimated values as well as the corresponding 95% confidence intervals of the model complexity, accuracy, and elastic net mixing factor α. The receiver operating characteristic (ROC) curves were generated for model performance on the FNA, internal tissue bank, and external TCGA cohorts and the corresponding concordance statistic (c-statistic, a.k.a. the area under the ROC curve, AUC) (33–35) were calculated for model performance comparison.

Results and Discussion

Patient Characteristics

Patients’ demographics and clinical characteristics were obtained for the Exploring, Training, and Validation cohorts summarized in and . Most characteristics were consistent across all cohorts. Such consistency between the research cohorts (Exploring and TCGA) and the clinical cohorts (WFBH-FNA and WFBH-TB) were important for knowledge translation from scientific studies to clinical applications.

Table 1

Patients’ characteristics.

Characteristics	Exploring	Training	Validation
Characteristics	Exploring	WFBH-FNA	WFBH-TB	TCGA
Overall Cohort Size	490	83	42	1,016
Age, mean (sd)	65.1 (10.2)	65.9 (9.5)	65.7 (7.7)	66.7 (9.4)
Gender, n (%)
Female	218 (44.5%)	50 (60.2%)	18 (40.9%)	406 (40.0%)
Male	272 (55.5%)	33 (39.8)	26 (59.1%)	610 (60.0%)
Race, n (%)	—
Caucasian		72 (86.8%)	38 (86.4%)	738 (72.6%)
African American		9 (10.8%)	5 (11.4%)	82 (8.1%)
Others		2 (2.4%)	1 (2.2%)	196 (19.3%)
Adenocarcinoma	384 (78.4%)	47 (56.6%)	20 (47.7%)	515 (53.7%)
Age, mean (sd)	64.1 (10.3)	65.2 (9.5)	65.3 (9.0)	65.7 (10.0)
Gender, n (%)
Female	180 (46.9%)	22 (46.8%)	10 (47.6%)	276 (53.6%)
Male	204 (53.1%)	25 (53.2%)	11 (52.4%)	239 (46.4%)
Race, n (%)	—
Caucasian		40 (85.1%)	19 (90.5%)	389 (75.5%)
African American		6 (12.8%)	2 (9.5%)	52 (10.1%)
Others		1 (2.1%)	0	74 (14.4%)
Squamous Cell Carcinoma	106 (21.6%)	25 (30.1%)	22 (52.3%)	501 (46.3%)
Age, mean (sd)	68.6 (9.1)	66.2 (8.6)	66.0 (6.5)	67.7 (8.6)
Gender, n (%)
Female	38 (35.8%)	6 (24%)	8 (34.8%)	130 (25.9%)
Male	68 (64.2%)	19 (76%)	15 (65.2%)	371 (74.1%)
Race, n (%)	—
Caucasian		21 (84%)	19 (82.6%)	349 (69.7%%)
African American		3 (12%)	3 (13.0%)	30 (6.0%)
Others		1 (4%)	1 (4.4%)	122 (24.4%)
NOS	—	11 (13.3%)	—	—

The demographical and clinical characteristics of the 3 cohorts in this study were summarized. WFBH-FNA: the training cohort with tumor tissues collected from FNA biopsy of WFBH patients. WFBH-TB and TCGA: the validation cohorts with frozen tumor tissues collected from surgeries of WFBH patients and TCGA patients, respectively.

Patients’ characteristics. The demographical and clinical characteristics of the 3 cohorts in this study were summarized. WFBH-FNA: the training cohort with tumor tissues collected from FNA biopsy of WFBH patients. WFBH-TB and TCGA: the validation cohorts with frozen tumor tissues collected from surgeries of WFBH patients and TCGA patients, respectively. The original NanoString 67-gene panel data as well as the individually normalized gene profile of the Exploring cohort and FNA cohort were visualized in and , respectively. The panel was composed of 63 candidate diagnostic genes and, for internal normalization purpose, 4 housekeeping genes. The candidate diagnostic genes demonstrated strong differential expression pattern in the LUAD vs. LUSC samples. Specifically, the 23 LUAD-specific genes were highly expressed in LUAD samples, and vice versa for the 40 LUSC-specific genes. To reach a robust model, a large-scale resampling for cross-validation was performed over the full range of elastic net mixing factor α. Through 1,695,000 training-and-testing rounds, the best performance of an accuracy of 0.902 ± 0.053 was achieved at α = 0.5 with a model complexity of 15 predictive genes and and ). The boxplots of key model features such as the optimal α, λ, accuracy, and the model complexity during the first 100 resampling were shown in . The convergence of model training during resampling was shown in .

Figure 2

The expression pattern of the 67 diagnostic genes in the Exploration Cohort. The Exploration Cohort was composed of 490 samples collected from LUAD (n = 384) and LUSC (n = 106) tumors. The 67-gene diagnostic panel was composed of 27 LUAD-specific and 40 LUSC-specific genes. The gene expression pattern was derived from the normalized microarray-based transcriptomics data and scaled for visualization.

The Molecular Subtyping Model

The final 15 signature genes as well as corresponding coefficients were listed in and . The performance of the model on FNA cohort was demonstrated in and listed in .

Table 2

Genes and coefficients of the molecular subtyping model.

Category	Gene	coefficient
LUSC-related	TP63	-0.58872
LUSC-related	KRT14	-0.39198
LUSC-related	ANXA8L2	-0.27091
LUSC-related	KRT5	-0.25936
LUSC-related	SERPINB13	-0.10086
LUSC-related	SNAI2	-0.048
LUSC-related	KRT6A	-0.0221
LUSC-related	PKP1	-0.00488
	(Intercept)	0.27297
LUAD-related	SPINK1	0.00302
LUAD-related	CD55	0.00562
LUAD-related	NKX2-1	0.02709
LUAD-related	MUC1	0.13468
LUAD-related	GPR116	0.13792
LUAD-related	PNMA2	0.15816
LUAD-related	TMC5	0.31305

Signature genes used in the molecular subtyping model were listed according to the absolute values of their logistic regression coefficients. The categories of the genes were color coded with respect to the associated subtypes (vermilion for LUAD and bluish green for LUSC). Genes are sorted according to significance (absolute values of the corresponding coefficients, with positive values favoring LUAD and negative values favoring LUSC subtypes).

Figure 3

Model performance. The predicted probabilities of subtype LUAD for (A) each FNA samples and (B) each TCGA samples was visualized in the bar plot. Samples classified as LUAD (p ≥ 0.5) or LUSC (p < 0.5). Signature genes used in the molecular subtyping model were listed according to the absolute values of their logistic regression coefficients. The categories of the genes were color coded with respect to the associated subtypes (vermilion for LUAD and bluish green for LUSC). (C) The receiver operating characteristic curves (ROCs) of the model performance on the WFBH FNA, the WFBH tissue bank, and the TCGA cohorts. The corresponding c-statistics (AUC, area under the ROC curve) were 0.986, 0.911, and 0.982, respectively. When the probability threshold is set at θ = 0.5, model accuracies are 0.931, 0.881, and 0.945 for the WFBH FNA, the WFBH tissue bank, and the TCGA cohorts, respectively (marked as open circles). The optimal accuracies are reached when optimal θ levels are used, which are: 0.958 at θ = 0.60 for the WFBH FNA cohort, 0.905 at θ = 0.59 for the WFBH tissue bank cohort, and 0.946 at θ = 0.68 for the TCGA cohort (solid gray circles).

Genes and coefficients of the molecular subtyping model. Signature genes used in the molecular subtyping model were listed according to the absolute values of their logistic regression coefficients. The categories of the genes were color coded with respect to the associated subtypes (vermilion for LUAD and bluish green for LUSC). Genes are sorted according to significance (absolute values of the corresponding coefficients, with positive values favoring LUAD and negative values favoring LUSC subtypes). Model performance. The predicted probabilities of subtype LUAD for (A) each FNA samples and (B) each TCGA samples was visualized in the bar plot. Samples classified as LUAD (p ≥ 0.5) or LUSC (p < 0.5). Signature genes used in the molecular subtyping model were listed according to the absolute values of their logistic regression coefficients. The categories of the genes were color coded with respect to the associated subtypes (vermilion for LUAD and bluish green for LUSC). (C) The receiver operating characteristic curves (ROCs) of the model performance on the WFBH FNA, the WFBH tissue bank, and the TCGA cohorts. The corresponding c-statistics (AUC, area under the ROC curve) were 0.986, 0.911, and 0.982, respectively. When the probability threshold is set at θ = 0.5, model accuracies are 0.931, 0.881, and 0.945 for the WFBH FNA, the WFBH tissue bank, and the TCGA cohorts, respectively (marked as open circles). The optimal accuracies are reached when optimal θ levels are used, which are: 0.958 at θ = 0.60 for the WFBH FNA cohort, 0.905 at θ = 0.59 for the WFBH tissue bank cohort, and 0.946 at θ = 0.68 for the TCGA cohort (solid gray circles). The intensive resampling was crucial since the sample size was relatively small (n = 72) comparing with the candidate gene set (m = 63). As shown in , significant variations were observed for the FNA dataset, comparing with the much more stable results for The Cancer Genome Atlas (TCGA) dataset (n = 1,016) ( ). The convergence analysis ( ) showed that, to confidently estimate the model complexity and the elastic net mixing factor, at least 1,000 rounds of resampling was required. Resampling approach also provided a realistic estimation of the model performance ( ). The model performance on internal tissue bank dataset (accuracy of 89.3%) and on external TCGA dataset (accuracy of 94.5%) were consistent with the predicted performance (0.903 ± 0.053). The model was validated internally with the 67-gene panel transcriptomics generated on the NanoString platform with frozen tumor LUAD and LUSC tissues, and externally with the RNA-Seq transcriptomics from TCGA LUAD and LUSC dataset. The performance on the WFBH tissue bank cohort and on the TCGA cohort were shown in , , and , respectively. The c-statistics of the model on the WFBH-FNA, WFBH-TB, and the TCGA cohorts were 0.986, 0.911, and 0.982, respective, with the corresponding Receiver Operating Characteristic (ROC) curves shown in . Control of overfitting is a major challenge in molecular subtyping, since comparing with the large size of candidate biomarkers made possible by modern high-throughput or multiplex technologies. The clinical cohorts are usually small. We have significantly reduced the candidate biomarkers from microarray data to 63 genes. Such a focused gene panel still raise overfitting concern for our clinical cohort. With elastic net regularization and intensive resampling, we reduced the model complexity (gene size = 15) and achieved an events-per-variable (EPV) level of about 5. Such EPV level was around the minimum statistically acceptable (36) ratio for model development. Therefore, the validation independent dataset was necessary. The TCGA cohort validation confirmed that the established molecular subtyping model was not overfitted.

Choosing Thresholds for Confident Subtyping in Clinical Implementation

In clinical practice, it is important to identify which cases can be confidently subtyped without pathologists’ review. Subtype-specific thresholds, θ and θ, can be used to reach required confidence. Samples falling between these thresholds need to be examined by pathologists. As shown in , 94.7% of clinical samples can be confidently subtyped with 95% specificity when using θ = 0.84 and θ = 0.74. About 87.5% clinical samples can be subtyped at a specificity of 97% if thresholds θ = 0.89 and θ = 0.66 are used.

Comparison With Current Subtyping Approaches

The classification model demonstrated high accuracy when compared to standard NSCLC clinical subtyping. The TCGA LUAD and LUSC cohort study model outperformed the histopathological image-based artificial intelligence approach using deep convolutional neural network (37), with an AUC of 98.2% vs. 97%. On clinical samples, a subtyping specificity of 97% is at the high-end of current pathological diagnosis utilizing immunohistochemical markers (38). The performance analysis showed that only about 12.5% clinical samples are below this specificity and thus need to be further reviewed by pathologists. Implementing our subtyping model will significantly relieve the burden of pathologists.

NOS Case Profiling

The NOS cases in the WFBH-TB cohort were examined using the developed molecular subtyping model and the subtypes of these cases predicted ( ). Further analysis demonstrated unique subtyping probability distribution of NOS cases (the gray cumulative density function curve and the gray probability density function peak in the left and right panel of ) comparing with typical LUAD and LUSC cases (Kolmogoro-Smirnov test p-values are 2×10–4 for both NOS vs. LUAD and NOS vs. LUSC). In contrast, there was no similar NOS-like probability density function peak in TCGA cohort (right panel in ), in which the NOS cases were excluded. The profiling results suggested that the NOS cases at WFBH cohort might be molecularly distinct from typical LUAD and LUSC subtypes.

Figure 4

Profiling LUAD and LUSC subtypes and NOS cases. The cumulative probability distribution curves (left) and the estimated probability density distribution curves (right) for the prediction results of the LUAD, LUSC, and NOS cases in WFBH (A) and TCGA (B) cohorts, respectively. Each circle in the left panels represents a sample.

Clinical Relevance

Our work proved the concept that a Comprehensive, Computable NanoString Diagnostic gene panel (C2Dx) is a promising tool to conserve diagnostic tissue specimens that require molecular pathological diagnosis when limited small volume tumor specimens are available. The C2Dx platform was developed from underlying classification models for molecular diagnosis, the NanoString gene panel to reliably generate targeted genomics data, and the clinical and experimental protocols for obtaining FNA specimens and generating NanoString genomics data. C2Dx platform demonstrates high potential in addressing the challenges in molecular diagnosis on FNA samples. The current advance of clinical practice raises a challenge to traditional pathology approaches: the gap between the dramatically reduced sample volume when less invasive approaches such as FNA are used and the fast-increasing needs of molecular diagnosis required by modern therapies such as immune therapies, targeted therapies, and, of course, chemotherapies. NanoString data generated from the clinical FNA samples can address such clinical need. Our approach provides a reliable clinical solution to this emerging clinical need. We chose to use the LUAD vs. LUSC subtyping to prove the concept of C2Dx. First, the task is representatively complex for typical molecular diagnosis tasks, and requires a large number of genes used in a multivariate model. It is both experimentally and mathematically more complex than the standard PD-1/PD-L1 test which only requires a single or few genes. It is representative for other molecular diagnostic needs by using metagenes and gene signatures. Second, this task is suitable for examining the performance of the C2Dx platform. It has been well-established clinically, pathologically, and molecularly. Finally, subtyping LUAD vs. LUSC cases is still an essential pathological task in clinical practice and impacts treatment decisions. Beyond the use of the C2Dx platform for fresh FNA samples, the computable subtyping model can be applied for other clinical scenarios. Scenario 1: centralized C2Dx laboratory services, as many clinical sites rely on such external resources for genomics profiling tasks. Scenario 2: retrospective calibration and molecular diagnosis for frozen samples stored in tissue banks at low cost and low demands of sample volume. Our work demonstrated that reliable molecular diagnoses can be achieved for frozen samples and thus both scenarios can be archived. Finally, our subtyping model allows linking the emerging clinical FNA samples to previous patient cohorts for integrative analysis. For example, samples collected at WFBH are comparable to TCGA samples using the subtyping probability scores calculated from the logistic subtyping model on the nanoString 19-gene panel data and the TCGA’s RNA-Seq transcriptomic data. Such mapping provides comprehensive, multi-omics molecular contexts to individual clinical cases, and enables translations of knowledge from public biomedical big data to support precision oncology for individual patients.

Limitations and Future Directions

The main goal of this work is to use a well-established molecular subtyping task (distinguishing LUSC from LUAD cases among NSCLC patients) to prove the concept of a crucial step toward a comprehensive, computable gene panel for molecular diagnosis – robust molecular modeling for bench- -to-bedside knowledge translation. To be implemented in real-world clinical settings, the model needs to be expanded to distinguish small-cell lung cancer cases. Another major future direction will be the incorporation of biomarkers representing the pathological and prognostic knowledge from different domains including genomic abnormalities such as EGFR and BRAF mutations, molecular subtypes such as LUSC and LUAD, immunotherapy markers such as PD-1/PD-L1 expression, and other transcriptomic and genomic signatures for precision oncology.

Conclusions

Our work using FNA-based nanoString gene panel for robust molecular subtyping of NSCLC patients dramatically can reduce the demand of diagnostic samples for subtyping. It can avoid the complications and risks of re-biopsy when specimens are limited. It allows for more tissues to support molecular diagnoses for emerging targeted and immune therapies, and release pathologists from subtyping burden. It may also be used in the research setting, when utilizing small volume “left over” sampling. This translational modeling work is a crucial step toward comprehensive, computable molecular diagnosis devices and potentially revolutionizes current NSCLC diagnosis procedures for precision oncology

Data Availability Statement

The original contributions presented in the study are included in the article/ . Further inquiries can be directed to the corresponding author. The 19-gene molecular subtyping model was also available as an R package at http://github.com/Su-informatics-lab/FNASubtype, with sample R code demonstrated in the supplements.

Ethics Statement

The study was approved by the Institutional Review Board of the Wake Forest Baptist Medical Center. The patients/participants provided their written informed consent to participate in this study.

Author Contributions

JS, LM, and JR prepared the manuscript. LM and JR designed the experiments. JS and LH analyzed the data. RB developed the github distribution of the tool. GP, JC, CB, TD, LC, IP, JH, HC, and JP contributed to sample collection and data generation. JS, LH, BPr, MC, LM, and JR participated the development of the manuscript. All authors contributed to the article and approved the submitted version.

Funding

The work is partially supported by the Cancer Center Support Grant from the National Cancer Institute to the Comprehensive Cancer Center of Wake Forest Baptist Medical Center (P30 CA012197). The work is also partially supported by the Indiana University Precision Health Initiative (funded to JS). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

32 in total

1. Introduction to The 2015 World Health Organization Classification of Tumors of the Lung, Pleura, Thymus, and Heart.

Authors: William D Travis; Elisabeth Brambilla; Allen P Burke; Alexander Marx; Andrew G Nicholson
Journal: J Thorac Oncol Date: 2015-09 Impact factor: 15.609

Review 2. Endoscopic ultrasound-guided fine-needle aspiration for non-small cell lung cancer staging: A systematic review and metaanalysis.

Authors: Carlos G Micames; Douglas C McCrory; Darren A Pavey; Paul S Jowell; Frank G Gress
Journal: Chest Date: 2007-02 Impact factor: 9.410

Review 3. NanoString, a novel digital color-coded barcode technology: current and future applications in molecular diagnostics.

Authors: Hin-Fung Tsang; Vivian Weiwen Xue; Su-Pin Koh; Ya-Ming Chiu; Lawrence Po-Wah Ng; Sze-Chuen Cesar Wong
Journal: Expert Rev Mol Diagn Date: 2016-12-12 Impact factor: 5.225

Review 4. Fine needle aspiration of non-small cell lung cancer: current state and future perspective.

Authors: A Fassina; R Cappellesso; F Simonato; C Lanza; A Marzari; M Fassan
Journal: Cytopathology Date: 2012-08 Impact factor: 2.073

Review 5. Liquid Biopsies in Cancer Diagnosis, Monitoring, and Prognosis.

Authors: Gabriele De Rubis; Sabna Rajeev Krishnan; Mary Bebawy
Journal: Trends Pharmacol Sci Date: 2019-02-05 Impact factor: 14.819

Review 6. The Present and Future of Liquid Biopsies in Non-Small Cell Lung Cancer: Combining Four Biosources for Diagnosis, Prognosis, Prediction, and Disease Monitoring.

Authors: Jillian Wilhelmina Paulina Bracht; Clara Mayo-de-Las-Casas; Jordi Berenguer; Niki Karachaliou; Rafael Rosell
Journal: Curr Oncol Rep Date: 2018-07-20 Impact factor: 5.075

7. Feasibility of lung cancer RNA acquisition from a single transbronchial or transthoracic needle pass (FASTT trial).

Authors: Travis Dotson; Christina Bellinger; Jing Su; Kris Hansen; Graham E Parks; James O Cappellari; Lou Craddock; Hollins Clark; Clifford Howard; W Jeffrey Petty; Bharat Prakash; Kounosuke Watabe; Michael Chan; Jonathan Hovda; Lance D Miller; Jimmy Ruiz
Journal: Lung Cancer Date: 2018-11-20 Impact factor: 5.705

8. Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study.

Authors: Kerby Shedden; Jeremy M G Taylor; Steven A Enkemann; Ming-Sound Tsao; Timothy J Yeatman; William L Gerald; Steven Eschrich; Igor Jurisica; Thomas J Giordano; David E Misek; Andrew C Chang; Chang Qi Zhu; Daniel Strumpf; Samir Hanash; Frances A Shepherd; Keyue Ding; Lesley Seymour; Katsuhiko Naoki; Nathan Pennell; Barbara Weir; Roel Verhaak; Christine Ladd-Acosta; Todd Golub; Michael Gruidl; Anupama Sharma; Janos Szoke; Maureen Zakowski; Valerie Rusch; Mark Kris; Agnes Viale; Noriko Motoi; William Travis; Barbara Conley; Venkatraman E Seshan; Matthew Meyerson; Rork Kuick; Kevin K Dobbin; Tracy Lively; James W Jacobson; David G Beer
Journal: Nat Med Date: 2008-07-20 Impact factor: 53.440

9. Reproducible, Quantitative, and Flexible Molecular Subtyping of Clinical DLBCL Samples Using the NanoString nCounter System.

Authors: Margaret H Veldman-Jones; Zhongwu Lai; Mark Wappett; Chris G Harbron; J Carl Barrett; Elizabeth A Harrington; Kenneth S Thress
Journal: Clin Cancer Res Date: 2014-10-09 Impact factor: 12.531

10. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning.

Authors: Nicolas Coudray; Paolo Santiago Ocampo; Theodore Sakellaropoulos; Navneet Narula; Matija Snuderl; David Fenyö; Andre L Moreira; Narges Razavian; Aristotelis Tsirigos
Journal: Nat Med Date: 2018-09-17 Impact factor: 53.440