Literature DB >> 30325352

A radiogenomic dataset of non-small cell lung cancer.

Shaimaa Bakr1, Olivier Gevaert2, Sebastian Echegaray3, Kelsey Ayers4, Mu Zhou2, Majid Shafiq5, Hong Zheng2, Jalen Anthony Benson4, Weiruo Zhang3, Ann N C Leung3, Michael Kadoch6, Chuong D Hoang7, Joseph Shrager8,9, Andrew Quon3, Daniel L Rubin3, Sylvia K Plevritis3, Sandy Napel3.   

Abstract

Medical image biomarkers of cancer promise improvements in patient care through advances in precision medicine. Compared to genomic biomarkers, image biomarkers provide the advantages of being non-invasive, and characterizing a heterogeneous tumor in its entirety, as opposed to limited tissue available via biopsy. We developed a unique radiogenomic dataset from a Non-Small Cell Lung Cancer (NSCLC) cohort of 211 subjects. The dataset comprises Computed Tomography (CT), Positron Emission Tomography (PET)/CT images, semantic annotations of the tumors as observed on the medical images using a controlled vocabulary, and segmentation maps of tumors in the CT scans. Imaging data are also paired with results of gene mutation analyses, gene expression microarrays and RNA sequencing data from samples of surgically excised tumor tissue, and clinical data, including survival outcomes. This dataset was created to facilitate the discovery of the underlying relationship between tumor molecular and medical image features, as well as the development and evaluation of prognostic medical image biomarkers.

Entities:  

Mesh:

Year:  2018        PMID: 30325352      PMCID: PMC6190740          DOI: 10.1038/sdata.2018.202

Source DB:  PubMed          Journal:  Sci Data        ISSN: 2052-4463            Impact factor:   6.444


Background and Summary

Advances in high-throughput molecular technologies hold great promise for the development of genomic biomarkers that enable precision medicine tailored to specific patients. These molecular biomarkers deliver powerful diagnostic information, as well as high prognostic significance. Similarly, medical imaging technologies provide tools for measuring the structural, functional and physiologic properties of tissue. Identifying image-based properties of tumors through medical images is a standard part of diagnosis, clinical staging, and treatment planning. Because image interpretation can be subjective, for medical imaging to have a role in personalized medicine, the development of robust, standardized image features that can be used to predict molecular properties, prognosis and/or treatment response, is required. These standardized features can be in the form of semantic annotations acquired from human observers, or radiomic features, i.e. quantitative image features computed from the image pixels. Quantitative image features include tumor size and shape, image intensity distributions, and image texture. While the adoption of molecular technologies can be limited by cost and the invasiveness of the procedure, medical imaging is, more commonly, part of the standard of care[1]. Moreover, in comparison to molecular profiling, radiomic characterization provides a more comprehensive representation of the tumor. Since molecular profiling is restricted to the region of the biopsy, it results in an incomplete representation of the heterogeneous tissue of the tumor. On the other hand, molecular technologies allow profiling of genes expressed in the tissue sample. This complementary relationship suggests that combining the use of molecular and imaging biomarkers has the potential to improve patient care and to provide insight into how molecular mechanisms give rise to imaging phenotypes. The prognostic power of medical image features and their link to molecular properties has only been recently investigated for certain cancer types[2-20]. An important challenge in such radiogenomic studies is the scarcity of large data sets containing medical images, extracted image features, gene expression profiles, and clinical data with survival outcomes. Specifically, for NSCLC, which is the leading cause of cancer death[21], there is a dearth of available datasets that contain medical images, molecular features, and associated clinical data. In NSCLC, CT and PET/CT are the investigation tools of choice for diagnosis, staging and monitoring of response to treatment. From these scans, one can compute a large number of quantitative image features for associations with tumor molecular features and clinical outcomes. Molecular profiles of tumors can be obtained through needle biopsies or samples of surgically-excised tumors. Clinical data and outcomes can be obtained from standard medical follow-up. While large molecular datasets with clinical data are readily available[22-25], there are fewer public medical imaging datasets combined with clinical and molecular data. For example, while five independent NSCLC datasets containing collectively 788 subjects were used in a radiogenomics study[7], only 89 subjects had imaging, molecular and clinical data. Moreover, that dataset included CT scans but did not contain PET/CT data. It is important to continue to create large integrated databases available for discovery and validation of biomarkers, and so we created this dataset to allow researchers to investigate the relationships between image features, tumor molecular phenotype, and survival outcomes. Between 2008 and 2012, we collected clinical and imaging data for 211 subjects referred for surgical treatment and obtained tissue samples from the excised tumors, where available. Tissue samples were analyzed to produce molecular phenotypes using gene microarrays, RNA sequencing technology, or both, in addition to standard-of-care NSCLC mutational testing. We also collected clinical data, such as: age, gender, weight, ethnicity, smoking status, TNM stage, histopathological grade. In addition, we included 3D tumor segmentations of the CT studies that were used for computation of 3D quantitative image features. Not all data are available for all subjects due to limitations in resources; out of the 211 subjects, 116 have all data types expect for micro-array (the data type with the smallest number of subjects), 130 have clinical, imaging (CT and PET/CT), and molecular (RNA-Seq) as detailed in Tables 1 and 2.
Table 1

Summary of the major collected data types and the corresponding number of subjects with available data.

Data TypeNumber of subjects
Clinical Data211
CT211
CT Tumor Segmentations144
CT Semantic Annotations190
PET/CT201
RNA-Seq130
Gene expression Microarrays26
Table 2

Subject IDs versus data type. Yes/no indicates if a data type is available for that particular subject.

 ClinicalCTPET/CTRNA-SeqSemantic AnnotationsSegmentations
AMC 001YesYesYesNoNoNo
AMC 002YesYesNoNoNoNo
AMC 003YesYesYesNoYesNo
AMC 004YesYesYesNoNoNo
AMC 005YesYesNoNoYesNo
AMC 006YesYesYesNoYesNo
AMC 007YesYesNoNoYesNo
AMC 008YesYesNoNoYesNo
AMC 009YesYesYesNoYesNo
AMC 010YesYesYesNoYesNo
AMC 011YesYesYesNoNoNo
AMC 012YesYesYesNoNoNo
AMC 013YesYesYesNoYesNo
AMC 014YesYesYesNoYesNo
AMC 015YesYesYesNoYesNo
AMC 016YesYesYesNoYesNo
AMC 017YesYesYesNoNoNo
AMC 018YesYesYesNoYesNo
AMC 019YesYesNoNoYesNo
AMC 020YesYesYesNoYesNo
AMC 021YesYesYesNoYesNo
AMC 022YesYesYesNoYesNo
AMC 023YesYesYesNoYesNo
AMC 024YesYesYesNoYesNo
AMC 025YesYesYesNoNoNo
AMC 026YesYesYesNoYesNo
AMC 027YesYesYesNoYesNo
AMC 028YesYesNoNoYesNo
AMC 029YesYesNoNoYesNo
AMC 030YesYesYesNoYesNo
AMC 031YesYesNoNoYesNo
AMC 032YesYesYesNoYesNo
AMC 033YesYesYesNoYesNo
AMC 034YesYesNoNoYesNo
AMC 035YesYesYesNoYesNo
AMC 036YesYesYesNoYesNo
AMC 037YesYesYesNoYesNo
AMC 038YesYesYesNoYesNo
AMC 039YesYesYesNoYesNo
AMC 040YesYesYesNoYesNo
AMC 041YesYesYesNoYesNo
AMC 042YesYesYesNoYesNo
AMC 043YesYesNoNoYesNo
AMC 044YesYesYesNoYesNo
AMC 045YesYesYesNoYesNo
AMC 046YesYesYesNoYesNo
AMC 047YesYesYesNoYesNo
AMC 048YesYesYesNoYesNo
AMC 049YesYesYesNoYesNo
R01 001YesYesYesNoYesYes
R01 002YesYesYesNoYesYes
R01 003YesYesYesYesYesYes
R01 004YesYesYesYesYesYes
R01 005YesYesYesYesYesYes
R01 006YesYesYesYesYesYes
R01 007YesYesYesYesYesYes
R01 008YesYesYesNoYesYes
R01 009YesYesYesNoNoNo
R01 010YesYesYesNoYesYes
R01 011YesYesYesNoYesYes
R01 012YesYesYesYesYesYes
R01 013YesYesYesYesYesYes
R01 014YesYesYesYesYesYes
R01 015YesYesYesYesYesYes
R01 016YesYesYesYesYesYes
R01 017YesYesYesYesYesYes
R01 018YesYesYesYesYesYes
R01 019YesYesYesNoYesYes
R01 020YesYesYesNoYesYes
R01 021YesYesYesYesYesYes
R01 022YesYesYesYesYesYes
R01 023YesYesYesYesYesYes
R01 024YesYesYesYesYesYes
R01 025YesYesYesNoYesYes
R01 026YesYesYesYesYesYes
R01 027YesYesYesYesYesYes
R01 028YesYesYesYesYesYes
R01 029YesYesYesYesYesYes
R01 030YesYesYesNoYesYes
R01 031YesYesYesYesYesYes
R01 032YesYesYesYesYesYes
R01 033YesYesYesYesYesYes
R01 034YesYesYesYesYesYes
R01 035YesYesYesYesYesYes
R01 036YesYesYesNoYesYes
R01 037YesYesYesYesYesYes
R01 038YesYesYesYesYesYes
R01 039YesYesYesYesYesYes
R01 040YesYesYesYesYesYes
R01 041YesYesYesYesYesYes
R01 042YesYesYesYesYesYes
R01 043YesYesYesYesYesYes
R01 044YesYesYesNoYesYes
R01 045YesYesYesNoYesYes
R01 046YesYesYesYesYesYes
R01 047YesYesYesNoYesYes
R01 048YesYesYesYesYesYes
R01 049YesYesYesYesYesYes
R01 050YesYesYesNoYesYes
R01 051YesYesYesYesYesYes
R01 052YesYesYesYesYesYes
R01 053YesYesYesNoYesYes
R01 054YesYesYesYesYesYes
R01 055YesYesYesYesYesYes
R01 056YesYesYesYesYesYes
R01 057YesYesYesYesYesYes
R01 058YesYesYesNoYesYes
R01 059YesYesYesYesYesYes
R01 060YesYesYesYesYesYes
R01 061YesYesYesYesYesYes
R01 062YesYesYesYesYesYes
R01 063YesYesYesYesYesYes
R01 064YesYesYesYesYesYes
R01 065YesYesYesYesYesYes
R01 066YesYesYesYesYesYes
R01 067YesYesYesYesYesYes
R01 068YesYesYesYesYesYes
R01 069YesYesYesYesYesYes
R01 070YesYesYesNoYesYes
R01 071YesYesYesYesYesYes
R01 072YesYesYesYesYesYes
R01 073YesYesYesYesYesYes
R01 074YesYesYesNoYesYes
R01 075YesYesYesNoYesYes
R01 076YesYesYesYesYesYes
R01 077YesYesYesYesYesYes
R01 078YesYesYesYesYesYes
R01 079YesYesYesYesYesYes
R01 080YesYesYesYesYesYes
R01 081YesYesYesYesYesYes
R01 082YesYesYesNoYesYes
R01 083YesYesYesYesYesYes
R01 084YesYesYesYesYesYes
R01 085YesYesYesNoYesYes
R01 086YesYesYesNoYesYes
R01 087YesYesYesNoYesYes
R01 088YesYesYesNoYesYes
R01 089YesYesYesYesYesYes
R01 090YesYesYesNoYesYes
R01 091YesYesYesYesYesYes
R01 092YesYesYesNoYesYes
R01 093YesYesYesYesYesYes
R01 094YesYesYesYesYesYes
R01 095YesYesYesNoYesYes
R01 096YesYesYesYesYesYes
R01 097YesYesYesYesYesYes
R01 098YesYesYesYesYesYes
R01 099YesYesYesYesYesYes
R01 100YesYesYesYesYesYes
R01 101YesYesYesYesYesYes
R01 102YesYesYesYesYesYes
R01 103YesYesYesYesYesYes
R01 104YesYesYesYesYesYes
R01 105YesYesYesYesYesYes
R01 106YesYesYesYesYesYes
R01 107YesYesYesYesYesYes
R01 108YesYesYesYesYesYes
R01 109YesYesYesYesYesYes
R01 110YesYesYesYesYesYes
R01 111YesYesYesYesYesYes
R01 112YesYesYesYesYesYes
R01 113YesYesYesYesYesYes
R01 114YesYesYesYesYesYes
R01 115YesYesYesYesYesYes
R01 116YesYesYesYesYesYes
R01 117YesYesYesYesYesYes
R01 118YesYesYesYesYesYes
R01 119YesYesYesYesYesYes
R01 120YesYesYesYesYesYes
R01 121YesYesYesYesYesYes
R01 122YesYesYesYesYesYes
R01 123YesYesYesYesYesYes
R01 124YesYesYesYesYesYes
R01 125YesYesYesYesYesYes
R01 126YesYesYesYesYesYes
R01 127YesYesYesYesYesYes
R01 128YesYesYesYesYesYes
R01 129YesYesYesYesYesYes
R01 130YesYesYesYesYesYes
R01 131YesYesYesYesYesYes
R01 132YesYesYesYesYesYes
R01 133YesYesYesYesNoYes
R01 134YesYesYesYesYesYes
R01 135YesYesYesYesYesYes
R01 136YesYesYesYesYesYes
R01 137YesYesYesYesYesYes
R01 138YesYesYesYesYesYes
R01 139YesYesYesYesYesYes
R01 140YesYesYesYesYesYes
R01 141YesYesYesYesYesYes
R01 142YesYesYesYesYesYes
R01 143YesYesYesNoYesNo
R01 144YesYesYesYesYesYes
R01 145YesYesYesYesYesYes
R01 146YesYesYesYesYesYes
R01 147YesYesYesNoYesNo
R01 148YesYesYesNoNoNo
R01 149YesYesYesNoNoNo
R01 150YesYesYesNoNoNo
R01 151YesYesYesNoYesNo
R01 152YesYesYesNoYesNo
R01 153YesYesYesNoNoNo
R01 154YesYesYesNoYesNo
R01 156YesYesYesNoNoNo
R01 157YesYesYesNoNoNo
R01 158YesYesYesNoNoNo
R01 159YesYesYesNoNoNo
R01 160YesYesYesNoNoNo
R01 161YesYesYesNoNoNo
R01 162YesYesYesNoNoNo
R01 163YesYesYesNoNoNo

Methods

Subject Demographics and Clinical Data

With approval of our respective Institutional Review Boards (IRB), we recruited a total of 211 subjects for the following two cohorts: (1) The R01 cohort consisted of 162 NSCLC subjects (38 females, 124 males, age at scan: mean 68, range 42–86) from Stanford University School of Medicine (69) and Palo Alto Veterans Affairs Healthcare System (93). Subjects were recruited between April 7th, 2008 and September 15th, 2012. Subjects signed written consent forms according to the guidelines of institutions’ IRBs. The subjects were selected from a pool of early stage NSCLC patients, referred for surgical treatment with preoperative CT and PET/CT performed prior to surgical procedures. Samples of excised tissues were later used to obtain mutation data and gene expression data using gene expression microarrays, or RNA sequencing, or both. Identifiers for this set of 162 subjects are in the format R01-XXXXXX. (2) The AMC cohort, consisting of 49 additional subjects (33 females, 16 males, age at scan: mean 67, range 24–80), was retrospectively collected from Stanford University School of Medicine based on the same criteria in addition to the availability of the following clinical mutational test results: Epidermal Growth Factor Receptor (EGFR), Kirsten Rat Sarcoma viral oncogene homolog (KRAS), and Anaplastic Lymphoma Kinase (ALK). Identifiers for this set of 49 subjects are in the format AMC-XXXXXX. For both cohorts, clinical data included, where available, smoking history (211), survival (211), recurrence status (210), histology (211), histopathological grading (162) and Pathological TNM staging (161). There were 172 adenocarcinomas and 35 squamous cell carcinomas and 4 not otherwise specified with grades ranging from poorly to well-differentiated. Clinical date features (e.g. recurrence date and scan dates) are shifted for anonymization purposes and are chronologically ordered relative to each other. Table 3 summarizes clinical data of the cohorts, and Table 4 lists all clinical features.
Table 3

Summary of demographic (sex and ethnicity) and clinical cohort characteristics (histology, pathological TNM stage and histopathological grade).

FeatureNumber of Subjects
Sex
Female76
Male135
Ethnicity
African-American6
Asian24
Caucasian123
Hispanic/Latino6
Native Hawaiian/Pacific Islander3
Not Recorded49
Histology
Adenocarcinoma172
Squamous cell carcinoma35
Not otherwise specified4
Pathological T stage
T00
Tis6
T1a40
T1b31
T1nos0
T2a47
T2b10
T2nos0
T321
T47
TX0
Not Collected49
Pathological N stage
N0129
N115
N218
N30
NX0
Not Collected49
Pathological M stage
M0157
M1a1
M1b4
Not Collected49
Histopathological Grade
G1 Well differentiated32
G2 Moderately differentiated76
G3 Poorly differentiated33
Other, Type I: Well to moderately differentiated9
Other, Type II: Moderately to poorly differentiated12
Not Collected49
Table 4

List of clinical features collected from subject medical records for our cohort of 211 subjects and corresponding number of patients with filled information for each feature.

Clinical FeaturesNumber of Patients
Subject affiliation211
Age at Histological Diagnosis211
Weight (lbs)152
Gender211
Ethnicity162
Smoking status211
Pack Years203
Quit Smoking Year194
Ground Glass146
Tumor Location211
Histology211
Pathological T stage162
Pathological N stage162
Pathological M stage162
Histopathological Grade162
Lymphovascular invasion154
Pleural invasion (elastic, visceral, parietal)154
EGFR mutation status206
KRAS mutation status205
ALK translocation status196
Adjuvant Treatment210
Chemotherapy210
Radiation210
Recurrence210
Recurrence Location210
Date of Recurrence210
Date of Last Known Alive211
Survival Status211
Date of Death211
CT Date211
Days between CT and surgery211
PET Date162

Imaging Data

Subjects received preoperative CT and PET/CT scans at Stanford University Medical Center and Palo Alto Veterans Affairs Healthcare System prior to surgical treatment as part of their care. Different scanners were used depending on the institution and physician choice and scanning protocols also varied.

De-Identification of Imaging Data

All imaging data were de-identified prior to analysis at Stanford. For subjects from Stanford, we de-identified the imaging data using the Medical Imaging Resource Center (MIRC) Clinical Trial Processor (CTP) (RSNA, Oakbrook, IL). The MIRC CTP is a software tool designed to Anonymize DICOM objects to remove protected health information. Medical image data from VA subjects were de-identified using PACSGEAR (Perceptive Software, Pleasanton, CA). Prior to making the data available on The Cancer Imaging Archive (TCIA)[26], we performed a second round of de-identification using CTP, further assuring complete removal of any identifying information. TCIA complies with HIPAA de-identification standards using the Safe Harbor Method as defined in section 164.514(b)(2) of the HIPPA Privacy Rule. This is done by incorporating the “Basic Application Confidentiality Profile” which is amended by inclusion of the following profile options: Clean Pixel Data Option, Clean Descriptors Option, Retain Longitudinal with Modified Dates Option, Retain Patient Characteristics Option, Retain Device Identity Option, and Retain Safe Private Option. The de-identification rules applied to each object are recorded by TCIA in the DICOM sequence Method Code Sequence [0012,0063] by entering the Code Value, Coding Scheme Designator, and Code Meaning for each profile and option that were applied to the DICOM object during de-identification[27].

CT Data

CT images in DICOM format[28] are available from 211 subjects. Since this is a retrospectively collected dataset, different subjects were scanned using different scanners, scanning protocols and scanning parameters: slice thickness of 0.625–3 mm (median: 1.5 mm) and an X-ray tube current of 124–699 mA (mean 220 mA) at 80–140 kVp (mean 120 kVp). Detailed scanning parameters, including scanner make and model are specified in the DICOM headers. Scans were acquired with subjects in supine position with arms at sides, from the apex of the lung to the adrenal gland within a single breath-hold. Table 5 summarizes the ranges of CT parameters used for our cohort.
Table 5

Summary of key CT scanning parameters in our cohort.

ParameterValueNo. of Subjects
Peak kilovoltage (kVp)100–120See DICOM image headers for individual scans
X-ray Tube Current (mA)28–749See DICOM image headers for individual scans
0.62512
164
Slice Thickness (mm)1.5114
22
2.515
34

PET/CT Data

Fasting Fluorodeoxyglucose 18F-FDG PET/CT data are available for 201 subjects. A GE Discovery D690 PET/CT was used for PET/CT scanning at Stanford University Medical Center, while the Palo Alto VA employed a GE Discovery PET/CT scanner. (The exact model of PET/CT scanners are specified DICOM image headers.) FDG Dose and uptake time were 138.90–572.25 MBq (mean 309.26 MBq) and 23.08–128.90 min (mean 66.58 minutes), respectively. PET images were generated at both sites using a similar protocol. Specifically, CT-based attenuation correction was utilized with iterative Ordered Subset Expectation Maximization (OSEM) reconstruction. Image acquisition included routine coverage of base-of-skull to mid-thigh with additional spot views where necessary. Each bed position was 1–5-minute acquisition, dependent on su weight. Table 6 summarizes ranges of scan parameters used to obtain PET/CT images. This PET/CT data set was used to identify tumor PET-FDG uptake features associated with gene expression signatures and survival[29].
Table 6

Summary of key PET/CT parameters in our cohort.

ParameterValue
FDG Dose (MBq)138.90–572.25
FDG uptake time (min)23.08–128.90

CT and PET/CT acquisition protocols

It has been recognized that the results of quantitative analyses (including e.g., radiomics) of images will vary as a function of image acquisition and reconstruction protocol[30-38]. However, we note that the imaging datasets reported here were acquired over several years and from several institutions, and not as part of a prospective trial. For these reasons there was no attempt to harmonize the acquisition and reconstruction protocols.

Semantic Annotations

Semantic annotations are available for axial CT series of 190 subjects. The template of semantic terms was developed in consensus by two academic thoracic radiologists (A.N.C.L. and D.A.) with expertise and interest in lung cancer imaging. The template was developed for nodules as they are the most common manifestation of lung cancer. As a result, we provide no semantic annotations for cancers of other manifestations, e.g., central obstructive tumors or "pneumonic tumors”. The template contains 28 nodule analysis features and parenchymal features comprising conventional and newly developed features used for diagnosis and staging using the CT images. Nodule features describe anatomy location, geometry, internal features and other associated findings of the nodules. Parenchymal features characterize lung emphysema, bronchi and lumen. The selected terms are in common usage in radiology clinical practice and are derived from descriptions in the radiology literature; definitions of some of these, such as “nodule” are found in the Fleischner Society: Glossary of Terms for Thoracic Imaging[39]. Table 7 (available online only) describes the semantic features included in the template. The ePAD template that we developed forces complete annotation for each nodule, resulting in all applicable features being collected. There are some features whose presence are conditioned upon other features being present. For example, the primary emphysema pattern feature is not collected when emphysema is not present in the lung. ePAD creates annotations in the Annotation and Image Mark-up (AIM) file format using a controlled vocabulary. The AIM information model is designed to be semantically inoperable. Information such as annotator identity, annotation date, and a reference to the annotated image, complement information on anatomic entities and imaging observation characteristics of the referenced image. AIM files supplement DICOM and other image formats which do not contain information on the meaning of the pixels in the image[40,41]. One radiologist (A.N.L.) with more than 20 years of experience ascribed the semantic annotations for all subjects’ CT scans using ePAD, an open-source and freely available web-based quantitative imaging informatics platform[41]. While we acknowledge that semantic annotations are subjective and subject to intra-and inter-reader variability, these were used in several studies, e.g., to predict EGFR and KRAS mutation status[42], and to create a radiogenomic map linking semantic features to gene expression profiles generated by RNA sequencing[13].
Table 7

List of semantic features collected from axial CT for a subset of 190 subjects. Nodule analysis provides information on anatomy location, geometry, internal features and other associated findings of the nodules.

Nodule Analysis
  
CategoryFeatureValue
The “internal” features refer to findings inside (interior of) the tumor. Associated findings are observed within the imaging study but outside the tumor volume. Parenchymal analysis characterizes lung emphysema, bronchi and lumen. The ePAD template used to collect these features requires all features to be collected. However, some features are conditional upon another feature being present. For example, the “primary emphysema pattern” feature is collected unless the “emphysema feature” is present.  
 Anatomic Location-Right Upper Lobe-Right Middle Lobe-Right Lower Lobe-Left Upper Lobe-Lingula-Left Lower Lobe-Right bronchial tree-Left bronchial tree-Right side-Left side-Unable to determine
Axial Location-Central | Peripheral (edge < 2 cm from visceral pleura)
Longest diameter (mm)Integer
Longest perpendicular diameter (mm)Integer
Nodule attenuation-Solid-Pure Ground Glass (GG) (Non-solid)-Semi-consolidation (attenuation between solid and ground –glass in non-solid nodules)-Part-solid, solid ≤ 5 mm diameter-Part-solid, solid > 5 mm diameter
Nodule Reticulation-Absent | -Present (lines inside GG nodule)
Internal FeaturesInternal Air alveolograms/bronchograms-Absent | -Present
Necrosis-Absent | -Present
Cavitation-Absent | -Present
Nodule Margins – primary pattern-Smooth (sharply delineated margins – can outline confidently without oscillations or serrations)-Irregular (minor oscillations or serrations of margin)-Lobulated (focal convexity or protrusions of lesion into lung)-Spiculation (several linear radiations of finite length extending into adjacent lung)-Poorly defined (lack of clear delineation of margins – cannot outline confidently)
Nodule Shape-Round (roughly spherical)-Oval (ratio of x/y diameters > 1.5)-Complex (neither 1 nor 2)-Polygonal (straight or concave borders)
Nodule Calcification-No calcification-Central calcification-Peripheral
Associated FindingsAttachment to Pleura-Absent | -Present
Attachment to Vessel-Absent | -Present
Attachment to Bronchus-Absent | -Present
Pleural Retraction-Absent | -Present
Entering Airway-Absent | -Present
Thickened adjacent bronchovascular bundle-Absent | -Present
Vascular convergence-Absent | -Present
Septal thickening-Absent | -Present
Nodule Periphery-Emphysema-Fibrosis (diffuse)-Normal-Scarring (focal)
Satellite nodules in Primary Lesion Lobe ( ≥ 4 mm, noncalcified)-Absent | -Solid | -Non-solid | -Semi-consolidation | -Part-solid
Nodules in NON-lesion lobe SAME Lung ( ≥ 4 mm, noncalcified)-Absent | -Solid | -Non-solid | -Semi-consolidation | -Part-solid
Nodules in CONTRALATERAL Lung ( ≥ 4 mm, noncalcified)-Absent | -Solid | -Non-solid | -Semi-consolidation | -Part-solid
Centrilobular nodules – diffuse (RB type nodules)-Absent | -Present
Lung Parenchyma Analysis
  
CategoryFeatureValue
Emphysema-Absent | -Present
Primary emphysema Pattern-Centrilobular-Pan-acinar-Paraseptal-Paracicatricial-NA
Primary Distribution-Upper predominant-Middle Predominant-Lower Predominant-Diffuse, no predominance-Patchy, no predominance-NA or Unable to determine
Primary Emphysema Laterality-Right | -Left | -Both
Secondary Emphysema Pattern-Centrilobular-Pan-acinar-Paraseptal-Paracicatricial-NA
Secondary Emphysema Distribution-Upper predominant-Middle Predominant-Lower Predominant-Diffuse, no predominance-Patchy, no predominance-NA or Unable to determine
Secondary emphysema laterality-Right | -Left | -Both
Overall Emphysema Severity-None-Low (1-25%)-Moderate (26-50%)-Moderately High (51-75%)-High ( > 75%)
Lung FeaturesAirway Abnormalities-Absent | -Present
Bronchial wall thickening-Absent | -Present
Airway ectasia (mild luminal enlargement)-Absent | -Present
Bronchiectasis (moderate enlargement)-Absent | -Present
Luminal narrowing-Absent | -Present
Bronchiolar prominence-Absent | -Present
Tree-in-Bud (airway secretions)-Absent | -Present
Mosaic oligemia-Absent | -Present
Fibrosis-Absent | -Present
Anatomic Fibrosis Distribution-Apical-Upper predominant-Middle Predominant-Lower Predominant-Diffuse, no predominancne-Patchy, no predominance-Unable to determine
Axial Fibrosis Distribution-Subpleural-Bronchovascular-Both 1 & 2-Random
Fibrosis Type-Usual Interstitial Pneumonia (UIP)-Nonspecific Interstitial Pneumonia (NSIP)-Hypersensitivity Pneumonitis (HP)-Sarcoidosis-Smoking-related-Post-infectious (include Oesophago-Gastro-Duodenoscopy OGD)-Other (specify)-Indeterminate

Segmentations

Initial segmentations for 144 subjects were obtained from an axial CT image series using an unpublished automatic segmentation algorithm. All of these segmentations were viewed by a thoracic radiologist (M.K.) with more than 5 years of experience and edited as necessary using ePAD. Final segmentations were reviewed by an additional thoracic radiologist (A.N.L.); disagreements in tumor boundaries were discussed and edited as appropriate, with final approval by A.N.L. All segmentations are stored as DICOM Segmentation Objects[28].

Molecular Data

Tumor Preparation

All tumor samples were collected from treatment-naïve subjects during surgical procedure. Following excision, the surgeon cut a 3–5-mm-thick slice along the longest axis of the excised tissue, which was frozen within 30 minutes of excision. It was later retrieved for RNA extraction. Molecular data are available from EGFR, KRAS, ALK mutational testing, gene expression microarrays, and RNA sequencing. Tumors from 17 subjects were analyzed using both gene expression microarrays and RNA sequencing.

Mutational testing

EGFR, KRAS and ALK mutation status are available from clinical records in 206, 205, and 196 subjects, respectively. Single nucleotide mutation detection was performed using SNaPshot technology based on dideoxy single-base extension of oligonucleotide primers after multiplex polymerase chain reaction (PCR). Exons 18, 19, 20 and 21 were tested for EGFR mutations. Exon 2 Positions 12 and 13 were tested for missense KRAS mutations with amino acid substitution. Mutation results were a combination of mutation at any location of the tested exons. For ALK, EML4-ALK translocation detection test was performed using fluorescence in situ hybridization (FISH).

Gene Expression Microarray Data

Gene expression microarray data was collected for the subset of 26 subjects, who underwent surgical treatment between April 7, 2008 and May 21, 2010. RNA was processed at the Stanford Functional Genomics Facility using Illumina Whole Genome Bead Chips (Human HT-12; Illumina, San Diego, CA). These data were preprocessed as follows: First, we filtered the microarray probes on the basis of a significant detection call in at least 60% of the samples. Next, we log transformed the microarray data and used quantile normalization to normalize between arrays. These data, along with the corresponding CT images, were used to describe associations between image features, gene expression, and survival[10,29].

RNA Sequencing Data

Based on availability and quality of available tissue, RNA sequencing was performed on samples from 130 subjects (17 of which intersect with the gene expression microarray dataset described in the previous section). We excluded RNASeq for tissue samples with RNA integrity number (RIN) below 2.5. Total RNA was extracted from the tissue samples and converted into a library for paired-end sequencing on Illumina Hiseq according to the protocol for the Illumina TruSeq Sample preparation kit (Centrillion Biosciences, Palo Alto, CA). Briefly, total RNA quality and quantity were measured by BioAnalyzer (Agilent). For library preparation, the TruSeq Total Stranded RNA with Ribo-Zero Reduction (Illumina) was used following manufacturer’s instructions. This method includes a Ribo-Zero rRNA depletion step, followed by fragmentation and cDNA synthesis using SuperScript II (Life Technologies). The cDNA was A-tailed, ligated and amplified using the materials in the TruSeq Total Stranded RNA with Ribo-Zero Reduction kit. Quality was confirmed using the BioAnalyzer and finally the concentration evaluated by KAPA qPCR (KAPA Biosystems). Prior to sequencing, samples were diluted to 4 nmol and pooled. Pooled libraries were clustered via the cBOT and sequenced on the HiSeq 2500 (illumina) following manufacturer’s instructions. The set of 130 tissue samples was sequenced in three batches of sizes 16, 66, 48. Data processing was performed by Centrillion Biosciences as follows: reads were aligned to the human genome (hg19) using the alignment algorithm STAR[43] version 2.3 with 91 bases of splice junction overhangs. Next, Cufflinks version 2.0.2[44] was used to determine the expression calls in each sample using Fragments Per Kilobase of transcript per Million mapped reads (FPKM).

Data Records

Subject Identifiers

A unique identifier for each subject is identical in all four public data records in this dataset. Subject ID’s are 6-digit numbers in the form of R01-XXXXXX or AMC-XXXXXX.

Data Record 1

Clinical, image, semantic data for all subjects are stored in The Cancer Imaging Archive (TCIA) (Data Citation 1). One comma-delimited file contains clinical data for all subjects with unique subject identifiers. Semantic features for each subject are stored in Annotation and Image Markup (AIM) files[45]. CT and PET/CT Images are in DICOM format. Where available, segmentations are provided as DICOM Segmentation Objects.

Data Record 2

Image data of 26 subjects had been previously deposited in the TCIA repository (Data Citation 2). These images were given new subject names in the form R01-XXXXXX as part of the new dataset described in this work.

Data Record 3

Gene expression microarray data, available for 26 subjects, were deposited in National Center for Biotechnology Information Gene Expression Omnibus (NCBI GEO)[46] (Data Citation 3). The subject identifiers are identical to subject names in Data Record 2. Processed gene clusters were deposited in tab-delimited files with column values corresponding to microarray ID, log2 transformed quantile normalized and probe selection detection-p-value, respectively. This data record also contains raw expression data, as well as matrix data obtained prior to normalization.

Data Record 4

Raw and processed sequencing data obtained from RNASeq for 130 subjects are available at NCBI GEO (Data Citation 4). The subject IDs are identical to subject names in Data Record 1.

Technical Validation

All CT and PET/CT data were collected as part of patient care and therefore all quality assurance was performed by the institution that collected the data.

Usage Notes

All data are freely available to browse, download, and use for commercial, scientific and educational purposes as outlined in the Creative Commons Attribution 3.0 Unported License. Users should properly cite this source for any work based on this dataset.

Additional information

How to cite this article: Bakr, S. et al. A radiogenomic dataset of non-small cell lung cancer. Sci. Data. 5:180202 doi: 10.1038/sdata.2018.202 (2018). Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
  44 in total

1.  Tumour heterogeneity in non-small cell lung carcinoma assessed by CT texture analysis: a potential marker of survival.

Authors:  Balaji Ganeshan; Elleny Panayiotou; Kate Burnand; Sabina Dizdarevic; Ken Miles
Journal:  Eur Radiol       Date:  2011-11-17       Impact factor: 5.315

2.  Imaging features from pretreatment CT scans are associated with clinical outcomes in nonsmall-cell lung cancer patients treated with stereotactic body radiotherapy.

Authors:  Qian Li; Jongphil Kim; Yoganand Balagurunathan; Ying Liu; Kujtim Latifi; Olya Stringfield; Alberto Garcia; Eduardo G Moros; Thomas J Dilling; Matthew B Schabath; Zhaoxiang Ye; Robert J Gillies
Journal:  Med Phys       Date:  2017-06-24       Impact factor: 4.071

3.  Automated tracking of quantitative assessments of tumor burden in clinical trials.

Authors:  Daniel L Rubin; Debra Willrett; Martin J O'Connor; Cleber Hage; Camille Kurtz; Dilvan A Moreira
Journal:  Transl Oncol       Date:  2014-02-01       Impact factor: 4.243

4.  NCBI GEO: archive for high-throughput functional genomic data.

Authors:  Tanya Barrett; Dennis B Troup; Stephen E Wilhite; Pierre Ledoux; Dmitry Rudnev; Carlos Evangelista; Irene F Kim; Alexandra Soboleva; Maxim Tomashevsky; Kimberly A Marshall; Katherine H Phillippy; Patti M Sherman; Rolf N Muertter; Ron Edgar
Journal:  Nucleic Acids Res       Date:  2008-10-21       Impact factor: 16.971

5.  Magnetic resonance image features identify glioblastoma phenotypic subtypes with distinct molecular pathway activities.

Authors:  Haruka Itakura; Achal S Achrol; Lex A Mitchell; Joshua J Loya; Tiffany Liu; Erick M Westbroek; Abdullah H Feroze; Scott Rodriguez; Sebastian Echegaray; Tej D Azad; Kristen W Yeom; Sandy Napel; Daniel L Rubin; Steven D Chang; Griffith R Harsh; Olivier Gevaert
Journal:  Sci Transl Med       Date:  2015-09-02       Impact factor: 17.956

6.  Identification of noninvasive imaging surrogates for brain tumor gene-expression modules.

Authors:  Maximilian Diehn; Christine Nardini; David S Wang; Susan McGovern; Mahesh Jayaraman; Yu Liang; Kenneth Aldape; Soonmee Cha; Michael D Kuo
Journal:  Proc Natl Acad Sci U S A       Date:  2008-03-24       Impact factor: 11.205

7.  Prediction of recurrence-free survival in postoperative non-small cell lung cancer patients by using an integrated model of clinical information and gene expression.

Authors:  Eung-Sirk Lee; Dae-Soon Son; Sung-Hyun Kim; Jinseon Lee; Jisuk Jo; Joungho Han; Heesue Kim; Hyun Joo Lee; Hye Young Choi; Youngja Jung; Miyeon Park; Yu Sung Lim; Kwhanmien Kim; YoungMog Shim; Byung Chul Kim; Kyusang Lee; Nam Huh; Christopher Ko; Kyunghee Park; Jae Won Lee; Yong Soo Choi; Jhingook Kim
Journal:  Clin Cancer Res       Date:  2008-11-15       Impact factor: 12.531

8.  ArrayExpress update--an archive of microarray and high-throughput sequencing-based functional genomics experiments.

Authors:  Helen Parkinson; Ugis Sarkans; Nikolay Kolesnikov; Niran Abeygunawardena; Tony Burdett; Miroslaw Dylag; Ibrahim Emam; Anna Farne; Emma Hastings; Ele Holloway; Natalja Kurbatova; Margus Lukk; James Malone; Roby Mani; Ekaterina Pilicheva; Gabriella Rustici; Anjan Sharma; Eleanor Williams; Tomasz Adamusiak; Marco Brandizi; Nataliya Sklyar; Alvis Brazma
Journal:  Nucleic Acids Res       Date:  2010-11-10       Impact factor: 16.971

9.  Variability in CT lung-nodule quantification: Effects of dose reduction and reconstruction methods on density and texture based features.

Authors:  P Lo; S Young; H J Kim; M S Brown; M F McNitt-Gray
Journal:  Med Phys       Date:  2016-08       Impact factor: 4.071

10.  Radiomics: Images Are More than Pictures, They Are Data.

Authors:  Robert J Gillies; Paul E Kinahan; Hedvig Hricak
Journal:  Radiology       Date:  2015-11-18       Impact factor: 11.105

View more
  41 in total

1.  Precision Medicine in Pancreatic Disease-Knowledge Gaps and Research Opportunities: Summary of a National Institute of Diabetes and Digestive and Kidney Diseases Workshop.

Authors:  Mark E Lowe; Dana K Andersen; Richard M Caprioli; Jyoti Choudhary; Zobeida Cruz-Monserrate; Anil K Dasyam; Christopher E Forsmark; Fred S Gorelick; Joe W Gray; Mark Haupt; Kimberly A Kelly; Kenneth P Olive; Sylvia K Plevritis; Noa Rappaport; Holger R Roth; Hanno Steen; S Joshua Swamidass; Temel Tirkes; Aliye Uc; Kirill Veselkov; David C Whitcomb; Aida Habtezion
Journal:  Pancreas       Date:  2019 Nov/Dec       Impact factor: 3.327

Review 2.  Radiogenomics Based on PET Imaging.

Authors:  Yong-Jin Park; Mu Heon Shin; Seung Hwan Moon
Journal:  Nucl Med Mol Imaging       Date:  2020-05-09

Review 3.  Radiomics: an Introductory Guide to What It May Foretell.

Authors:  Stephanie Nougaret; Hichem Tibermacine; Marion Tardieu; Evis Sala
Journal:  Curr Oncol Rep       Date:  2019-06-25       Impact factor: 5.075

4.  Identifying 18F-FDG PET-metabolic radiomic signature for lung adenocarcinoma prognosis via the leveraging of prognostic transcriptomic module.

Authors:  Jin Li; Yixin Liu; Wenlei Dong; Yang Zhou; Jingquan Wu; Kuan Luan; Lishuang Qi
Journal:  Quant Imaging Med Surg       Date:  2022-03

5.  ITHscore: comprehensive quantification of intra-tumor heterogeneity in NSCLC by multi-scale radiomic features.

Authors:  Jiaqi Li; Zhenbin Qiu; Chao Zhang; Sijie Chen; Mengmin Wang; Qiuchen Meng; Haiming Lu; Lei Wei; Hairong Lv; Wenzhao Zhong; Xuegong Zhang
Journal:  Eur Radiol       Date:  2022-08-24       Impact factor: 7.034

6.  Next-Generation Radiogenomics Sequencing for Prediction of EGFR and KRAS Mutation Status in NSCLC Patients Using Multimodal Imaging and Machine Learning Algorithms.

Authors:  Isaac Shiri; Hasan Maleki; Ghasem Hajianfar; Hamid Abdollahi; Saeed Ashrafinia; Mathieu Hatt; Habib Zaidi; Mehrdad Oveisi; Arman Rahmim
Journal:  Mol Imaging Biol       Date:  2020-08       Impact factor: 3.488

7.  Artificial Intelligence and Precision Medicine: A Perspective.

Authors:  Jacek Lorkowski; Oliwia Kolaszyńska; Mieczysław Pokorski
Journal:  Adv Exp Med Biol       Date:  2022       Impact factor: 2.622

8.  A Radiogenomics Ensemble to Predict EGFR and KRAS Mutations in NSCLC.

Authors:  Silvia Moreno; Mario Bonfante; Eduardo Zurek; Dmitry Cherezov; Dmitry Goldgof; Lawrence Hall; Matthew Schabath
Journal:  Tomography       Date:  2021-04-29

9.  Correction for Systematic Bias in Radiomics Measurements Due to Variation in Imaging Protocols.

Authors:  Jocelyn Hoye; Taylor Smith; Ehsan Abadi; Justin B Solomon; Ehsan Samei
Journal:  Acad Radiol       Date:  2021-06-13       Impact factor: 5.482

10.  Bone Marrow and Tumor Radiomics at 18F-FDG PET/CT: Impact on Outcome Prediction in Non-Small Cell Lung Cancer.

Authors:  Sarah A Mattonen; Guido A Davidzon; Jalen Benson; Ann N C Leung; Minal Vasanawala; George Horng; Joseph B Shrager; Sandy Napel; Viswam S Nair
Journal:  Radiology       Date:  2019-09-17       Impact factor: 29.146

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.