| Literature DB >> 32255249 |
Ivan R Vogelius1,2, Jens Petersen3, Søren M Bentzen4.
Abstract
Radiation oncology, a major treatment modality in the care of patients with malignant disease, is a technology- and computer-intensive medical specialty. As such, it should lend itself ideally to data science methods, where computer science, statistics, and clinical knowledge are combined to advance state-of-the-art care. Nevertheless, data science methods in radiation oncology research are still in their infancy and successful applications leading to improved patient care remain scarce. Here, we discuss data interoperability issues within and across organizational boundaries that hamper the introduction of big data and data science techniques in radiation oncology. At the semantic level, creating common underlying models and codification of the data, including the use of data elements with standardized definitions, an ontology, remains a work in progress. Methodological issues in data science and in the use of large population-based health data registries are identified. We show that data science methods and big data cannot replace randomized clinical trials in comparative effectiveness research by reviewing a series of instances where the outcomes of big data analyses and randomized trials are at odds. We also discuss the modern wave of machine learning and artificial intelligence as represented by deep learning and convolutional neural networks. Finally, we identify promising research avenues and remain optimistic that the data sources in radiation oncology can be linked to yield important insights in the near future. We argue that data science will be a valuable complement to, but not a replacement of, the traditional hypothesis-driven translational research chain and the randomized clinical trials that form the backbone of evidence-based medicine.Entities:
Keywords: artificial intelligence; data science; radiotherapy
Mesh:
Year: 2020 PMID: 32255249 PMCID: PMC7332210 DOI: 10.1002/1878-0261.12685
Source DB: PubMed Journal: Mol Oncol ISSN: 1574-7891 Impact factor: 6.603
Fig. 1The translational research chain versus the data science approach.
Fig. 2Key sources of data and points of radiation data loss. From lower left and counterclockwise, radiation dose plan banks are currently available in all modern institutions as record‐and‐verify systems that contain image data and 3D radiation dose exposure, but no follow‐up data. Hospital‐based EHR systems contain more detail on other treatments, but do not contain detailed, granular radiotherapy data; such data are often reduced to prescription dose/fractionation. Radiotherapy data are further lost when these data are moved to large claims‐based registries or tumor registries, where data on various aspects of long‐term outcome are available. Red boxes: examples of main shortcomings of data source. Green boxes: examples of main strengths.
Fig. 3Schematic illustration of loss of granularity of radiotherapy data when moving from single institutional series to the largest of available datasets. This graph provides a schematic illustration of the level of radiotherapy data granularity versus sample sizes across selected published studies and available databases in the United States as examples. QUANTEC (Marks et al., 2010a) is an example of a federated learning model that aims to bridge institutional series (Defraene et al., 2019). Also shown is a population‐based series of breast cancer patients without detailed dosimetry, but with information available about whether internal mammary nodes were included in the radiotherapy target (Thorsen et al., 2016). The graph also shows the number of patients in randomized trials of external beam radiotherapy (EBRT) for prostate cancer (Vogelius and Bentzen, 2018; Widmark et al., 2019), the number of patients in Radiation Therapy Oncology Group (RTOG) trial databases (personal communication), and the number of breast cancer patients in the National Cancer Database (NCDB) and Medicare. Where long‐term outcomes are available in the large series (to the right), radiotherapy information is often reduced to one bit of information (radiotherapy given or not) in these studies (McGale et al., 2016). Abbreviations: QUANTEC: Quantitative Analyses of Normal tissue Effects in the Clinic. RT, radiotherapy; RCT, randomized controlled trial.
Fig. 4A comparison of effect‐size estimates from randomized controlled trials and registry‐based analyses. The schematic shows published effect‐size estimates from randomized controlled trials (x‐axis) and registry‐based analyses (y‐axis). Concordant effect sizes are indicated by the black identity line. We see examples of registry‐based studies over‐ and underestimating effects, as well as being relatively in agreement. Data are collected from Ang et al. (2014), McGale et al. (2016), Pignon et al. (2009), and Zandberg et al. (2018), and reanalyzed/replotted for this analysis by SMB and IRV. CRT, chemoradiotherapy; CTX‐RT, cetuximab + RT; DSS, disease‐specific survival; HNSCC, head‐and‐neck squamous cell cancer; OS, overall survival.
Glossary of data science terms.
| Application programming interface (API) | Communication protocol that allows external communication with software or server. In this field, APIs allow researchers to write code (scripts) to query radiotherapy databases to extract features from (large numbers of) individual patients' scan or dosimetry data |
| Artificial neural networks (ANNs) | An ANN is a network of artificial neurons, connected such that output from a given neuron forms the input to one or more neurons in the next ‘layer’. Passing input data through many successive such layers allows for complex transformations, that is, complex mathematical functions that link a set of inputs to a specific output |
| Artificial neuron | The artificial neuron is the basic building block of an ANN. It is a mathematical function that takes multiple real‐valued inputs, each of which is multiplied by a weight. These weighted inputs are then summed and put into a so‐called activation function that outputs a real value. The activation function is typically a nonlinear function, for example, a sigmoid function |
| Deep learning (DL) | A type of learning that uses multiple ANN layers to progressively extract higher level features from the raw input |
| Federated learning (a.k.a., distributed learning) | This approach entails training a model simultaneously on several datasets that reside on different servers while communicating model data (such as goodness‐of‐fit data and regression coefficients) rather than exchanging the data itself |
| Generalization error | Generalization errors are calculated by metrics that quantify the amount of error a prediction algorithm makes on a set of previously unseen data |
| High‐dimensionality datasets | This is a general term used to describe datasets that contain large numbers of features per patient, including genomic data and image features |
| Machine learning (ML) | The study of how computers learn from data to solve problems. ML is also used to refer to algorithms or systems that learn from data how to solve a task, as opposed to being explicitly programmed how to do so |
| Multiple‐layered network/deep neural networks | These are neural networks that consist of many layers of neurons between the input and output, such that the output of one layer becomes input for the next |
| Ontology | Representation, formal naming, and definition of the data in a field of research, examples are tumor characteristics (e.g., UICC staging), organ delineation/naming, dose descriptors, and disease/procedural codes |
| Record‐and‐verify databases | Databases that were originally invented to document treatments and reduce risk of errors, and that have evolved into complete information systems that contain image data, planned dose matrices, and detailed delivery data. They usually have some sort of application programming interface |
| Semisupervised learning | Machine learning from input data, where only a subset of input data is paired with output data, that is, an approach that mixes supervised and unsupervised learning |
| Single‐layer model | This term describes conventional regression models that could be seen to provide a ‘single layer’: In these models, a single mathematical descriptor (e.g., logistic function or Cox model) connects input data to outcome prediction. It is used for illustration here, but it is not an often‐used term |
| Supervised learning | The task of learning a function that maps an input to an output, based on example input–output pairs. Regression models are examples of this approach |
| Tall datasets | These are ‘Big data’ datasets where the number of cases (individuals, patients) is much larger than the number of features per case. Examples are population‐based cancer registries or claims databases |
| Unsupervised learning | This approach finds patterns in datasets without preexisting labels, that is, based solely on the structure of the input data, which is also known as self‐organization. Hierarchical clustering is an example of such a method |
| Wide datasets | ‘Big data’ datasets, in which the number of features (data items) per case is much larger than the number of cases. Examples include data from genomics or proteomics or from medical imaging |
Fig. 5A DL algorithm to analyze radiation exposure in routine clinical setting. Output from a DL algorithm in the form of a 3D U‐Net architecture (Ronneberger et al., 2015). The algorithm was trained on a dataset of manual annotations of lung substructures (vessels and airways). Subsequently, the DL algorithm performed annotations on a previously unseen routine, planning CT scan from a record‐and‐verify system yielding the airway and vessel annotations in green shades depicted on the left. Right: The annotated CT scan is overlaid with a 3D dose color wash to show the potential of automating access to detailed radiotherapy exposure data, which may subsequently be connected to outcome data from the larger registries. Note that some of the smaller vessels are exaggerated in size due to partial volume effects.