Nam Nhut Phan1,2,3, Amrita Chattopadhyay3, Tzu-Pin Lu3,4, Mong-Hsun Tsai3,5,6. 1. Bioinformatics Program, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica, Taipei. 2. Graduate Institute of Biomedical Electronics and Bioinformatics, Department of Electrical Engineering, National Taiwan University, Taipei. 3. Bioinformatics and Biostatistics Core, Centre of Genomic and Precision Medicine, National Taiwan University, Taipei. 4. Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei. 5. Institute of Biotechnology, National Taiwan University, Taipei. 6. Center of Biotechnology, National Taiwan University, Taipei.
Buzzwords indicate popular trends in research fields. These terms might last for decades or perish in just a few years (1). Over the last ten years, we have witnessed the rise of a big buzzword-deep learning (DL) (2-7). In brief, DL is a sub-domain of artificial intelligence (AI), a type of representation learning, which automatically finds features in a data and transforms them into a higher abstract data based on matrix operations (3). There are various types of DL algorithms such as convolutional neural networks (8), recurrent neural networks (9), long short-term memory networks (10), convolutional deep belief networks (11), generative adversarial networks (12), and deep residual networks (13), just to name a few. Depending on the specific task/problem, one could use these networks individually or combine them into a pipeline. The biggest advantage of DL algorithms is that they can be trained without pre-defined features/variables, which is especially convenient for complicated data types, such as biomedical images or sequencing data, that are time-consuming and computationally expensive and require a high level of human expertise for feature selection (3). Moreover, high-end facilities such as graphical processing units, central processing units, and random-access memory are needed for processing and training such data within a reasonable amount of time.A growing body of research related to neural network applications for solving problems in the biomedical field includes diverse research topics that commonly leverage big data. This includes biomedical images and multi-omics datasets either from public domain or in-house data from different populations (6,14-16). Biomedical images can be in 2-dimensional (2D) format such as pathological images, or 3-dimensional (3D) such as with mammography images, computed tomography scans, and magnetic resonance imaging (17-22). A single scanned image could be split into hundreds to several thousands of smaller images, which easily complies with the data demands of neural network training. The data formats for multi-omics data is even more complicated and are highly dependent on the manufacturing platforms. The omics data, such as genomics (sequencing data) (23), transcriptomics (sequencing and expression data) (24,25), proteomics (mass spectrometry data) (26), and metabolomics (metabolite compounds) (27), can be used for DL models as long as the number of samples and features is suitable for training and can achieve acceptable accuracy. From only a single run, these high-throughput platforms can generate thousands to millions of data points from each sample. Integrating these could provide an unprecedentedly comprehensive data to study the complicated diseases such as cancer (28) or human brain diseases (29). Therefore, this is a golden era for data-driven research, not only due to the huge amount of publicly available datasets, but also because of the rapid development of modern algorithms and giant technology corporations such as Google (TensorFlow and CoLab cloud computing) (30-32), Amazon (Amazon Web Services) (33), and Facebook (PyTorch) (34) and their platforms and cloud computing services. With such favorable conditions and the available open-source environments of the DL community, it is inevitable that biomedical researchers start to enter the race of DL. Moreover, several databases are available that house a huge number of biomedical images such as the National Cancer Institute’s GDC Data Portal (https://portal.gdc.cancer.gov), the National Institutes of Health Database (https://nihcc.app.box.com/v/ChestXray-NIHCC), the Cancer Imaging Archive (https://www.cancerimagingarchive.net), NLM’s MedPix database (https://medpix.nlm.nih.gov/home), the Open Access Series of Imaging Studies (OASIS) (http://www.oasis-brains.org), the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (http://adni.loni.usc.edu), and Stanford’s AI in Medicine database (AIMI) (https://aimi.stanford.edu/research/public-datasets); all of these could be of immense advantage for the DL community. These databases are maintained and continuously updated with additional samples and data types and play a central role in DL studies due to their well-structured and diverse disease sources. For instance, the GDC data portal can provide whole exome sequencing data, targeted sequencing data, RNA-sequencing data, genotype data, tissue and diagnostic slides, whole genome data, and ATAC-seq data. All of these data are not fully open access, but researchers can apply for access to the controlled portions of the data. However, model training on such large datasets requires data labeling and annotation, which are time-consuming and sometimes expensive, so there are still barriers to the use of all the available data.Many DL publications describe well-annotated datasets; however, gaining access to these resources is usually difficult. Access to in-house datasets, pre-annotated by experts, is still in demand, for the benefit of the healthcare research community. As the public domain data are usually specific to ethnic groups or local populations, other in-house datasets from varied ethnicities could serve as an external validation resource to prevent model bias of certain datasets. That would ultimately make the pre-trained model more useful across populations.Clinical application is the ultimate goal in biomedical research. Therefore, the questions or hypotheses that researchers aim to address with DL, leveraging all ready-to-use data and resources, is of utmost clinical importance. This is what leads to the proper design of models that represent complex real-life data, and potentially provide data-driven information for clinical research. All of this requires close collaboration between laboratory researchers and medical doctors, to understand the current needs in each specific disease and successfully translate findings from the laboratory bench to the clinic.
Authors: Ryan Poplin; Avinash V Varadarajan; Katy Blumer; Yun Liu; Michael V McConnell; Greg S Corrado; Lily Peng; Dale R Webster Journal: Nat Biomed Eng Date: 2018-02-19 Impact factor: 25.671
Authors: Andre Esteva; Alexandre Robicquet; Bharath Ramsundar; Volodymyr Kuleshov; Mark DePristo; Katherine Chou; Claire Cui; Greg Corrado; Sebastian Thrun; Jeff Dean Journal: Nat Med Date: 2019-01-07 Impact factor: 53.440
Authors: Fabian Horst; Sebastian Lapuschkin; Wojciech Samek; Klaus-Robert Müller; Wolfgang I Schöllhorn Journal: Sci Rep Date: 2019-02-20 Impact factor: 4.379
Authors: Geert Litjens; Clara I Sánchez; Nadya Timofeeva; Meyke Hermsen; Iris Nagtegaal; Iringo Kovacs; Christina Hulsbergen-van de Kaa; Peter Bult; Bram van Ginneken; Jeroen van der Laak Journal: Sci Rep Date: 2016-05-23 Impact factor: 4.379
Authors: Babak Saravi; Frank Hassel; Sara Ülkümen; Alisia Zink; Veronika Shavlokhova; Sebastien Couillard-Despres; Martin Boeker; Peter Obid; Gernot Michael Lang Journal: J Pers Med Date: 2022-03-22