| Literature DB >> 30577743 |
John X Qiu1, Hong-Jun Yoon2, Kshitij Srivastava1, Thomas P Watson3, J Blair Christian1, Arvind Ramanathan1, Xiao C Wu4, Paul A Fearn5, Georgia D Tourassi1.
Abstract
BACKGROUND: Deep Learning (DL) has advanced the state-of-the-art capabilities in bioinformatics applications which has resulted in trends of increasingly sophisticated and computationally demanding models trained by larger and larger data sets. This vastly increased computational demand challenges the feasibility of conducting cutting-edge research. One solution is to distribute the vast computational workload across multiple computing cluster nodes with data parallelism algorithms. In this study, we used a High-Performance Computing environment and implemented the Downpour Stochastic Gradient Descent algorithm for data parallelism to train a Convolutional Neural Network (CNN) for the natural language processing task of information extraction from a massive dataset of cancer pathology reports. We evaluated the scalability improvements using data parallelism training and the Titan supercomputer at Oak Ridge Leadership Computing Facility. To evaluate scalability, we used different numbers of worker nodes and performed a set of experiments comparing the effects of different training batch sizes and optimizer functions.Entities:
Mesh:
Year: 2018 PMID: 30577743 PMCID: PMC6302459 DOI: 10.1186/s12859-018-2511-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Primary cancer site codes and corresponding number of pathology reports associated with the code
| Code | # cases | Code | # cases | Code | # cases | Code | # cases |
|---|---|---|---|---|---|---|---|
| C00 | 16 | C17 | 156 | C41 | 88 | C64 | 458 |
| C01 | 53 | C18 | 1951 | C42 | 1800 | C65 | 20 |
| C02 | 108 | C19 | 118 | C44 | 1151 | C66 | 32 |
| C04 | 39 | C20 | 646 | C48 | 84 | C67 | 947 |
| C05 | 31 | C21 | 75 | C49 | 261 | C68 | 30 |
| C06 | 48 | C22 | 268 | C50 | 4415 | C69 | 18 |
| C07 | 66 | C23 | 27 | C51 | 97 | C70 | 17 |
| C08 | 16 | C24 | 26 | C52 | 39 | C71 | 296 |
| C09 | 71 | C25 | 151 | C53 | 314 | C72 | 36 |
| C10 | 27 | C26 | 32 | C54 | 882 | C73 | 305 |
| C11 | 43 | C30 | 47 | C55 | 174 | C74 | 10 |
| C12 | 11 | C31 | 15 | C56 | 448 | C75 | 22 |
| C13 | 14 | C32 | 240 | C57 | 83 | C76 | 196 |
| C14 | 24 | C34 | 1570 | C60 | 18 | C77 | 741 |
| C15 | 199 | C38 | 106 | C61 | 2313 | C80 | 962 |
| C16 | 427 | C40 | 14 | C62 | 60 | C90 | 24 |
Fig. 1Cross-Entropy Validation Loss vs. Epoch by Mini-batch size
Fig. 2Cross-Entropy Validation Loss vs. Elapsed Train Time by Mini-batch size
Fig. 3Cross-Entropy Validation Loss vs. Elapsed Train Time by Optimizer Function
Fig. 410-Epoch Train Time vs. Number of Worker Nodes for Various Batch Sizes
Classification performance of Convolutional Neural Networks and Random Forest classifiers in Micro-F1 and Macro-F1 scores
| Micro-F1 | Macro-F1 | |
|---|---|---|
| CNN | 0.8425 | 0.5117 |
| Random Forest | 0.7632 | 0.3567 |
Primary site-specific classification performance and their number of support
| Site | Precision | Recall | F1-score | Support | Site | Precision | Recall | F1-score | Support |
|---|---|---|---|---|---|---|---|---|---|
| C00 | 1.00 | 0.19 | 0.32 | 16 | C41 | 0.51 | 0.24 | 0.33 | 88 |
| C01 | 0.62 | 0.57 | 0.59 | 53 | C42 | 0.92 | 0.95 | 0.94 | 1800 |
| C02 | 0.71 | 0.86 | 0.78 | 108 | C44 | 0.85 | 0.91 | 0.88 | 1150 |
| C04 | 0.69 | 0.64 | 0.67 | 39 | C48 | 0.36 | 0.10 | 0.15 | 84 |
| C05 | 0.62 | 0.42 | 0.50 | 31 | C49 | 0.40 | 0.44 | 0.42 | 261 |
| C06 | 0.38 | 0.35 | 0.37 | 48 | C50 | 0.94 | 0.97 | 0.95 | 4414 |
| C07 | 0.83 | 0.86 | 0.84 | 66 | C51 | 0.89 | 0.74 | 0.81 | 97 |
| C08 | 0.00 | 0.00 | 0.00 | 16 | C52 | 0.43 | 0.15 | 0.23 | 39 |
| C09 | 0.81 | 0.90 | 0.85 | 71 | C53 | 0.78 | 0.77 | 0.77 | 314 |
| C10 | 0.36 | 0.15 | 0.21 | 27 | C54 | 0.78 | 0.91 | 0.84 | 882 |
| C11 | 0.72 | 0.49 | 0.58 | 43 | C55 | 0.47 | 0.13 | 0.21 | 174 |
| C12 | 0.00 | 0.00 | 0.00 | 11 | C56 | 0.72 | 0.84 | 0.77 | 448 |
| C13 | 0.00 | 0.00 | 0.00 | 14 | C57 | 0.59 | 0.16 | 0.25 | 83 |
| C14 | 0.47 | 0.29 | 0.36 | 24 | C60 | 1.00 | 0.22 | 0.36 | 18 |
| C15 | 0.82 | 0.81 | 0.82 | 199 | C61 | 0.98 | 0.99 | 0.98 | 2313 |
| C16 | 0.82 | 0.81 | 0.81 | 427 | C62 | 0.98 | 0.92 | 0.95 | 60 |
| C17 | 0.71 | 0.54 | 0.62 | 156 | C64 | 0.89 | 0.93 | 0.91 | 458 |
| C18 | 0.85 | 0.90 | 0.87 | 1951 | C65 | 0.33 | 0.10 | 0.15 | 20 |
| C19 | 0.56 | 0.31 | 0.40 | 118 | C66 | 0.68 | 0.47 | 0.56 | 32 |
| C20 | 0.81 | 0.84 | 0.82 | 646 | C67 | 0.93 | 0.96 | 0.94 | 947 |
| C21 | 0.72 | 0.64 | 0.68 | 75 | C68 | 0.33 | 0.03 | 0.06 | 30 |
| C22 | 0.74 | 0.82 | 0.78 | 268 | C69 | 0.00 | 0.00 | 0.00 | 18 |
| C23 | 1.00 | 0.19 | 0.31 | 27 | C70 | 1.00 | 0.06 | 0.11 | 17 |
| C24 | 0.67 | 0.08 | 0.14 | 26 | C71 | 0.79 | 0.87 | 0.83 | 296 |
| C25 | 0.79 | 0.79 | 0.79 | 151 | C72 | 0.65 | 0.42 | 0.51 | 36 |
| C26 | 0.00 | 0.00 | 0.00 | 32 | C73 | 0.94 | 0.97 | 0.95 | 305 |
| C30 | 0.56 | 0.49 | 0.52 | 47 | C74 | 0.00 | 0.00 | 0.00 | 10 |
| C31 | 0.67 | 0.13 | 0.22 | 15 | C75 | 1.00 | 0.18 | 0.31 | 22 |
| C32 | 0.79 | 0.88 | 0.83 | 240 | C76 | 0.41 | 0.27 | 0.33 | 196 |
| C34 | 0.86 | 0.91 | 0.88 | 1569 | C77 | 0.65 | 0.65 | 0.65 | 741 |
| C38 | 0.62 | 0.49 | 0.55 | 106 | C80 | 0.51 | 0.48 | 0.50 | 962 |
| C40 | 0.00 | 0.00 | 0.00 | 14 | C90 | 0.00 | 0.00 | 0.00 | 24 |
Fig. 5Support-normalized confusion matrix between the actual and predicted values from the CNN classifier for 64 primary cancer sites
Fig. 6Master Node Validation Loss/Worker Node Training Loss vs Epoch with batch size 64