| Literature DB >> 29702574 |
Yuting Xing1, Chengkun Wu2, Xi Yang3, Wei Wang4, En Zhu5, Jianping Yin6.
Abstract
A prevailing way of extracting valuable information from biomedical literature is to apply text mining methods on unstructured texts. However, the massive amount of literature that needs to be analyzed poses a big data challenge to the processing efficiency of text mining. In this paper, we address this challenge by introducing parallel processing on a supercomputer. We developed paraBTM, a runnable framework that enables parallel text mining on the Tianhe-2 supercomputer. It employs a low-cost yet effective load balancing strategy to maximize the efficiency of parallel processing. We evaluated the performance of paraBTM on several datasets, utilizing three types of named entity recognition tasks as demonstration. Results show that, in most cases, the processing efficiency can be greatly improved with parallel processing, and the proposed load balancing strategy is simple and effective. In addition, our framework can be readily applied to other tasks of biomedical text mining besides NER.Entities:
Keywords: Tianhe-2; big data; biomedical text mining; load balancing; parallel computing
Mesh:
Year: 2018 PMID: 29702574 PMCID: PMC6099625 DOI: 10.3390/molecules23051028
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
The system configuration of Tianhe-2.
| Items | Content |
|---|---|
| Manufacturer | NUDT |
| Cores | 3,120,000 |
| Memory | 1,024,000 GB |
| CPU | Intel Xeon E5-2692v2 12 C 2.2 GHz |
| Interconnect | TH Express-2 |
| Linpack Performance(Rmax) | 33,862.7 TFlop/s |
| Theoretical Peak(Rpeak) | 54,902.4 TFlop/s |
| HPCG [TFlop/s] | 580.109 |
| Operating System | Kylin Linux |
| MPI | MPICH2 with a customized GLEX channel |
Figure 1The time cost of processing different numbers of input articles in serial.
Time distribution of processing different numbers of input articles in serial.
| Number of Papers | Time(s) | ||||
|---|---|---|---|---|---|
| Distribution | GNormPlus | tmVar | Dnorm | Total | |
| 10 | 2.11 | 1742.16 | 416.71 | 58.29 | 2219.27 |
| 20 | 4.2 | 2764.13 | 444.86 | 65.14 | 3278.33 |
| 40 | 7.76 | 4865.14 | 488.08 | 83.19 | 5444.17 |
| 80 | 17.17 | 7864.37 | 744.86 | 177.45 | 8803.85 |
Figure 2The time cost of processing different input sizes in serial.
Time cost distribution of processing different input sizes in serial.
| Sizes of Papers (M) | Time (s) | ||||
|---|---|---|---|---|---|
| Distribution | GNormPlus | tmVar | Dnorm | Total | |
| 1 | 1.93 | 1511.35 | 239.74 | 59.51 | 1812.53 |
| 2 | 4.80 | 2696.64 | 319.31 | 62.94 | 3083.69 |
| 4 | 16.24 | 4944.35 | 633.57 | 86.24 | 5680.40 |
| 8 | 21.77 | 8604.43 | 985.63 | 130.33 | 9742.16 |
Figure 3Effects of different load balancing strategies.
Profiling of different load balancing strategies.
| (a) Maximum times on different numbers of parallel processes with different strategies. | |||
|
|
| ||
|
|
|
| |
| 2 | 8676.60 | 8651.42 | 8466.89 |
| 4 | 4466.81 | 5211.96 | 4596.09 |
| 8 | 2792.29 | 2706.41 | 2386.45 |
| 16 | 1737.94 | 1457.01 | 1475.07 |
| 32 | 932.85 | 942.10 | 930.25 |
| 64 | 609.06 | 579.8 | 553.42 |
| (b) Average times on different numbers of parallel processes with different strategies. | |||
|
|
| ||
|
|
|
| |
| 2 | 8513.10 | 8374.08 | 8392.46 |
| 4 | 4001.38 | 4535.42 | 4393.07 |
| 8 | 2287.19 | 2354.14 | 2242.42 |
| 16 | 1265.46 | 1174.12 | 1270.33 |
| 32 | 624.73 | 666.91 | 640.63 |
| 64 | 379.08 | 376.06 | 366.84 |
| (c) Load balancing efficiencies on different numbers of parallel processes with different strategies. | |||
|
|
| ||
|
|
|
| |
| 2 | 0.98 | 0.97 | 0.99 |
| 4 | 0.90 | 0.87 | 0.96 |
| 8 | 0.82 | 0.87 | 0.94 |
| 16 | 0.73 | 0.81 | 0.86 |
| 32 | 0.67 | 0.71 | 0.69 |
| 64 | 0.62 | 0.65 | 0.66 |
Figure 4Load balancing efficiencies.
The processing time of 61,078 papers running on 128 processes.
| Number of Processes | Time (s) | ||||
|---|---|---|---|---|---|
| Distribution | GNormPlus | tmVar2.0 | Dnorm | Total | |
| 1 | 18,934.18 | 5,874,482.04 | 654,145.38 | 82,455.3 | 6,630,016.9 |
| 128 | 3643 | 23,733 | 16,214 | 233 | 43,823 |
| Speed-up (x) | 5.20 | 247.52 | 40.34 | 353.89 | 151.29 |
Figure 5The implementation and deployment of paraBTM in large-scale parallel environment.
Figure 6Random load balancing strategy.
Figure 7A sorted file list.
Figure 8Round-robin algorithm.
Figure 9A demonstration of the Short-Board load balancing algorithm.
Figure 10The pseudo code of Short-Board.