| Literature DB >> 30463540 |
Xiaoyan Wang1, Yichuan Li1, Tingting He1, Xingpeng Jiang2, Xiaohua Hu3,4.
Abstract
BACKGROUND: Microbe plays a crucial role in the functional mechanism of an ecosystem. Identification of the interactions among microbes is an important step towards understand the structure and function of microbial communities, as well as of the impact of microbes on human health and disease. Despite the importance of it, there is not a gold-standard dataset of microbial interactions currently. Traditional approaches such as growth and co-culture analysis need to be performed in the laboratory, which are time-consuming and costly. By providing predicted candidate interactions to experimental verification, computational methods are able to alleviate this problem. Mining microbial interactions from mass medical texts is one type of computational methods. Identification of the named entity of bacteria and related entities from the text is the basis for microbial relation extraction. In the previous work, a system of bacteria named entities recognition based on the dictionary and conditional random field was proposed. However, it is inefficient when dealing with large-scale text.Entities:
Keywords: Microbial interactions; Named entity recognition; Spark; Text mining
Mesh:
Year: 2018 PMID: 30463540 PMCID: PMC6249713 DOI: 10.1186/s12918-018-0625-3
Source DB: PubMed Journal: BMC Syst Biol ISSN: 1752-0509
Fig. 1The Bacteria named entity recognition system flow chart
The performance of models trained on different scale training sets
| Training set (The number of sentences) | CRF++ on single node | Spark version | ||||
|---|---|---|---|---|---|---|
| Precision | Recall | F-Measure | Precision | Recall | F-Measure | |
| 1000 | 84.679% | 73.429% | 78.654% | 86.715% | 80.566% | 83.527% |
| 2000 | 85.442% | 76.391% | 80.664% | 88.031% | 80.880% | 84.304% |
| 3000 | 86.287% | 78.232% | 82.062% | 88.623% | 81.463% | 84.892% |
| 4000 | 85.707% | 78.591% | 81.995% | 88.389% | 82.002% | 85.076% |
| 5000 | 86.447% | 78.725% | 82.405% | 88.699% | 81.373% | 84.878% |
| 6000 | 87.831% | 80.341% | 83.919% | 89.492% | 82.944% | 86.094% |
| 7000 | 88.456% | 80.476% | 84.277% | 89.981% | 83.438% | 86.586% |
| 8000 | 87.745% | 80.341% | 83.880% | 90.398% | 83.662% | 86.900% |
| 9000 | 88.345% | 80.969% | 84.496% | 90.847% | 84.201% | 87.398% |
| 10,000 | 88.873% | 81.373% | 84.958% | 90.944% | 83.842% | 87.249% |
The average prediction time of CRF++ on single node vs Spark version
| Data sets (The number of abstracts) | (s) | Spark version (different numbers of processor cores) (s) | |||
|---|---|---|---|---|---|
| 12 | 24 | 36 | 48 | ||
| 2000 | 362.411 | 118.479 | 83.758 | 75.223 | 72.375 |
| 10,000 | 1716.569 | 533.486 | 325.471 | 286.723 | 268.614 |
| 20,000 | 3081.027 | 964.063 | 612.743 | 525.29 | 517.477 |
| 30,000 | 5207.298 | 1406.216 | 883.148 | 793.282 | 734.974 |
| 40,000 | 6141.149 | 1858.607 | 1168.061 | 1020.059 | 966.032 |
| 50,000 | 7956.735 | 2154.872 | 1465.193 | 1243.926 | 1191.362 |
Fig. 2The prediction time and dataset scale curves of CRF++ on single node vs Spark version (48-cores processor)
Fig. 3The prediction time and the number of processor cores curves on 6 data sets