| Literature DB >> 31870283 |
Hsin-Yao Wang1,2, Wen-Chi Li3, Kai-Yao Huang3, Chia-Ru Chung4, Jorng-Tzong Horng4,5, Jen-Fu Hsu6,7, Jang-Jih Lu8,9,10, Tzong-Yi Lee11,12.
Abstract
BACKGROUND: Group B streptococcus (GBS) is an important pathogen that is responsible for invasive infections, including sepsis and meningitis. GBS serotyping is an essential means for the investigation of possible infection outbreaks and can identify possible sources of infection. Although it is possible to determine GBS serotypes by either immuno-serotyping or geno-serotyping, both traditional methods are time-consuming and labor-intensive. In recent years, the matrix-assisted laser desorption ionization-time of flight mass spectrometry (MALDI-TOF MS) has been reported as an effective tool for the determination of GBS serotypes in a more rapid and accurate manner. Thus, this work aims to investigate GBS serotypes by incorporating machine learning techniques with MALDI-TOF MS to carry out the identification.Entities:
Keywords: GBS; Group B streptococcus; MALDI-TOF-MS; Machine learning; Serotypes
Mesh:
Year: 2019 PMID: 31870283 PMCID: PMC6929280 DOI: 10.1186/s12859-019-3282-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Number of extracted features among 10 different bin sizes in each model
| Bin size (Da) | Number of peaks in each model | ||||
|---|---|---|---|---|---|
| Type Ia | Type Ib | Type III | Type V | Type VI | |
| 1 | 1621 | 1621 | 1622 | 1621 | 1620 |
| 2 | 1013 | 1013 | 1012 | 1013 | 1012 |
| 3 | 779 | 780 | 779 | 779 | 779 |
| 4 | 644 | 644 | 642 | 643 | 644 |
| 5 | 576 | 576 | 575 | 576 | 576 |
| 6 | 493 | 493 | 493 | 493 | 493 |
| 7 | 468 | 469 | 468 | 469 | 469 |
| 8 | 428 | 430 | 429 | 429 | 427 |
| 9 | 398 | 398 | 398 | 397 | 397 |
| 10 | 380 | 381 | 380 | 380 | 380 |
Fig. 1Flowchart of GBS serotype prediction in this work. The study can be divided into three parts: data collection, data analysis, and prediction analysis
Fig. 2Example of binning method for feature extraction. At the upper part, we stacked three mass spectra as examples, and their ranges are from m/z 2000 to m/z 2030. At the lower part, all mass spectra were divided by using ten different sizes of error regions from 1 Da to 10 Da. The blue squares indicate that at least one peak from any data is in the range of the bin
Performance (five-fold cross validation) of the predictive models for each serotype when using OneR for feature selection. SVM: Support Vector Machine; Sn: Sensitivity; Sp: Specificity; Acc: Accuracy; MCC: Matthews Correlation Coefficient
| Serotype | Feature selection | Classifiers | Bin size | Number of features | Sn | Sp | Acc | MCC |
|---|---|---|---|---|---|---|---|---|
| Ia | OneR | Random Forest | 9 | 42 | 95.1% | 89.1% | 90.9% | 0.804 |
| SVM | 9 | 38 | 93.5% | 91.9% | 92.4% | 0.828 | ||
| Ib | Random Forest | 7 | 48 | 83.7% | 81.0% | 81.8% | 0.611 | |
| SVM | 7 | 46 | 76.4% | 74.3% | 74.9% | 0.473 | ||
| III | Random Forest | 9 | 28 | 90.1% | 87.4% | 88.3% | 0.753 | |
| SVM | 9 | 16 | 86.5% | 85.5% | 85.8% | 0.700 | ||
| V | Random Forest | 1 | 43 | 86.2% | 85.9% | 86.0% | 0.690 | |
| SVM | 1 | 43 | 81.3% | 81.0% | 81.1% | 0.590 | ||
| VI | Random Forest | 3 | 47 | 93.2% | 91.6% | 92.2% | 0.837 | |
| SVM | 9 | 13 | 91.2% | 89.6% | 90.2% | 0.796 |
Performance (five-fold cross validation) of the predictive models for each serotype when using PCC for feature selection. SVM: Support Vector Machine; Sn: Sensitivity; Sp: Specificity; Acc: Accuracy; MCC: Matthews Correlation Coefficient
| Serotype | Feature selection | Classifiers | Bin size | Number of features | Sn | Sp | Acc | MCC |
|---|---|---|---|---|---|---|---|---|
| I a | PCC | Random Forest | 9 | 35 | 100.0% | 96.1% | 97.3% | 0.939 |
| SVM | 9 | 28 | 100.0% | 99.6% | 99.8% | 0.994 | ||
| I b | Random Forest | 3 | 38 | 100.0% | 93.3% | 95.3% | 0.899 | |
| SVM | 3 | 30 | 100.0% | 99.3% | 99.5% | 0.988 | ||
| Random Forest | 6 | 7 | 91.0% | 90.2% | 90.5% | 0.795 | ||
| SVM | 6 | 7 | 91.0% | 89.7% | 90.2% | 0.789 | ||
| Random Forest | 9 | 46 | 100.0% | 92.3% | 94.6% | 0.885 | ||
| SVM | 9 | 43 | 100.0% | 99.6% | 99.8% | 0.994 | ||
| Random Forest | 6 | 31 | 97.3% | 91.6% | 93.7% | 0.872 | ||
| SVM | 7 | 18 | 94.6% | 93.2% | 93.7% | 0.868 |
Characteristics of the training data sets of each serotype after oversampling
| Models of serotypes | Type of data | Original data | Post oversampling | Total |
|---|---|---|---|---|
| Ia | Ia | 41 | 123 | 407 |
| Non-Ia | 284 | 284 | ||
| Ib | Ib | 41 | 123 | 407 |
| Non-Ib | 284 | 284 | ||
| III | III | 111 | 111 | 325 |
| Non-III | 214 | 214 | ||
| V | V | 41 | 123 | 407 |
| Non-V | 284 | 284 | ||
| VI | VI | 74 | 148 | 399 |
| Non-VI | 251 | 251 |
Performance of the predictive models for each serotype by using independent testing data. PCC was used as the feature selection method. SVM: Support Vector Machine; Sn: Sensitivity; Sp: Specificity; Acc: Accuracy; MCC: Matthews Correlation Coefficient
| Serotype | Bin size of Peaks | Number of features | Classifiers | Sn | Sp | Acc | MCC |
|---|---|---|---|---|---|---|---|
| 9 | 28 | Random Forest | 66.0% | 61.4% | 61.9% | 0.168 | |
| SVM | 19.1% | 94.9% | 87.1% | 0.172 | |||
| I b | 3 | 30 | Random Forest | 55.6% | 54.8% | 54.9% | 0.071 |
| SVM | 54.0% | 38.9% | 41.0% | −0.050 | |||
| 6 | 7 | Random Forest | 73.0% | 74.1% | 73.9% | 0.405 | |
| SVM | 68.0% | 71.3% | 70.6% | 0.336 | |||
| 9 | 43 | Random Forest | 63.6% | 40.6% | 43.4% | 0.028 | |
| SVM | 5.5% | 94.1% | 83.4% | −0.007 | |||
| 7 | 18 | Random Forest | 70.4% | 70.3% | 70.4% | 0.381 | |
| SVM | 67.6% | 64.4% | 65.4% | 0.297 |
Data statistics of the training data set and independent testing data set among each serotype of GBS (serotype Ia, Ib, II, III, IV, V, VI, VII, and unknown serotypes)
| Serotypes | Number of Mass Spectra (%)a | |
|---|---|---|
| Training Data Set | Independent Testing Data Set | |
| Ia | 41 (12.6%) | 47 (10.2%) |
| Ib | 41 (12.6%) | 63 (13.6%) |
| II | 16 (4.9%) | 44 (9.5%) |
| III | 111 (34.2%) | 100 (21.6%) |
| IV | – | 5 (1.1%) |
| V | 41 (12.6%) | 55 (11.9%) |
| VI | 74 (22.8%) | 143 (31.0%) |
| VII | 1 (0.3%) | 2 (0.4%) |
| VIII | – | 1 (0.2%) |
| unknown | – | 2 (0.4%) |
| Total | 325 (100%) | 462 (100%) |
a: Percentage of each serotype in two kinds of data sets
Fig. 3Example of a way to find the main peak. Following Fig. 2, we took the bin size 5 Da (from m/z 2010 to m/z 2014.999) for an example
Fig. 4Example of peak pair. When the p-value was less than or equal to 0.05, this peak pair was statistically significant
Fig. 5Example of the feature selection method, OneR. There are six data as examples in the left Table. (3 Type III and 3 Non-type III). In the right table, the statistics of the number of occurrences in Peak 1 and Peak 2 are counted. The red words in the table are the numbers of judgment errors
Fig. 6Data distribution of training data set by unsupervised hierarchical cluster analysis (UHCA) with bin size 1 Da
Fig. 7Data distribution of training data set by UHCA with features selected from the five models
Fig. 8Data distribution of discriminative peak pairs for each serotype. The peaks were selected and ranked by PCC. a Type Ia, (b) Type Ib, (c) Type III, (d) Type V, and (e) Type VI. The distribution of data for each pair of the training data set in each model is shown. The term ‘none’ in the legend indicates that both peaks are absent. The terms ‘peak1’ and ‘peak2’ represented that only the peak above VS appeared or the lower peak appeared. The term ‘both’ represented the simultaneous appearance of two peaks. Overall, there was a significant difference of each pair based on different combinations in the positive and negative data set
Fig. 9Prediction page of the GBS Website