| Literature DB >> 34141144 |
Danyi Xiong1,2, Ze Zhang2, Tao Wang2, Xinlei Wang1.
Abstract
As a branch of machine learning, multiple instance learning (MIL) learns from a collection of labeled bags, each containing a set of instances. The learning process is weakly supervised due to ambiguous instance labels. Since its emergence, MIL has been applied to solve various problems including content-based image retrieval, object tracking/detection, and computer-aided diagnosis. In biomedical research, the use of MIL has been focused on medical image analysis and molecule activity prediction. We review and apply 16 methods to investigate the applicability of MIL to a novel biomedical application, cancer detection using T-cell receptor (TCR) sequences. This important application can be a viable approach for large-scale cancer screening, as TCRs can be easily profiled from a subject's peripheral blood. We consider two feasible data-generating mechanisms, and for the purpose of performance evaluation, we simulate data under each mechanism, where we vary potentially important factors to mimic realistic situations. We also apply the methods to sequencing data of ten cancer types from The Cancer Genome Atlas, as an early proof of concept for distinguishing tumor patients from healthy individuals via TCR sequencing of peripheral blood. We find that given an appropriate MIL method is used, satisfactory performance with Area Under the Receiver Operating Characteristic Curve above 80% can be achieved for five in the ten cancers. Based on our numerical results, we make suggestions about selection of a proper method and avoidance of any method with poor performance. We further point out directions of future research as well as identify a pressing need of new MIL methodologies for improved performance (for some cancer types) and more explainable outcomes.Entities:
Keywords: Binary classification; Primary instance; T-cell receptor; Weakly supervised learning; Witness rate
Year: 2021 PMID: 34141144 PMCID: PMC8192570 DOI: 10.1016/j.csbj.2021.05.038
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1The pipeline of data processing in our MIL application of cancer detection using TCR sequences. Each encoded TCR sequence is represented by a d-dimensional numeric feature vector learned by the auto-encoder.
Fig. 3Mean AUROC (%) of bag classification using different MIL methods, evaluated on simulation scenarios each with 100 replicates generated under model II. IS/BS/ES methods are distinguished by green, blue, and magenta lines.
The selected MIL methods for cancer detection using TCR sequences. Those that can perform instance classification are highlighted in italics.
| C | NSK-SVM | EMD-SVM | miGraph | MInD | |||
| BoW | CCE | MI-Net | |||||
Fig. 2Mean AUROC (%) of bag classification using different MIL methods, evaluated on simulation scenarios each with 100 replicates generated under model I. IS/BS/ES methods are distinguished by green, blue, and magenta lines.
Average computation time (with standard error) in seconds for each MIL method based on 20 datasets under the basic setting of each model. Clock time is counted from loading data to producing classification results.
| IS methods | MILBoost | SI- | SI-SVM | EMDD | mi-SVM | MI-SVM | mi-Net |
|---|---|---|---|---|---|---|---|
| Model I (WR) | 9 (0.05) | 13 (0.06) | 17 (0.05) | 24 (0.22) | 22 (1.14) | 33 (2.75) | 13 (0.07) |
| Model II (PPI) | 10 (0.12) | 9 (0.03) | 18 (4.21) | 18 (1.00) | 14 (0.24) | 13 (0.27) | 13 (0.09) |
| BS methods | MInD | C | EMD-SVM | miGraph | NSK-SVM | ||
| Model I (WR) | 9 (0.03) | 21 (0.05) | 3 (0.50) | 77 (0.38) | 80 (0.08) | ||
| Model II (PPI) | 13 (0.17) | 9 (0.04) | 3 (0.61) | 20 (0.31) | 21 (0.20) | ||
| ES methods | BoW | CCE | MILES | MI-Net | |||
| Model I (WR) | 60 (21.84) | 14 (0.36) | 42 (0.21) | 14 (0.07) | |||
| Model II (PPI) | 20 (3.78) | 14 (0.14) | 20 (0.22) | 13 (0.04) | |||
TCGA data: descriptive statistics including the sample size and bag size (i.e., the number of instances per bag) for selected cancer types.
| Sample size | Bag size | |||||
|---|---|---|---|---|---|---|
| Cancer type | Total | Mean (SD) | Total | Mean (SD) | Total | |
| Proportion of positive bags | 10% | 50% | 10% | 50% | ||
| DLBC | 225 | 90 | 6.4 (13.8) | 1446 | 11.8 (21.8) | 1063 |
| THYM | 225 | 216 | 6.3 (14.2) | 1421 | 15.2 (24.5) | 3277 |
| ESCA | 225 | 332 | 8.8 (24.5) | 1979 | 10.2 (20.6) | 3380 |
| BRCA | 225 | 404 | 5.3 (12.9) | 1200 | 5.6 (12.6) | 2255 |
| KIRC | 225 | 404 | 6.0 (17.0) | 1347 | 6.2 (10.3) | 2518 |
| LUAD | 225 | 404 | 4.5 (14.5) | 1018 | 4.9 (12.9) | 1974 |
| LUSC | 225 | 404 | 4.9 (10.7) | 1093 | 4.0 (6.3) | 1622 |
| OV | 225 | 404 | 5.6 (11.1) | 1263 | 9.1 (17.3) | 3670 |
| SKCM | 225 | 404 | 5.8 (12.0) | 1306 | 6.2 (13.2) | 2515 |
| STAD | 225 | 404 | 9.9 (23.9) | 2221 | 18.3 (31.9) | 7401 |
Fig. 4TCGA data: panels (a) and (b) show boxplots of mean AUROC (%) by cancer type for different MIL methods using data with 10% and 50% positive bags, respectively; panels (c) and (d) show boxplots of mean AUROC (%) by MIL method for different cancer types using data with 10% and 50% positive bags, respectively. Categorization of MIL methods are distinguished by color (green: IS methods; blue: BS methods; magenta: ES methods).
The best and worst MIL methods for bag classification based on our numerical evaluation using simulation and real data examples. Categorization of MIL methods are distinguished by color (green and italic: IS methods; blue and bold: BS methods; magenta: ES methods).