| Literature DB >> 34950904 |
Artur Yakimovich1, Anaël Beaugnon2, Yi Huang3, Elif Ozkirimli4.
Abstract
Recent advances in biomedical machine learning demonstrate great potential for data-driven techniques in health care and biomedical research. However, this potential has thus far been hampered by both the scarcity of annotated data in the biomedical domain and the diversity of the domain's subfields. While unsupervised learning is capable of finding unknown patterns in the data by design, supervised learning requires human annotation to achieve the desired performance through training. With the latter performing vastly better than the former, the need for annotated datasets is high, but they are costly and laborious to obtain. This review explores a family of approaches existing between the supervised and the unsupervised problem setting. The goal of these algorithms is to make more efficient use of the available labeled data. The advantages and limitations of each approach are addressed and perspectives are provided.Entities:
Keywords: active learning; data annotation; data labeling; data value; machine learning; self-supervised learning; semi-supervised learning; zero-shot learning
Year: 2021 PMID: 34950904 PMCID: PMC8672145 DOI: 10.1016/j.patter.2021.100383
Source DB: PubMed Journal: Patterns (N Y) ISSN: 2666-3899
Figure 1A schematic depiction of supervised and unsupervised machine learning approaches
(A) An example of a supervised approach. Here, data points (circles) in ×1 and ×2 dimensions are labeled in magenta and green categories, allowing the model (dashed line) to be fitted.
(B) An example of an unsupervised approach. Here an algorithm attempts to detect patterns (clusters; dashed line) in unlabeled data (light blue circles).
For simplicity of graphical illustration, all the data points are depicted in a 2D space of ×1 and ×2. However, similar concepts apply to n-dimensional space. Dashed line represents an abstraction for a model or a decision boundary.
Examples of zero-shot learning by formatting text data to fit models
| Task | Input | Output | Reference |
|---|---|---|---|
| Multi-class classification | |||
| Question answering | Yes/No | Sun et al. | |
| Natural language inference | Entailment/Contradiction | Yin et al. | |
| Cloze | Choose from | Schick and Schütze, |
References to biomedical applications
| Approach | Biomedical applications |
|---|---|
| Supervised | Protein-compound affinity prediction Omics Biomedical imaging EHR |
| Semi-supervised | Omics |
| Active learning | Drug discovery Omics Biomedical imaging Clinical text data |
| Data augmentation | Omics Biomedical imaging Clinical text data |
| Transfer learning | Omics Biomedical imaging Biomedical text data |
| Self-supervised | Omics Biomedical imaging |
| Few/one/zero-shot and few-shot learning | Omics and Affinity prediction Drug discovery Literature indexing Biomedical imaging Patient clustering |
| Weakly supervised | Drug discovery Computer-aided diagnosis Biomedical imaging EHR |
Relevant approaches according to the amount of labeled and unlabeled data available
| Amount of available data | Some unlabeled data | No unlabeled data |
|---|---|---|
| Enough labeled data to train a supervised model | Supervised learning | Supervised learning |
| Data augmentation | Data augmentation | |
| Semi-supervised learning | ||
| Active learning | ||
| Some labeled data, but not enough to train a supervised model | Data augmentation | Data augmentation |
| Semi-supervised learning | Transfer learning | |
| Active learning | ||
| Transfer learning | ||
| Self-supervised learning | ||
| Few/one/zero-shot learning | ||
| No labeled data | Active learning | Zero-shot learning |
Large labeled datasets are required for pre-training.
Figure 2Example of data point importance for model selection
Magenta and green circles are data points that correspond to respective class labels.
(A) An example of a full dataset with 8 data points belonging to magenta class and 14 belonging to the green class; dashed line represents a selected model.
(B) A subset of six data points selected from the full dataset is less valuable for accurate model definition, as multiple models (e.g., two dotted lines of incorrect models and one dashed line for correct model) can be fitted to this subset, but not the full dataset.
(C) A subset of six data points selected from the full dataset are more valuable for accurate model definition (single dashed line).
For simplicity of graphical illustration, all the data points are depicted in a 2D space of ×1 and ×2. However, similar concepts apply to n-dimensional space. Dashed line represents an abstraction for a model or a decision boundary.
Figure 3Strategies between supervised and unsupervised approaches
Magenta and green colors correspond to respective class labels, blue circles represent unlabeled data, dashed lines represent the learned decision boundary.
(A) In a semi-supervised learning approach, clustering and manual annotation of few points is performed.
(B) In an active learning approach, an active request to the user to obtain annotation is performed; here it is depicted in a gray bubble with a “select” call to action.
(C) In the data augmentation approach, light circles with dashed borders represent data points obtained from the original through augmentation (e.g., linear transformation).
(D) A transfer learning approach uses a model pre-trained on one dataset (regular circles) and fine-tuned on another dataset (light circles with dashed borders). Here the trained model parameters transfer is symbolized by the dashed model line “transferred” from left to right as indicated by the blue arrow.
(E) Self-supervised learning approach. Here, a jigsaw in-painting task, which is not related to labels (so-called pretext task) is depicted as an example. The jigsaw pretext task is formulated automatically and allows learning of representations from the data.
(F) Weakly supervised learning approach. Here, magenta and green inequalities represent coarse heuristic rules used for data annotation.
For simplicity of graphical illustration, all the data points are depicted in a 2D space of ×1 and ×2. However, similar concepts apply to n-dimensional space. Dashed line represents an abstraction for a model or a decision boundary.