Literature DB >> 34950904

Labels in a haystack: Approaches beyond supervised learning in biomedical applications.

Artur Yakimovich¹, Anaël Beaugnon², Yi Huang³, Elif Ozkirimli⁴.

Abstract

Recent advances in biomedical machine learning demonstrate great potential for data-driven techniques in health care and biomedical research. However, this potential has thus far been hampered by both the scarcity of annotated data in the biomedical domain and the diversity of the domain's subfields. While unsupervised learning is capable of finding unknown patterns in the data by design, supervised learning requires human annotation to achieve the desired performance through training. With the latter performing vastly better than the former, the need for annotated datasets is high, but they are costly and laborious to obtain. This review explores a family of approaches existing between the supervised and the unsupervised problem setting. The goal of these algorithms is to make more efficient use of the available labeled data. The advantages and limitations of each approach are addressed and perspectives are provided.

Entities: Chemical

Keywords: active learning; data annotation; data labeling; data value; machine learning; self-supervised learning; semi-supervised learning; zero-shot learning

Year: 2021 PMID： 34950904 PMCID： PMC8672145 DOI： 10.1016/j.patter.2021.100383

Source DB: PubMed Journal: Patterns (N Y) ISSN： 2666-3899

Introduction

Learning a task is easier when you have examples. In the absence of examples, humans can leverage other strategies to learn new material, such as generalization based on examples from similar tasks or trial and error. Most of these strategies aim to make efficient use of the available examples, if there are any. Such strategies work best when the data (or examples) come from a representative and unbiased sampling of the underlying data landscape. In life sciences and health care, finding labeled data that sample the whole distribution is a major challenge. For example, in microscopy, data scarcity, cross-equipment compatibility, resolution limitations, and image quality are challenges in building consensus-labeled datasets., Another example is related to the drug discovery domain; protein-compound interaction datasets are limited to highly studied proteins or compounds leading to an oversampling of privileged protein-compound pairs. In health care, access to patient data and finding consensus-labeled electronic health records (EHR) remain a huge challenge., Furthermore, the population is often not sampled with a fair or representative distribution, leading to bias in the available datasets., Another area of focus in life sciences and health care is finding information in scientific publications, where labeled data are long-tailed due to the flexible queries and differences in research topic popularity. Accelerated by COVID-19 research, the field is experiencing a huge increase in the number of publications, making information retrieval a major bottleneck in keeping up with literature. The lack of labeled data is a critical challenge that must be overcome to train supervised learning models in the biomedical field. Some solutions to deal with this problem have been reviewed in biomedical imaging and clinical text data analysis. In this paper, we summarize several machine learning strategies that can be used with no or limited labeled data, with a special focus on life sciences and health-care-related domains. These strategies are blurring the boundary between supervised and unsupervised learning. Machine learning (ML) approaches are traditionally separated into supervised and unsupervised paradigms. Additionally, many researchers single out reinforcement learning as a third paradigm, which is not discussed in this review. In the supervised paradigm, the machine learning algorithm learns how to perform a task from data manually annotated by a person. It is worth mentioning that for the sake of this review we define supervision as manual, although such definition is strict. This is aimed to concretely define assisted approaches to labeling in the respective sections below. In contrast, the unsupervised paradigm aims to identify the patterns within the data algorithmically, i.e., without human help (Figure 1). Supervised learning delivers more desirable results for automation, aiming to mimic human behavior, but it falls short in scalability, due to the laborious and often expensive process of data annotation., Unsupervised learning can also be used to automate some tasks (e.g., anomaly detection), but it usually performs much more poorly than supervised learning because it is not guided by manually annotated data. Unsupervised learning is therefore more suited to data exploration tasks such as clustering and association mining.

Figure 1

A schematic depiction of supervised and unsupervised machine learning approaches

(A) An example of a supervised approach. Here, data points (circles) in ×1 and ×2 dimensions are labeled in magenta and green categories, allowing the model (dashed line) to be fitted.

(B) An example of an unsupervised approach. Here an algorithm attempts to detect patterns (clusters; dashed line) in unlabeled data (light blue circles).

For simplicity of graphical illustration, all the data points are depicted in a 2D space of ×1 and ×2. However, similar concepts apply to n-dimensional space. Dashed line represents an abstraction for a model or a decision boundary.

A schematic depiction of supervised and unsupervised machine learning approaches (A) An example of a supervised approach. Here, data points (circles) in ×1 and ×2 dimensions are labeled in magenta and green categories, allowing the model (dashed line) to be fitted. (B) An example of an unsupervised approach. Here an algorithm attempts to detect patterns (clusters; dashed line) in unlabeled data (light blue circles). For simplicity of graphical illustration, all the data points are depicted in a 2D space of ×1 and ×2. However, similar concepts apply to n-dimensional space. Dashed line represents an abstraction for a model or a decision boundary. The scarcity of annotated data hinders the use of supervised learning in many practical use cases. This is especially pronounced for the application of higher expressive capacity models used in representation learning and deep learning (DL). Such models are much more data-hungry (often by a factor of a 100) compared with feature-generation-based ML.16, 17, 18 This means that even more annotated examples are required to learn relevant representations from the data than in traditional ML. In this paper, we review some techniques that have been proposed by the ML community to address the annotated data scarcity problem. First, we discuss the value each data point brings to the resulting model. Next, we discuss Semi-supervised Learning and Active Learning, which fall between the supervised and the unsupervised paradigms by leveraging both labeled and unlabeled data. Data Augmentation and Self-supervised Learning generate reliable annotated data in an automatic fashion that can later be used by classical supervised learning models. Transfer Learning and Zero/One/Few-Shot Learning techniques are able to leverage models pre-trained for a similar context (Table 1). Finally, Weakly Supervised Learning techniques learn predictive models from inexpensive weak labels that may be wrong. Table 2 summarizes the references to biomedical applications for each technique and Table 3 depicts which approaches are relevant according to the amount of labeled and unlabeled data available.

Table 1

Examples of zero-shot learning by formatting text data to fit models

Task	Input	Output	Reference
Multi-class classification	X	Y
Question answering	X. Is this about Y ?	Yes/No	Sun et al.¹⁹
Natural language inference	X. This is about Y.	Entailment/Contradiction	Yin et al.²⁰
Cloze	X. This is about _.	Choose from Y	Schick and Schütze,²¹ Tam et al.²²

Table 2

References to biomedical applications

Approach	Biomedical applications
Supervised	• Protein-compound affinity prediction²³ • Omics²⁴^,²⁵ • Biomedical imaging¹^,²^,²⁶^,²⁷ • EHR⁴^,⁵
Semi-supervised	• Omics²⁸^,²⁹
Active learning	• Drug discovery³⁰^,³¹ • Omics³²^,³³ • Biomedical imaging¹¹^,³⁴ • Clinical text data³⁵
Data augmentation	• Omics³⁶^,³⁷ • Biomedical imaging³⁸^,³⁹^,⁴⁰^,⁴¹^,⁴² • Clinical text data⁴³
Transfer learning	• Omics⁴⁴^,⁴⁵ • Biomedical imaging46, 47, 48, 49^,²⁶^,⁵⁰ • Biomedical text data51, 52, 53, 54^,⁵⁵
Self-supervised	• Omics³⁷^,⁵⁶ • Biomedical imaging⁵⁷^,⁴²
Few/one/zero-shot and few-shot learning	• Omics and Affinity prediction⁵⁸ • Drug discovery⁵⁹ • Literature indexing⁶⁰ • Biomedical imaging⁶¹^,⁶² • Patient clustering⁶³
Weakly supervised	• Drug discovery⁶⁴ • Computer-aided diagnosis⁶⁵ • Biomedical imaging⁶⁶^,⁶⁷^,⁶⁸ • EHR⁶⁹

Table 3

Relevant approaches according to the amount of labeled and unlabeled data available

Amount of available data	Some unlabeled data	No unlabeled data
Enough labeled data to train a supervised model	Supervised learning	Supervised learning
	Data augmentation	Data augmentation
	Semi-supervised learning
	Active learning
Some labeled data, but not enough to train a supervised model	Data augmentation	Data augmentation
	Semi-supervised learning	Transfer learninga
	Active learning
	Transfer learninga
	Self-supervised learninga
	Few/one/zero-shot learninga
No labeled data	Active learningZero-shot learninga	Zero-shot learninga

Large labeled datasets are required for pre-training.

Examples of zero-shot learning by formatting text data to fit models References to biomedical applications Protein-compound affinity prediction Omics, Biomedical imaging,,, EHR, Omics, Drug discovery, Omics, Biomedical imaging, Clinical text data Omics, Biomedical imaging,,,, Clinical text data Omics, Biomedical imaging46, 47, 48, 49,, Biomedical text data51, 52, 53, 54, Omics, Biomedical imaging, Omics and Affinity prediction Drug discovery Literature indexing Biomedical imaging, Patient clustering Drug discovery Computer-aided diagnosis Biomedical imaging,, EHR Relevant approaches according to the amount of labeled and unlabeled data available Large labeled datasets are required for pre-training.

Data value

ML model performance often improves as more data are collected. Increasing the size of the dataset serves two purposes simultaneously. First, it provides more information about the problem, making the solution likely to be more general. Second, it improves the performance of complex models. Increasing the dataset size often means an increase in the number of unannotated data points. However, in a supervised learning setting, simply collecting observations is often insufficient. To be used as the training data for supervised learning, these observations must often be manually annotated. For instance, ImageNet, an image dataset of around 1.3 million individual images, required manually annotating each of these images to belong to one or more of 1,000 classes in a routine and laborious process. However, not every data point is created equal. Some data points may be more useful to obtain a representative dataset. For example, for classification problems in cases when labeled data is scarce, it is preferable to have data points closer to the decision boundary (Figure 2). Several strategies for data point valuation have thus far been proposed, ranging from linear classification to decision trees and game theory (Shapley values). Purposeful data collection may be achieved only when the goal is clearly defined. Cost-effective ways are therefore designed to weight the data points differently during training by the effective number (expected volume) of samples or by focusing on hard examples.

Figure 2

Example of data point importance for model selection

Magenta and green circles are data points that correspond to respective class labels.

(A) An example of a full dataset with 8 data points belonging to magenta class and 14 belonging to the green class; dashed line represents a selected model.

(B) A subset of six data points selected from the full dataset is less valuable for accurate model definition, as multiple models (e.g., two dotted lines of incorrect models and one dashed line for correct model) can be fitted to this subset, but not the full dataset.

(C) A subset of six data points selected from the full dataset are more valuable for accurate model definition (single dashed line).

Example of data point importance for model selection Magenta and green circles are data points that correspond to respective class labels. (A) An example of a full dataset with 8 data points belonging to magenta class and 14 belonging to the green class; dashed line represents a selected model. (B) A subset of six data points selected from the full dataset is less valuable for accurate model definition, as multiple models (e.g., two dotted lines of incorrect models and one dashed line for correct model) can be fitted to this subset, but not the full dataset. (C) A subset of six data points selected from the full dataset are more valuable for accurate model definition (single dashed line). For simplicity of graphical illustration, all the data points are depicted in a 2D space of ×1 and ×2. However, similar concepts apply to n-dimensional space. Dashed line represents an abstraction for a model or a decision boundary. In recent years, a number of approaches to improve efficiency of labeled data utilization have gained traction.,, Overall, these approaches attempt to improve the efficiency of labeled data utilization by high expressive capacity models. Such models often count millions of trainable parameters and hence require very large labeled datasets to avoid overfitting. Approaches tackling this are therefore indispensable for fields where labeled data is scarce. In this review, the most popular examples are discussed.

Semi-supervised learning

Semi-supervised learning is halfway between supervised and unsupervised learning. It trains predictive models from both labeled and unlabeled data (Figure 3A) in order to obtain better models than if they were trained with plain supervised learning on only the labeled data available.

Figure 3

Strategies between supervised and unsupervised approaches

Magenta and green colors correspond to respective class labels, blue circles represent unlabeled data, dashed lines represent the learned decision boundary.

(A) In a semi-supervised learning approach, clustering and manual annotation of few points is performed.

(B) In an active learning approach, an active request to the user to obtain annotation is performed; here it is depicted in a gray bubble with a “select” call to action.

(C) In the data augmentation approach, light circles with dashed borders represent data points obtained from the original through augmentation (e.g., linear transformation).

(D) A transfer learning approach uses a model pre-trained on one dataset (regular circles) and fine-tuned on another dataset (light circles with dashed borders). Here the trained model parameters transfer is symbolized by the dashed model line “transferred” from left to right as indicated by the blue arrow.

(E) Self-supervised learning approach. Here, a jigsaw in-painting task, which is not related to labels (so-called pretext task) is depicted as an example. The jigsaw pretext task is formulated automatically and allows learning of representations from the data.

(F) Weakly supervised learning approach. Here, magenta and green inequalities represent coarse heuristic rules used for data annotation.

Strategies between supervised and unsupervised approaches Magenta and green colors correspond to respective class labels, blue circles represent unlabeled data, dashed lines represent the learned decision boundary. (A) In a semi-supervised learning approach, clustering and manual annotation of few points is performed. (B) In an active learning approach, an active request to the user to obtain annotation is performed; here it is depicted in a gray bubble with a “select” call to action. (C) In the data augmentation approach, light circles with dashed borders represent data points obtained from the original through augmentation (e.g., linear transformation). (D) A transfer learning approach uses a model pre-trained on one dataset (regular circles) and fine-tuned on another dataset (light circles with dashed borders). Here the trained model parameters transfer is symbolized by the dashed model line “transferred” from left to right as indicated by the blue arrow. (E) Self-supervised learning approach. Here, a jigsaw in-painting task, which is not related to labels (so-called pretext task) is depicted as an example. The jigsaw pretext task is formulated automatically and allows learning of representations from the data. (F) Weakly supervised learning approach. Here, magenta and green inequalities represent coarse heuristic rules used for data annotation. For simplicity of graphical illustration, all the data points are depicted in a 2D space of ×1 and ×2. However, similar concepts apply to n-dimensional space. Dashed line represents an abstraction for a model or a decision boundary. There is an important prerequisite to obtain better models than supervised learning: the distribution of examples, which unlabeled data will help elucidate, must be relevant for the prediction problem at hand. If this condition is not met, using unlabeled data can degrade the prediction accuracy by misguiding the inference. In practice, semi-supervised learning strategies rely on assumptions about the data. For instance, the generative semi-supervised method that consists in applying the expectation maximization (EM) algorithm on both labeled and unlabeled data relies on the cluster assumption (if points are in the same cluster, they are likely to be of the same class). Another example is transductive support vector machine (SVM), which implements the low-density separation assumption (the decision boundary should lie in a low-density region): it maximizes the margin for both labeled and unlabeled data. The reader may refer to Chapelle et al., for more details about semi-supervised learning assumptions and strategies. Semi-supervised learning will be most useful whenever there is far more unlabeled data than labeled. This is likely to occur if obtaining data points is cheap, but obtaining the labels costs a lot of time, effort, or money. For instance, semi-supervised learning is particularly suited to the classification of protein sequences. Protein sequences are nowadays acquired at industrial speed, but determining the functions of a single protein may require years of scientific work.

Active learning

Active learning is an iterative human-in-the-loop process that starts by identifying the unlabeled data points that are considered as the most useful for training a predictive model. At each iteration, unlabeled data points are queried by the active learning strategy and annotated by human experts, and the model is updated with the newly annotated data points (Figure 3B). Similar to semi-supervised learning, active learning aims to efficiently leverage unlabeled data during the learning process for performance promotion, while reducing the human expert workload. Both techniques can be combined to get the best out of the unlabeled data. The traditional active learning strategy, uncertainty sampling, selects the data points closest to the decision boundary for annotation. The theory is that the model is uncertain about the prediction of these data points, so annotating them will significantly increase the accuracy. Other active learning strategies are more explorative and exploit the structure in data to annotate points from all of the data space. Active learning strategies must find the right trade-off between exploitation and exploration, i.e., selecting data points both from sampled and unsampled areas of the data space. Selecting the right query strategy and model for a given dataset is a real challenge. Indeed, active learning strategies perform differently on different datasets and there is no guarantee that they will outperform random sampling., Meta-learning is investigated to address this issue:, it leverages reinforcement learning to learn the best active learning strategy. Active learning is a human-in-the-loop process where the user experience should not be overlooked., Taking humans into account and ensuring a good user experience is as important as choosing an optimal query strategy to effectively train a performant predictive model with reduced human workload. However, most studies do not evaluate their query strategy in a real-world setting: they leverage oracles that answer the queries automatically from fully annotated datasets and they leave the user experiments for future work. They assess the human workload with the number of manual annotations, but the time spent in the overall annotation process is a much more realistic metric. Indeed, data points do not all have the same cost to be annotated. For instance, the data points queried by uncertainty sampling, close to the decision boundary, are often tricky cases that are more costly to annotate. In a real-world setting, even domain experts may not be able to annotate ambiguous data points. Moreover, studies often assume that a single data point is queried and annotated at each active learning iteration, to always select the optimal query. This optimal setting is, however, not workable in real-world annotation systems, as the annotators would spend more time waiting for annotation queries (while the predictive model is updated and the next annotation query is computed) than actually annotating data. Finally, most biomedical annotation tasks require specific domain knowledge, and crowdsourcing cannot be leveraged to annotate data at low cost. As annotating data is a cumbersome task, it is critical to show the biomedical experts that their annotations improve the accuracy of the predictive model, so that they continue annotating., It is, however, not straightforward to provide such feedback: most studies leverage a fully annotated validation dataset to assess the performance of the predictive model across iterations, but such a dataset is usually not available when deploying an annotation system. There is currently little research that focuses on user experience in active learning processes,,,, and very few studies assess their method with user experiments. Such research would significantly foster the use of this technique in real-world annotation systems. Active learning strategies are particularly useful to annotate datasets for unbalanced prediction problems where random sampling is not effective. For example, it has been widely applied to automate drug discovery,, where the active learning strategy identifies which experiments to perform next. Active learning is also relevant when crowdsourcing cannot be leveraged because expert knowledge is required to annotate,, or the data are too sensitive to be shared.

Data augmentation

An early attempt to address the mismatch on availability and requirement of labeled data in DL was data augmentation (Figure 3C). Traditional techniques apply label-preserving transformations to the already-annotated data points to increase the amount of training data. For instance, image data can simply be flipped or rotated,, and text data can be slightly modified through deletion of random words and synonym substitution.,, Generative adversarial networks (GANs) can be used to generate a much broader set of augmentations. GANs learn the data distribution from some training data and then generate new samples that are as realistic as possible from this distribution. They use two competing neural networks: one that generates new samples from noise and one that discriminates samples as real or synthetic. GANs have been used in medical imaging to generate additional realistic training data. For example, Calimeri et al. generated magnetic resonance imaging (MRI) slices of the human brain with GANs and human physicians were not able to distinguish the artificially generated examples from the real ones. Augmentation strategies are often carefully selected and fine-tuned for each individual case, taking domain-specific understanding of medical image data into account to avoid noise and artifacts. For instance, Mok and Chung have proposed a new data augmentation technique that generates images with the desired invariance and robustness properties for brain tumor segmentation. They have fine-tuned their GAN to generate images with realistic tumor boundaries. In another study, Gupta and colleagues used image-to-image translation GAN to enrich images with a broader pallet of readouts. In contrast to conventional augmentation, however, such an approach did not lead to an increase of individual data points. In contrast to semi-supervised and active learning techniques, data augmentation strategies are useful even when a large amount of unlabeled data is not available. Data augmentation can be used to address class imbalance by generating more examples of underrepresented classes. For example, Ollagnier and Williams have compared several text data augmentation techniques to remedy class imbalance in clinical case classification. In the case of image data, Jin et al. have used a GAN to increase the training data of their DL model for pathological lung segmentation of CT scans. They have particularly focused on generating examples where nodules lie on the lung border because these specific cases were previously segmented poorly and underrepresented in the original training dataset.

Transfer learning

Transfer learning is an ML technique where a model that is trained on one task is then repurposed on a second related task (Figure 3D). It is not specific to DL, but it has been widely leveraged in this field as the representation layers can be easily shared by different prediction tasks. In practice, the weights of the representation layers are set according to a model trained on a previous task. Then, these weights can be either used for initialization with subsequent update during the training process (fine-tuning) or completely frozen, and only the task-specific layers are trained. Transfer learning is a relevant solution to deal with the lack of annotated data if you can identify a related task with abundant labeled data or if a pre-trained model is available. It has significantly fostered the use of representation learning in biomedical applications. Manual feature engineering was often preferred over representation learning in the biomedical field because of the lack of training data. Thanks to transfer learning, generic DL models and generic annotated datasets can be leveraged to train biomedical prediction models with few domain-specific annotated examples. For instance, the natural language model BioBERT has been created from the generic model BERT through transfer learning to obtain better performance on biomedical text mining tasks. It has been initialized with the weights of BERT (trained on general domain corpora) and trained on biomedical domain corpora. BioBERT has also been further fine-tuned through transfer learning for specific NLP tasks such as biomedical Named Entity Recognition. A similar approach was applied to scientific text, clinical notes, and PubMed articles, and yielded SciBERT, ClinicalBERT, and PubMedBERT. Transfer learning is also widely used in computer vision to create biomedical specific models. For instance, Cheng et al. have leveraged transfer learning from a model pre-trained on ImageNet to identify specific patterns from abdominal radiographs with limited training data. Another group also used ImageNet pre-trained ResNet-50 for a classification task in micrographs of virus-infected cells. While other examples include transfer learning from models trained on, for example, an MNIST dataset, overall transfer learning from ImageNet-pre-trained models remains the most-favored approach. Proteomics and genomics benefit from transfer learning as well. For example, Deznabi et al. used ProtVec to predict kinase-phosphosite association.

Self-supervised learning

Self-supervised learning (Figure 3E) learns relevant representations (or embeddings) without any manual annotation cost.,107, 108, 109, 110, 111, 112 It relies on a two-step process: (1) a pretext task (or self-supervised task) is used to learn meaningful representations from annotations that are inherent to the data, and (2) the learned representations are leveraged to address the downstream task, the task at hand, with many fewer labeled data. Pretext tasks themselves do not usually provide any useful application. Their learning objective should be set properly to receive supervision from the data themselves. This way, a large amount of annotated data can be generated automatically to learn high-quality representations. Several pretext tasks have been proposed for images; Doersch et al. have proposed a context prediction task where the objective is to predict the relative location of patches extracted from a given image. A similar pretext task consists of solving jigsaw puzzles where the tiles have been generated automatically by splitting images., Pretext tasks can also rely on generative models. For instance, in-painting is a generative pretext task where an arbitrary fraction of an image is removed, and the objective is to reconstruct it from the remaining context. In-painting has recently been demonstrated to be capable of predicting unseen fluorescence signals on a single-cell level in confocal microscopy. Colorization is another example of a generative pretext task: a gray-scale filter is applied to the input images and the objective is to recover the colors. Pretext tasks for language data are posed more natively, e.g., with next-word prediction, or masked language modeling and next-sentence prediction used by BERT. This approach may not immediately make intuitive sense. After all, what practical use would such an ML model present for solving, e.g., a classification problem? However, in the sense of representation learning, performing such a task achieves the main part of the goal: it allows the system to learn meaningful representation from a large dataset. Once done, these representations can be repurposed through transfer learning (weights transfer) and fine-tuning to address an actual task at hand with fewer labeled data. Furthermore, these two steps can be merged into one through shared weights.,119, 120, 121 Representations learned through self-supervised learning can be leveraged for both supervised and unsupervised downstream tasks. The learned representations can be used directly for unsupervised learning tasks such as clustering or similarity computation for search and retrieval tasks. The model trained for the pretext task can also be repurposed through transfer learning to address supervised tasks with many fewer labeled data. Self-supervised learning bridges the gap between supervised and unsupervised learning by focusing on learning high-quality representations. The pretext task must be chosen carefully to get representations as meaningful as possible for the downstream task. Multi-task learning can also be leveraged to train relevant representations from several pretext tasks. Self-supervised learning is proving incredibly powerful across various domains and types of data. This set of techniques refocuses the field on learning high-quality representations bridging domains of generative and discriminative modeling.

Few/one/zero-shot learning

When there are only a limited number of valuable labeled data points, ML solutions have to adapt and generalize to tackle few-shot or one-shot learning, especially to avoid overfitting., In some extreme situations, there are no labeled data at all for training (for example, when there are too many labels to annotate), and the model has to deal with unseen labels, leading to a challenge called zero-shot learning. The solutions to the few/one/zero-shot learning challenges essentially imitate how humans recognize and learn by analogy when there is limited evidence. These solutions often fall into three strategies: relationship similarity, task reduction, and prior knowledge. First, in most cases, with proper embedding or vector representation, the few/one/zero-shot learning models can locate the data or labels in the high-dimensional “semantic space”, measure the similarity of their relationship to those in the training set, and provide proper predictions accordingly., The embedding approach has proved its effectiveness in multiple biomedical domains, such as affinity prediction, drug discovery, literature indexing, image recognition for cancer detection,, and patient clustering. To generate suitable embeddings or vector representations in each biomedical domain, self-supervised learning and transfer learning methods can be leveraged. Second, when there are well-established models in the same domain, it is more straightforward to format data to fit the models than to tune the models to fit the data. For example, with auxiliary sentences, the input and output of a text classification task can be compatible with a question-and-answer model, a natural-language-inference model, or a cloze model, (Table 1). Last, but not least, prior structured knowledge can provide additional domain knowledge to the models to support “reasoning.” For example, the International Classification of Diseases (ICD) hierarchy improves the classification model performance of infrequent labels in electronic medical records (EMR).

Weakly supervised learning

Many challenging prediction tasks require complex models (i.e., with large numbers of parameters) that need large training datasets. However, it is often difficult to acquire strong supervision such as large fully annotated datasets with perfectly accurate labels due to the high cost of the data-labeling process. In some cases, only weak supervision can be collected, and some training algorithms, called weakly supervised, can handle this imperfect supervision to build predictive models. The goal of weak supervision is to train predictive models on large imperfectly labeled training datasets to get better performance than fully supervised models trained on small datasets annotated with perfectly accurate labels. There are many settings of weak supervision. For instance, distant supervision leverages knowledge bases to derive some weak supervision. In other cases, only partial labels, coarse-grained labels, may be available. Only the partial ordering may be provided when learning user preferences over items, or the coarse-grained label “flower" can be provided for a picture of an arum lily instead of spending a consequent amount of time to find the exact taxonomy. Multi-instance learning is another setting of weak supervision where the learner receives a set of labeled bags, each containing many instances. In the simple case of multiple-instance binary classification, a bag may be labeled negative if all the instances in it are negative. On the other hand, a bag is labeled positive if there is at least one instance in it which is positive. Multi-instance learning has been applied to drug-activity prediction, computer-aided diagnosis, and pathology. Weak supervision can also rely on labels with a low accuracy coming from rule-based heuristics, (Figure 3F) or crowdsourcing annotations. In this setting, each instance can be associated with several labels, and the key technical challenge is how to unify and de-noise them, given that they are each noisy, may disagree with one another, may be correlated, and may have arbitrary (unknown) accuracies that may depend on the subset of the dataset being labeled. This type of weak supervision is gaining rapid adoption by the biomedical research community to assist in labeling electronic health records or MRI images.

Conclusions: Biomedical applications

The recent advances in data-intensive computing have been accelerated by improvements in performance due to techniques such as representation learning that address data curation and feature engineering problems. Yet, as we improve the performance, models tend to get larger and the requirement for annotated datasets grows. The set of techniques described in this review aim to make more efficient use of limited annotated data. These techniques focus on generating synthetic data from existing annotations, automating data annotation processes, or lowering the amount of annotated data needed to train the machine learning systems. Even though they were proposed independently, these methods show increasing overlap. For example, weak supervision and self-supervision in DL ultimately serve the same purpose, exploit similar implicit properties of the learning systems, and often require fine-tuning for the best performance. Weak supervision also has some conceptual overlap with data augmentation: both techniques generate labeled data automatically with a rule-based approach to provide some supervision at low cost. Another example is the similarity between semi-supervision, weak supervision, and active learning: they all leverage unlabeled data to enrich or create an annotated dataset with different levels of involvement from domain experts. While semi-supervised learning pseudo-labels unlabeled data automatically, weakly supervised learning requires domain experts to provide pseudo-labeling rules, and active learning requires even more involvement of domain experts to manually annotate some unlabeled data points. Finally, transfer learning is closely related to self-supervised learning and few/one/zero-shot learning. Self-supervised learning and transfer learning are often applied jointly: self-supervised learning allows the collection of relevant representations at low cost, which are then fine-tuned through transfer learning to answer the problem at hand with fewer annotated data points. As for few/one-shot learning, it often leverages a model previously trained for another task such as transfer learning, while zero-shot learning often leverages a model pre-trained with self-supervised learning without fine-tuning. Several dawning and established techniques have been omitted from this review due to their broader focus on the way systems learn. They include meta-learning, and universal representations. The former relates to approaches aimed at optimizing the representation learning algorithms using training meta-data, i.e., learning to learn. Recent work in this field shows further improvement of learning algorithms is still possible. The latter seeks to identify the best algorithmic way to obtain reusable (i.e., “universal”) representations valid across multiple domains building upon self-supervised learning approaches. Perhaps it is those, or the methods reviewed in this work, that will one day allow us to stop searching for the labels in a haystack of data.

45 in total

1. Active learning with support vector machine applied to gene expression data for cancer classification.

Authors: Ying Liu
Journal: J Chem Inf Comput Sci Date: 2004 Nov-Dec

2. iDrug-Target: predicting the interactions between drug compounds and target proteins in cellular networking via benchmark dataset optimization approach.

Authors: Xuan Xiao; Jian-Liang Min; Wei-Zhong Lin; Zi Liu; Xiang Cheng; Kuo-Chen Chou
Journal: J Biomol Struct Dyn Date: 2015-01-14

Labels in a haystack: Approaches beyond supervised learning in biomedical applications.

Introduction

Data value

Semi-supervised learning

Active learning

Data augmentation

Transfer learning

Self-supervised learning

Few/one/zero-shot learning

Weakly supervised learning

Conclusions: Biomedical applications

1. Active learning with support vector machine applied to gene expression data for cancer classification.

2. iDrug-Target: predicting the interactions between drug compounds and target proteins in cellular networking via benchmark dataset optimization approach.

Review 3. Practical considerations for active machine learning in drug discovery.

Review 4. Information Retrieval and Text Mining Technologies for Chemistry.

5. DeepKinZero: zero-shot learning for predicting kinase-phosphosite associations involving understudied kinases.

6. Dissecting racial bias in an algorithm used to manage the health of populations.

7. DeepDTA: deep drug-target binding affinity prediction.

8. DeepC: predicting 3D genome folding using megabase-scale transfer learning.

9. Mimicry Embedding Facilitates Advanced Neural Network Training for Image-Based Pathogen Detection.

10. Microscopy deep learning predicts virus infections and reveals mechanics of lytic-infected cells.

Review 1. Reinforcing Interdisciplinary Collaborations to Unravel the Astrocyte "Calcium Code".