| Literature DB >> 35573796 |
Rui-Si Hu1,2, Abd El-Latif Hesham3, Quan Zou1,2.
Abstract
In recent years, massive attention has been attracted to the development and application of machine learning (ML) in the field of infectious diseases, not only serving as a catalyst for academic studies but also as a key means of detecting pathogenic microorganisms, implementing public health surveillance, exploring host-pathogen interactions, discovering drug and vaccine candidates, and so forth. These applications also include the management of infectious diseases caused by protozoal pathogens, such as Plasmodium, Trypanosoma, Toxoplasma, Cryptosporidium, and Giardia, a class of fatal or life-threatening causative agents capable of infecting humans and a wide range of animals. With the reduction of computational cost, availability of effective ML algorithms, popularization of ML tools, and accumulation of high-throughput data, it is possible to implement the integration of ML applications into increasing scientific research related to protozoal infection. Here, we will present a brief overview of important concepts in ML serving as background knowledge, with a focus on basic workflows, popular algorithms (e.g., support vector machine, random forest, and neural networks), feature extraction and selection, and model evaluation metrics. We will then review current ML applications and major advances concerning protozoal pathogens and protozoal infectious diseases through combination with correlative biology expertise and provide forward-looking insights for perspectives and opportunities in future advances in ML techniques in this field.Entities:
Keywords: artificial intelligence; drug and vaccine discovery; host-parasite interaction; image detection; machine learning; protozoal parasite; public health
Mesh:
Year: 2022 PMID: 35573796 PMCID: PMC9097758 DOI: 10.3389/fcimb.2022.882995
Source DB: PubMed Journal: Front Cell Infect Microbiol ISSN: 2235-2988 Impact factor: 6.073
An overview of main human protozoan parasites and the available genome links.
| Causative agent | Taxonomic group | Caused disease | Genome link† |
|---|---|---|---|
|
| Amoebozoa | Acanthamoeba keratitis |
|
|
| Amoebozoa | Amoebiasis |
|
|
| Apicomplexa | Babesiosis |
|
|
| Apicomplexa | Cyclosporiasis |
|
|
| Apicomplexa | Cryptosporidiosis |
|
|
| Apicomplexa | Malaria |
|
|
| Apicomplexa | Toxoplasmosis |
|
|
| Kinetoplastida | Leishmaniasis |
|
|
| Kinetoplastida | African sleeping sickness |
|
|
| Kinetoplastida | Chagas disease |
|
|
| Metamonada | Giardiasis |
|
|
| Metamonada | Trichomoniasis |
|
Refer to the links provided by Aurrecoechea et al. (2017). The download link can access to the corresponding species in database.
Figure 1Example schematic workflow of constructing a machine learning predictor (supervised learning approach). The overall flue contains four steps, namely, data preparation, model training, evaluation, and prediction. Step1: data are preprocessed to ensure suitability for machine learning and are split into training and test cohorts. The preprocessed data are characterized (i.e., numerical vector with labels) by feature extraction method and the optimal features are charactered by feature selection methods (such as by MRMD software). Step2: A type of machine learning algorithm (e.g., SVM, RF, or neural network) is chosen based on the data to be used and the desired task, and the models are trained on training dataset. Step3: model performance is evaluated using the test dataset through cross-validation and by means of metrics such as ROC or accuracy. During this step, hyperparameter optimization (or hyperparameter tuning) is usually performed to determine the right combination of hyperparameters that maximizes the model performance: for example, the maximum depth allowed for a decision tree algorithm and the number of trees contained in a random forest algorithm. Step4: an optimal model is chosen as the final model and packaged appropriately for users (e.g., online webserver for prediction or scripts for local use).
Representative artificial intelligence applications for protozoal pathogen detection in publications.
| Author | Image acquisition method | Dataset (total) | Species | Classifier | Result |
|---|---|---|---|---|---|
| ( | Microscope | 15 images |
| SVM | 97.7% accuracy, 97.4% sensitivity, and 97.7% specificity |
| ( | Microscope | 74 images |
| SVM, KNN and NB | 96.75% sensitivity and 94.59% specificity |
| ( | Microscope | 120 images |
| Bayesian | 98.3% sensitivity and 84.37% specificity |
| ( | Microscope | 450 images |
| SVM | 94% sensitivity and 99.7% specificity |
| ( | Microscope | 12,936 images |
| AdaBoost and SVM | 100% sensitivity and 93.25% specificity |
| ( | Microscope | Quantitative phase images of unstained cells |
| LDC, k-NNC and LR | The highest accuracy of 99.7%, 99.5% and 99.1% in LDC, NNC, and LR, respectively |
| ( | Microscope | 27,578 images |
| CNN | 97.37% accuracy, 96.99% sensitivity, 97.75% specificity, and 97.36% F1-score |
| ( | Microscope | 27,558 images |
| CNN | 98.6% accuracy, 98.1% sensitivity, 99.2% specificity, 98.7% F1-score, and 97.2% MCC |
| ( | Microscope | 27,558 images |
| CNN | 99.6% accuracy, 100% precision, 99.92% recall, and 99.96% F1-score |
| ( | Imaging flow cytometry | 80,146 images |
| CNN | > 99.6% accuracy, 97.37% sensitivity and 99.95% specificity |
| ( | Microscope | 13,135 images (T400 dataset) and 14,992 images (T1000 dataset) |
| Transfer learning | T400 –93.1% accuracy, 93.9% F1-score, 96% recall, and 91.9% precision; T1000 –94.0% accuracy, 93.9% F1-score, 92.9% recall, and 94.9% precision |
| ( | Microscope | 24,358 images |
| Deep cycle transfer learning | 95.7% accuracy, 95.7% F1-score, 95.7% recall, and 95.8% precision |
| ( | Microscope | 79,672 images |
| GCN | 98.3% accuracy, 98.5% precision, 98.3% recall, and 98.3% F1-score |
Available tools for microscopic image recognition and detection of protozoal pathogens.
| Model | Description | Species | Availability | Refs |
|---|---|---|---|---|
| CLoDSA | An image augmentation library for object classification, localization, detection, semantic segmentation and instance segmentation. |
|
| ( |
| R-CNN | Automated cell identification of malaria parasite cells using Region-based convolutional neural network model for both brightfield and fluorescence images. |
|
| ( |
| DTGCN | A tool based on GCN was used for recognizing blood smear images of malaria parasite on multi-stages. |
|
| ( |
| DCTL | Detection of three apicomplexan parasites by employing deep cycle transfer learning method to conduct microscopic image analysis. |
|
| ( |
| FCGAN | A microscopic image recognition method by employing fuzzy cycle generative adversarial network by the combination of transfer learning. |
|
| ( |
| MCellNet | A deep neural network processing pipeline by combining the imaging flow cytometry as a detection system realizes rapid, accurate and high-throughput detection and classification with respects to the waterborne parasites. |
|
| ( |
Figure 2Parasite recognition and detection in blood smear image using CNN approach. A blood smear image for intracellular parasite infection is a typical microscope image, which allows CNN to take in an input microscope image, assign importance (learnable weights and biases) to various aspects and be able to accurately differentiate a parasite-infected red blood cell from a normal red blood cell or discerning different lifecycle stage of a parasite. The feedforward layers of CNN contain the input layer, convolutional layer, pooling layer, flatten layer, and fully connected layer. A three-dimensional matrix presents the image data contained in the input layer, and the image is reshaped into a single column. To conduct the convolution operation, the layer is used to create several smaller picture windows to deliver data information. Pooling is a down-sampling operation capable of reducing the dimensionality of the feature map. The flatten layer is used to “flatten” the input, that is, making the multi-dimensional input into one-dimensional data. The fully connected layer is used to identify and classify the object of an image, thereby obtaining output results for probabilistic detection in red blood cells.
Figure 3In terms of public health surveillance in protozoal diseases, a variety of data types can be used to construct machine learning models, capable of deriving data from independent variables to dependent variables to obtain the prediction result of disease.
Figure 4In host-parasite interaction, four common features in a computational workflow are worth considering for which they can further improve the accuracy of prediction, including protein 3D structural information, domain-domain interaction and/or domain-motif interaction, and interologs based on the known PPI in an organism.
Figure 5A basic workflow for virtual screening platform in the process of drug discovery. In this process, machine learning represents a powerful tool to predict compound-target pairs using compound datasets and functional known proteins.