| Literature DB >> 35264189 |
Dibyendu Ghosh1, Srija Chakraborty2, Hariprasad Kodamana2,3, Supriya Chakraborty4.
Abstract
BACKGROUND: Inclusion of high throughput technologies in the field of biology has generated massive amounts of data in the recent years. Now, transforming these huge volumes of data into knowledge is the primary challenge in computational biology. The traditional methods of data analysis have failed to carry out the task. Hence, researchers are turning to machine learning based approaches for the analysis of high-dimensional big data. In machine learning, once a model is trained with a training dataset, it can be applied on a testing dataset which is independent. In current times, deep learning algorithms further promote the application of machine learning in several field of biology including plant virology. MAIN BODY: Plant viruses have emerged as one of the principal global threats to food security due to their devastating impact on crops and vegetables. The emergence of new viral strains and species help viruses to evade the concurrent preventive methods. According to a survey conducted in 2014, plant viruses are anticipated to cause a global yield loss of more than thirty billion USD per year. In order to design effective, durable and broad-spectrum management protocols, it is very important to understand the mechanistic details of viral pathogenesis. The application of machine learning enables precise diagnosis of plant viral diseases at an early stage. Furthermore, the development of several machine learning-guided bioinformatics platforms has primed plant virologists to understand the host-virus interplay better. In addition, machine learning has tremendous potential in deciphering the pattern of plant virus evolution and emergence as well as in developing viable control options.Entities:
Keywords: Control options; Deep learning; Evolution and emergence; Host-virus interactions; Machine learning; Pathogenesis; Plant virus
Mesh:
Year: 2022 PMID: 35264189 PMCID: PMC8905280 DOI: 10.1186/s12985-022-01767-5
Source DB: PubMed Journal: Virol J ISSN: 1743-422X Impact factor: 4.099
Fig. 1Standard flowchart for creation of a machine learning model to study biological data. The figure here shows the steps followed in order to create a machine learning model that can successfully study different types of biological data. The data is initially split up into training and testing sets. Each object of the training set is associated with a feature vector, which is passed into the required machine learning algorithm. After manipulating the various parameters of the model, a resultant machine learning model for prediction is developed. This model is then checked by passing the objects of the testing set through it. The resultant output accuracy determines the usefulness of the created model
Fig. 2A schematic representation of a standard artificial neural network. The network is divided into three major components: the input layer, multiple hidden layers and the output layer. In this figure, it is assumed that the input layer has 3 independent variables, each of which is parsed through a set of weights and activation functions in the hidden layers and finally output layers to yield the model output. The activation functions are nonlinear mathematical function such as Tanh, Sigmoid, ReLU, etc. to induce nonlinearity to the model. Depending on the network structure, there may be ‘n’ neurons (also called hidden layer units) in each hidden layer and there may be multiple hidden layers. Any ANN with more than one hidden layer is technically is deep ANN. Once an input is fed into the network, one after another, each hidden layers gets operated among each other till finally the output layer is reached and activated, producing the final result. Weights in each layer is trained by means of the backpropagation algorithm
Fig. 3Application of ML in understanding plant virus pathogenesis. ML enables early diagnosis of plant viral diseases at field level through analyzing hyperspectral images. Metagenomics study of diseased plant samples helps identification of related and unrelated viral genomes. ML can assist in the classification of these viral sequences which primes our understanding of virus evolution. Furthermore, ML-assisted bioinformatics tools have been developed to identify viral suppressors of RNA silencing (VSRs). ML can also guide us to predict the sub-cellular localization and even the structure of the viral proteins. Prediction of accurate structures of virus encoded proteins may help to identify inhibitors of these effector proteins. To understand the host response, several groups have performed transcriptome, proteome and metabolome of virus infected plants. ML can prime the accurate and fast analysis of these high throughput data to identify gene regulatory networks (GRN) and novel host factors involved in host-virus interplay. Characterization of these host factors in terms of sub-cellular localization and structure prediction will boost understanding of plant virus pathogenesis. ML may also assist plant virologists in genomic selection to identify elite virus resistant cultivars. This figure was created using BioRender (https://biorender.com/)
The application of ML-assisted diagnosis of plant viral diseases
| Plant | Viruses/viral diseases | Algorithms/methodologies used | Accuracy | References |
|---|---|---|---|---|
| Cassava | Cassava mosaic disease | Convolutional neural networks (CNN) | 96% | [ |
| Cassava brown streak disease | 98% | |||
| Cucumber | (i) Melon yellow spot virus; (ii) Zucchini yellow mosaic virus | CNN | 94.9% | [ |
| Mungbean | Yellow mosaic disease | CNN | 91.234% for VirLeafNet-1 | [ |
| 96.429% for VirLeafNet-2 | ||||
| 97.403% for VirLeafNet-3 | ||||
| Potato | Potato virus Y | Support vector machine (SVM) classifier | 89.8% | [ |
| Sweet pepper | Tomato spotted wilt virus (TSWV) | Outlier removal auxiliary classifier generative adversarial nets (OR-AC-GAN) | 96.25% (before the onset of visible symptoms) | [ |
| Tobacco | Tobacco mosaic virus (TMV) | Successive projections algorithm (SPA) with extreme learning machine (ELM) classifier | 98.33% | [ |
| Tobacco | TMV | SVM | 93.5% on the training set | [ |
| 92.7% on the independent set | ||||
| Tobacco | TSWV | Model by boosted regression tree (BRT) algorithm and Wavelength selection by SPA | 85.2% | [ |
| Tobacco | Tomato leaf curl New Delhi virus and Tomato leaf curl Gujarat virus | CNN [Visual Geometry Group 16] | 97.21% | [ |
| Tomato | Groundnut bud necrosis virus (GBNV) | SVM | 97.8% | [ |
Brief description of ML-based bioinformatics platforms used in studying plant virus interactions
| Name | Application | Input and output | Salient features | References |
|---|---|---|---|---|
| V-PIPE | Assess genetic diversity of viral population and ensure identification of true viral variants from high throughput data | A hidden Markov model-based read aligner, ngshmmalign, is developed | [ | |
| NBSPred | Identify potential NBS-LRR and NBS-LRR like proteins | Gene prediction tool, Augustus2.7, is used to convert genomic sequences to protein sequences | [ | |
| TransDecoder is used to convert transcripts sequences to protein sequences | ||||
| (i) Frequency of aminoacids, dipeptides, tripeptides and multiplet; (ii) charge (iii) hydrophobicity are considered for the calculation of sequence compositional property | ||||
| LOCALIZER | Predict the sub cellular localization of plant proteins and effector proteins encoded by plant-infecting fungus and oomycete | Trained by support vector machine model | [ | |
| Maximum range: 2000 sequences | ||||
| (ii) Identification of transit peptides (for chloroplast and mitochondria) and nuclear localization signal (NLS) | ||||
| MU-LOC | Predict the mitochondrial localization of plant proteins | The predictor has been trained using support vector machine and deep neural network | [ | |
| pVsupPred | Predict RNA silencing suppressor activity of viral proteins (VSR) | Random forest model guided tool | [ | |
| Prediction on the basis of presence of (i) GW/WG motif and (ii) dsRNA binding domain in the viral protein | ||||
| Alphafold | Predict the structure of a protein | Neural-network based model | [ | |
| Median accuracy: (i) 6.6 Å for Alphafold, (ii) 1.5 Å for Alphafold2 | ||||
| Virfinder | Identify sequences of viruses from metagenomic data | [ | ||