| Literature DB >> 34368274 |
Jeonghoon Kim1, Kyuyoung Lee2, Ruwini Rupasinghe2, Shahbaz Rezaei3, Beatriz Martínez-López2, Xin Liu3.
Abstract
Porcine reproductive and respiratory syndrome is an infectious disease of pigs caused by PRRS virus (PRRSV). A modified live-attenuated vaccine has been widely used to control the spread of PRRSV and the classification of field strains is a key for a successful control and prevention. Restriction fragment length polymorphism targeting the Open reading frame 5 (ORF5) genes is widely used to classify PRRSV strains but showed unstable accuracy. Phylogenetic analysis is a powerful tool for PRRSV classification with consistent accuracy but it demands large computational power as the number of sequences gets increased. Our study aimed to apply four machine learning (ML) algorithms, random forest, k-nearest neighbor, support vector machine and multilayer perceptron, to classify field PRRSV strains into four clades using amino acid scores based on ORF5 gene sequence. Our study used amino acid sequences of ORF5 gene in 1931 field PRRSV strains collected in the US from 2012 to 2020. Phylogenetic analysis was used to labels field PRRSV strains into one of four clades: Lineage 5 or three clades in Linage 1. We measured accuracy and time consumption of classification using four ML approaches by different size of gene sequences. We found that all four ML algorithms classify a large number of field strains in a very short time (<2.5 s) with very high accuracy (>0.99 Area under curve of the Receiver of operating characteristics curve). Furthermore, the random forest approach detects a total of 4 key amino acid positions for the classification of field PRRSV strains into four clades. Our finding will provide an insightful idea to develop a rapid and accurate classification model using genetic information, which also enables us to handle large genome datasets in real time or semi-real time for data-driven decision-making and more timely surveillance.Entities:
Keywords: artificial intelligence; classification; k-nearest neighbor; multilayer perceptron; phylogenetic tree; random forest; support vector machine; swine health
Year: 2021 PMID: 34368274 PMCID: PMC8345883 DOI: 10.3389/fvets.2021.683134
Source DB: PubMed Journal: Front Vet Sci ISSN: 2297-1769
Figure 1Flow chart of data preprocessing from one amino acid sequence into five numeric scores.
Figure 2The phylogeny of 1931 field Porcine reproductive and respiratory syndrome virus (PRRSV) field strains estimated by maximum likelihood approach based on the nucleotide sequences of open reading frame 5 (ORF5) gene.
Figure 3Principal component analysis (PCA) visualizations of 1,000 amino acid scores of 1931 field PRRSV strains. [Red: L5 clade, purple: L1A clade, yellow: L1B clade & green: L1C clade].
Figure 4Distributions of importance scores for random forest classification of 1931 field PRRSV strains between four clades based on 200 amino acid positions. (A) The distribution importance scores in descending order. (B) Importance scores by the amino acid sequences.
Top 4 key amino acid positions in open reading frame 5 gene with the highest random forest importance scores for the 1931 field PRRS strain classification [Importance score > 0.06].
| 26 | 0.145 | A (98.4) | V (98.3) | V (94.2) | A (96.5) |
| 170 | 0.071 | E (99.5) | E (98.1) | N (100) | G (100) |
| 137 | 0.070 | A (99.8) | S (100%) | S (100%) | S (100%) |
| 191 | 0.063 | R (99.5) | K (99.8) | K (100) | R (64.8) |
X is sequencing error.
Figure 5Accuracy and time consumption of four machine learning algorithms in five experiments for PRRSV classification. First experiment fully includes 200 amino acid positions and other four experiments sequentially involved the top 4 amino acid positions [26th, 170th, 137th, and 191st] by their importance score. Top: area under the curve (AUC) values. Bottom: time consumption (seconds). Orange lines are mean values over 100 runs.
The class-wise averaged approximated precision/recall/f1-score values for the corresponding four machine learning (ML) algorithm by five experiments.
| RF | 0.99/0.99/0.99 | 0.99/0.99/0.99 | 0.99/0.99/0.99 | 0.99/0.99/0.99 | 0.75/0.85/0.79 |
| SVM | 0.99/0.99/0.99 | 0.99/0.99/0.99 | 0.99/0.99/0.99 | 0.99/0.99/0.99 | 0.75/0.85/0.79 |
| KNN | 0.99/0.99/0.99 | 0.99/0.99/0.99 | 0.99/0.99/0.99 | 0.98/0.98/0.98 | 0.73/0.83/0.77 |
| CNN | 0.99/0.99/0.99 | 0.99/0.99/0.99 | 0.99/0.99/0.99 | 0.99/0.98/0.98 | 0.73/0.83/0.77 |