| Literature DB >> 28399797 |
Mohamed Amine Remita1,2, Ahmed Halioui1,2, Abou Abdallah Malick Diouara1,2, Bruno Daigle1,2, Golrokh Kiani1,2, Abdoulaye Baniré Diallo3,4.
Abstract
BACKGROUND: Advances in cloning and sequencing technology are yielding a massive number of viral genomes. The classification and annotation of these genomes constitute important assets in the discovery of genomic variability, taxonomic characteristics and disease mechanisms. Existing classification methods are often designed for specific well-studied family of viruses. Thus, the viral comparative genomic studies could benefit from more generic, fast and accurate tools for classifying and typing newly sequenced strains of diverse virus families.Entities:
Keywords: Prediction; Sequence classification; Virus classification
Mesh:
Year: 2017 PMID: 28399797 PMCID: PMC5387389 DOI: 10.1186/s12859-017-1602-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Overview of CASTOR kernel architecture. The kernel is composed of two main units (classifier construction and prediction). White rectangles represent input and output data; grey and curved rectangles represent processes. TS and VS are training set and validation set, respectively
Fig. 2Class cohesion of three virus datasets. The four columns illustrate the separability and compactness of three virus complete genomes datasets based on 172 restriction enzyme cuts. The first column shows heatmaps of CUT clustered by x-axis. The samples in the y-axis are grouped by studied classes followed by intra-class clusterings. The second column shows MDS of the CUT distances between samples. The third and fourth column represent, respectively, the Cohesion and Silhouette indexes of the classes. a Classes in HPV are Alpha species, Beta and Gamma genera. b Classes in HBV are A-H genotypes c Classes in HIV-1 are M pure subtypes and CRFs
CASTOR best accuracies on the classification of five datasets
| Group of virus | Organism | Classification | # of classes | # of instances |
|
|
| Classifier ID |
|---|---|---|---|---|---|---|---|---|
| I (dsDNA) | HPV | Genera | 3 | 125 | 0.992 | 0.005 | 0.992 | PMSHPV01 |
| Alpha species | 8 | 118 | 0.992 | 0.002 | 0.992 | PMSHPV02 | ||
| VII (dsDNA-RT) | HBV | Genotypes | 8 | 230 | 0.996 | 0.001 | 0.996 | PMSHBV01 |
| VI (ssRNA-RT) | HIV-1 | Groups | 4 | 76 | 1.000 | 0.000 | 1.000 | PMSHIV01 |
| M Subtypes | 18 | 597 | 0.983 | 0.001 | 0.983 | PMSHIV02 |
This table contains the best results of the experimental study performed on the different datasets. The evaluation measures are obtained with 10-fold cross-validation analysis. The column Classifier ID contains the corresponding models available in CASTOR platform
Fig. 3Learning algorithm evaluation on five datasets. This figure illustrates the F-measure distribution (boxplot) of seven learning algorithms on the prediction of a HPV genera, b HPV Alpha species, c HBV genotypes, d HIV-1 M subtypes with complete genomes e HIV-1 M subtypes with pol fragments. HPV and HBV datasets are complete genomes. The number below each boxplot corresponds to the statistically discriminative rank of the algorithms. The ranking is performed with paired Student’s t test. μ, σ are the mean and the standard deviation of the overall F-measures, respectively. p is the p-value of the statistically significance of the weighted F-measure mean differences among the algorithms computed with the Wilcoxon/Kruskal-Wallis test
Evaluation of HIV-1 classification with CASTOR
| Classification | # of classes | # of instances | [min - max] instances/class |
|
|
| Classifier ID | |
|---|---|---|---|---|---|---|---|---|
| Complete genomes | Groups (M, N, O and P) | 4 | 76 | [4 – 32] | 1.000 | 0.000 | 1.000 | PMVHIVGC01 |
| Pure subtypes | 6 | 189 | [30 – 36] | 0.995 | 0.001 | 0.995 | PMVHIVGC02 | |
| CRFs | 12 | 234 | [10 – 30] | 1.000 | 0.000 | 1.000 | PMVHIVGC03 | |
| Pure subtypes and CRFs | 18 | 423 | [10 – 36] | 0.981 | 0.001 | 0.981 | PMVHIVGC04 | |
| Pure subtypes vs CRFs | 2 | 200 | [100 – 100] | 0.795 | 0.205 | 0.795 | PMVHIVGC05 | |
|
| Groups (M, N, O and P) | 4 | 94 | [4 – 45] | 1.000 | 0.000 | 1.000 | PMVHIVPL01 |
| Pure subtypes | 6 | 1800 | [300 – 300] | 0.983 | 0.003 | 0.983 | PMVHIVPL02 | |
| CRFs | 16 | 480 | [30 – 30] | 0.971 | 0.002 | 0.971 | PMVHIVPL03 | |
| CRFs | 6 | 1200 | [200 – 200] | 0.993 | 0.001 | 0.993 | PMVHIVPL04 | |
| Pure subtypes and CRFs | 23 | 690 | [30 – 30] | 0.920 | 0.004 | 0.919 | PMVHIVPL05 | |
| Pure subtypes and CRFs | 12 | 2400 | [200 – 200] | 0.962 | 0.003 | 0.962 | PMVHIVPL06 | |
| Pure subtypes vs CRFs | 2 | 200 | [100 – 100] | 0.885 | 0.115 | 0.885 | PMVHIVPL07 |
This table contains the TPR, FPR and F-measure of 12 HIV-1 classifications obtained with 10-fold cross-validation analysis. For each classification, the number of corresponding classes and instances are given. The range [min-max] indicates the interval of instance frequencies per class used during the training of each model. The column Classifier ID contains the corresponding models available in CASTOR platform
Fig. 4Performance of CASTOR with COMET and REGA predictors on HIV-1 datasets. The panels a and b show the percentage of correct classifications for HIV-1 complete genomes and HIV-1 pol fragments, respectively. The number of instances and the associated classes for each sampling is presented above the panels. Complete sampling corresponds to 10% of Los Alamos HIV data selected randomly. In specific subtypes sampling, the predictors are assessed against their trained classes. In common subtypes sampling, the predictors are assessed against the intersection of the classes of the three trained predictors
Performances of HIV-1 predictors on complete genome classification
| COMET | REGA | CASTOR | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| # of instances |
|
|
|
|
|
|
|
|
| ||
| CRFs | HIV1_01_AE | 100 | 0.960 | 0.000 | 0.980 | 0.970 | 0.000 | 0.985 | 1.000 | 0.000 | 1.000 |
| HIV1_02_AG | 10 | 0.900 | 0.000 | 0.947 | 0.700 | 0.000 | 0.824 | 0.900 | 0.007 | 0.818 | |
| Mean | 0.930 | 0.000 | 0.964 | 0.835 | 0.000 | 0.905 | 0.950 | 0.004 | 0.909 | ||
| Pure subtypes | HIV1_A | 100 | 0.660 | 0.000 | 0.795 | 0.990 | 0.000 | 0.995 | 0.940 | 0.000 | 0.969 |
| HIV1_B | 100 | 0.910 | 0.000 | 0.953 | 1.000 | 0.000 | 1.000 | 0.960 | 0.003 | 0.975 | |
| HIV1_C | 100 | 0.970 | 0.000 | 0.985 | 1.000 | 0.000 | 1.000 | 0.970 | 0.003 | 0.980 | |
| Mean | 0.847 | 0.000 | 0.911 | 0.997 | 0.000 | 0.998 | 0.957 | 0.002 | 0.975 | ||
This table contains TPR, FPR and F-measure of COMET, REGA and CASTOR on the prediction of HIV-1 M pure subtypes and CFRs complete genomes. The shown classes belong to the common subtypes sampling. The CASTOR model used in this evaluation is PMSHIV02
HIV-1 predictor performances on pol fragment classification
| COMET | REGA | CASTOR | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| # of instances |
|
|
|
|
|
|
|
|
| ||
| CRFs | HIV1_01_AE | 1000 | 0.989 | 0.000 | 0.993 | 0.007 | 0.000 | 0.014 | 0.956 | 0.001 | 0.975 |
| HIV1_02_AG | 1000 | 0.952 | 0.002 | 0.967 | 0.000 | 0.000 | 0.000 | 0.853 | 0.005 | 0.897 | |
| HIV1_06_cpx | 698 | 0.924 | 0.000 | 0.958 | 0.938 | 0.000 | 0.965 | 0.927 | 0.003 | 0.943 | |
| HIV1_07_BC | 1000 | 0.977 | 0.000 | 0.988 | 0.988 | 0.000 | 0.993 | 0.982 | 0.002 | 0.980 | |
| HIV1_08_BC | 399 | 0.965 | 0.000 | 0.981 | 0.990 | 0.000 | 0.994 | 0.972 | 0.001 | 0.970 | |
| HIV1_11_cpx | 58 | 0.828 | 0.000 | 0.906 | 0.690 | 0.000 | 0.816 | 0.897 | 0.006 | 0.588 | |
| HIV1_12_BF | 222 | 0.860 | 0.000 | 0.925 | 0.374 | 0.000 | 0.544 | 0.932 | 0.008 | 0.807 | |
| Mean | 0.928 | 0.000 | 0.960 | 0.570 | 0.000 | 0.618 | 0.931 | 0.004 | 0.880 | ||
| Pure subtypes | HIV1_A | 1000 | 0.966 | 0.001 | 0.980 | 0.968 | 0.106 | 0.654 | 0.891 | 0.006 | 0.917 |
| HIV1_B | 1000 | 0.995 | 0.001 | 0.993 | 0.945 | 0.000 | 0.970 | 0.817 | 0.007 | 0.866 | |
| HIV1_C | 1000 | 0.990 | 0.001 | 0.991 | 0.997 | 0.000 | 0.997 | 0.912 | 0.003 | 0.942 | |
| HIV1_D | 1000 | 0.938 | 0.000 | 0.968 | 0.911 | 0.000 | 0.953 | 0.892 | 0.010 | 0.899 | |
| HIV1_F | 1000 | 0.927 | 0.000 | 0.962 | 0.970 | 0.000 | 0.985 | 0.914 | 0.003 | 0.940 | |
| HIV1_G | 1000 | 0.915 | 0.001 | 0.952 | 0.929 | 0.007 | 0.931 | 0.778 | 0.003 | 0.860 | |
| Mean | 0.955 | 0.001 | 0.974 | 0.953 | 0.019 | 0.915 | 0.867 | 0.005 | 0.904 | ||
This table contains TPR, FPR and F-measure of COMET, REGA and CASTOR on the prediction of HIV-1 M pure subtypes and CFRs pol fragments. The shown classes belong to the common subtypes sampling. The CASTOR model used in this evaluation is PMSHIV03