| Literature DB >> 27980679 |
Giulia Fiscon1, Emanuel Weitschek1,2, Eleonora Cella3,4, Alessandra Lo Presti3, Marta Giovanetti3,5, Muhammed Babakir-Mina6, Marco Ciotti7, Massimo Ciccozzi1,3, Alessandra Pierangeli8, Paola Bertolazzi1, Giovanni Felici1.
Abstract
BACKGROUND: Continuous improvements in next generation sequencing technologies led to ever-increasing collections of genomic sequences, which have not been easily characterized by biologists, and whose analysis requires huge computational effort. The classification of species emerged as one of the main applications of DNA analysis and has been addressed with several approaches, e.g., multiple alignments-, phylogenetic trees-, statistical- and character-based methods.Entities:
Keywords: Classification of genomic sequences; Extraction of multiple classification models; Genetic algorithms; Supervised learning
Year: 2016 PMID: 27980679 PMCID: PMC5139023 DOI: 10.1186/s13040-016-0116-2
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Overview of the parameters of the genetic algorithm
| Parameter | Description |
|---|---|
| Maxiter | maximum number of iterations |
|
Max
| maximum length of the subsequences |
|
Base
| starting value of |
| Dimstore | maximum cardinality of |
| Initpop | cardinality of the initial population |
Fig. 1Extended flowchart of MISSEL. Graphical representation of the whole procedure implemented by the genetic algorithm: (i) initialization of a random population of individuals; (ii) evolution of the current population according to a set of genetic operators; (iii) evaluation of the fitness of each individual in the population according to fitness function F and updating of the fitness value; (iv) checking the termination conditions. The evolution step is further composed of probability computation, selection of individuals, parthenogenesis, mutation, trimming and dominance checking; then the individual is inserted into the current population, where a proper cleaning is also made according to dominance rules
Data set description of Influenza viruses
| Class/genomic region | NA | HA | MP |
|---|---|---|---|
| H1N1 | 5999 | 6110 | 11994 |
| H3N2 | 4716 | 4715 | 9427 |
| Number of sequences | 10715 | 10825 | 21421 |
| Number of nucleotides | 1410 | 1701 | 756 |
The number of sequences and their corresponding nucleotides is shown for each virus subtype (H1N1 and H3N2), which is considered as a different class, and for each genomic region
Data set description of Polyoma viruses
| Class/genomic region | VP1 | VP2 | VP3 | ST | LT |
|---|---|---|---|---|---|
| BKPyV | 26 | 25 | 25 | 13 | 26 |
| HPyV6 | 7 | 7 | 7 | 7 | 7 |
| HPyV7 | 7 | 7 | 7 | 7 | 7 |
| HPyV9 | 2 | 2 | 2 | 2 | 2 |
| HPyV10 | 1 | 1 | 1 | 1 | 1 |
| HPyV12 | 2 | 2 | 2 | 2 | 2 |
| JCPyV | 23 | 20 | 21 | 15 | 21 |
| KIPyV | 10 | 8 | 8 | 14 | 8 |
| MCPyV | 3 | 2 | 2 | 28 | 13 |
| MW | 19 | 19 | 19 | 15 | 19 |
| MX | 1 | 1 | 1 | 1 | 1 |
| STLPyV | 6 | 6 | 6 | 6 | 6 |
| WUPyV | 14 | 23 | 14 | 16 | 14 |
| Number of sequences | 121 | 123 | 115 | 127 | 127 |
| Number of nucleotides | 1065 | 726 | 588 | 519 | 828 |
The number of sequences and their corresponding nucleotides is shown for each virus subtype, which is considered as a different class, and for each genomic region
Data set description of Rhino viruses
| Class/genomic region | VP4/2 |
|---|---|
| A | 752 |
| B | 209 |
| C | 355 |
| Number of sequences | 1316 |
| Number of nucleotides | 369 |
The number of sequences and their corresponding nucleotides is shown for each virus subtype (A, B, C), which is considered as a different class, and for the VP4/2 genomic region
Setting of parameters used for the execution of MISSEL
| Maxiter |
Max
|
Base
| Dimstore | |
|---|---|---|---|---|
| Influenza viruses | 5·104 | 10–20 | 5 | 106 |
| Polyoma viruses | 5·104 | 20 | 5 | 106 |
| Rhino viruses | 5·104 | 20 | 5 | 106 |
Number of equivalent and non-dominated solutions for Influenza viruses H1N1 and H3N2 with β≤10 for HA and NA genomic regions and β≤20 for MP genomic region
| Genomic region | Number of solutions |
|---|---|
| HA | 655 |
| MP | 23 |
| NA | 486 |
| Total number of solutions | 1164 |
Fig. 4Distribution of non-dominated solutions for genomic regions: a HA b MP c NA of Influenza viruses, d LT e ST f VP1 g VP2 h VP3 of Polyoma viruses, and i VP4/2 of Rhino viruses
Classification accuracy on training and test set for the 3 genomic regions of Influenza viruses (mean ± standard deviation computed on all solutions)
| Genomic region |
| Train [%] | Test [%] |
|---|---|---|---|
| HA | 2 | 95.44 ± 17.14 | 95.46 ± 17.05 |
| 3 | 99.99 ± 0.06 | 99.98 ± 0.08 | |
| 4 | 99.99 ± 0.11 | 99.96 ± 0.15 | |
| 5 | 100 ± 0.06 | 99.98 ± 0.09 | |
| 6 | 100 | 99.99 ± 0.04 | |
| 7 | 100 | 99.99 ± 0.04 | |
| 8 | 100 | 99.98 ± 0.05 | |
| 9 | 100 | 99.98 ± 0.06 | |
| 10 | 100 | 99.98 ± 0.07 | |
| NA | 2 | 100 ± 0.01 | 99.98 ± 0.02 |
| 3 | 99.98 ± 0.11 | 99.96 ± 0.13 | |
| 4 | 100 | 99.99 ± 0.02 | |
| 5 | 100 | 99.99 ± 0.02 | |
| 6 | 100 | 99.99 ± 0.02 | |
| 7 | 100 | 99.99 ± 0.03 | |
| 8 | 100 | 99.99 ± 0.03 | |
| 9 | 100 | 99.99 ± 0.02 | |
| 10 | 100 | 100 ± 0.01 | |
| MP | 2 | 56.72 ± 29.09 | 56.41 ± 28.95 |
| 3 | 84.59 ± 5.75 | 83.94 ± 5.48 | |
| 5 | 38.46 ± 2.44 | 38.39 ± 2.54 | |
| 8 | 71.15 ± 40.04 | 71.24 ± 40.10 | |
| 9 | 81.01 | 80.96 | |
| 12 | 81.51 | 80.72 | |
| 19 | 91.66 | 91.14 |
Accuracy rates of extracted equivalent and non-dominated solutions with β≤10
Fig. 2Bar plots of the classification performances on the test sets for the three analyzed types of viruses. The reported values are averaged on the solutions with the same value of β; the error bars refer to the corresponding standard deviations computed on all solutions. a HA b MP c NA data sets of Influenza viruses, d LT e ST f VP1 g VP2 h VP3 data sets of Polyoma viruses, and i VP4/2 data sets of Rhino viruses
Fig. 3Bar plots of the classification performances on the training sets for the three analyzed types of viruses. The reported values are averaged on the solutions with the same value of β; the error bars refer to the corresponding standard deviations computed on all solutions. a HA b MP c NA data sets of Influenza viruses, d LT e ST f VP1 g VP2 h VP3 data sets of Polyoma viruses, and i VP4/2 data sets of Rhino viruses
Number of equivalent and non-dominated solutions for Polyoma viruses with β≤20
| Genomic region | Number of solutions |
|---|---|
| LT | 53 |
| ST | 17 |
| VP1 | 84 |
| VP2 | 22 |
| VP3 | 18 |
| Total number of solutions | 194 |
Classification accuracy on training and test set for the 5 genomic regions of polyomaviruses (mean ± standard deviation computed on all solutions)
| Genomic region |
| Train [%] | Test [%] |
|---|---|---|---|
| LT | 2 | 100 | 93.94 |
| 4 | 100 | 92.93 ± 1.75 | |
| 6 | 100 | 93.18 ± 1.56 | |
| 8 | 100 | 93.94 | |
| 9 | 100 | 93.94 | |
| 11 | 100 | 93.94 | |
| 13 | 100 | 93.94 | |
| 15 | 100 | 93.94 ±1.50·10−14 | |
| 16 | 100 | 93.94 | |
| 18 | 100 | 93.94 | |
| 20 | 100 | 93.94 ±2.92·10−14 | |
| ST | 3 | 100 | 93.75 |
| 5 | 100 | 93.75 | |
| 8 | 100 | 93.75 | |
| 11 | 100 | 93.75 | |
| 12 | 100 | 93.75 | |
| 17 | 100 | 93.75 | |
| 20 | 100 | 93.75 | |
| VP1 | 2 | 100 | 90.32 |
| 4 | 100 | 93.55 | |
| 6 | 100 | 93.55 ±2.97·10−14 | |
| 8 | 100 | 93.15 ± 1.45 | |
| 9 | 100 | 93.55 | |
| 11 | 100 | 92.38 ± 2.61 | |
| 12 | 100 | 93.55 | |
| 14 | 100 | 93.55 | |
| 17 | 100 | 93.55 | |
| 19 | 100 | 92.74 ± 2.20 | |
| 20 | 100 | 93.55 | |
| VP2 | 2 | 100 | 93.55 |
| 4 | 100 | 93.55 | |
| 7 | 100 | 92.96 ± 1.31 | |
| 8 | 100 | 93.55 | |
| 10 | 100 | 93.55 | |
| 12 | 100 | 93.55 | |
| 13 | 100 | 93.55 | |
| 19 | 100 | 93.55 | |
| 20 | 100 | 93.55 | |
| VP3 | 2 | 100 | 93.33 |
| 4 | 100 | 91.11 ± 1.92 | |
| 6 | 100 | 93.33 | |
| 8 | 100 | 93.33 | |
| 13 | 100 | 93.33 | |
| 15 | 100 | 93.33 | |
| 16 | 100 | 93.33 | |
| 17 | 100 | 93.33 | |
| 19 | 100 | 93.33 |
Accuracy rates of extracted equivalent and non-dominated solutions with β≤20
Number of equivalent and non-dominated solutions for Rhino viruses with β≤20
| Genomic region | Number of solutions |
|---|---|
| VP4/2 (ABC-Rhino) | 11 |
Classification accuracy on training and test set for Rhino viruses genomic region (mean ± standard deviation computed on all solutions). Accuracy rates of extracted equivalent and non-dominated solutions with β≤20
| Genomic region |
| Train [%] | Test [%] |
|---|---|---|---|
| VP4/2 | 2 | 100 | 99.81 ± 0.27 |
| 4 | 100 | 100 | |
| 8 | 100 | 100 | |
| 13 | 100 | 100 | |
| 16 | 100 | 100 | |
| 18 | 100 | 100 |
Fig. 5Distribution of the number of extracted solutions for each β value of genomic regions: a HA b MP c NA of Influenza viruses, d LT e ST f VP1 g VP2 h VP3 of Polyoma viruses, and i VP4/2 of Rhino viruses. The x axes report the length of solution (β); y axes refer to the number of extracted solutions for each β
Fig. 6Distribution of β values for the whole sequences of genomic regions: a HA b MP c NA of Influenza viruses, d LT e ST f VP1 g VP2 h VP3 of Polyoma viruses, and i VP4/2 of Rhino viruses. The x axes report the solutions; y axes refer to the corresponding β value according to a decreasing order