| Literature DB >> 28476106 |
Jose Cleydson F Silva1,2, Thales F M Carvalho1, Marcos F Basso2, Michihito Deguchi2, Welison A Pereira2, Roberto R Sobrinho2, Pedro M P Vidigal3, Otávio J B Brustolini2, Fabyano F Silva4, Maximiller Dal-Bianco2, Renildes L F Fontes5, Anésia A Santos2,6, Francisco Murilo Zerbini2,7, Fabio R Cerqueira1,8, Elizabeth P B Fontes9,10.
Abstract
BACKGROUND: The Geminiviridae family encompasses a group of single-stranded DNA viruses with twinned and quasi-isometric virions, which infect a wide range of dicotyledonous and monocotyledonous plants and are responsible for significant economic losses worldwide. Geminiviruses are divided into nine genera, according to their insect vector, host range, genome organization, and phylogeny reconstruction. Using rolling-circle amplification approaches along with high-throughput sequencing technologies, thousands of full-length geminivirus and satellite genome sequences were amplified and have become available in public databases. As a consequence, many important challenges have emerged, namely, how to classify, store, and analyze massive datasets as well as how to extract information or new knowledge. Data mining approaches, mainly supported by machine learning (ML) techniques, are a natural means for high-throughput data analysis in the context of genomics, transcriptomics, proteomics, and metabolomics.Entities:
Keywords: Data Warehouse; Data mining; Geminivirus; Knowledge discovery; Machine learning; Random Forest
Mesh:
Substances:
Year: 2017 PMID: 28476106 PMCID: PMC5420152 DOI: 10.1186/s12859-017-1646-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Overview of the geminivirus.org framework. Initially, the geminivirus data were recovered from GenBank in the GenBank file format (1). The data were extracted, transformed, and standardized using algorithms based on rules and machine learning (ML) approaches (2). Next, the abstracts of the scientific publications were recovered from PubMed (https://www.ncbi.nlm.nih.gov/pubmed) (3) and the geographic coordinates of the isolates were retrieved from Google Maps (4). Data were merged and loaded into the relational database (5) in different dimensions such as the collection date, host range, geographic region, genomic data, associated publications, and organism data. The data were used to define the training set for building ML models to classify genera using Random Forest (RF), Multilayer Perceptron (MLP), and Sequential Minimal Optimization (SMO) learning algorithms (6). Information and analytical tools, such as basic local alignment tools (BLAST), sequence demarcation tools, and phylogenetic reconstruction, were embedded in the system (7) and an ORF Search tool for classification of ORFs based on ML procedures (8) was implemented. All analysis results are visible and freely available (9)
Example of information extracted from the GenBank file and stored in geminivirus.org
| TAGs | Value |
|---|---|
| LOCUS | KJ939916 |
| DEFINITION | Soybean chlorotic spot virus isolate BR:Flt14:11 segment DNA-A, complete sequence. |
| ORGANISM | Soybean chlorotic spot virus |
| PUBMED | 25028472 |
| AUTHORS | Sobrinho,R.R., Xavier,C.A.D., Pereira,H.M.B., Lima,G.S.A.,,Assuncao,I.P., Mizubuti,E.S.G., Duffy,S. and Zerbini,F.M. |
| .,JOURNAL Submitted | Departamento de Fitopatologia, BIOAGRO, Universidade Federal de Vicosa, Av. Peter Henry Rolfs s/n, Vicosa, Minas Gerais 36570–900, Brazil |
| Assembly Method | CodonCode Aligner v. 4.1.1 DEMO |
| Sequencing Technology | Sanger dideoxy sequencing |
| host | Macroptilium lathyroides |
| taxon | 1221206 |
| country | Brazil |
| segment | DNA-A |
| lat_lon | |
| collection_date | 18-Mar-2012 |
| collected_by | |
| CDS | 199..954 |
| gene | |
| note | coat protein |
| product | CP |
| protein_id | AIN36521.1 |
| translation | MVKRDAPWRHMAGTSKVSRSSNFSPRGGGGPKNNRTSEWVNRPM … |
| ORIGIN | ACCGGATGGCCGCGCGATTTTTTATGGGCCTTATCTTTTGGCTCGTTCTTTTGGACCGAGTGTATTTGAATTAAAGTAAAGTTATTCCCTGTCCAA................ |
Terms used to name CDS in NCBI
| Genera | CDS term NCBI | Varsani standard |
|---|---|---|
| Betasatellite | “beta” or “c1” | betaC1 |
| Alphasatellite | “alpha” or “rep” | alphaRep |
| Begomovirus | “bv1” or “nsp” or “nuclear shuttle” | NSP |
| Begomovirus | “bc1” or “bc2” or “mp” | MP |
| All genera | “c1” or “ac1” or “rep” or al1 | Rep |
| All genera | “c2” or “ac2” or “trap” or “al2” or “transcription activator protein” | TrAP |
| All genera | “c3” or “ac3” or “ren” or al3 | REn |
| All genera | “c4” or “ac4” or al4 | sd/p.sd |
| All genera | “c5” or “ac5” | AC5 |
| All genera | “v1” or “av1” or “cp” or “ar1” or “capsid protein” or “coat protein” | CP |
| All genera | “v2” or “av2” or “pre-coat” or “precoat” or ar2 | pre-coat |
| All genera | “v3” or “av3” | Reg |