Literature DB >> 32341946

Chaos game representation dataset of SARS-CoV-2 genome.

Raquel de M Barbosa1, Marcelo A C Fernandes2,3.   

Abstract

As of April 16, 2020, the novel coronavirus disease (called COVID-19) spread to more than 185 countries/regions with more than 142,000 deaths and more than 2,000,000 confirmed cases. In the bioinformatics area, one of the crucial points is the analysis of the virus nucleotide sequences using approaches such as data stream, digital signal processing, and machine learning techniques and algorithms. However, to make feasible this approach, it is necessary to transform the nucleotide sequences string to numerical values representation. Thus, the dataset provides a chaos game representation (CGR) of SARS-CoV-2 virus nucleotide sequences. The dataset provides the CGR of 100 instances of SARS-CoV-2 virus, 11540 instances of other viruses from the Virus-Host DB dataset, and three instances of Riboviria viruses from NCBI (Betacoronavirus RaTG13, bat-SL-CoVZC45, and bat-SL-CoVZXC21).
© 2020 The Author(s).

Entities:  

Keywords:  CGR; COVID-19; SARS-CoV-2

Year:  2020        PMID: 32341946      PMCID: PMC7182522          DOI: 10.1016/j.dib.2020.105618

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Value of the data

These data are useful because they provide numeric representation of the COVID-2019 epidemic virus (SARS-CoV-2). With this form of the data, it is possible to use data stream, digital signal processing, and machine learning algorithms. All researchers in bioinformatics, computing science, and computing engineering field can benefit from these data because by using this numeric representation they can apply several techniques such as machine learning and digital signal processing in genomic information. Data experiments that use clustering and classification techniques in SARS-CoV-2 virus genomic information can be used with this dataset. These data represent an easy way to evaluate the SARS-CoV-2 virus genome.

Data Description

This work presents a new dataset of a chaos game representation (CGR) of SARS-CoV-2 virus nucleotide sequences. The dataset contains two kinds of data, the raw data, and the processing data. The raw data is composed of the 100 instances of the SARS-CoV-2 virus genome collected from the National Center for Biotechnology Information (NCBI) [1], 11540 instances of other viruses from the Virus-Host DB [2], [3], and three other instances of Riboviria also collected from the NCBI (Betacoronavirus RaTG13, bat-SL-CoVZC45, and bat-SL-CoVZXC21). Which have high similarity with SARS-CoV-2 [4], [5]. The dataset provides two groups of formats files for all data. In the first group, all data are stored in Matlab file format (.mat), and in the second group, part of the data is stored in Microsoft Excel (.xlsx) and another part in the text file (.txt). The two groups have the same information. The data is organized into three main directories: “SARS-CoV-2 data”, “Virus-Host DB data” and “Other viruses data.” Each main directory is formed by two sub-directories: “Matlab” and “Excel and txt.” Each sub-directory “Matlab” contains three files called “RawDataTable.mat”, “RawData.mat” and “CGRData.mat”. “RawDataTable.mat” and “RawData.mat” files store the raw data information from the viruses database; they have the same information, however in the “RawDataTable.mat” the attributes are stored in Matlab table format (after 2013b version) and in “RawData.mat” the attributes are stored in Matlab cell arrays format. Each “CGRData.mat” file stores the CGR values of all viruses presented in each “RawDataTable.mat” and “RawData.mat” file. For the main directory “Virus-Host DB data”, the CGR values are stored in 10 files where each -th file is called “RawData_.mat.” Each sub-directory “Excel and txt” is composed of a file and another sub-directory called “RawData.xlsx” and “CGRData”, respectively. Each “RawData.xlsx” file has the raw data information from the viruses database, and each “CGRData” has the CGR of viruses presented in each “RawData.xlsx” file. The points of the CGR associated with each virus are stored in a text file called “LocusName_COD.txt” where COD is the code (locus name) associated with the virus in Genbank [6].

Experimental Design, Materials, and Methods

The Chaos Game Representation (CGR), proposed by H. Joel Jeffrey in [7], transforms the nucleotide sequence (DNA or RNA) to bi-dimensional real values. The CGR maintains the statistical properties of the nucleotide sequence, and it allows an investigation of the local and global patterns in sequences [8], [9]. The CGR has with input the nucleotide sequence, s, expressed aswhere N is the length of sequence and s is the n-th nucleotide of the sequence. Each n-th nucleotide, s, is mapped to bi-dimensional symbol (s(n), s(n)) and it can be expressed asandAfter the mapping, each n-th symbol (s(n), s(n)) is transformed in CGR values by equations expressed asandwhere for the initial condition, and [7], [8]. The dataset was generated with and . Figures 1(a), 1(b), 1(c) and 1(d) show a example of CGR points (p(n), p(n)) from dataset presented in this work.
Fig. 1

Example of the CGR values for the SARS-CoV-2 virus stored in this dataset.

Example of the CGR values for the SARS-CoV-2 virus stored in this dataset.
Specification Table
SubjectBiochemistry, Genetics and Molecular Biology (General)
Specific subject areaBioinformatics
Type of dataTable
Number
How data were acquiredNCBI - Genbank - SARS-CoV2 https://www.ncbi.nlm.nih.gov/genbank/sars-cov-2-seqs/
Virus-Host-DB https://www.genome.jp/virushostdb/
Matlab Software
Excel Software
Data formatRaw and analyzed data are in Matlab file (.mat), Microsoft Excel file (.xlsx), and text file (.txt).
Parameters for data collectionThe entire dataset was generated using MATLAB 2019b on Windows operating system with Intel Core - i5 6500T 2.5 GHz quad-core processor with 16GB of RAM.
Description of data collectionThe raw data were downloaded from NCBI - Genbank, and Virus-Host-DB. The CGR values were generated using Matlab.
Data source locationLaboratory of Machine Learning and Intelligent Instrumentation, IMD/nPITI, Federal University of Rio Grande do Norte.
Data accessibilityhttps://data.mendeley.com/datasets/nvk5bf3m2f/1
  5 in total

1.  Encoding and Decoding DNA Sequences by Integer Chaos Game Representation.

Authors:  Changchuan Yin
Journal:  J Comput Biol       Date:  2018-12-05       Impact factor: 1.479

2.  Chaos game representation of gene structure.

Authors:  H J Jeffrey
Journal:  Nucleic Acids Res       Date:  1990-04-25       Impact factor: 16.971

3.  Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison.

Authors:  Tung Hoang; Changchuan Yin; Stephen S-T Yau
Journal:  Genomics       Date:  2016-08-15       Impact factor: 5.736

4.  Linking Virus Genomes with Host Taxonomy.

Authors:  Tomoko Mihara; Yosuke Nishimura; Yugo Shimizu; Hiroki Nishiyama; Genki Yoshikawa; Hideya Uehara; Pascal Hingamp; Susumu Goto; Hiroyuki Ogata
Journal:  Viruses       Date:  2016-03-01       Impact factor: 5.048

5.  Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study.

Authors:  Gurjit S Randhawa; Maximillian P M Soltysiak; Hadi El Roz; Camila P E de Souza; Kathleen A Hill; Lila Kari
Journal:  PLoS One       Date:  2020-04-24       Impact factor: 3.240

  5 in total
  6 in total

1.  NGS data vectorization, clustering, and finding key codons in SARS-CoV-2 variations.

Authors:  Juhyeon Kim; Saeyeon Cheon; Insung Ahn
Journal:  BMC Bioinformatics       Date:  2022-05-17       Impact factor: 3.307

2.  Data stream dataset of SARS-CoV-2 genome.

Authors:  Raquel de M Barbosa; Marcelo A C Fernandes
Journal:  Data Brief       Date:  2020-06-10

3.  Adoption of Digital Technologies in Health Care During the COVID-19 Pandemic: Systematic Review of Early Scientific Literature.

Authors:  Davide Golinelli; Erik Boetto; Gherardo Carullo; Andrea Giovanni Nuzzolese; Maria Paola Landini; Maria Pia Fantini
Journal:  J Med Internet Res       Date:  2020-11-06       Impact factor: 5.428

4.  Early survey with bibliometric analysis on machine learning approaches in controlling COVID-19 outbreaks.

Authors:  Haruna Chiroma; Absalom E Ezugwu; Fatsuma Jauro; Mohammed A Al-Garadi; Idris N Abdullahi; Liyana Shuib
Journal:  PeerJ Comput Sci       Date:  2020-11-23

5.  How do we share data in COVID-19 research? A systematic review of COVID-19 datasets in PubMed Central Articles.

Authors:  Xu Zuo; Yong Chen; Lucila Ohno-Machado; Hua Xu
Journal:  Brief Bioinform       Date:  2021-03-22       Impact factor: 11.622

6.  Convolutional Neural Network Applied to SARS-CoV-2 Sequence Classification.

Authors:  Gabriel B M Câmara; Maria G F Coutinho; Lucileide M D da Silva; Walter V do N Gadelha; Matheus F Torquato; Raquel de M Barbosa; Marcelo A C Fernandes
Journal:  Sensors (Basel)       Date:  2022-07-31       Impact factor: 3.847

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.