Literature DB >> 15980524

GENSTYLE: exploration and analysis of DNA sequences with genomic signature.

Bernard Fertil1, Matthieu Massin, Sylvain Lespinats, Caroline Devic, Philippe Dumee, Alain Giron.   

Abstract

GENSTYLE (http://Genstyle.imed.jussieu.fr) is a workspace designed for the characterization and classification of nucleotide sequences. Based on the genomic signature paradigm, GENSTYLE focuses on oligonucleotide frequencies in DNA sequences. Users can select sequences of interest in the GENSTYLE companion database, where the whole set of GenBank sequences is grouped per species, or upload their own sequences to work with. Tools for the exploration and analysis of signatures allow (i) identification of the origin of DNA segments (detection of rare species or species for which technical problems prevent fast characterization, such as micro-organisms with slow growth), (ii) analysis of the homogeneity of a genome and isolation of areas with novel functionality (horizontal transfers for example)--and (iii) molecular phylogeny and taxonomy.

Entities:  

Mesh:

Substances:

Year:  2005        PMID: 15980524      PMCID: PMC1160249          DOI: 10.1093/nar/gki489

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


GENOMIC SIGNATURE AND DNA STYLE

A great number of DNA sequences are now available from web-based databases. DNA samples of >140 000 named organisms can be found, for example, in GenBank. The characteristics of these sequences have been extensively studied, and extracted information is often interpreted in terms of evolution or systematic molecular biology. Many works are devoted to the so-called metagenomic analysis of DNA sequences. One approach deals with the frequencies of short oligonucleotides. Karlin and Burge initially focused on dinucleotide relative abundance (1). It quickly became obvious that the set of oligonucleotide frequencies was species specific (2–6). The set of oligonucleotide frequencies was subsequently considered to be a genomic signature. Studies based on genomic signature are becoming more and more popular (7,8). It has been observed that the genomic signature results from a species-specific ‘writing STYLE’ (4,9,10). Indeed, on the one hand, the genomic signatures of species differ from one another, and, on the other hand, the majority of genome segments within a species have comparable signatures. As a consequence, each species can be assigned a DNA style that can be derived from most of its available DNA fragments. The methodology that we have developed thus makes it possible to study and compare a great number of sequences and species, inasmuch as the calculation of a signature on a laptop computer requires <1 s per million nucleotides. The genomic signature is visualized as a parametric image using the ‘chaos game representation’ algorithm (3,5,8,11–14). Our experience with genomic signatures shows that the comparison of four-letter word signatures offers a good trade-off between accuracy of classification, usual size of DNA fragments and computer load (9,15). In our hands, comparison of signatures is achieved by means of the Euclidian metric in a space with 256 dimensions (there are 256 different 4-letter words). Of course, other methods for comparison of signatures are available. They often provide slightly different results [see Refs (4,8,16,17) for some other measures of dissimilarities]. It must be pointed out that comparisons of DNA style do not require homologous sequences and almost any DNA segment is eligible (4,9). In fact, the species-specific DNA style concept motivates and justifies most of the works dealing with the genomic signature, including, for example, assignment of genomic fragments (4,18), taxonomic/phylogenetic analyses (15,17,19) and detection of horizontal transfers (HTs) (20,21). Detection of HTs is a major application of the DNA style concept. Some of the abnormal patterns in a genome may be considered to result from HTs. Numerous methods relying on a gene's nucleotide or oligonucleotide composition for the detection of HTs are available (22–32). Among them, hidden Markov models (HMMs) and wavelet transforms are two of the efficient approaches in use for detecting and characterizing original motifs and patterns. Their performances have been subjected to extensive comparisons (20,21,31,32). Many other applications are emerging, such as the characterization of unknown sequences, the quality control of sequencing and pre-processing for homologous sequences screening. A web service () has recently been made available for the comparison of tetranucleotide usage patterns in DNA sequences (33). It comes with pre-computed tetranucleotide usage patterns for 166 prokaryote chromosomes as a source for limited data mining.

GENSTYLE

GENSTYLE is grounded in the genomic signature paradigm. It offers three sets of tools for the characterization and classification of nucleotide sequences. Parts of GENSTYLE were made accessible to the bioinformatics community through our site () starting in 1999, after the publication of the seminal paper describing the concept and its usefulness (3). The current version results from a substantial redesign that developed into the GENSTYLE workspace. Three dedicated toolboxes have been implemented for collecting, selecting and processing sequences. The sequence analysis toolbox is made for Identification of the origin of short DNA fragments. Any DNA sequence is eligible for searching for its origin. This feature is useful, for example, for the recognition of rare and/or slow growth organisms (sequences usually hard to characterize). Detection of ‘atypical’ areas in a genome, in particular the detection of HTs (and potential donors). The closest species (from the genomic signature point of view) of an atypical DNA segment give clues about the donor in the case of putative HTs (under implementation). Building of taxonomic and phylogenetic trees. Distance between signatures remains to been established as a reference for phylogenetic studies, but several recent and interesting results have shown its potentially great value (15,17). In particular, our current work with corona viruses is very promising with this respect. There is a large genomic signature database behind GENSTYLE that greatly enhances its power and scope. The full set of GENBANK sequences, stored by species, is available for signature studies. The GENSTYLE companion genomic signature database handles ∼170 000 species and unspecified organisms (>2 000 000 DNA sequences). It is updated on a regular basis, using the bimonthly releases issued by GenBank.

GENSTYLE WORKSPACE

GENSTYLE tools are available from within a user workspace. This makes it possible to work online on the whole set (or part of it) of GenBank nucleotide sequences belonging to one or several species. User's sequences can also be uploaded to work with. Tools for the exploration and analysis of signatures are straightforward. They do not require much prior knowledge. Results are displayed in specific windows with images, tables and charts. Most of the outputs can be downloaded for further processing. The user's workspace can be saved for later use. There are three toolboxes in the GENSTYLE workspace: Online versions of additional tools already in use in our lab are currently under development. They include navigation along genomes by means of local signatures (for HT detection, for example), visualization of similarities between local signatures along several genomes and taxonomic trees. Sequence collector toolbox. The sequence collector allows workspace to be loaded with sequences of interest. Sequences can be selected through the GENSTYLE companion database browser. The user's sequences (FASTA format, eventually grouped into a single text file, zipped or not) can be uploaded through the uploader. Sequence filters toolbox. Although many sequences can be uploaded to a given workspace, it may be interesting to work on selected subsets. Several tools are available for this task, including selection of DNA type and size of sequences. Sequence analysis toolbox. The tools of this toolbox operate on the selected sequences of the workspace. They allow Visualization of sequence signatures. Detailed examination of signatures. If several sequences are analysed together, a distance matrix (Euclidean metric, Phylip format) is provided to build taxonomic/phylogenetic trees (8). Searching for species with similar DNA signatures in the GENSTYLE companion database. Observation of similarities (and differences) between signatures by principal components analysis (PCA) (34).

WORKING WITH GENSTYLE

A tutorial is available online. It demonstrates how the origin of a small DNA sequence can be looked for in the GENSTYLE companion database. Briefly, the sequence of interest has to be pasted into the appropriate field of the demonstrator tool (Figure 1A). The sequence signatures for oligonucleotides (words) 1–9 nt long are subsequently calculated, oligonucleotide counts are obtained (Figure 1A) and signatures are displayed (Figure 1B). Specific word counts and frequencies are available in popup windows (Figure 1C). Species with the closest signatures are then determined (Figure 1D). Distances to the sequence of interest are expressed in an arbitrary unit (AU). It can be seen that the sequence of interest belongs to the SARS Virus (d = 11) and that the closest species are PEDV and PTGV corona viruses. Although this procedure seems to mimic BLAST/FASTA functions, it is quite different in nature. Similarities between sequences can be observed even when they are not homologous. As a consequence, the origin of a sequence can be obtained once the DNA material characterizing the genomic signature of the species of origin is available (typically 2000 nt). Homologous DNA counterparts are not required in the database.
Figure 1

An example from GENSTYLE's online tutorial.

  28 in total

1.  Mining Bacillus subtilis chromosome heterogeneities using hidden Markov models.

Authors:  Pierre Nicolas; Laurent Bize; Florence Muri; Mark Hoebeke; François Rodolphe; S Dusko Ehrlich; Bernard Prum; Philippe Bessières
Journal:  Nucleic Acids Res       Date:  2002-03-15       Impact factor: 16.971

2.  Analysis of genomic sequences by Chaos Game Representation.

Authors:  J S Almeida; J A Carriço; A Maretzek; P A Noble; M Fletcher
Journal:  Bioinformatics       Date:  2001-05       Impact factor: 6.937

3.  Evolutionary implications of microbial genome tetranucleotide frequency biases.

Authors:  David T Pride; Richard J Meinersmann; Trudy M Wassenaar; Martin J Blaser
Journal:  Genome Res       Date:  2003-02       Impact factor: 9.043

4.  A genomic schism in birds revealed by phylogenetic analysis of DNA strings.

Authors:  Scott V Edwards; Bernard Fertil; Alain Giron; Patrick J Deschavanne
Journal:  Syst Biol       Date:  2002-08       Impact factor: 15.683

5.  How to interpret an anonymous bacterial genome: machine learning approach to gene identification.

Authors:  W S Hayes; M Borodovsky
Journal:  Genome Res       Date:  1998-11       Impact factor: 9.043

6.  Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences.

Authors:  S Schbath; B Prum; E de Turckheim
Journal:  J Comput Biol       Date:  1995       Impact factor: 1.479

7.  Detection and characterization of horizontal transfers in prokaryotes using genomic signature.

Authors:  Christine Dufraigne; Bernard Fertil; Sylvain Lespinats; Alain Giron; Patrick Deschavanne
Journal:  Nucleic Acids Res       Date:  2005-01-13       Impact factor: 16.971

8.  A new computational method for the detection of horizontal gene transfer events.

Authors:  Aristotelis Tsirigos; Isidore Rigoutsos
Journal:  Nucleic Acids Res       Date:  2005-02-16       Impact factor: 16.971

9.  TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences.

Authors:  Hanno Teeling; Jost Waldmann; Thierry Lombardot; Margarete Bauer; Frank Oliver Glöckner
Journal:  BMC Bioinformatics       Date:  2004-10-26       Impact factor: 3.169

10.  Pervasive properties of the genomic signature.

Authors:  Robert W Jernigan; Robert H Baran
Journal:  BMC Genomics       Date:  2002-08-09       Impact factor: 3.969

View more
  11 in total

1.  Classification and regression tree (CART) analyses of genomic signatures reveal sets of tetramers that discriminate temperature optima of archaea and bacteria.

Authors:  Betsey Dexter Dyer; Michael J Kahn; Mark D Leblanc
Journal:  Archaea       Date:  2008-12       Impact factor: 3.273

2.  Word-based characterization of promoters involved in human DNA repair pathways.

Authors:  Jens Lichtenberg; Edwin Jacox; Joshua D Welch; Kyle Kurz; Xiaoyu Liang; Mary Qu Yang; Frank Drews; Klaus Ecker; Stephen S Lee; Laura Elnitski; Lonnie R Welch
Journal:  BMC Genomics       Date:  2009-07-07       Impact factor: 3.969

3.  Signal processing for metagenomics: extracting information from the soup.

Authors:  Gail L Rosen; Bahrad A Sokhansanj; Robi Polikar; Mary Ann Bruns; Jacob Russell; Elaine Garbarine; Steve Essinger; Non Yok
Journal:  Curr Genomics       Date:  2009-11       Impact factor: 2.236

4.  Whole genome evaluation of horizontal transfers in the pathogenic fungus Aspergillus fumigatus.

Authors:  Ludovic V Mallet; Jennifer Becq; Patrick Deschavanne
Journal:  BMC Genomics       Date:  2010-03-12       Impact factor: 3.969

5.  Microbial lifestyle and genome signatures.

Authors:  Chitra Dutta; Sandip Paul
Journal:  Curr Genomics       Date:  2012-04       Impact factor: 2.236

6.  Applying small-scale DNA signatures as an aid in assembling soybean chromosome sequences.

Authors:  Myron Peto; David M Grant; Randy C Shoemaker; Steven B Cannon
Journal:  Adv Bioinformatics       Date:  2010-08-19

7.  Sequence composition similarities with the 7SL RNA are highly predictive of functional genomic features.

Authors:  Yanick Paquet; Alan Anderson
Journal:  Nucleic Acids Res       Date:  2010-04-14       Impact factor: 16.971

8.  Metagenome fragment classification using N-mer frequency profiles.

Authors:  Gail Rosen; Elaine Garbarine; Diamantino Caseiro; Robi Polikar; Bahrad Sokhansanj
Journal:  Adv Bioinformatics       Date:  2008-11-16

9.  Massive gene acquisitions in Mycobacterium indicus pranii provide a perspective on mycobacterial evolution.

Authors:  Vikram Saini; Saurabh Raghuvanshi; Jitendra P Khurana; Niyaz Ahmed; Seyed E Hasnain; Akhilesh K Tyagi; Anil K Tyagi
Journal:  Nucleic Acids Res       Date:  2012-09-10       Impact factor: 16.971

10.  Fast comparison of DNA sequences by oligonucleotide profiling.

Authors:  Vicente Arnau; Miguel Gallach; Ignacio Marín
Journal:  BMC Res Notes       Date:  2008-02-28
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.