Literature DB >> 23275734

Sequence Maneuverer: tool for sequence extraction from genomes.

Tayyaba Yasmin1, Inayat Ur Rehman, Adnan Ahmad Ansari, Khurrum Liaqat, Muhammad Irfan Khan.   

Abstract

UNLABELLED: The availability of genomic sequences of many organisms has opened new challenges in many aspects particularly in terms of genome analysis. Sequence extraction is a vital step and many tools have been developed to solve this issue. These tools are available publically but have limitations with reference to the sequence extraction, length of the sequence to be extracted, organism specificity and lack of user friendly interface. We have developed a java based software package having three modules which can be used independently or sequentially. The tool efficiently extracts sequences from large datasets with few simple steps. It can efficiently extract multiple sequences of any desired length from a genome of any organism. The results are crosschecked by published data. AVAILABILITY: URL 1: http://ww3.comsats.edu.pk/bio/ResearchProjects.aspx URL 2: http://ww3.comsats.edu.pk/bio/SequenceManeuverer.aspx.

Entities:  

Keywords:  Annotation; Bioinformatics Software; Biology and Genetics; Coding tools and Techniques

Year:  2012        PMID: 23275734      PMCID: PMC3532014          DOI: 10.6026/97320630081277

Source DB:  PubMed          Journal:  Bioinformation        ISSN: 0973-2063


Background

The DNA sequences of many organisms are available through different databases. Moreover, the easy access of many sequenced genomes has enhanced the pace of research in the field of bioinformatics [1]. Analysis of coding [2] and noncoding [3] regions of some of the genomes has revealed the underlying biological messages to some extent but this is like touching the tip of the iceberg. Biological interpretations of non coding sequences are rather more challenging due to their abundance and nonspecific pattern of occurrence in genomes [4]. Genome wide studies usually require extraction of large DNA sequences from a given data set. Normally, DNA sequences are stored in specific formats in different databases and users extract the related information according to the experimental objectives. Different softwares available for DNA sequence extractions have their own pros and cons and no single software can fulfill all the requirements of a user at one time [5-8]. The extraction of coding/non coding sequences from the chromosome files stored in a database is a vital and basic step in research plans in the field of bioinformatics. Sequence maneuverer has been designed to resolve this problem which takes GenBank file as input and generates FASTA lines. These FASTA lines are used as input in sequence extractor which then extracts the sequences accordingly.

Methodology

Sequence maneuverer basically consists of three modules named as annotator, FASTA line generator and sequence extractor. These modules could be used independently or in combination depending upon the user's objectives. The main interface is shown in (Figure 1). This software has been implemented in Java programming language. A system requirement for this software is Java Virtual Machine (JVM). The annotator deals with annotation files available in GenBank formats. FASTA Lines Generator creates FASTA lines and writes them in a text file which can be used as the input file for sequence extractor. The software will extract the sequence efficiently through sequence extractor.
Figure 1

The interface of Sequence Maneuverer.

Fasta Lines Generator:

The user can specify different attributes like, project name, authors name and the project details. The resulting information will be stored in a separate file named as Project Details.txt. “Browse” button takes input from GenBank formatted files and then by clicking on the “generate” button the user can get FASTA file named as “FASTAz.txt”. This text file (FASTAz.txt) contains FASTA lines for the annotation file of any chromosome the user has specified as input.

Sequence Extractor:

The software package deals with the FASTA lines and the chromosome files that user chooses in order to extract the dataset for sequence analysis. Currently, there is a shortage of publically available stand alone applications for extraction of sequence upstream or downstream of the transcription start site (TSS) or coding DNA sequences (CDS) that uses the FASTA lines. An effort has been made in this regard; a desktop application has been developed with a user friendly interface. Moreover, its efficiency and effectiveness is evident from its fast extraction process without RAM-intensive file loading operations. The Table 1 (see supplementary material) shows the system specification, extraction time and other details. Work Flow of Sequence Extractor is shown in (Figure 2).
Figure 2

Work flow of Sequence extractor.

Validation

The output generated by this program was tested manually by checking the TSS location and its sequences from Arabidopsis thaliana genome. Extracted sequence length was about 200 upstream and 50 downstream. The output was set of 251 nucleotides long sequences (TSS at +1). Furthermore, comparison of the output with publically available datasets revealed that our results substantially matched with the output of published datasets ( http://linux1.softberry.com/data/plantprom/Links/PLPR_predicted_ATceres.seq). In addition to the validation of promoter sequences, CDS results of this software also matched with the results obtained from NCBI.

Utility

Efficient sequence extraction of any desired length from genome of any organism; multiple sequences can be handled or manipulated simultaneously; any raw sequence can be converted into GenBank format using annotator.
  7 in total

Review 1.  Annotating non-coding regions of the genome.

Authors:  Roger P Alexander; Gang Fang; Joel Rozowsky; Michael Snyder; Mark B Gerstein
Journal:  Nat Rev Genet       Date:  2010-07-13       Impact factor: 53.242

2.  Phylogenetic analysis of 5'-noncoding regions from the ABA-responsive rab16/17 gene family of sorghum, maize and rice provides insight into the composition, organization and function of cis-regulatory modules.

Authors:  Christina D Buchanan; Patricia E Klein; John E Mullet
Journal:  Genetics       Date:  2004-11       Impact factor: 4.562

3.  Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence.

Authors:  Changchuan Yin; Stephen S-T Yau
Journal:  J Theor Biol       Date:  2007-04-10       Impact factor: 2.691

4.  PromoSer: A large-scale mammalian promoter and transcription start site identification service.

Authors:  Anason S Halees; Dmitriy Leyfer; Zhiping Weng
Journal:  Nucleic Acids Res       Date:  2003-07-01       Impact factor: 16.971

5.  PPDB, the Plant Proteomics Database at Cornell.

Authors:  Qi Sun; Boris Zybailov; Wojciech Majeran; Giulia Friso; Paul Dominic B Olinares; Klaas J van Wijk
Journal:  Nucleic Acids Res       Date:  2008-10-02       Impact factor: 16.971

6.  Compressing DNA sequence databases with coil.

Authors:  W Timothy J White; Michael D Hendy
Journal:  BMC Bioinformatics       Date:  2008-05-20       Impact factor: 3.169

7.  RSAT: regulatory sequence analysis tools.

Authors:  Morgane Thomas-Chollier; Olivier Sand; Jean-Valéry Turatsinze; Rekin's Janky; Matthieu Defrance; Eric Vervisch; Sylvain Brohée; Jacques van Helden
Journal:  Nucleic Acids Res       Date:  2008-05-21       Impact factor: 16.971

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.