Literature DB >> 24020993

gff2sequence, a new user friendly tool for the generation of genomic sequences.

Abstract

BACKGROUND: General Feature Format (GFF) files are used to store genome features such as genes, exons, introns, primary transcripts etc. Although many software packages (i.e. ab initio gene prediction programs) can annotate features by using such a standard, a small number of tools have been developed to extract the corresponding sequence information from the original genome. However the present tools do not execute either a quality control or a customizable filter of the annotated features is available.
FINDINGS: gff2sequence is a program that extracts nucleotide/protein sequences from a genomic multifasta by using the information provided by a general feature format file. While a graphical user interface makes this software very easy to use, a C++ algorithm allows high performance together with low hardware demand. The software also allows the extraction of the genic portions such as the untranslated and the coding sequences. Moreover a highly customizable quality control pipeline can be used to deal with anomalous splicing sites, incorrect open reading frames and not canonical characters within the retrieved sequences.
CONCLUSIONS: gff2sequence is a user friendly program that allows the generation of highly customizable sequence datasets by processing a general feature format file. The presence of a wide range of quality filters makes this tool also suitable for refining the ab initio gene predictions.

Entities: Chemical Species

Year: 2013 PMID： 24020993 PMCID： PMC3848729 DOI： 10.1186/1756-0381-6-15

Source DB: PubMed Journal: BioData Min ISSN： 1756-0381 Impact factor: 2.522

Background

Advent of next generation sequencing, together with the organization of several genome projects, made sequencing the genome an affordable task for many organisms. Many gene prediction programs allow the identification of genes within a new genome [1] and the General Feature Format (GFF, proposed by Durbin and Haussler, http://www.sanger.ac.uk/software/gff/) is often chosen for storing the resulting data [2]. GFF format reports the genomic features in single-line records where information such as type and position are provided. Several tools have already been developed in order to deal with GFF files (refer also to the above URL). Programs such as BEDtools [3], readseq [4] and gff-ex (http://bioinfo.icgeb.res.in/gff/) can perform this task although their usage requires previous command line interface experience. Galaxy [5-7] is an easy to use alternative featuring a graphical user interface which is available as a stand alone version, as well as a web application. Although extremely versatile these programs have limitation in dealing with annotation data. As an example, the gene regions are not straightforwardly reconstructed and instead introns and exons sequences are generated. Enboss [8], together with its graphical user interface (GUI) version Jemboss [9], can also manage annotation files and convert them in a large number of formats. However all these software lack a downstream quality control of the output data (e.g. presence of anomalous characters within the nucleotide sequences, presence of canonical splicing sites in introns, etc.). Additionally, dealing with annotation data may represent a non trivial task due to the heterogeneity underlying the way a GFF file can be written. Indeed the association between initial and final positions of a feature and its direction (5’ -> 3’ or viceversa), the presence of several splicing variants for the same gene, the possible occurrence of overlapping features should all be considered. This may require an intense scripting effort which can rarely be made by a user with basic informatics skills. Here we present gff2sequence an open-source program which allows the extraction of gene features from an annotation file while controlling for several quality filters and maintaining a user friendly graphical environment.

Implementation

C++ was used in order to implement the main algorithm of gff2sequence as well as for its graphic user interface which relies on the Qt-project library (qt-project.org).

Results and discussion

gff2sequence (Figure 1) takes in input a GFF (or GTF) annotation file and the relevant multifasta genome information and generates the nucleotide sequences of many genic and intergenic features (e.g. untranslated regions, coding sequences, proteins, introns, exons, genes, transcripts and down/upstream sequences). While the software was designed to work with gene annotation data it can also be used to extract generic features from any multifasta nucleotide sequence (see documentation for details).

Figure 1

gff2sequence graphic user interface.

gff2sequence graphic user interface. Many parameters can be set in order to filter the output sequences. A general quality control can be used for selecting only sequences with no special characters (e.g. N for incomplete assembly, X for masking, etc.), or exceeding a user defined length. Coding sequences can be tested for the presence of a proper start codon (ATG), for the occurrence of a canonical stop signals (e.g. TGA, TAA or TAG), and for the presence of full codons (e.g. the total number of nucleotides is divisible by 3). Such features are automatically selected when the CDS translation is performed (standard genetic code is used to perform this task). Finally introns may be filtered by specifying the splicing signals. gff2sequence generates three output files when the intergenic sequence information are gathered: (a) a multifasta formatted file reporting the nucleotide sequences, (b) a list of gene couples that are adjacent (e.g. with no intergenic sequences between them) and (c) a list of gene couples that are overlapping (e.g. genes which partially share a portion of DNA). The software was tested on gff files that were available from a number of curators and performed smoothly for the following species (see Supplementary file “inputFileTested.pdf” for a complete list of the used URL): Arabidopsis thaliana[10], Vitis vinifera[11], (Phytozome [12]), Solanum lycopersicum (Sol Genomics Network at www. http://solgenomics.net/[13]), Oryza sativa[14] (Rice genome annotation project at http://rice.plantbiology.msu.edu/), Zea mays[15] (http://www.maizesequence.org/). The main algorithm allows fast computations without being too demanding on hardware. As a way of example, full analysis (e.g. quality filters were all selected) of the large Zea mays genome (around 2500 Mbp) was performed in 11 minutes and required between 2.5 and 3 Gigabytes of memory on an Intel® Core™ i5 CPU M 430 working at 2.27 GHz. The presence of 2920 anomalous coding sequences and 928 overlapping genes emerged from such analysis.

Conclusions

We believe gff2sequence may represent a valuable and easy to use alternative for the generation of a customized sequence dataset from general feature formatted file. Moreover identification of anomalous coding sequences, overlapping genes or non-canonical splicing sites may help in refining the automatic gene predictions.

Availability and requirements

Project name: gff2sequence (version 0.1) Project home page:http://sourceforge.net/projects/gff2sequence/ Operating system: Linux 64-bit Programming language: C++ Other requirements: Qt library installed License: GNU GPL Long term support: Software support will be given for at least one year after release. Any bug will be analyzed and the software corrected. Each bug correction and/or software improvement will be followed by a new version release.

Competing interests

Both authors declare that they have no competing interests.

Authors’ contributions

SC was involved in the design and realization of the software. AP contributed to the project conception and participated to draft the manuscript. Both authors read and approved the final manuscript.

14 in total

1. EMBOSS: the European Molecular Biology Open Software Suite.

Authors: P Rice; I Longden; A Bleasby
Journal: Trends Genet Date: 2000-06 Impact factor: 11.639

2. Galaxy: a platform for interactive large-scale genome analysis.

Authors: Belinda Giardine; Cathy Riemer; Ross C Hardison; Richard Burhans; Laura Elnitski; Prachi Shah; Yi Zhang; Daniel Blankenberg; Istvan Albert; James Taylor; Webb Miller; W James Kent; Anton Nekrutenko
Journal: Genome Res Date: 2005-09-16 Impact factor: 9.043

3. The B73 maize genome: complexity, diversity, and dynamics.

Authors: Patrick S Schnable; Doreen Ware; Robert S Fulton; Joshua C Stein; Fusheng Wei; Shiran Pasternak; Chengzhi Liang; Jianwei Zhang; Lucinda Fulton; Tina A Graves; Patrick Minx; Amy Denise Reily; Laura Courtney; Scott S Kruchowski; Chad Tomlinson; Cindy Strong; Kim Delehaunty; Catrina Fronick; Bill Courtney; Susan M Rock; Eddie Belter; Feiyu Du; Kyung Kim; Rachel M Abbott; Marc Cotton; Andy Levy; Pamela Marchetto; Kerri Ochoa; Stephanie M Jackson; Barbara Gillam; Weizu Chen; Le Yan; Jamey Higginbotham; Marco Cardenas; Jason Waligorski; Elizabeth Applebaum; Lindsey Phelps; Jason Falcone; Krishna Kanchi; Thynn Thane; Adam Scimone; Nay Thane; Jessica Henke; Tom Wang; Jessica Ruppert; Neha Shah; Kelsi Rotter; Jennifer Hodges; Elizabeth Ingenthron; Matt Cordes; Sara Kohlberg; Jennifer Sgro; Brandon Delgado; Kelly Mead; Asif Chinwalla; Shawn Leonard; Kevin Crouse; Kristi Collura; Dave Kudrna; Jennifer Currie; Ruifeng He; Angelina Angelova; Shanmugam Rajasekar; Teri Mueller; Rene Lomeli; Gabriel Scara; Ara Ko; Krista Delaney; Marina Wissotski; Georgina Lopez; David Campos; Michele Braidotti; Elizabeth Ashley; Wolfgang Golser; HyeRan Kim; Seunghee Lee; Jinke Lin; Zeljko Dujmic; Woojin Kim; Jayson Talag; Andrea Zuccolo; Chuanzhu Fan; Aswathy Sebastian; Melissa Kramer; Lori Spiegel; Lidia Nascimento; Theresa Zutavern; Beth Miller; Claude Ambroise; Stephanie Muller; Will Spooner; Apurva Narechania; Liya Ren; Sharon Wei; Sunita Kumari; Ben Faga; Michael J Levy; Linda McMahan; Peter Van Buren; Matthew W Vaughn; Kai Ying; Cheng-Ting Yeh; Scott J Emrich; Yi Jia; Ananth Kalyanaraman; An-Ping Hsia; W Brad Barbazuk; Regina S Baucom; Thomas P Brutnell; Nicholas C Carpita; Cristian Chaparro; Jer-Ming Chia; Jean-Marc Deragon; James C Estill; Yan Fu; Jeffrey A Jeddeloh; Yujun Han; Hyeran Lee; Pinghua Li; Damon R Lisch; Sanzhen Liu; Zhijie Liu; Dawn Holligan Nagel; Maureen C McCann; Phillip SanMiguel; Alan M Myers; Dan Nettleton; John Nguyen; Bryan W Penning; Lalit Ponnala; Kevin L Schneider; David C Schwartz; Anupma Sharma; Carol Soderlund; Nathan M Springer; Qi Sun; Hao Wang; Michael Waterman; Richard Westerman; Thomas K Wolfgruber; Lixing Yang; Yeisoo Yu; Lifang Zhang; Shiguo Zhou; Qihui Zhu; Jeffrey L Bennetzen; R Kelly Dawe; Jiming Jiang; Ning Jiang; Gernot G Presting; Susan R Wessler; Srinivas Aluru; Robert A Martienssen; Sandra W Clifton; W Richard McCombie; Rod A Wing; Richard K Wilson
Journal: Science Date: 2009-11-20 Impact factor: 47.728

Review 4. Gene prediction based on DNA spectral analysis: a literature review.

Authors: Sajid A Marhon; Stefan C Kremer
Journal: J Comput Biol Date: 2011-03-07 Impact factor: 1.479

5. BEDTools: a flexible suite of utilities for comparing genomic features.

Authors: Aaron R Quinlan; Ira M Hall
Journal: Bioinformatics Date: 2010-01-28 Impact factor: 6.937

6. Galaxy: a web-based genome analysis tool for experimentalists.

Authors: Daniel Blankenberg; Gregory Von Kuster; Nathaniel Coraor; Guruprasad Ananda; Ross Lazarus; Mary Mangan; Anton Nekrutenko; James Taylor
Journal: Curr Protoc Mol Biol Date: 2010-01

7. The Sol Genomics Network (solgenomics.net): growing tomatoes using Perl.

Authors: Aureliano Bombarely; Naama Menda; Isaak Y Tecle; Robert M Buels; Susan Strickler; Thomas Fischer-York; Anuradha Pujar; Jonathan Leto; Joseph Gosselin; Lukas A Mueller
Journal: Nucleic Acids Res Date: 2010-10-08 Impact factor: 16.971

8. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

Authors: Jeremy Goecks; Anton Nekrutenko; James Taylor
Journal: Genome Biol Date: 2010-08-25 Impact factor: 13.583

9. Phytozome: a comparative platform for green plant genomics.

Authors: David M Goodstein; Shengqiang Shu; Russell Howson; Rochak Neupane; Richard D Hayes; Joni Fazo; Therese Mitros; William Dirks; Uffe Hellsten; Nicholas Putnam; Daniel S Rokhsar
Journal: Nucleic Acids Res Date: 2011-11-22 Impact factor: 16.971

10. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions.

Authors: Jonathan E Allen; William H Majoros; Mihaela Pertea; Steven L Salzberg
Journal: Genome Biol Date: 2006-08-07 Impact factor: 13.583

8 in total

1. Network analysis of pseudogene-gene relationships: from pseudogene evolution to their functional potentials.

Authors: Travis S Johnson; Sihong Li; Jonathan R Kho; Kun Huang; Yan Zhang
Journal: Pac Symp Biocomput Date: 2018

2. Living apart together: crosstalk between the core and supernumerary genomes in a fungal plant pathogen.

Authors: Adriaan Vanheule; Kris Audenaert; Sven Warris; Henri van de Geest; Elio Schijlen; Monica Höfte; Sarah De Saeger; Geert Haesaert; Cees Waalwijk; Theo van der Lee
Journal: BMC Genomics Date: 2016-08-23 Impact factor: 3.969

3. Sequence Polymorphisms and Structural Variations among Four Grapevine (Vitis vinifera L.) Cultivars Representing Sardinian Agriculture.

Authors: Luca Mercenaro; Giovanni Nieddu; Andrea Porceddu; Mario Pezzotti; Salvatore Camiolo
Journal: Front Plant Sci Date: 2017-07-20 Impact factor: 5.753

4. The Evolutionary Basis of Translational Accuracy in Plants.

Authors: Salvatore Camiolo; Gaurav Sablok; Andrea Porceddu
Journal: G3 (Bethesda) Date: 2017-07-05 Impact factor: 3.154

5. Genome-Wide Identification of Histone Modification Gene Families in the Model Legume Medicago truncatula and Their Expression Analysis in Nodules.

Authors: Loredana Lopez; Giorgio Perrella; Ornella Calderini; Andrea Porceddu; Francesco Panara
Journal: Plants (Basel) Date: 2022-01-26