| Literature DB >> 27536341 |
Ying Huang1, Shi-Yi Chen2, Feilong Deng2.
Abstract
In silico analysis of DNA sequences is an important area of computational biology in the post-genomic era. Over the past two decades, computational approaches for ab initio prediction of gene structure from genome sequence alone have largely facilitated our understanding on a variety of biological questions. Although the computational prediction of protein-coding genes has already been well-established, we are also facing challenges to robustly find the non-coding RNA genes, such as miRNA and lncRNA. Two main aspects of ab initio gene prediction include the computed values for describing sequence features and used algorithm for training the discriminant function, and by which different combinations are employed into various bioinformatic tools. Herein, we briefly review these well-characterized sequence features in eukaryote genomes and applications to ab initio gene prediction. The main purpose of this article is to provide an overview to beginners who aim to develop the related bioinformatic tools.Entities:
Keywords: Ab initio gene prediction; Compositional properties; Eukaryotes; Functional signals; Sequence features
Year: 2016 PMID: 27536341 PMCID: PMC4975701 DOI: 10.1016/j.csbj.2016.07.002
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Architecture of eukaryotic genomes. A total of 32 representative species are included for comparatively illustrating the genome size, GC content, as well as respective proportions of intergenic regions (IG), exons and introns. In brief, all five indices were generated by the dissection of annotation information of reference genome (in GFF format) downloaded from NCBI (March, 2016); and these steps were performed using in-house scripts written in Python language. Additionally, the screenshot of NCBI taxonomic tree is employed to show the phylogenetic relationships among species, in which the full Latin scientific names of species were used.
Fig. 2Schematic illustration of main sequence features.
Fig. 3Base composition observed among three positions of coding codons or noncoding triplets. This analysis is totally based on 50, 909 reference sequences of mRNA and lincRNA in human. (a) The overall frequencies of nucleotide A, T, C and G among three positions are first computed for entire sequence. (b) The relative frequencies of four nucleotides at each position are further shown. For the non-coding sequences, the three-periodic nucleotide usage was calculated with arbitrary selection of start position.
Summary of the selected tools for ab initio gene prediction in eukaryotes.
| Tools | Years | Main sequence features | Algorithms |
|---|---|---|---|
| GeneID | 1992 | Splice sites; Start and stop codons; | Rule-based system |
| GeneParser | 1993 | Splice site; Codon usage; Compositional complexity; Hexamer frequency; Length distribution; Periodic asymmetry | Dynamic programming |
| GENSCAN | 1997 | Coding signals; Length distributions and compositional features of exons, introns and intergenic regions | Generalized hidden Markov mode |
| HMMgene | 1997 | Coding, noncoding, and intergenic sequences | Hidden Markov model |
| Fgenesh | 2000 | Splice sites; Start and stop codons; | Hidden Markov model |
| AUGUSTUS | 2005 | Sequences around splice sites, start and stop codons, and coding and non-coding regions; Length of exons, introns and intergenic regions | Generalized hidden Markov mode |
Note: only these actively cited tools are included without subjective preference.