Literature DB >> 33429904

Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes.

Katelyn McNair1, Carol L Ecale Zhou2, Brian Souza3, Stephanie Malfatti3, Robert A Edwards1,4.   

Abstract

One of the main steps in gene-finding in prokaryotes is determining which open reading frames encode for a protein, and which occur by chance alone. There are many different methods to differentiate the two; the most prevalent approach is using shared homology with a database of known genes. This method presents many pitfalls, most notably the catch that you only find genes that you have seen before. The four most popular prokaryotic gene-prediction programs (GeneMark, Glimmer, Prodigal, Phanotate) all use a protein-coding training model to predict protein-coding genes, with the latter three allowing for the training model to be created ab initio from the input genome. Different methods are available for creating the training model, and to increase the accuracy of such tools, we present here GOODORFS, a method for identifying protein-coding genes within a set of all possible open reading frames (ORFS). Our workflow begins with taking the amino acid frequencies of each ORF, calculating an entropy density profile (EDP), using KMeans to cluster the EDPs, and then selecting the cluster with the lowest variation as the coding ORFs. To test the efficacy of our method, we ran GOODORFS on 14,179 annotated phage genomes, and compared our results to the initial training-set creation step of four other similar methods (Glimmer, MED2, PHANOTATE, Prodigal). We found that GOODORFS was the most accurate (0.94) and had the best F1-score (0.85), while Glimmer had the highest precision (0.92) and PHANOTATE had the highest recall (0.96).

Entities:  

Keywords:  annotation; clustering; gene; genome; machine learning; phage; prediction

Year:  2021        PMID: 33429904      PMCID: PMC7827183          DOI: 10.3390/microorganisms9010129

Source DB:  PubMed          Journal:  Microorganisms        ISSN: 2076-2607


  9 in total

1.  CRITICA: coding region identification tool invoking comparative analysis.

Authors:  J H Badger; G J Olsen
Journal:  Mol Biol Evol       Date:  1999-04       Impact factor: 16.240

2.  Analyses of four new Caulobacter Phicbkviruses indicate independent lineages.

Authors:  Kiesha Wilson; Bert Ely
Journal:  J Gen Virol       Date:  2019-01-18       Impact factor: 3.891

3.  Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene.

Authors:  W Fiers; R Contreras; F Duerinck; G Haegeman; D Iserentant; J Merregaert; W Min Jou; F Molemans; A Raeymaekers; A Van den Berghe; G Volckaert; M Ysebaert
Journal:  Nature       Date:  1976-04-08       Impact factor: 49.962

4.  Microbial gene identification using interpolated Markov models.

Authors:  S L Salzberg; A L Delcher; S Kasif; O White
Journal:  Nucleic Acids Res       Date:  1998-01-15       Impact factor: 16.971

5.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.

Authors:  R D Fleischmann; M D Adams; O White; R A Clayton; E F Kirkness; A R Kerlavage; C J Bult; J F Tomb; B A Dougherty; J M Merrick
Journal:  Science       Date:  1995-07-28       Impact factor: 47.728

6.  Prodigal: prokaryotic gene recognition and translation initiation site identification.

Authors:  Doug Hyatt; Gwo-Liang Chen; Philip F Locascio; Miriam L Land; Frank W Larimer; Loren J Hauser
Journal:  BMC Bioinformatics       Date:  2010-03-08       Impact factor: 3.169

7.  PHANOTATE: a novel approach to gene identification in phage genomes.

Authors:  Katelyn McNair; Carol Zhou; Elizabeth A Dinsdale; Brian Souza; Robert A Edwards
Journal:  Bioinformatics       Date:  2019-11-01       Impact factor: 6.937

Review 8.  Array programming with NumPy.

Authors:  Charles R Harris; K Jarrod Millman; Stéfan J van der Walt; Ralf Gommers; Pauli Virtanen; David Cournapeau; Eric Wieser; Julian Taylor; Sebastian Berg; Nathaniel J Smith; Robert Kern; Matti Picus; Stephan Hoyer; Marten H van Kerkwijk; Matthew Brett; Allan Haldane; Jaime Fernández Del Río; Mark Wiebe; Pearu Peterson; Pierre Gérard-Marchant; Kevin Sheppard; Tyler Reddy; Warren Weckesser; Hameer Abbasi; Christoph Gohlke; Travis E Oliphant
Journal:  Nature       Date:  2020-09-16       Impact factor: 49.962

9.  MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes.

Authors:  Huaiqiu Zhu; Gang-Qing Hu; Yi-Fan Yang; Jin Wang; Zhen-Su She
Journal:  BMC Bioinformatics       Date:  2007-03-16       Impact factor: 3.169

  9 in total
  1 in total

1.  MultiPhATE2: code for functional annotation and comparison of phage genomes.

Authors:  Carol L Ecale Zhou; Jeffrey Kimbrel; Robert Edwards; Katelyn McNair; Brian A Souza; Stephanie Malfatti
Journal:  G3 (Bethesda)       Date:  2021-05-07       Impact factor: 3.154

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.