Zebulun Arendsee1,2,3, Jing Li1,2, Urminder Singh1,2, Arun Seetharam2,4, Karin Dorman1,2,5, Eve Syrkin Wurtele1,2,3. 1. Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, USA. 2. Genetics, Development, and Cell Biology, Iowa State University, Ames, IA, USA. 3. Center for Metabolic Biology, Iowa State University, Ames, IA, USA. 4. Genome Informatics Facility, Iowa State University, Ames, IA, USA. 5. Department of Statistics, Iowa State University, Ames, IA, USA.
Abstract
MOTIVATION: The goal of phylostratigraphy is to infer the evolutionary origin of each gene in an organism. This is done by searching for homologs within increasingly broad clades. The deepest clade that contains a homolog of the protein(s) encoded by a gene is that gene's phylostratum. RESULTS: We have created a general R-based framework, phylostratr, to estimate the phylostratum of every gene in a species. The program fully automates analysis: selecting species for balanced representation, retrieving sequences, building databases, inferring phylostrata and returning diagnostics. Key diagnostics include: detection of genes with inferred homologs in old clades, but not intermediate ones; proteome quality assessments; false-positive diagnostics, and checks for missing organellar genomes. phylostratr allows extensive customization and systematic comparisons of the influence of analysis parameters or genomes on phylostrata inference. A user may: modify the automatically generated clade tree or use their own tree; provide custom sequences in place of those automatically retrieved from UniProt; replace BLAST with an alternative algorithm; or tailor the method and sensitivity of the homology inference classifier. We show the utility of phylostratr through case studies in Arabidopsis thaliana and Saccharomyces cerevisiae. AVAILABILITY AND IMPLEMENTATION: Source code available at https://github.com/arendsee/phylostratr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: The goal of phylostratigraphy is to infer the evolutionary origin of each gene in an organism. This is done by searching for homologs within increasingly broad clades. The deepest clade that contains a homolog of the protein(s) encoded by a gene is that gene's phylostratum. RESULTS: We have created a general R-based framework, phylostratr, to estimate the phylostratum of every gene in a species. The program fully automates analysis: selecting species for balanced representation, retrieving sequences, building databases, inferring phylostrata and returning diagnostics. Key diagnostics include: detection of genes with inferred homologs in old clades, but not intermediate ones; proteome quality assessments; false-positive diagnostics, and checks for missing organellar genomes. phylostratr allows extensive customization and systematic comparisons of the influence of analysis parameters or genomes on phylostrata inference. A user may: modify the automatically generated clade tree or use their own tree; provide custom sequences in place of those automatically retrieved from UniProt; replace BLAST with an alternative algorithm; or tailor the method and sensitivity of the homology inference classifier. We show the utility of phylostratr through case studies in Arabidopsis thaliana and Saccharomyces cerevisiae. AVAILABILITY AND IMPLEMENTATION: Source code available at https://github.com/arendsee/phylostratr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Fang Li; Rahul V Rane; Victor Luria; Zijun Xiong; Jiawei Chen; Zimai Li; Renee A Catullo; Philippa C Griffin; Michele Schiffer; Stephen Pearce; Siu Fai Lee; Kerensa McElroy; Ann Stocker; Jennifer Shirriffs; Fiona Cockerell; Chris Coppin; Carla M Sgrò; Amir Karger; John W Cain; Jessica A Weber; Gabriel Santpere; Marc W Kirschner; Ary A Hoffmann; John G Oakeshott; Guojie Zhang Journal: Mol Ecol Resour Date: 2021-12-08 Impact factor: 8.678
Authors: Sagnik Banerjee; Priyanka Bhandary; Margaret Woodhouse; Taner Z Sen; Roger P Wise; Carson M Andorf Journal: BMC Bioinformatics Date: 2021-04-20 Impact factor: 3.169
Authors: Wyatt A Shell; Michael A Steffen; Hannah K Pare; Arun S Seetharam; Andrew J Severin; Amy L Toth; Sandra M Rehan Journal: Commun Biol Date: 2021-02-26