Literature DB >> 25313296

Optimizing high performance computing workflow for protein functional annotation.

Larissa Stanberry1, Bhanu Rekepalli2, Yuan Liu2, Paul Giblock3, Roger Higdon1, Elizabeth Montague1, William Broomall1, Natali Kolker1, Eugene Kolker4.   

Abstract

Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.

Entities:  

Keywords:  BLAST; COG; HSPp-BLAST; PS; PSI-BLAST; XSEDE; computational bioinformatics; data-enabled life sciences; petascale; protein annotation; protein sequence universe; science gateways; sequence similarity

Year:  2014        PMID: 25313296      PMCID: PMC4194055          DOI: 10.1002/cpe.3264

Source DB:  PubMed          Journal:  Concurr Comput        ISSN: 1532-0626            Impact factor:   1.536


  30 in total

1.  The SYSTERS protein sequence cluster set.

Authors:  A Krause; J Stoye; M Vingron
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins.

Authors:  E V Kriventseva; W Fleischmann; E M Zdobnov; R Apweiler
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

3.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors:  Weizhong Li; Adam Godzik
Journal:  Bioinformatics       Date:  2006-05-26       Impact factor: 6.937

4.  Manual curation is not sufficient for annotation of genomic databases.

Authors:  William A Baumgartner; K Bretonnel Cohen; Lynne M Fox; George Acquaah-Mensah; Lawrence Hunter
Journal:  Bioinformatics       Date:  2007-07-01       Impact factor: 6.937

5.  Vaccines of the 21st century and vaccinomics: data-enabled science meets global health to spark collective action for vaccine innovation.

Authors:  Vural Ozdemir; Tikki Pang; Bartha M Knoppers; Denise Avard; Samer A Faraj; Ma'n H Zawati; Eugene Kolker
Journal:  OMICS       Date:  2011-08-17

6.  Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins.

Authors:  Natali Kolker; Roger Higdon; William Broomall; Larissa Stanberry; Dean Welch; Wei Lu; Winston Haynes; Roger Barga; Eugene Kolker
Journal:  OMICS       Date:  2011 Jul-Aug

7.  The case for cloud computing in genome informatics.

Authors:  Lincoln D Stein
Journal:  Genome Biol       Date:  2010-05-05       Impact factor: 13.583

8.  Cloud computing and the DNA data race.

Authors:  Michael C Schatz; Ben Langmead; Steven L Salzberg
Journal:  Nat Biotechnol       Date:  2010-07       Impact factor: 54.908

9.  eggNOG: automated construction and annotation of orthologous groups of genes.

Authors:  Lars Juhl Jensen; Philippe Julien; Michael Kuhn; Christian von Mering; Jean Muller; Tobias Doerks; Peer Bork
Journal:  Nucleic Acids Res       Date:  2007-10-16       Impact factor: 16.971

10.  The COG database: an updated version includes eukaryotes.

Authors:  Roman L Tatusov; Natalie D Fedorova; John D Jackson; Aviva R Jacobs; Boris Kiryutin; Eugene V Koonin; Dmitri M Krylov; Raja Mazumder; Sergei L Mekhedov; Anastasia N Nikolskaya; B Sridhar Rao; Sergei Smirnov; Alexander V Sverdlov; Sona Vasudevan; Yuri I Wolf; Jodie J Yin; Darren A Natale
Journal:  BMC Bioinformatics       Date:  2003-09-11       Impact factor: 3.169

View more
  1 in total

1.  Protein Sequence Annotation Tool (PSAT): a centralized web-based meta-server for high-throughput sequence annotations.

Authors:  Elo Leung; Amy Huang; Eithon Cadag; Aldrin Montana; Jan Lorenz Soliman; Carol L Ecale Zhou
Journal:  BMC Bioinformatics       Date:  2016-01-20       Impact factor: 3.169

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.