Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Optimizing high performance computing workflow for protein functional annotation.

Literature DB >> 25313296

Optimizing high performance computing workflow for protein functional annotation.

Larissa Stanberry¹, Bhanu Rekepalli², Yuan Liu², Paul Giblock³, Roger Higdon¹, Elizabeth Montague¹, William Broomall¹, Natali Kolker¹, Eugene Kolker⁴.

Abstract

Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.

Entities: Chemical Disease Gene Species

Keywords: BLAST; COG; HSPp-BLAST; PS; PSI-BLAST; XSEDE; computational bioinformatics; data-enabled life sciences; petascale; protein annotation; protein sequence universe; science gateways; sequence similarity

Year: 2014 PMID： 25313296 PMCID： PMC4194055 DOI： 10.1002/cpe.3264

Source DB: PubMed Journal: Concurr Comput ISSN： 1532-0626 Impact factor: 1.536

30 in total

1. The SYSTERS protein sequence cluster set.

Authors: A Krause; J Stoye; M Vingron
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins.

Authors: E V Kriventseva; W Fleischmann; E M Zdobnov; R Apweiler
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

3. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors: Weizhong Li; Adam Godzik
Journal: Bioinformatics Date: 2006-05-26 Impact factor: 6.937

4. Manual curation is not sufficient for annotation of genomic databases.

Authors: William A Baumgartner; K Bretonnel Cohen; Lynne M Fox; George Acquaah-Mensah; Lawrence Hunter
Journal: Bioinformatics Date: 2007-07-01 Impact factor: 6.937

5. Vaccines of the 21st century and vaccinomics: data-enabled science meets global health to spark collective action for vaccine innovation.

Authors: Vural Ozdemir; Tikki Pang; Bartha M Knoppers; Denise Avard; Samer A Faraj; Ma'n H Zawati; Eugene Kolker
Journal: OMICS Date: 2011-08-17

6. Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins.

Authors: Natali Kolker; Roger Higdon; William Broomall; Larissa Stanberry; Dean Welch; Wei Lu; Winston Haynes; Roger Barga; Eugene Kolker
Journal: OMICS Date: 2011 Jul-Aug

7. The case for cloud computing in genome informatics.

Authors: Lincoln D Stein
Journal: Genome Biol Date: 2010-05-05 Impact factor: 13.583

8. Cloud computing and the DNA data race.

Authors: Michael C Schatz; Ben Langmead; Steven L Salzberg
Journal: Nat Biotechnol Date: 2010-07 Impact factor: 54.908

9. eggNOG: automated construction and annotation of orthologous groups of genes.

Authors: Lars Juhl Jensen; Philippe Julien; Michael Kuhn; Christian von Mering; Jean Muller; Tobias Doerks; Peer Bork
Journal: Nucleic Acids Res Date: 2007-10-16 Impact factor: 16.971

10. The COG database: an updated version includes eukaryotes.

Authors: Roman L Tatusov; Natalie D Fedorova; John D Jackson; Aviva R Jacobs; Boris Kiryutin; Eugene V Koonin; Dmitri M Krylov; Raja Mazumder; Sergei L Mekhedov; Anastasia N Nikolskaya; B Sridhar Rao; Sergei Smirnov; Alexander V Sverdlov; Sona Vasudevan; Yuri I Wolf; Jodie J Yin; Darren A Natale
Journal: BMC Bioinformatics Date: 2003-09-11 Impact factor: 3.169

1 in total

1. Protein Sequence Annotation Tool (PSAT): a centralized web-based meta-server for high-throughput sequence annotations.

Authors: Elo Leung; Amy Huang; Eithon Cadag; Aldrin Montana; Jan Lorenz Soliman; Carol L Ecale Zhou
Journal: BMC Bioinformatics Date: 2016-01-20 Impact factor: 3.169

1 in total