| Literature DB >> 25313296 |
Larissa Stanberry1, Bhanu Rekepalli2, Yuan Liu2, Paul Giblock3, Roger Higdon1, Elizabeth Montague1, William Broomall1, Natali Kolker1, Eugene Kolker4.
Abstract
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.Entities:
Keywords: BLAST; COG; HSPp-BLAST; PS; PSI-BLAST; XSEDE; computational bioinformatics; data-enabled life sciences; petascale; protein annotation; protein sequence universe; science gateways; sequence similarity
Year: 2014 PMID: 25313296 PMCID: PMC4194055 DOI: 10.1002/cpe.3264
Source DB: PubMed Journal: Concurr Comput ISSN: 1532-0626 Impact factor: 1.536