Literature DB >> 21809957

Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins.

Natali Kolker1, Roger Higdon, William Broomall, Larissa Stanberry, Dean Welch, Wei Lu, Winston Haynes, Roger Barga, Eugene Kolker.   

Abstract

To address the monumental challenge of assigning function to millions of sequenced proteins, we completed the first of a kind all-versus-all sequence alignments using BLAST for 9.9 million proteins in the UniRef100 database. Microsoft Windows Azure produced over 3 billion filtered records in 6 days using 475 eight-core virtual machines. Protein classification into functional groups was then performed using Hive and custom jars implemented on top of Apache Hadoop utilizing the MapReduce paradigm. First, using the Clusters of Orthologous Genes (COG) database, a length normalized bit score (LNBS) was determined to be the best similarity measure for classification of proteins. LNBS achieved sensitivity and specificity of 98% each. Second, out of 5.1 million bacterial proteins, about two-thirds were assigned to significantly extended COG groups, encompassing 30 times more assigned proteins. Third, the remaining proteins were classified into protein functional groups using an innovative implementation of a single-linkage algorithm on an in-house Hadoop compute cluster. This implementation significantly reduces the run time for nonindexed queries and optimizes efficient clustering on a large scale. The performance was also verified on Amazon Elastic MapReduce. This clustering assigned nearly 2 million proteins to approximately half a million different functional groups. A similar approach was applied to classify 2.8 million eukaryotic sequences resulting in over 1 million proteins being assign to existing KOG groups and the remainder clustered into 100,000 functional groups.

Entities:  

Mesh:

Substances:

Year:  2011        PMID: 21809957     DOI: 10.1089/omi.2011.0101

Source DB:  PubMed          Journal:  OMICS        ISSN: 1536-2310


  4 in total

1.  Opportunities and challenges for the life sciences community.

Authors:  Eugene Kolker; Elizabeth Stewart; Vural Ozdemir
Journal:  OMICS       Date:  2012-03

2.  Optimizing high performance computing workflow for protein functional annotation.

Authors:  Larissa Stanberry; Bhanu Rekepalli; Yuan Liu; Paul Giblock; Roger Higdon; Elizabeth Montague; William Broomall; Natali Kolker; Eugene Kolker
Journal:  Concurr Comput       Date:  2014-09-10       Impact factor: 1.536

3.  Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons.

Authors:  Illyoung Choi; Alise J Ponsero; Matthew Bomhoff; Ken Youens-Clark; John H Hartman; Bonnie L Hurwitz
Journal:  Gigascience       Date:  2019-02-01       Impact factor: 6.524

4.  A Graphic Encoding Method for Quantitative Classification of Protein Structure and Representation of Conformational Changes.

Authors:  Hector Carrillo-Cabada; Jeremy Benson; Asghar M Razavi; Brianna Mulligan; Michel A Cuendet; Harel Weinstein; Michela Taufer; Trilce Estrada
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2021-08-06       Impact factor: 3.702

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.