| Literature DB >> 21364823 |
Kresimir Sikic, Oliviero Carugo.
Abstract
Non-redundant protein datasets are of utmost importance in bioinformatics. Constructing such datasets means removing protein sequences that overreach certain similarity thresholds. Several programs such as 'Decrease redundancy', 'cd-hit', 'Pisces', 'BlastClust' and 'SkipRedundant' are available. The issue that we focus on here is to what extent the non-redundant datasets produced by different programs are similar to each other. A systematic comparison of the features and of the outputs of these programs, by using subsets of the UniProt database, was performed and is described here. The results show high level of overlap between non-redundant datasets obtained with the same program fed with the same initial dataset but different percentage of identity threshold, and moderate levels of similarity between results obtained with different programs fed with the same initial dataset and the same percentage of identity threshold. We must be aware that some differences may arise and the use of more than one computer application is advisable.Entities:
Year: 2010 PMID: 21364823 PMCID: PMC3055704 DOI: 10.6026/97320630005234
Source DB: PubMed Journal: Bioinformation ISSN: 0973-2063
Figure 1Pairwise percentages of identity calculated on the non-redundant set using the Needleman-Wunsch algorithm. Non-redundant sets were obtained using cd-hit program with max PID = 40%. Similar results were obtained for Decrease redundancy, Pisces and BlastClust.
Figure 2Venn diagram. Overlap of four non-redundant datasets, each obtained with a different program, based on the same input dataset (D_100_100) and the same PID (40%). 95 sequences are common to the non-redundant outputs of BlastClust, cd-hit, Decrese redundancy, and Pisces.