| Literature DB >> 20122235 |
Nitin Bhardwaj1, Mark Gerstein, Hui Lu.
Abstract
BACKGROUND: In supervised learning, traditional approaches to building a classifier use two sets of examples with pre-defined classes along with a learning algorithm. The main limitation of this approach is that examples from both classes are required which might be infeasible in certain cases, especially those dealing with biological data. Such is the case for membrane-binding peripheral domains that play important roles in many biological processes, including cell signaling and membrane trafficking by reversibly binding to membranes. For these domains, a well-defined positive set is available with domains known to bind membrane along with a large unlabeled set of domains whose membrane binding affinities have not been measured. The aforementioned limitation can be addressed by a special class of semi-supervised machine learning called positive-unlabeled (PU) learning that uses a positive set with a large unlabeled set. METHODS In this study, we implement the first application of PU-learning to a protein function prediction problem: identification of peripheral domains. PU-learning starts by identifying reliable negative (RN) examples iteratively from the unlabeled set until convergence and builds a classifier using the positive and the final RN set. A data set of 232 positive cases and ~3750 unlabeled ones were used to construct and validate the protocol.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20122235 PMCID: PMC3009533 DOI: 10.1186/1471-2105-11-S1-S6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1An example of the peripheral domain (C2-domain of PKCα, PDB ID: 1DSY). The protein targets specific lipids in the membranes in response to certain signal which, in this case, is binding of 2 Ca2+ ions (shown as red spheres). The protein (shown in cartoon representation) penetrates the membrane partially. Lipid hydrogens are not shown for clarity.
Figure 2Two step strategy used in . In the first step, a set of reliable negative examples are identified using a spy technique. During the spy technique, some positive examples are included in the U set as spies. After the classification, a threshold is chosen such that all the spies are classified as positive and the ones below the threshold form the reliable negative (RN) set. In the second step, the RN set and the P set are used to build a classifier.
Figure 3The spy technique.
Figure 4Modified spy technique.
Figure 5Reliable negative (.
Figure 6Reliable negative (.