| Literature DB >> 23691543 |
Sungchul Kim1, Lee Sael, Hwanjo Yu.
Abstract
Understanding functions of proteins is one of the most important challenges in many studies of biological processes. The function of a protein can be predicted by analyzing the functions of structurally similar proteins, thus finding structurally similar proteins accurately and efficiently from a large set of proteins is crucial. A protein structure can be represented as a vector by 3D-Zernike Descriptor (3DZD) which compactly represents the surface shape of the protein tertiary structure. This simplified representation accelerates the searching process. However, computing the similarity of two protein structures is still computationally expensive, thus it is hard to efficiently process many simultaneous requests of structurally similar protein search. This paper proposes indexing techniques which substantially reduce the search time to find structurally similar proteins. In particular, we first exploit two indexing techniques, i.e., iDistance and iKernel, on the 3DZDs. After that, we extend the techniques to further improve the search speed for protein structures. The extended indexing techniques build and utilize an reduced index constructed from the first few attributes of 3DZDs of protein structures. To retrieve top-k similar structures, top-10 × k similar structures are first found using the reduced index, and top-k structures are selected among them. We also modify the indexing techniques to support θ-based nearest neighbor search, which returns data points less than θ to the query point. The results show that both iDistance and iKernel significantly enhance the searching speed. In top-k nearest neighbor search, the searching time is reduced 69.6%, 77%, 77.4% and 87.9%, respectively using iDistance, iKernel, the extended iDistance, and the extended iKernel. In θ-based nearest neighbor serach, the searching time is reduced 80%, 81%, 95.6% and 95.6% using iDistance, iKernel, the extended iDistance, and the extended iKernel, respectively.Entities:
Mesh:
Year: 2013 PMID: 23691543 PMCID: PMC3618241 DOI: 10.1186/1472-6947-13-s1-s8
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Figure 1Two example proteins. (a) 2kr1-A and (b) 3DVG-A.
Figure 23D Zernike descriptors of two example proteins. The dimension of 3DZD is 121.
Figure 3Top-k search using iDistance. x mark is query; circle mark is data point; circle with solid line is cluster; circle with dashed line is query region.
Figure 4Top-2 query processing in iKernel. q is query point, Cis i-th cluster, there are same number of data points in rings , and MPDis the mininum possible distance of j-th ring of i-th cluster.
Top-2 query processing
| step | output | top | |
|---|---|---|---|
| 0 | |||
| 1 | |||
| 2 | |||
| 3 | |||
| 4 | |||
| 5 |
Figure 5The number of data instances assigned to clusters by random and k-means clustering. The solid line is random clustering, and the dashed line is K-means clustering.
Figure 6The coverage of top-25 in top-250 as the dimension used for index increases. The number of clusters, M, is 866, <Δr, C >is <0.2, 4 >, and the number of data points in each ring, g is 50.
Figure 7The change of processing time as the dimension used for index increases. The number of clusters, M, is 866, <Δr, C >is <0.2, 4 >, and the number of data points in each ring, g is 50.
Figure 8The distribution of processing time of fKNNs. The experiment is conducted with 100 randomly selected query proteins.
Figure 9The efficiency of top-k search using iDistance with various Δ. Δr is the amount of value added to r after each iteration in iDistance.
Figure 10The efficiency of top-k search using iKernel with . g is the number of data points in each ring when iKernel is used.
The effectiveness of clustering (Processing time (sec.)/Evaluation ratio)
| iDistance | iKernel | |
|---|---|---|
| Random | 0.34124/0.2132 | 0.26173/0.2251 |
| K-means |
Figure 11The processing time/evaluation ratio as k increases.
Figure 12The processing time/evaluation ratio as .
Figure 13The analysis of . The number of nearest neighbors and the number of queries without nearest neighbors as θ increases.
Figure 14The distribution of processing time of fTNNs.
The comparison of the proposed approaches in θ-based nearest neighbor search (Proc. is processing time measured in second and Eval. is evaluation ratio)
| LS | iDistance | iKernel | |||
|---|---|---|---|---|---|
| basic | basic | ||||
| Proc. | 1.08 | 0.2128 | 0.2091 | 0.087 | |
| Eval. | 1 | 0.1254 | 0.0537 | 0.075 | |
The comparison of the proposed approaches in top-k nearest neighbor search (Proc. is processing time measured in second and Eval. is evaluation ratio)
| LS | iDistance | iKernel | |||
|---|---|---|---|---|---|
| basic | basic | ||||
| Proc. | 1.0567 | 0.3212 | 0.2387 (95.2) | 0.2434 | |
| Eval. | 1 | 0.3 | 0.202 (95.2) | 0.1993 (95.2) | |
Figure 15The simulation result as the number of query increases. x-axis is the number of users in log scale.