| Literature DB >> 27064397 |
Jonathan Kuck1, Honglei Zhuang1, Xifeng Yan2, Hasan Cam3, Jiawei Han1.
Abstract
Outlier or anomaly detection in large data sets is a fundamental task in data science, with broad applications. However, in real data sets with high-dimensional space, most outliers are hidden in certain dimensional combinations and are relative to a user's search space and interest. It is often more effective to give power to users and allow them to specify outlier queries flexibly, and the system will then process such mining queries efficiently. In this study, we introduce the concept of query-based outlier in heterogeneous information networks, design a query language to facilitate users to specify such queries flexibly, define a good outlier measure in heterogeneous networks, and study how to process outlier queries efficiently in large data sets. Our experiments on real data sets show that following such a methodology, interesting outliers can be defined and uncovered flexibly and effectively in large heterogeneous networks.Entities:
Year: 2015 PMID: 27064397 PMCID: PMC4825692 DOI: 10.5441/002/edbt.2015.29
Source DB: PubMed Journal: Adv Database Technol
Figure 1Bibliographic network schema and instantiated network.
Figure 2Path instantiations of the meta-path (APVPA) connecting authors Jim and Mary
Publication records of candidate and reference vertices. The reference set contains 100 authors with identical publication records, given by the reference author.
| VLDB | KDD | STOC | SIGGRAPH | |
|---|---|---|---|---|
| Reference Author | 10 | 10 | 1 | 1 |
| Sarah | 10 | 10 | 1 | 1 |
| Rob | 0 | 1 | 20 | 20 |
| Lucy | 0 | 5 | 10 | 10 |
| Joe | 0 | 0 | 0 | 2 |
| Emma | 0 | 0 | 0 | 30 |
NetOut outlier scores of select candidate vertices given a query whose feature meta-path is 𝒫 = (APV) and reference set is given in Table 1, compared with scores computed using PathSim and the cosine similarity in place of normalized connectivity.
| Ω | Ω | Ω | |
|---|---|---|---|
| Sarah | 100 | 100 | 100 |
| Rob | 6.24 | 9.97 | 12.43 |
| Lucy | 31.11 | 32.79 | 32.83 |
| Joe | 50 | 1.94 | 7.04 |
| Emma | 3.33 | 5.44 | 7.04 |
Comparing different outlierness measure, with query S = S =author(“Christos Faloutsos”).paper.author and feature meta-path 𝒫 = (APV) Outliers found by normalized connectivity are interesting outliers, while outliers found by PathSim or CosSim are authors with very few papers, which are not interesting.
| Method | Ω | Ω | Ω | |||
|---|---|---|---|---|---|---|
| Ranking | Name | Ω-value | Name | Ω-value | Name | Ω-value |
| 1 | Adam Wright | 2.54 | Wenyao Ho | 1.07 | John Chien-Han Tseng | 0.0022 |
| 2 | Philip Koopman | 2.55 | Fernanda Balem | 1.12 | Fernanda Balem | 0.0038 |
| 3 | Nicholas D. Sidiropoulos | 3.29 | Rebecca B. Buchheit | 1.31 | Guoqiang Shan | 0.0046 |
| 4 | Katia P. Sycara | 3.64 | John Chien-Han Tseng | 1.41 | Wenyao Ho | 0.0066 |
| 5 | David S. Doermann | 3.65 | Chi-Dong Chen | 1.47 | Chi-Dong Chen | 0.0077 |
Query templates used to construct query sets for efficiency experiments. 10,000 random authors are selected and substituted where indicated by “·” in each query template.
| Number | Query Templates |
|---|---|
| 𝒬1 | |
| 𝒬2 | |
| 𝒬3 |
Case study of NetOut results on several queries.
| Ranking | Name | Ω-Value |
| 1 | Adam Wright | 2.54 |
| 2 | Philip Koopman | 2.55 |
| 3 | Nicholas D. Sidiropoulos | 3.29 |
| 4 | Katia P. Sycara | 3.64 |
| 5 | David S. Doermann | 3.65 |
| 6 | Asim Smailagic | 3.69 |
| 7 | John Chien-Han Tseng | 4.00 |
| 8 | Daniel P. Siewiorek | 4.22 |
| 9 | Jessica K. Hodgins | 4.52 |
| 10 | Dimitris N. Metaxas | 4.57 |
| Ranking | Name | Ω-value |
| 1 | Dimitris N. Metaxas | 1.06 |
| 2 | Bin Zhang | 1.06 |
| 3 | Hui Zhang | 1.07 |
| 4 | Lionel M. Ni | 1.07 |
| 5 | Bin Liu | 1.08 |
| 6 | Joel H. Saltz | 1.08 |
| 7 | Yang Wang | 1.08 |
| 8 | Hao Wang | 1.08 |
| 9 | Ee-Peng Lim | 1.12 |
| 10 | Katia P. Sycara | 1.13 |
| 1 | 1.27 | |
| 2 | Wolfgang Glänzel | 4.99 |
| 3 | Paul M. Thompson | 6.46 |
| 4 | Yehuda Lindell | 9.21 |
| 5 | Kwan-Liu Ma | 12.2 |
| 6 | Dhabaleswar K. Panda | 13.23 |
| 7 | Christos Davatzikos | 13.95 |
| 8 | Andrzej Skowron | 14.62 |
| 9 | Anil K. Jain | 15.75 |
| 10 | Fillia Makedon | 15.95 |
Figure 3Comparing total execution time for 10,000 randomly generated queries between the baseline implementation and the implementation with pre-materialization.
Figure 4In-depth analysis of query processing time using selective pre-materialization strategies with the relative frequency threshold set to 0.01. “Not indexed vectors” indicates processing time spent on meta-path materialization from vertices without pre-materialization; “Indexed vectors” indicates time spent looking up pre-materialized meta-paths from materialized vertices; “Outlierness calculation” indicates calculation time of NetOut.
Figure 5Comparison of efficiency performance with different relative frequency threshold in selective pre-materialization indexing strategy.