Literature DB >> 16085385

Efficient streaming text clustering.

Shi Zhong1.   

Abstract

Clustering data streams has been a new research topic, recently emerged from many real data mining applications, and has attracted a lot of research attention. However, there is little work on clustering high-dimensional streaming text data. This paper combines an efficient online spherical k-means (OSKM) algorithm with an existing scalable clustering strategy to achieve fast and adaptive clustering of text streams. The OSKM algorithm modifies the spherical k-means (SPKM) algorithm, using online update (for cluster centroids) based on the well-known Winner-Take-All competitive learning. It has been shown to be as efficient as SPKM, but much superior in clustering quality. The scalable clustering strategy was previously developed to deal with very large databases that cannot fit into a limited memory and that are too expensive to read/scan multiple times. Using the strategy, one keeps only sufficient statistics for history data to retain (part of) the contribution of history data and to accommodate the limited memory. To make the proposed clustering algorithm adaptive to data streams, we introduce a forgetting factor that applies exponential decay to the importance of history data. The older a set of text documents, the less weight they carry. Our experimental results demonstrate the efficiency of the proposed algorithm and reveal an intuitive and an interesting fact for clustering text streams-one needs to forget to be adaptive.

Mesh:

Year:  2005        PMID: 16085385     DOI: 10.1016/j.neunet.2005.06.008

Source DB:  PubMed          Journal:  Neural Netw        ISSN: 0893-6080


  2 in total

1.  Privacy-preserving discovery of topic-based events from social sensor signals: an experimental study on Twitter.

Authors:  Duc T Nguyen; Jai E Jung
Journal:  ScientificWorldJournal       Date:  2014-04-03

2.  SOTXTSTREAM: Density-based self-organizing clustering of text streams.

Authors:  Avory C Bryant; Krzysztof J Cios
Journal:  PLoS One       Date:  2017-07-07       Impact factor: 3.240

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.