Literature DB >> 31105412

Automated Phrase Mining from Massive Text Corpora.

Jingbo Shang1, Jialu Liu2, Meng Jiang1, Xiang Ren1, Clare R Voss3, Jiawei Han1.   

Abstract

As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, AutoPhrase, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, AutoPhrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, AutoPhrase can be extend to model single-word quality phrases.

Entities:  

Keywords:  Automatic Phrase Mining; Distant Training; Multiple Languages; Part-of-Speech tag; Phrase Mining

Year:  2018        PMID: 31105412      PMCID: PMC6519941          DOI: 10.1109/TKDE.2018.2812203

Source DB:  PubMed          Journal:  IEEE Trans Knowl Data Eng        ISSN: 1041-4347            Impact factor:   6.977


  4 in total

1.  Cloud-Based Phrase Mining and Analysis of User-Defined Phrase-Category Association in Biomedical Publications.

Authors:  Dibakar Sigdel; Vincent Kyi; Aiden Zhang; Shaun P Setty; David A Liem; Yu Shi; Xuan Wang; Jiaming Shen; Wei Wang; JiaWei Han; Peipei Ping
Journal:  J Vis Exp       Date:  2019-02-23       Impact factor: 1.355

2.  Unveiling Evolutionary Path of Nanogenerator Technology: A Novel Method Based on Sentence-BERT.

Authors:  Huailan Liu; Rui Zhang; Yufei Liu; Cunxiang He
Journal:  Nanomaterials (Basel)       Date:  2022-06-11       Impact factor: 5.719

3.  RedMed: Extending drug lexicons for social media applications.

Authors:  Adam Lavertu; Russ B Altman
Journal:  J Biomed Inform       Date:  2019-10-15       Impact factor: 6.317

4.  Leverage knowledge graph and GCN for fine-grained-level clickbait detection.

Authors:  Mengxi Zhou; Wei Xu; Wenping Zhang; Qiqi Jiang
Journal:  World Wide Web       Date:  2022-03-16       Impact factor: 3.000

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.