Literature DB >> 27185919

On the unsupervised analysis of domain-specific Chinese texts.

Ke Deng1, Peter K Bol2, Kate J Li3, Jun S Liu4.   

Abstract

With the growing availability of digitized text data both publicly and privately, there is a great need for effective computational tools to automatically extract information from texts. Because the Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries, most existing Chinese text-mining methods require a prespecified vocabulary and/or a large relevant training corpus, which may not be available in some applications. We introduce an unsupervised method, top-down word discovery and segmentation (TopWORDS), for simultaneously discovering and segmenting words and phrases from large volumes of unstructured Chinese texts, and propose ways to order discovered words and conduct higher-level context analyses. TopWORDS is particularly useful for mining online and domain-specific texts where the underlying vocabulary is unknown or the texts of interest differ significantly from available training corpora. When outputs from TopWORDS are fed into context analysis tools such as topic modeling, word embedding, and association pattern finding, the results are as good as or better than that from using outputs of a supervised segmentation method.

Keywords:  Chinese history; EM algorithm; blogs; text segmentations; word discovery

Year:  2016        PMID: 27185919      PMCID: PMC4896694          DOI: 10.1073/pnas.1516510113

Source DB:  PubMed          Journal:  Proc Natl Acad Sci U S A        ISSN: 0027-8424            Impact factor:   11.205


  4 in total

1.  Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis.

Authors:  H J Bussemaker; H Li; E D Siggia
Journal:  Proc Natl Acad Sci U S A       Date:  2000-08-29       Impact factor: 11.205

2.  Finding scientific topics.

Authors:  Thomas L Griffiths; Mark Steyvers
Journal:  Proc Natl Acad Sci U S A       Date:  2004-02-10       Impact factor: 11.205

3.  Distributional regularity and phonotactic constraints are useful for segmentation.

Authors:  M R Brent; T A Cartwright
Journal:  Cognition       Date:  1996 Oct-Nov

4.  WordSpy: identifying transcription factor binding motifs by building a dictionary and learning a grammar.

Authors:  Guandong Wang; Taotao Yu; Weixiong Zhang
Journal:  Nucleic Acids Res       Date:  2005-07-01       Impact factor: 16.971

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.