Literature DB >> 32511252

Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets.

Dario Borrelli1, Gabriela Gongora Svartzman1, Carlo Lipizzi1.   

Abstract

Symbolic sequential data are produced in huge quantities in numerous contexts, such as text and speech data, biometrics, genomics, financial market indexes, music sheets, and online social media posts. In this paper, an unsupervised approach for the chunking of idiomatic units of sequential text data is presented. Text chunking refers to the task of splitting a string of textual information into non-overlapping groups of related units. This is a fundamental problem in numerous fields where understanding the relation between raw units of symbolic sequential data is relevant. Existing methods are based primarily on supervised and semi-supervised learning approaches; however, in this study, a novel unsupervised approach is proposed based on the existing concept of n-grams, which requires no labeled text as an input. The proposed methodology is applied to two natural language corpora: a Wall Street Journal corpus and a Twitter corpus. In both cases, the corpus length was increased gradually to measure the accuracy with a different number of unitary elements as inputs. Both corpora reveal improvements in accuracy proportional with increases in the number of tokens. For the Twitter corpus, the increase in accuracy follows a linear trend. The results show that the proposed methodology can achieve a higher accuracy with incremental usage. A future study will aim at designing an iterative system for the proposed methodology.

Entities:  

Mesh:

Year:  2020        PMID: 32511252      PMCID: PMC7279599          DOI: 10.1371/journal.pone.0234214

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


  9 in total

1.  Comparing and combining chunkers of biomedical text.

Authors:  Ning Kang; Erik M van Mulligen; Jan A Kors
Journal:  J Biomed Inform       Date:  2010-11-04       Impact factor: 6.317

2.  Unsupervised learning of natural languages.

Authors:  Zach Solan; David Horn; Eytan Ruppin; Shimon Edelman
Journal:  Proc Natl Acad Sci U S A       Date:  2005-08-08       Impact factor: 11.205

3.  Unsupervised word embeddings capture latent knowledge from materials science literature.

Authors:  Vahe Tshitoyan; John Dagdelen; Leigh Weston; Alexander Dunn; Ziqin Rong; Olga Kononova; Kristin A Persson; Gerbrand Ceder; Anubhav Jain
Journal:  Nature       Date:  2019-07-03       Impact factor: 49.962

4.  Incorporating linguistic knowledge for learning distributed word representations.

Authors:  Yan Wang; Zhiyuan Liu; Maosong Sun
Journal:  PLoS One       Date:  2015-04-13       Impact factor: 3.240

5.  Rank diversity of languages: generic behavior in computational linguistics.

Authors:  Germinal Cocho; Jorge Flores; Carlos Gershenson; Carlos Pineda; Sergio Sánchez
Journal:  PLoS One       Date:  2015-04-07       Impact factor: 3.240

6.  Music viewed by its entropy content: A novel window for comparative analysis.

Authors:  Gerardo Febres; Klaus Jaffe
Journal:  PLoS One       Date:  2017-10-17       Impact factor: 3.240

7.  Word2vec convolutional neural networks for classification of news articles and tweets.

Authors:  Beakcheol Jang; Inhwan Kim; Jong Wook Kim
Journal:  PLoS One       Date:  2019-08-22       Impact factor: 3.240

8.  Learning supervised embeddings for large scale sequence comparisons.

Authors:  Dhananjay Kimothi; Pravesh Biyani; James M Hogan; Akshay Soni; Wayne Kelly
Journal:  PLoS One       Date:  2020-03-13       Impact factor: 3.240

9.  Unsupervised chunking based on graph propagation from bilingual corpus.

Authors:  Ling Zhu; Derek F Wong; Lidia S Chao
Journal:  ScientificWorldJournal       Date:  2014-03-19
  9 in total
  2 in total

1.  Correction: Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets.

Authors:  Dario Borrelli; Gabriela Gongora Svartzman; Carlo Lipizzi
Journal:  PLoS One       Date:  2021-01-07       Impact factor: 3.240

2.  Content-based user classifier to uncover information exchange in disaster-motivated networks.

Authors:  Pouria Babvey; Gabriela Gongora-Svartzman; Carlo Lipizzi; Jose E Ramirez-Marquez
Journal:  PLoS One       Date:  2021-11-16       Impact factor: 3.240

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.