Literature DB >> 35793328

A survey on text classification: Practical perspectives on the Italian language.

Andrea Gasparetto1, Alessandro Zangari1, Matteo Marcuzzo1, Andrea Albarelli2.   

Abstract

Text Classification methods have been improving at an unparalleled speed in the last decade thanks to the success brought about by deep learning. Historically, state-of-the-art approaches have been developed for and benchmarked against English datasets, while other languages have had to catch up and deal with inevitable linguistic challenges. This paper offers a survey with practical and linguistic connotations, showcasing the complications and challenges tied to the application of modern Text Classification algorithms to languages other than English. We engage this subject from the perspective of the Italian language, and we discuss in detail issues related to the scarcity of task-specific datasets, as well as the issues posed by the computational expensiveness of modern approaches. We substantiate this by providing an extensively researched list of available datasets in Italian, comparing it with a similarly sought list for French, which we use for comparison. In order to simulate a real-world practical scenario, we apply a number of representative methods to custom-tailored multilabel classification datasets in Italian, French, and English. We conclude by discussing results, future challenges, and research directions from a linguistically inclusive perspective.

Entities:  

Mesh:

Year:  2022        PMID: 35793328      PMCID: PMC9258888          DOI: 10.1371/journal.pone.0270904

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.752


  5 in total

1.  Long short-term memory.

Authors:  S Hochreiter; J Schmidhuber
Journal:  Neural Comput       Date:  1997-11-15       Impact factor: 2.026

2.  An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian.

Authors:  Marco Pota; Mirko Ventura; Rosario Catelli; Massimo Esposito
Journal:  Sensors (Basel)       Date:  2020-12-28       Impact factor: 3.576

3.  CAS: corpus of clinical cases in French.

Authors:  Natalia Grabar; Clément Dalloux; Vincent Claveau
Journal:  J Biomed Semantics       Date:  2020-08-06

4.  Morpheme matching based text tokenization for a scarce resourced language.

Authors:  Zobia Rehman; Waqas Anwar; Usama Ijaz Bajwa; Wang Xuan; Zhou Chaoying
Journal:  PLoS One       Date:  2013-08-21       Impact factor: 3.240

5.  The influence of preprocessing on text classification using a bag-of-words representation.

Authors:  Yaakov HaCohen-Kerner; Daniel Miller; Yair Yigal
Journal:  PLoS One       Date:  2020-05-01       Impact factor: 3.240

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.