| Literature DB >> 28224131 |
Abstract
Arabic diacritics are often missed in Arabic scripts. This feature is a handicap for new learner to read َArabic, text to speech conversion systems, reading and semantic analysis of Arabic texts. The automatic diacritization systems are the best solution to handle this issue. But such automation needs resources as diactritized texts to train and evaluate such systems. In this paper, we describe our corpus of Arabic diacritized texts. This corpus is called Tashkeela. It can be used as a linguistic resource tool for natural language processing such as automatic diacritics systems, dis-ambiguity mechanism, features and data extraction. The corpus is freely available, it contains 75 million of fully vocalized words mainly 97 books from classical and modern Arabic language. The corpus is collected from manually vocalized texts using web crawling process.Entities:
Keywords: Arabic language; Corpus; Diacritization; Natural language processing
Year: 2017 PMID: 28224131 PMCID: PMC5310197 DOI: 10.1016/j.dib.2017.01.011
Source DB: PubMed Journal: Data Brief ISSN: 2352-3409
Corpus parts.
| Total words | 75,629,921 | Percent |
|---|---|---|
| 74,762,008 | 98.85% | |
97 Books filtered from 7079 books from Shamela Library. | ||
| 867,913 | 1.15% | |
| • 20 modern books | 398,911 | |
Texts crawled from Internet learning.aljazeera.net al-kalema.org enfal.de diverse … | 461,283 | |
| • Manually diacritized | 7701 |
Corpus words statistics.
| Total Word counted: | 75 629 921 |
| Arabic vocalized word counted | 67 287 202 |
| Punctuation and non Arabic word counted: | 8 342 719 |
| Unrepeated un-vocalized word counted: | 486 524 |
| Unrepeated vocalized word counted: | 998 538 |
| Unrepeated semi-vocalized word counted: | 770 702 |
| Estimated number of vocalizations for a word | 2.05 |
| Subject Area | Computer Science |
| More specific subject area | Computational linguistics, natural language processing, text to speech, corpus. Arabic language. |
| Type of data | Text files |
| How data was acquired | The data is collected from freely published texts in ancients books, these books had been rewritten and vocalized by volunteers manually, to ensure that words are vocalized. |
| Data format | Raw |
| Experimental factors | Texts are collected, filtered and converted to text format. |
| Texts are cleaned by removing extra spaces and unnecessary stuff. Adding specific description to each book and data. | |
| Experimental features | conduct statistical analysis about word frequencies, for vocalized, semi vocalized and un-vocalized words. |
| Data source location | N/A |
| Data accessibility | Data presented in this article is freely available at |