| Literature DB >> 23990871 |
Zobia Rehman1, Waqas Anwar, Usama Ijaz Bajwa, Wang Xuan, Zhou Chaoying.
Abstract
Text tokenization is a fundamental pre-processing step for almost all the information processing applications. This task is nontrivial for the scarce resourced languages such as Urdu, as there is inconsistent use of space between words. In this paper a morpheme matching based approach has been proposed for Urdu text tokenization, along with some other algorithms to solve the additional issues of boundary detection of compound words, affixation, reduplication, names and abbreviations. This study resulted into 97.28% precision, 93.71% recall, and 95.46% F1-measure; while tokenizing a corpus of 57000 words by using a morpheme list with 6400 entries.Entities:
Mesh:
Year: 2013 PMID: 23990871 PMCID: PMC3749178 DOI: 10.1371/journal.pone.0068178
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Non-Joiner Urdu Alphabets.
| ا د ڈ ذ ر ز ڑ ژ و ے |
Joiner Urdu Alphabets.
| ب پ ت ٹ ث ج چ ح خ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن ہ ء ھ ی |
Words ending at non joiners.
| اسدشہرسےباہرجاپہنچا (I) | اسد شہر سے باہر جا پہنچا (II) |
| Asad reached out of the city. |
ZWNJ between words.
| (old track) پرانیسڑک |
| (Words without space or ZWNJ) |
| (old track) پرانی سڑک |
| (Words separated by space) |
| (old track) پرانیسڑک |
| (Words separated by ZWNJ) |
Space exclusion issues.
| Word | Category of the word |
| روٹی کپڑا (basic needs of life) | Compound |
| نظم و ضبط (discipline) | |
| حد نظر (scene limit) | |
| دن بدن (day by day) | Reduplication |
| صبح صبح (early morning) | |
| ٹھيک ٹھاک (absolutely fine) | |
| بيش قيمت (expensive) | Prefixation |
| ان تھک (hard work) | |
| آلہ کار (apparatus) | Suffixation |
| دہشت یگرد (terrorism) | |
| جنوبی افريقہ (South Africa) | Proper Noun with more than one word |
| زينب نور (Zainab Noor) | |
| ايش ٹرے (ash tray) | English words |
| نيٹ ورک (network) | |
| ايم قريشی (M. Qureshi) | Abbreviations |
| اين ايل پی (NLP) |
Output of forward maximum matching.
| سعودی (Saudi) |
Output of forward maximum matching.
| سعودی (Saudi) | عرب (Arab) |
Output of dynamic maximum matching.
| سعود (Saud) | 1 | 0 |
| سعودی (Saudi) | 1 | 0 |
Output of dynamic maximum matching.
| سعود (Saud) | ی(i) | 2 | 1 |
| سعودی (Saudi) | 1 | 0 |
Output of dynamic maximum matching.
| سعود(Saud) | ی(i) | عر(Ar) | 3 | 1 |
| سعودی(Saudi) | 1 | 0 | ||
| سعود(Saud) | ی(i) | عرب(arab) | 3 | 1 |
Output of dynamic maximum matching.
| سعود(Saud) | ی(i) | عر(Ar) | ب(ab) | 4 | 2 |
| سعودی(Saudi) | عر(Ar) | ب(ab) | 3 | 1 | |
| سعود(Saud) | ی(i) | عرب(Arab) | 3 | 1 | |
| سعودی(Saudi) | عرب(Arab) | 2 | 0 |
Segmentations produced by dynamic matching.
| اس | نے | کہا | کہ | اسے | جنے | دو | Correct |
| اس | نے | کہا | کہا | سے | جنے | دو | Incorrect |
| He said let him in. (Correct). | |||||||
Compound word generation.
| وہ | بہت | محنت | و | مشقت | سے | کام | کرتا | تھا |
| He had been working very hard. | ||||||||
Compound word generation.
| وہ | بہت | محنتومشقت | سے | کام | کرتا | تھا |
| He had been working very hard. | ||||||
Example of prefixation.
| سب | نا | اہل | تھے | ايک | مشکل | حل | نہ | کر | سکے |
| They were even unable to solve a single problem. | |||||||||
Example of prefixation.
| سب | نااہل | تھے | جو | مشکل | حل | نہ | کر | سکے |
| They were even unable to solve a single problem. | ||||||||
Example of suffixation.
| اس | نے | بہت | متاثر | کن | کام | کيا |
| He performed impressively. | ||||||
Example of suffixation.
| کيا | کام | کن | متاثر | بہت | نے | اس |
| He performed impressively. | ||||||
Example of suffixation.
| کيا | کام | متاثر کن | بہت | نے | اس |
| He performed impressively. | |||||
Example of suffixation.
| اس | نے | بہت | متاثر کن | کام | کيا |
| He performed impressively. | |||||
Example of full reduplication.
| اس | نے | دو | دو | ہار | خريدے |
| He bought two necklaces. | |||||
Example of full reduplication.
| اس | نے | دودو | ہار | خريدے |
| He bought two necklaces. | ||||
Example of partial reduplication.
| وہ | گاہے | بگاے | جيا | کرتا | تھا |
| He had been visiting time to time. | |||||
Example of partial reduplication.
| وہ | گاہےبگاہے | آيا | کرتا | تھا |
| He had been visiting time to time. | ||||
Example of names and abbreviate.
| اسد | علی | نے | يو | ۔ | ايس | ۔ | اے | ۔ | جانا | ہے |
| Asad Ali has to visit U.S.A. | ||||||||||
Example of names and abbreviations.
| اسدعلی | نے | يو | ۔ | ايس | ۔ | اے | ۔ | جانا | ہے |
| Asad Ali has to visit U.S.A. | |||||||||
Example of names and abbreviations.
| اسدعلی | نے | يو۔ | ايس | ۔ | اے | ۔ | جانا | ہے |
| Asad Ali has to visit U.S.A. | ||||||||
Example of names and abbreviations.
| اسدعلی | نے | يو۔ ايس | ۔ | اے | ۔ | جانا | ہے |
| Asad Ali decided to visit U.S.A. | |||||||
Example of names and abbreviations.
| اسدعلی | نے | يو۔ ايس۔ اے اے۔ | جانا | ہے |
| Asad Ali has to visit U.S.A. | ||||
Figure 1Performance comparison.