| Literature DB >> 28212419 |
Alberto Alexander Gayle1,2, Motomu Shimaoka3,4.
Abstract
INTRODUCTION: The predominance of English in scientific research has created hurdles for "non-native speakers" of English. Here we present a novel application of native language identification (NLI) for the assessment of medical-scientific writing. For this purpose, we created a novel classification system whereby scoring would be based solely on text features found to be distinctive among native English speakers (NS) within a given context. We dubbed this the "Genuine Index" (GI).Entities:
Mesh:
Year: 2017 PMID: 28212419 PMCID: PMC5315297 DOI: 10.1371/journal.pone.0172338
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Number of abstracts per country.
| Country | Abstracts |
|---|---|
| USA | 2085 |
| Turkey | 476 |
| Italy | 283 |
| Germany | 273 |
| Canada | 272 |
| UK | 262 |
| Japan | 244 |
| The Netherlands | 161 |
| France | 145 |
| Israel | 139 |
| India | 136 |
| China | 103 |
| Sweden | 87 |
| Greece | 86 |
| Brazil | 78 |
| Australia | 71 |
| Austria | 63 |
| Denmark | 55 |
| Iran | 54 |
| Switzerland | 54 |
| Finland | 54 |
| Poland | 53 |
| South Korea | 48 |
| Spain | 46 |
| Taiwan | 45 |
| Egypt | 42 |
| Norway | 42 |
| Belgium | 34 |
| South Africa | 33 |
| Former Yugoslavia | 33 |
| Hungary | 26 |
| Argentina | 26 |
| Hong Kong | 24 |
| Czech Republic—Slovakia | 22 |
| Saudi Arabia | 21 |
| Mexico | 17 |
| Oman | 16 |
| Nigeria | 15 |
| Jordan | 15 |
| Lebanon | 13 |
| Portugal | 10 |
| Thailand | 10 |
| Other | 87 |
| Unidentified | 28 |
Number of abstracts per country, obtained from Pediatric Blood & Cancer and Pediatric Hematology/Oncology, 1986–2015. Countries with with less than 10 abstracts aggregated under “Other”. Abstracts with unidentified countries aggregated under “Unidentified”.
Genuine index model performance.
| True JPN (n = 243) | True NS (n = 2665) | Precision | |
|---|---|---|---|
| 64 | 15 | 81.0% | |
| 179 | 2,650 | 93.7% | |
| 26.3% | 99.4% |
Confusion matrix detailing classification results of NS and Japanese researchers, where each cell represents number of corresponding abstracts. Includes performance metrics derived from confusion matrix, in which larger numbers represent better performance: F-score denotes performance for each respective class, while kappa denotes performance for the overall model. Class specific F-score calculated based on the above: JPN = 39.8%, NS = 96.5%. Overall kappa calculated based on the above: 37.2%.
Fig 195% confidence intervals for GI score by country (n > 39 abstracts).
X-axis denotes countries according to which GI score is aggregated. Y-axis denotes mean GI score per country. Means and 95% confidence intervals for each country reveal substantial variation, albeit with most averages falling within the 60–80 range.
Identification of homogenous subsets among countries with respect to GI score.
| Country | N | 1 | 2 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Japan | 244 | 44.3 | |||||||||
| South Korea | 48 | 65.5 | |||||||||
| Iran | 54 | 66.5 | |||||||||
| Poland | 53 | 66.5 | |||||||||
| Turkey | 476 | 67.0 | |||||||||
| Greece | 86 | 68.4 | 68.4 | ||||||||
| China | 103 | 69.4 | 69.4 | 69.4 | |||||||
| Taiwan | 45 | 69.8 | 69.8 | 69.8 | |||||||
| Austria | 63 | 70.0 | 70.0 | 70.0 | |||||||
| Germany | 273 | 70.4 | 70.4 | 70.4 | 70.4 | ||||||
| Spain | 46 | 70.6 | 70.6 | 70.6 | 70.6 | ||||||
| France | 145 | 71.2 | 71.2 | 71.2 | 71.2 | ||||||
| Brazil | 78 | 71.8 | 71.8 | 71.8 | 71.8 | 71.8 | |||||
| Italy | 283 | 72.0 | 72.0 | 72.0 | 72.0 | 72.0 | |||||
| Egypt | 42 | 72.3 | 72.3 | 72.3 | 72.3 | 72.3 | |||||
| Norway | 42 | 72.3 | 72.3 | 72.3 | 72.3 | 72.3 | |||||
| Finland | 54 | 72.5 | 72.5 | 72.5 | 72.5 | 72.5 | |||||
| Switzerland | 54 | 72.8 | 72.8 | 72.8 | 72.8 | 72.8 | |||||
| Sweden | 87 | 73.7 | 73.7 | 73.7 | 73.7 | 73.7 | |||||
| India | 136 | 74.3 | 74.3 | 74.3 | 74.3 | ||||||
| Israel | 139 | 75.6 | 75.6 | 75.6 | 75.6 | ||||||
| Denmark | 55 | 76.3 | 76.3 | 76.3 | 76.3 | ||||||
| Netherlands | 161 | 76.6 | 76.6 | 76.6 | |||||||
| UK | 262 | 80.1 | 80.1 | 80.1 | |||||||
| Australia | 71 | 81.3 | 81.3 | ||||||||
| Canada | 272 | 81.4 | 81.4 | ||||||||
| USA | 2085 | 82.0 | |||||||||
| Sig. | 1.00 | 0.07 | 0.05 | 0.11 | 0.06 | 0.08 | 0.13 | 0.24 | 0.07 | 1.00 |
Table of differences between means generated by Tukey HSD post-hoc test with statistical differences highlighted. Means for groups in homogeneous subsets are displayed. a. Uses Harmonic Mean Sample Size = 81.905. b. The group sizes are unequal. The harmonic mean of the group sizes is used. Type I error levels are not guaranteed.
Regression analysis: Coleman-Liau index vs GI score.
| Unstandardized Coefficients | |||||
|---|---|---|---|---|---|
| Model | B | Std. Error | t | Sig. | |
| 1 | (Constant) | 72.622 | 1.070 | 67.857 | 0.000 |
| CLI | .144 | .063 | 2.303 | .021 | |
Table shows the output for a regression analysis in which CLI is modeled as the dependent variable and GI score as the independent variable. Results suggest an association between increasing text complexity (CLI) and a writing style closer to that of the ideal (higher GI score). These results could be interpreted to imply that a 6.9 point increase in GI score is equivalent to a 1 year increase in grade-level; however, further research would be needed to substantiate.
Fig 2Analysis of correlation between national IELTS results and GI scores (n = 15).
Pearson correlation demonstrates statistically significant, albeit minor, correlation between between aggregate GI scores and IELTS scores for the countries where data exists. (A), academic writing: correlation = 0.5485, p = 0.034. (B) general writing: correlation = 0.6009, p = 0.018).
Top terms according to discriminatory power (SVM weights).
| Terms for Japanese (nNS) | Weight | Terms for NS | Weight |
|---|---|---|---|
| showed | 0.060 | child | 0.044 |
| although | 0.056 | transplant | 0.036 |
| detected | 0.056 | found to | 0.033 |
| serum | 0.055 | patients had | 0.029 |
| after | 0.050 | secondary | 0.028 |
| having | 0.050 | anemia | 0.028 |
| analyzed | 0.048 | agents | 0.027 |
| because | 0.046 | protocols | 0.026 |
| without | 0.045 | hematologic | 0.025 |
| old | 0.038 | post | 0.025 |
| that the | 0.037 | negative | 0.024 |
| the patient | 0.036 | review | 0.023 |
| transplantation | 0.035 | demonstrated | 0.023 |
| however | 0.035 | consistent | 0.023 |
| year old | 0.034 | compared to | 0.023 |
| was performed | 0.034 | evaluated | 0.023 |
| remission | 0.034 | reviewed | 0.023 |
| infection | 0.034 | well | 0.022 |
| cell transplantation | 0.033 | diagnosed with | 0.022 |
| stem cell transplantation | 0.032 | receiving | 0.022 |
| mutation | 0.032 | other | 0.022 |
| cases | 0.031 | we present | 0.022 |
| course | 0.031 | literature | 0.021 |
| with acute | 0.030 | commonly | 0.021 |
| should be | 0.029 | effects of | 0.021 |
| age of | 0.029 | all patients | 0.021 |
| after the | 0.029 | previously | 0.021 |
| diagnosed | 0.028 | risk of | 0.021 |
| acute | 0.027 | common | 0.021 |
| followed by | 0.027 | survival | 0.021 |
Ranked list of n-grams showing which terms are most significant for differentiating the writing of NS and Japanese researchers in this study. Due to the characteristics of SVM models, weights should only be interpreted as ordinal rankings; no linear relationship should be inferred.
Count/percentage of abstracts with GI score falling in 95%-ile.
| Country | Count | Percentage |
|---|---|---|
| USA | 203 | 10% |
| Canada | 21 | 8% |
| UK | 14 | 5% |
| The Netherlands | 11 | 7% |
| Israel | 10 | 7% |
| Australia | 8 | 11% |
| Italy | 5 | 2% |
| France | 4 | 3% |
| India | 4 | 3% |
| Sweden | 2 | 2% |
| Denmark | 2 | 4% |
| Germany | 2 | 1% |
| Greece | 1 | 1% |
| Turkey | 1 | 0% |
| Switzerland | 1 | 2% |
| Austria | 1 | 2% |
| Norway | 1 | 2% |
Percentage of abstracts with “superior” scores within each country. “Superior” defined to be scores falling within the 95%-ile (i.e., greater than 89). Countries with no abstracts within the 95%-ile have not been included in this table.