| Literature DB >> 26030738 |
Danushka Bollegala1, Georgios Kontonatsios2, Sophia Ananiadou2.
Abstract
Bilingual dictionaries for technical terms such as biomedical terms are an important resource for machine translation systems as well as for humans who would like to understand a concept described in a foreign language. Often a biomedical term is first proposed in English and later it is manually translated to other languages. Despite the fact that there are large monolingual lexicons of biomedical terms, only a fraction of those term lexicons are translated to other languages. Manually compiling large-scale bilingual dictionaries for technical domains is a challenging task because it is difficult to find a sufficiently large number of bilingual experts. We propose a cross-lingual similarity measure for detecting most similar translation candidates for a biomedical term specified in one language (source) from another language (target). Specifically, a biomedical term in a language is represented using two types of features: (a) intrinsic features that consist of character n-grams extracted from the term under consideration, and (b) extrinsic features that consist of unigrams and bigrams extracted from the contextual windows surrounding the term under consideration. We propose a cross-lingual similarity measure using each of those feature types. First, to reduce the dimensionality of the feature space in each language, we propose prototype vector projection (PVP)--a non-negative lower-dimensional vector projection method. Second, we propose a method to learn a mapping between the feature spaces in the source and target language using partial least squares regression (PLSR). The proposed method requires only a small number of training instances to learn a cross-lingual similarity measure. The proposed PVP method outperforms popular dimensionality reduction methods such as the singular value decomposition (SVD) and non-negative matrix factorization (NMF) in a nearest neighbor prediction task. Moreover, our experimental results covering several language pairs such as English-French, English-Spanish, English-Greek, and English-Japanese show that the proposed method outperforms several other feature projection methods in biomedical term translation prediction tasks.Entities:
Mesh:
Year: 2015 PMID: 26030738 PMCID: PMC4452086 DOI: 10.1371/journal.pone.0126196
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Algorithm 1 Prototype Vector Projection.
|
|
|
|
| 1: Set of prototype vectors 𝓟 = {} |
| 2: 𝓧 = { |
| 3: |
| 4: Compute the centroid vector |
| 5: Compute the score, |
| 6: Select the vector |
| 7: 𝓟 = 𝓟∪{ |
| 8: 𝓧 = 𝓧 { |
| 9: |
| 10: Perform Gram-Schmidt orthonormalization on 𝓟 to obtain an orthonormal set of unit-length basis vectors { |
| 11: |
| 12: |
| 13: |
| 14: |
Algorithm 2 Learning a Cross-Lingual Mapping.
|
|
|
|
| 1: Randomly select |
| 2: |
| 3: |
| 4: |
| 5: |
| 6: If |
| 7: |
| 8: |
| 9: |
| 10: Stop if |
| 11: Let |
| 12: |
| 13: |
Fig 1Nearest neighbor prediction with artificial data.
Pearson’s r correlation coefficients for different dimensionality reduction methods are shown under varying dimensionalities (Feature vectors are 1000 dimensional).
Fig 2Nearest neighbor prediction with artificial data.
Kendall’s τ (right) correlation coefficients for different dimensionality reduction methods are shown under varying dimensionalities (Feature vectors are 1000 dimensional).
Fig 3Nearest neighbor prediction with artificial data.
Pearson’s r correlation coefficients for different dimensionality reduction methods are shown under varying dimensionalities (Feature vectors are 10000 dimensional).
Fig 4Nearest neighbor prediction with artificial data.
Kendall’s τ (right) correlation coefficients for different dimensionality reduction methods are shown under varying dimensionalities (Feature vectors are 10000 dimensional).
Fig 5Nearest neighbor prediction with English feature vectors.
Pearson’s r correlation coefficients for different dimensionality reduction methods are shown under varying dimensionalities (Feature vectors are 1000 dimensional).
Fig 6Nearest neighbor prediction with English feature vectors.
Kendall’s τ (right) correlation coefficients for different dimensionality reduction methods are shown under varying dimensionalities (Feature vectors are 1000 dimensional).
Precision@rank values for English as the source language and different target languages using character n-gram features.
|
| No | SVD | NMF | PVP |
| @1 | 0.263 | 0.121 | 0.091 | 0.297 |
| @2 | 0.369 | 0.194 | 0.163 | 0.415 |
| @3 | 0.439 | 0.254 | 0.238 | 0.494 |
| @4 | 0.506 | 0.337 | 0.289 | 0.539 |
| @5 | 0.543 | 0.377 | 0.345 | 0.574 |
| @6 | 0.564 | 0.415 | 0.381 | 0.597 |
| @7 | 0.590 | 0.458 | 0.411 | 0.615 |
| @8 | 0.614 | 0.501 | 0.470 | 0.624 |
| @9 | 0.636 | 0.536 | 0.509 | 0.640 |
| @10 | 0.647 | 0.556 | 0.546 | 0.653 |
|
| No | SVD | NMF | PVP |
| @1 | 0.145 | 0.086 | 0.112 | 0.130 |
| @2 | 0.243 | 0.150 | 0.166 | 0.200 |
| @3 | 0.304 | 0.215 | 0.228 | 0.261 |
| @4 | 0.358 | 0.270 | 0.263 | 0.324 |
| @5 | 0.401 | 0.319 | 0.322 | 0.385 |
| @6 | 0.443 | 0.361 | 0.370 | 0.420 |
| @7 | 0.478 | 0.401 | 0.412 | 0.457 |
| @8 | 0.508 | 0.435 | 0.449 | 0.492 |
| @9 | 0.549 | 0.490 | 0.476 | 0.521 |
| @10 | 0.576 | 0.532 | 0.517 | 0.543 |
|
| No | SVD | NMF | PVP |
| @1 | 0.07 | 0.089 | 0.101 | 0.224 |
| @2 | 0.099 | 0.170 | 0.166 | 0.305 |
| @3 | 0.136 | 0.235 | 0.224 | 0.351 |
| @4 | 0.172 | 0.284 | 0.284 | 0.375 |
| @5 | 0.191 | 0.331 | 0.333 | 0.390 |
| @6 | 0.211 | 0.369 | 0.383 | 0.403 |
| @7 | 0.242 | 0.420 | 0.426 | 0.409 |
| @8 | 0.256 | 0.456 | 0.458 | 0.418 |
| @9 | 0.274 | 0.490 | 0.491 | 0.426 |
| @10 | 0.296 | 0.529 | 0.528 | 0.431 |
|
| No | SVD | NMF | PVP |
| @1 | 0.048 | 0.041 | 0.018 | 0.162 |
| @2 | 0.066 | 0.075 | 0.043 | 0.193 |
| @3 | 0.097 | 0.094 | 0.060 | 0.223 |
| @4 | 0.127 | 0.115 | 0.086 | 0.250 |
| @5 | 0.142 | 0.143 | 0.103 | 0.261 |
| @6 | 0.164 | 0.153 | 0.126 | 0.274 |
| @7 | 0.179 | 0.168 | 0.140 | 0.287 |
| @8 | 0.204 | 0.191 | 0.168 | 0.297 |
| @9 | 0.218 | 0.210 | 0.191 | 0.303 |
| @10 | 0.232 | 0.224 | 0.208 | 0.314 |
Precision@rank values for English as the source language and different target languages using contextual features.
|
| No | SVD | NMF | PVP |
| @1 | 0.083 | 0.038 | 0.063 | 0.090 |
| @2 | 0.148 | 0.095 | 0.111 | 0.173 |
| @3 | 0.202 | 0.141 | 0.177 | 0.231 |
| @4 | 0.248 | 0.189 | 0.223 | 0.275 |
| @5 | 0.294 | 0.231 | 0.261 | 0.325 |
| @6 | 0.319 | 0.266 | 0.300 | 0.369 |
| @7 | 0.347 | 0.295 | 0.332 | 0.402 |
| @8 | 0.371 | 0.323 | 0.361 | 0.437 |
| @9 | 0.394 | 0.352 | 0.404 | 0.459 |
| @10 | 0.420 | 0.379 | 0.437 | 0.478 |
|
| No | SVD | NMF | PVP |
| @1 | 0.070 | 0.042 | 0.087 | 0.103 |
| @2 | 0.121 | 0.094 | 0.143 | 0.182 |
| @3 | 0.185 | 0.145 | 0.192 | 0.245 |
| @4 | 0.241 | 0.177 | 0.230 | 0.289 |
| @5 | 0.273 | 0.217 | 0.289 | 0.341 |
| @6 | 0.317 | 0.250 | 0.337 | 0.389 |
| @7 | 0.354 | 0.295 | 0.382 | 0.436 |
| @8 | 0.382 | 0.332 | 0.415 | 0.471 |
| @9 | 0.419 | 0.368 | 0.462 | 0.498 |
| @10 | 0.457 | 0.397 | 0.487 | 0.536 |
|
| No | SVD | NMF | PVP |
| @1 | 0.044 | 0.031 | 0.038 | 0.036 |
| @2 | 0.068 | 0.063 | 0.066 | 0.085 |
| @3 | 0.103 | 0.080 | 0.085 | 0.119 |
| @4 | 0.129 | 0.105 | 0.110 | 0.164 |
| @5 | 0.151 | 0.125 | 0.134 | 0.184 |
| @6 | 0.178 | 0.147 | 0.152 | 0.225 |
| @7 | 0.197 | 0.165 | 0.170 | 0.238 |
| @8 | 0.223 | 0.191 | 0.192 | 0.264 |
| @9 | 0.240 | 0.212 | 0.204 | 0.285 |
| @10 | 0.256 | 0.236 | 0.219 | 0.298 |
|
| No | SVD | NMF | PVP |
| @1 | 0.037 | 0.018 | 0.027 | 0.032 |
| @2 | 0.056 | 0.031 | 0.046 | 0.064 |
| @3 | 0.090 | 0.044 | 0.057 | 0.080 |
| @4 | 0.094 | 0.057 | 0.071 | 0.102 |
| @5 | 0.108 | 0.068 | 0.080 | 0.126 |
| @6 | 0.116 | 0.083 | 0.109 | 0.134 |
| @7 | 0.133 | 0.093 | 0.129 | 0.149 |
| @8 | 0.144 | 0.106 | 0.140 | 0.159 |
| @9 | 0.151 | 0.114 | 0.153 | 0.170 |
| @10 | 0.168 | 0.126 | 0.168 | 0.179 |