| Literature DB >> 18036221 |
Mikael Nyström1, Magnus Merkel, Håkan Petersson, Hans Ahlfeldt.
Abstract
BACKGROUND: Automatic word alignment of parallel texts with the same content in different languages is among other things used to generate dictionaries for new translations. The quality of the generated word alignment depends on the quality of the input resources. In this paper we report on automatic word alignment of the English and Swedish versions of the medical terminology systems ICD-10, ICF, NCSP, KSH97-P and parts of MeSH and how the terminology systems and type of resources influence the quality.Entities:
Mesh:
Year: 2007 PMID: 18036221 PMCID: PMC2267171 DOI: 10.1186/1472-6947-7-37
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Partition characteristics
| Partition | Content | Word correlation | Character correlation | Word ratio difference | Rubrics | English rubric average number (standard deviation) of words | Swedish rubric average number (standard deviation) of words | English unique words | Swedish unique words | English unique words per rubric | Swedish unique words per rubric |
| All | All terminology systems | 0.78 | 0.79 | 38,575 | 3.7 (2.9) | 3.3 (3.0) | 17,679 | 25,848 | 0.5 | 0.7 | |
| 1 | MeSH, one word in either English or Swedish rubric | 0.56 | 13,514 | 1.5 (0.7) | 1.0 (0.1) | 11,267 | 13,581 | 0.8 | 1.0 | ||
| 2 | MeSH, more than one word in both English and Swedish rubrics | 0.52 | 0.71 | 0.30 | 5,568 | 2.6 (0.8) | 2.3 (0.7) | 5,434 | 6,443 | 1.0 | 1.2 |
| 3 | ICF, whole | 0.69 | 0.79 | 0.53 | 1,496 | 4.7 (2.5) | 4.2 (2.8) | 991 | 1,263 | 0.7 | 0.8 |
| 4 | KSH97-P, whole | 0.70 | 0.67 | 0.49 | 968 | 4.0 (2.5) | 3.5 (2.4) | 1,324 | 1,382 | 1.4 | 1.4 |
| 5 | ICD-10, except chapter 2 level 4 | 0.77 | 0.75 | 0.37 | 10,791 | 5.2 (3.0) | 5.2 (3.4) | 5,144 | 7,219 | 0.5 | 0.7 |
| 6 | NCSP, except chapter N | 0.64 | 0.63 | 0.38 | 4,137 | 5.8 (2.7) | 5.0 (2.5) | 1,758 | 2,347 | 0.4 | 0.6 |
| 7 | ICD-10, chapter 2 level 4 | 0.38 | 0.45 | 0.71 | 713 | 3.6 (2.2) | 6.3 (2.7) | 443 | 535 | 0.6 | 0.8 |
| 8 | NCSP, chapter N | 0.55 | 0.48 | 0.25 | 1,388 | 9.4 (2.6) | 7.7 (2.3) | 249 | 285 | 0.2 | 0.2 |
Content of the partitions.
Kendall's tau-b correlation between the English rubrics and corresponding Swedish rubrics according to number of words and number of characters and average absolute differences between the ratio for all rubrics in the partition and the grand mean of the different terminology partitions.
Number of parallel rubrics, average number and standard deviation of words per rubrics, number of unique words, and number of average unique words per rubric of the different terminology partitions.
Automatic word alignment results from batch 1
| Batch | Run | Resources | Alignment | Result | ||||||||||||||||||||||||
| Static | Statistic | Dynamic | ||||||||||||||||||||||||||
| Stan | POS | MeSH | Statistic partition | Training partition | Test partition | Recall | Precision | F-score | ||||||||||||||||||||
| 2 | 3 | 4 | 5 | 6 | 7 | 8 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ||||||||
| 1 | 1 | X | X | 0.54 | 0.56 | 0.55 | ||||||||||||||||||||||
| 1 | 2 | X | X | 0.48 | 0.56 | 0.52 | ||||||||||||||||||||||
| 1 | 3 | X | X | 0.38 | 0.41 | 0.39 | ||||||||||||||||||||||
| 1 | 4 | X | X | 0.43 | 0.48 | 0.45 | ||||||||||||||||||||||
| 1 | 5 | X | X | 0.41 | 0.46 | 0.43 | ||||||||||||||||||||||
| 1 | 6 | X | X | 0.40 | 0.36 | 0.38 | ||||||||||||||||||||||
| 1 | 7 | X | X | 0.46 | 0.52 | 0.49 | ||||||||||||||||||||||
| 1 | 8 | X | X | X | X | X | 0.68 | 0.68 | 0.68 | |||||||||||||||||||
| 1 | 9 | X | X | X | X | X | 0.72 | 0.75 | 0.73 | |||||||||||||||||||
| 1 | 10 | X | X | X | X | X | 0.60 | 0.60 | 0.60 | |||||||||||||||||||
| 1 | 11 | X | X | X | X | X | 0.69 | 0.67 | 0.68 | |||||||||||||||||||
| 1 | 12 | X | X | X | X | X | 0.65 | 0.64 | 0.64 | |||||||||||||||||||
| 1 | 13 | X | X | X | X | X | 0.53 | 0.48 | 0.50 | |||||||||||||||||||
| 1 | 14 | X | X | X | X | X | 0.71 | 0.71 | 0.71 | |||||||||||||||||||
| 1 | 15 | X | X | X | X | X | X | 0.72 | 0.70 | 0.71 | ||||||||||||||||||
| 1 | 16 | X | X | X | X | X | X | 0.80 | 0.80 | 0.80 | ||||||||||||||||||
| 1 | 17 | X | X | X | X | X | X | 0.71 | 0.68 | 0.69 | ||||||||||||||||||
| 1 | 18 | X | X | X | X | X | X | 0.80 | 0.78 | 0.79 | ||||||||||||||||||
| 1 | 19 | X | X | X | X | X | X | 0.83 | 0.78 | 0.80 | ||||||||||||||||||
| 1 | 20 | X | X | X | X | X | X | 0.63 | 0.58 | 0.60 | ||||||||||||||||||
| 1 | 21 | X | X | X | X | X | X | 0.84 | 0.85 | 0.84 | ||||||||||||||||||
Recall, precision and F-score from the automatic word alignment when resources for the automatic word alignment were generated from the same partition as the aligned partition. The configurations CfStatistical, CfStatisticalStatic and CfStatisticalStaticTraining were used.
Automatic word alignment results from batch 2
| Batch | Run | Resources | Alignment | Result | ||||||||||||||||||||||||
| Static | Statistic | Dynamic | ||||||||||||||||||||||||||
| Stan | POS | MeSH | Statistic partition | Training partition | Test partition | Recall | Precision | F-score | ||||||||||||||||||||
| 2 | 3 | 4 | 5 | 6 | 7 | 8 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ||||||||
| 2 | 1 | X | X | X | X | X | X | X | X | 0.34 | 0.40 | 0.37 | ||||||||||||||||
| 2 | 2 | X | X | X | X | X | X | X | X | 0.34 | 0.40 | 0.37 | ||||||||||||||||
| 2 | 3 | X | X | X | X | X | X | X | X | 0.34 | 0.40 | 0.37 | ||||||||||||||||
| 2 | 4 | X | X | X | X | X | X | X | X | 0.40 | 0.46 | 0.43 | ||||||||||||||||
| 2 | 5 | X | X | X | X | X | X | X | X | 0.35 | 0.41 | 0.38 | ||||||||||||||||
| 2 | 6 | X | X | X | X | X | X | X | X | 0.33 | 0.40 | 0.36 | ||||||||||||||||
| 2 | 7 | X | X | X | X | X | X | X | X | 0.35 | 0.41 | 0.38 | ||||||||||||||||
| 2 | 8 | X | X | X | X | X | X | X | X | X | X | X | 0.60 | 0.62 | 0.61 | |||||||||||||
| 2 | 9 | X | X | X | X | X | X | X | X | X | X | X | 0.60 | 0.62 | 0.61 | |||||||||||||
| 2 | 10 | X | X | X | X | X | X | X | X | X | X | X | 0.61 | 0.62 | 0.61 | |||||||||||||
| 2 | 11 | X | X | X | X | X | X | X | X | X | X | X | 0.65 | 0.65 | 0.65 | |||||||||||||
| 2 | 12 | X | X | X | X | X | X | X | X | X | X | X | 0.61 | 0.63 | 0.62 | |||||||||||||
| 2 | 13 | X | X | X | X | X | X | X | X | X | X | X | 0.60 | 0.62 | 0.61 | |||||||||||||
| 2 | 14 | X | X | X | X | X | X | X | X | X | X | X | 0.61 | 0.63 | 0.62 | |||||||||||||
| 2 | 15 | X | X | X | X | X | X | X | X | X | X | X | X | 0.65 | 0.65 | 0.65 | ||||||||||||
| 2 | 16 | X | X | X | X | X | X | X | X | X | X | X | X | 0.62 | 0.63 | 0.62 | ||||||||||||
| 2 | 17 | X | X | X | X | X | X | X | X | X | X | X | X | 0.64 | 0.64 | 0.64 | ||||||||||||
| 2 | 18 | X | X | X | X | X | X | X | X | X | X | X | X | 0.75 | 0.73 | 0.74 | ||||||||||||
| 2 | 19 | X | X | X | X | X | X | X | X | X | X | X | X | 0.67 | 0.67 | 0.67 | ||||||||||||
| 2 | 20 | X | X | X | X | X | X | X | X | X | X | X | X | 0.64 | 0.65 | 0.64 | ||||||||||||
| 2 | 21 | X | X | X | X | X | X | X | X | X | X | X | X | 0.64 | 0.65 | 0.64 | ||||||||||||
Recall, precision and F-score from the automatic word alignment when resources for the automatic word alignment were generated from a single partition and all partitions were automatically aligned. The configurations CfStatistical, CfStatisticalStatic and CfStatisticalStaticTraining were used.
Automatic word alignment results from batch 3
| Batch | Run | Resources | Alignment | Result | ||||||||||||||||||||||||
| Static | Statistic | Dynamic | ||||||||||||||||||||||||||
| Stan | POS | MeSH | Statistic partition | Training partition | Test partition | Recall | Precision | F-score | ||||||||||||||||||||
| 2 | 3 | 4 | 5 | 6 | 7 | 8 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ||||||||
| 3 | 1 | X | X | X | X | X | X | X | X | 0.34 | 0.40 | 0.37 | ||||||||||||||||
| 3 | 2 | X | X | X | X | X | X | X | X | X | 0.41 | 0.46 | 0.43 | |||||||||||||||
| 3 | 3 | X | X | X | X | X | X | X | X | X | X | 0.46 | 0.50 | 0.48 | ||||||||||||||
| 3 | 4 | X | X | X | X | X | X | X | X | X | X | X | 0.53 | 0.55 | 0.54 | |||||||||||||
| 3 | 5 | X | X | X | X | X | X | X | X | X | X | X | X | 0.57 | 0.58 | 0.57 | ||||||||||||
| 3 | 6 | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.59 | 0.59 | 0.59 | |||||||||||
| 3 | 7 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.61 | 0.61 | 0.61 | ||||||||||
| 3 | 8 | X | X | X | X | X | X | X | X | X | X | X | 0.60 | 0.62 | 0.61 | |||||||||||||
| 3 | 9 | X | X | X | X | X | X | X | X | X | X | X | X | 0.62 | 0.63 | 0.62 | ||||||||||||
| 3 | 10 | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.65 | 0.65 | 0.65 | |||||||||||
| 3 | 11 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.69 | 0.67 | 0.68 | ||||||||||
| 3 | 12 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.71 | 0.68 | 0.69 | |||||||||
| 3 | 13 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.72 | 0.69 | 0.70 | ||||||||
| 3 | 14 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.72 | 0.69 | 0.70 | |||||||
| 3 | 15 | X | X | X | X | X | X | X | X | X | X | X | X | 0.65 | 0.65 | 0.65 | ||||||||||||
| 3 | 16 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.67 | 0.66 | 0.66 | ||||||||||
| 3 | 17 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.69 | 0.67 | 0.68 | ||||||||
| 3 | 18 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.77 | 0.74 | 0.75 | ||||||
| 3 | 19 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.80 | 0.76 | 0.78 | ||||
| 3 | 20 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.80 | 0.77 | 0.78 | ||
| 3 | 21 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.81 | 0.77 | 0.79 |
Recall, precision and F-score from the automatic word alignment when resources for the automatic word alignment were used cumulatively and all partitions were automatically aligned. The configurations CfStatistical, CfStatisticalStatic and CfStatisticalStaticTraining were used.
Automatic word alignment results from batch 4
| Batch | Run | Resources | Alignment | Result | ||||||||||||||||||||||||
| Static | Statistic | Dynamic | ||||||||||||||||||||||||||
| Stan | POS | MeSH | Statistic partition | Training partition | Test partition | Recall | Precision | F-score | ||||||||||||||||||||
| 2 | 3 | 4 | 5 | 6 | 7 | 8 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ||||||||
| 4 | 1 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.73 | 0.71 | 0.72 | ||||||
| 4 | 2 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.81 | 0.81 | 0.81 | ||||||
| 4 | 3 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.81 | 0.76 | 0.78 | ||||||
| 4 | 4 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.81 | 0.78 | 0.79 | ||||||
| 4 | 5 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.84 | 0.79 | 0.81 | ||||||
| 4 | 6 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.74 | 0.65 | 0.69 | ||||||
| 4 | 7 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.85 | 0.85 | 0.85 | ||||||
Recall, precision and F-score from the automatic word alignment when resources for the automatic word alignment were generated from all partitions and a single partition was automatically aligned. The configuration CfStatisticalStaticTraining was used.
Automatic word alignment results from batch 5
| Batch | Run | Resources | Alignment | Result | ||||||||||||||||||||||||
| Static | Statistic | Dynamic | ||||||||||||||||||||||||||
| Stan | POS | MeSH | Statistic partition | Training partition | Test partition | Recall | Precision | F-score | ||||||||||||||||||||
| 2 | 3 | 4 | 5 | 6 | 7 | 8 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ||||||||
| 5 | 1 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.80 | 0.77 | 0.78 | |||
| 5 | 2 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.80 | 0.76 | 0.78 | ||
| 5 | 3 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.81 | 0.77 | 0.79 | ||
| 5 | 4 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.81 | 0.77 | 0.79 | |
| 5 | 5 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.81 | 0.76 | 0.78 | ||
| 5 | 6 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.81 | 0.77 | 0.79 | |
| 5 | 7 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.81 | 0.77 | 0.79 | |
| 5 | 8 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.81 | 0.77 | 0.79 |
Recall, precision and F-score from the automatic word alignment when resources for the automatic word alignment were generated from all partitions and all partitions were automatically aligned. If the standard resources, the parts-of-speech blocker and the MeSH-dictionary were used or not were instead altered in the 8 possible ways.
Automatic word alignment results from batch 6
| Batch | Run | Resources | Alignment | Result | ||||||||||||||||||||||||
| Static | Statistic | Dynamic | ||||||||||||||||||||||||||
| Stan | POS | MeSH | Statistic partition | Training partition | Test partition | Recall | Precision | F-score | ||||||||||||||||||||
| 2 | 3 | 4 | 5 | 6 | 7 | 8 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ||||||||
| 6 | 1 | X | X | X | X | X | X | X | 0.29 | 0.38 | 0.33 | |||||||||||||||||
| 6 | 2 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.77 | 0.76 | 0.76 | ||||||||||
| 6 | 3 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.58 | 0.61 | 0.59 | ||||||||||
| 6 | 4 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.80 | 0.77 | 0.78 | |||
| 6 | 5 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.79 | 0.77 | 0.78 | |||||||
| 6 | 6 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.72 | 0.69 | 0.70 | |||||||
| 6 | 7 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | 0.81 | 0.77 | 0.79 |
Recall, precision and F-score from the automatic word alignment when resources for the automatic word alignment were generated from all partitions and all partitions were automatically aligned. If the static resources, statistical resources and training resources were used or not were altered according to the table.
Figure 1The ITools suite. The heart of the ITools suite is the ITrix application which performs the automatic word alignment. The other tools used are three tools applied before ITrix in the pre-alignment phase: IFDG, for tagging, IStat, for statistical processing, and ILink, for training and creating resources used by ITrix. After the word alignment, the candidate term pairs are converted into an SQL database by Termbase Manager, and the candidate term pairs can be revised graphically in the IView application.
Candidate term pairs evaluation results
| Base forms | Inflected forms | |
| Correct | 23,737 | 28,342 |
| Partly correct | 4,081 | 4,401 |
| Incorrect | 1,617 | 1,691 |
| Total | 29,435 | 34,434 |
Numbers of correct, partly correct and incorrect term pairs after the manual evaluation of the candidate term pairs. Results for term pairs grouped together in their base forms and in their inflected forms are shown.
Figure 2Included term pairs per qvalue. The cumulative number of correct, partly correct and incorrect term pairs included for a specific qvalue. (All term pairs with a qvalue equal to or greater than the actual qvalue are included.) Only qvalues equal to or smaller than 2 are included in the figure.
Figure 3Recall and precision per qvalue. The recall and precision for a specific qvalue. (All term pairs with a qvalue equal to or greater than the actual qvalue are included.) Only q values equal to or smaller than 2 are included in the figure.
Intra-rater reliability results
| Part | Number of links in original alignment | Number of links in repeated alignment | Number of mutual links | F-score |
| 2 | 618 | 619 | 561 | 0.91 |
| 3 | 280 | 282 | 257 | 0.91 |
| 4 | 136 | 139 | 131 | 0.95 |
| 5 | 2,416 | 2,446 | 2,314 | 0.95 |
| 6 | 881 | 883 | 833 | 0.94 |
| 7 | 94 | 99 | 89 | 0.92 |
| '8 | 492 | 478 | 464 | 0.96 |
Number of links in original alignment, number of links in repeated alignment, number of links mutually included in both alignments and F-score.