| Literature DB >> 23409017 |
Tudor Groza1, Jane Hunter, Andreas Zankl.
Abstract
Phenotype descriptions are important for our understanding of genetics, as they enable the computation and analysis of a varied range of issues related to the genetic and developmental bases of correlated characters. The literature contains a wealth of such phenotype descriptions, usually reported as free-text entries, similar to typical clinical summaries. In this paper, we focus on creating and making available an annotated corpus of skeletal phenotype descriptions. In addition, we present and evaluate a hybrid Machine Learning approach for mining phenotype descriptions from free text. Our hybrid approach uses an ensemble of four classifiers and experiments with several aggregation techniques. The best scoring technique achieves an F-1 score of 71.52%, which is close to the state-of-the-art in other domains, where training data exists in abundance. Finally, we discuss the influence of the features chosen for the model on the overall performance of the method.Entities:
Mesh:
Year: 2013 PMID: 23409017 PMCID: PMC3568099 DOI: 10.1371/journal.pone.0055656
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Statistics of the phenotype descriptions corpus.
| Total number of figure captions | 1,194 |
| Total number of tokens | 64,052 |
| Average number of tokens per caption | 53 |
| Total number of phenotype descriptions | 5,423 |
| Average number of phenotype descriptions per caption | 4 |
| Average number of tokens per phenotype description | 5 |
| Maximum number of tokens in one phenotype description | 31 |
| Minimum number of tokens in one phenotype description | 1 |
The corpus used for training the classifiers has been manually compiled from 395 random publications from three different academic journals. It consists of 1,194 image captions that describe 5,423 phenotype descriptions. The total number of tokens in the corpus is 64,052, with an average of 5 tokens per phenotype description. The longest phenotype description comprises 31 tokens, while the shortest consists of only one token.
Evaluation results for individual classifiers – with and without domain-specific dictionaries.
| Method | Without dictionaries | With dictionaries | ||||
| P (%) | R (%) | F-1 (%) | P (%) | R (%) | F-1 (%) | |
| Mallet | 76.93 | 65.51 |
| 74.91 | 63.99 |
|
| CRF++ | 41.71 | 53.65 | 46.93 | 41.33 | 53.58 | 46.66 |
| YamCha1vs1 | 67.48 | 62.38 | 64.83 | 68.43 | 63.36 | 65.80 |
| YamCha1vsAll | 68.61 | 62.17 | 65.23 | 68.62 | 62.48 | 65.40 |
We can see that MALLET constantly outperforms all the other approaches, with a margin of almost 5% without using dictionaries and almost 3% when using domain-specific dictionaries. The surprising aspect is the decrease in performance when using dictionaries as opposed to the setting that omits them.
Evaluation results for the simple set aggregation technique – with and without domain-specific dictionaries.
| Aggregation | Without dictionaries | With dictionaries | ||||
| P (%) | R (%) | F-1 (%) | P (%) | R (%) | F-1 (%) | |
| Mallet | 64.61 |
| 70.86 | 64.24 |
|
|
| Mallet | 65.42 | 78.33 |
| 64.58 | 77.90 | 70.62 |
| YamCha1vs1 | 66.25 | 64.85 | 65.54 | 67.02 | 64.79 | 65.89 |
| YamCha1vs1 | 70.74 | 59.71 | 64.76 | 70.80 | 61.05 | 65.57 |
| Mallet |
| 49.35 | 63.51 |
| 48.57 | 62.57 |
The best scoring direct set operations are those that include MALLET in their composition, which is in line with the individual classification results. The italicised results demonstrate the effect of the set operations: union increases the recall with almost 13%, while intersection increases the precision with around 12%.
Evaluation results for the paired set aggregation technique – with and without domain-specific dictionaries.
| Aggregation | Without dictionaries | With dictionaries | ||||
| P (%) | R (%) | F-1 (%) | P (%) | R (%) | F-1 (%) | |
| (Mallet | 70.06 | 66.75 | 68.36 | 69.62 | 67.53 | 68.56 |
| (Mallet | 70.05 | 66.97 |
| 69.76 | 67.53 |
|
| (Mallet |
| 65.91 | 68.06 |
| 66.72 | 68.33 |
Evaluation results for the voting aggregation technique – with and without domain-specific dictionaries.
| Veto owner | Without dictionaries | With dictionaries | ||||
| P (%) | R (%) | F-1 (%) | P (%) | R (%) | F-1 (%) | |
| Mallet | 66.14 | 77.84 |
| 65.29 | 77.95 |
|
| CRF++ | 44.10 | 70.73 | 54.33 | 43.78 | 71.53 | 54.31 |
| YamCha1vs1 | 67.42 | 68.79 | 68.10 | 67.52 | 69.21 | 68.35 |
| YamCha1vsAll | 68.38 | 68.70 | 68.54 | 68.06 | 68.82 | 68.44 |
The results of the voting method are in line with the rest of the aggregation methods. The highest score (71.52%/71.06%) is achieved by using MALLET as veto owner.
Figure 1Evaluation results for MALLET ten-fold cross validation using single features.
The graph groups the features according to the categories used to describe them in the Materials and Methods section. We can observe that the simple and morphological features perform the best, with the Prefix feature achieving an F-1 score of 66.22%. Among the token context features, the token bigrams with a window of 3 provides the best configuration (almost 30% F-1). Dictionary-based features, both generic and domain-specific, have a poor performance, which is associated with their lack of discriminative power.
Evaluation results for MALLET ten-fold cross validation with leave-one-out feature.
| Feature | P (%) | R (%) | F-1 (%) |
| Prefix | 61.59 | 52.59 | 56.74 |
| Root (NLP) | 75.72 | 64.50 | 69.66 |
| M_Punct | 75.06 | 64.17 | 69.19 |
| M_Vowels | 76.65 | 64.93 | 70.30 |
| Root (LEX) | 72.90 | 62.32 | 67.20 |
| Suffix | 76.07 | 65.60 | 70.43 |
| POS (LEX) | 75.01 | 64.48 | 69.34 |
| M_Digits | 75.69 | 64.98 | 69.93 |
| M_Shape | 76.66 | 64.50 | 70.06 |
| M_Bshape | 75.34 | 65.92 | 70.31 |
| POS (NLP) | 75.50 | 65.40 | 70.09 |
| Token_Bi3 | 65.05 | 49.53 | 56.24 |
| D_Generic | 75.72 | 64.57 | 69.70 |
| D_Domain | 76.93 | 65.51 | 70.76 |
This overview shows the individual importance of each of the features in the overall classification model. The large majority of features have very little impact over the model, i.e., a decrease in performance of 1–2%. The only two features that make a difference are the Prefix and the token context (Token_Bi3) that affect the overall performance with almost 15%.
Comparative overview of the evaluation results – with and without domain-specific dictionaries.
| Veto owner | Without dictionaries | With dictionaries | ||||
| P (%) | R (%) | F-1 (%) | P (%) | R (%) | F-1 (%) | |
| Individual classification (Mallet) |
| 65.51 | 70.76 |
| 63.99 | 69.02 |
| Simple set operation | 65.42 |
| 71.30 | 64.24 |
| 70.66 |
| Aggregated set operation | 70.05 | 66.97 | 68.48 | 69.76 | 67.53 | 68.62 |
| Voting | 66.14 | 77.84 |
| 65.29 | 77.95 |
|
This comparative overview shows the difference in performance between all aggregation techniques. We can see that this difference is of almost 1% between the best performing individual classifier and the best aggregation technique – the voting mechanism.