| Literature DB >> 18315860 |
Olivo Miotto1, Tin Wee Tan, Vladimir Brusic.
Abstract
BACKGROUND: The explosive growth of biological data provides opportunities for new statistical and comparative analyses of large information sets, such as alignments comprising tens of thousands of sequences. In such studies, sequence annotations frequently play an essential role, and reliable results depend on metadata quality. However, the semantic heterogeneity and annotation inconsistencies in biological databases greatly increase the complexity of aggregating and cleaning metadata. Manual curation of datasets, traditionally favoured by life scientists, is impractical for studies involving thousands of records. In this study, we investigate quality issues that affect major public databases, and quantify the effectiveness of an automated metadata extraction approach that combines structural and semantic rules. We applied this approach to more than 90,000 influenza A records, to annotate sequences with protein name, virus subtype, isolate, host, geographic origin, and year of isolation.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18315860 PMCID: PMC2259408 DOI: 10.1186/1471-2105-9-S1-S7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Retrieval performances of the NCBI nucleotide and protein databases. Each chart shows 5 pairs of bars, one for each extracted property. The first (darker) bar of each pair shows the performance for the GenBank database while the second (lighter) bar shows the value for GenPept. The first chart shows the percentage of source documents from which a property value could be extracted, while the second graph shows the percentage of accurate values extracted, measured against the manually annotated dataset.
Structural rules employed for the extraction of sequence record properties from GenBank and GenPept. For each structural rule, the priority (lower numbers indicate higher priority) and XPath expression are given. The proteinName property was only extracted from GenPept.
| proteinName | 1 | /GBSeq/GBSeq_definition |
| 2 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='gene']/GBQualifier_value | |
| subtype | 1 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='strain']/GBQualifier_value |
| 2 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='isolate']/GBQualifier_value | |
| 3 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='organism']/GBQualifier_value | |
| isolate | 1 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='strain']/GBQualifier_value |
| 2 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='isolate']/GBQualifier_value | |
| 3 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='organism']/GBQualifier_value | |
| host | 1 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='specific_host']/GBQualifier_value |
| 2 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='strain']/GBQualifier_value | |
| 3 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='isolate']/GBQualifier_value | |
| 4 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='organism']/GBQualifier_value | |
| origin | 1 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='country']/GBQualifier_value |
| 2 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='isolation_source']/GBQualifier_value | |
| 3 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='strain']/GBQualifier_value | |
| 4 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='isolate']/GBQualifier_value | |
| 5 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='organism']/GBQualifier_value | |
| 6 | /GBSeq/GBSeq_references/GBReference/GBReference_title | |
| year | 1 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='note']/GBQualifier_value |
| 2 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='isolation_source']/GBQualifier_value | |
| 3 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='strain']/GBQualifier_value | |
| 4 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='isolate']/GBQualifier_value | |
| 5 | /GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier [GBQualifier_name='organism']/GBQualifier_value |
Figure 2Performance of structural rules for five metadata properties. Bars show the percentage of records for which a given structural rule produced the final property value. Rules are numbered according to their priority, matching the priorities shown in Table 1.
Figure 3Associations of sequences to isolates. The left chart shows a count of identified isolates, according to the number of sequences they are associated to. On the left, we show the distribution of sequences according to the number of sequences associated with their isolates.
Figure 4Isolate annotation and resulting corrections. The left chart shows the percentage of created IsolateRecord objects with a value for each of the five properties. For the host and origin property, the low yield of isolate annotations would indicate that isolates with a full complement of proteins (10 or 11 sequence records per isolate) are generally better annotated than isolates with a small number of sequences. The chart on the right shows the number of property values that were automatically modified (or added, in the case of sequence records for which structural rules did not yield a value).
Figure 5Experimental workflow of this study. The workflow has three main stages: the retrieval and merging of the source documents from public databases; the extraction of metadata by multiple structural rules; and the semantic restructuring of the sequence metadata, which identifies isolates, and subsequently re-annotates the sequences.
Figure 6Restructuring sequence metadata. Graph A shows the relationship between SequenceRecord resources, their metadata properties, and a source GenBank document, as encoded by ABK in its RDF output. In this example, records belonging to the same isolate have no relationship to each other. Graph B shows the same knowledge, restructured by the introduction of the IsolateRecord resource, and the transfer of isolate-specific metadata.
Figure 7Identification of conflicting metadata values. Sequences from the same isolate should have identical value for certain metadata properties, such as origin. However, inconsistencies often occur, as shown in 7A. Rule-based metadata restructuring transfers the inconsistent values to the IsolateRecord resource, as shown in 7B. Since origin is declared as a functional property, an OWL reasoner can identify the inconsistency as a breach of the ontology DL constraint.