| Literature DB >> 20946598 |
Daniel J Quest1, Miriam L Land, Thomas S Brettin, Robert W Cottingham.
Abstract
BACKGROUND: Traditional genome annotation systems were developed in a very different computing era, one where the World Wide Web was just emerging. Consequently, these systems are built as centralized black boxes focused on generating high quality annotation submissions to GenBank/EMBL supported by expert manual curation. The exponential growth of sequence data drives a growing need for increasingly higher quality and automatically generated annotation. Typical annotation pipelines utilize traditional database technologies, clustered computing resources, Perl, C, and UNIX file systems to process raw sequence data, identify genes, and predict and categorize gene function. These technologies tightly couple the annotation software system to hardware and third party software (e.g. relational database systems and schemas). This makes annotation systems hard to reproduce, inflexible to modification over time, difficult to assess, difficult to partition across multiple geographic sites, and difficult to understand for those who are not domain experts. These systems are not readily open to scrutiny and therefore not scientifically tractable. The advent of Semantic Web standards such as Resource Description Framework (RDF) and OWL Web Ontology Language (OWL) enables us to construct systems that address these challenges in a new comprehensive way.Entities:
Mesh:
Year: 2010 PMID: 20946598 PMCID: PMC3026362 DOI: 10.1186/1471-2105-11-S6-S15
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1An overview of the Oak Ridge Genome Annotation and Analysis (ORGAA) system.
A comparison of five common data storage technologies currently deployed in annotation systems.
| Free Text | Tab/Line Delimited | XML | RDF/XML | Relational-DB | |
|---|---|---|---|---|---|
| Description Logic (FOL) | NO | NO | NO | YES | NO |
| Ontology Standards | NO | NO | NO | YES | NO |
| Centralized/not scalable | NO | YES* | NO | NO | YES** |
| Human Readable | YES | YES | NO | YES | YES |
| Domain Expert Understandable | YES | YES | NO | YES | NO |
| Data Structure | NONE | Single Table | Tree | Graph | Relational Tables |
| Data Expectations | NONE | NONE | Schema - Constraints | Inference rules | Schema – Constraints |
| Native Format | Text | Text | Text | Text | Binary |
| Query Engine Language | Programmed by hand | Programmed by hand | Libraries available | SPARQL | SQL |
| Naming Standard | NO UNA**** | NO UNA | NO UNA | NO UNA | UNA |
| Sequence Storage Solution | In Text | In Text | XML/Indexed | XML/Indexed | Indexed |
| CWA/OWA*** | OWA | CWA | CWA | OWA | CWA |
| Search Speed (Worst Case) | NP-Hard | O(n) | O(n) | P | O(log n) |
| Update Speed (Worst Case) | NP-Hard | O(n) | O(n) | P | O(log n) |
| Conversion to Semantic | Data loss possible – done by hand | No data loss – done by hand | No data loss – done with robust libraries | - | No data loss –library usage and some added labeling by hand |
| Conversion from Semantic | No data loss | Data loss | Data loss | - | Data loss |
Free text is used in repositories such as scientific journals. Tab/Line delimited files are used in popular formats such as FASTA, GFF, and BLAST. Tab/Line delimited files also constitutes the bulk of program output from most bioinformatics software. Mature tools and sequence repositories such as GenBank support XML output. Many OWL based ontology repositories exist for semantic data integration, however RDF/XML data is currently scarce. Relational databases typically do not provide direct access to the data, instead a programming interface is provided for access to the underlying database. Free text is the most flexible, and also the least machine-readable. Relational databases are the most formal structures with the fastest indexing and searching capabilities. Relational technology requires the greatest computational expertise investment while free text is the most natural. XML and RDF/XML are designed for modification over time and in sharing data. In the rows discussing search speed and update speed, O(log n), O(n), P and NP-Hard are computer science terms indicating a range of how fast a computer solution can be obtained to a particular problem. P indicates a reasonable solution is possible in polynomial time, NP-Hard means that the solution space explodes relative to the input size. NP-Hard problems are expected to not be solvable on a computer in reasonable time. O(log n), O(n), and P are all solvable efficiently on a computer. In the rows discussing conversion to and from RDF/XML, Turtle, and other semantic aware data storage technologies, loss of information includes schema, constraints, data and formatting. For example, to convert from a relational schema to tab-delimited files, information is lost because the schema, triggers and views are not representable using tab-delimited files. So these columns are more than just data, they are data and descriptions surrounding the data for making logical conclusions and for executing computer codes in reasonable time. In the conversion from free text to semantic standards, assumptions and domain expertise may be lost.
*Assuming all information is in one file. If multiple files exist, then an indexing system needs to be developed to organize information.
**Relational databases are assumed to exist as a single installation on a powerful single resource. New database technologies have lessened this restriction in recent years.
***CWA – Closed World Assumption, missing information treated as false. OWA – Open World Assumption, missing information treated as unknown.
****UNA Unique Name Assumption – Each individual has a single unique name.
Figure 2SAnoS System Architecture. The open arrows represent the flow of data through the system.
Figure 3A simplified version of the ORNL annotation ontology edited in Protégé.
Figure 4An example RDF pipeline for algorithm execution.
Figure 5A Mechanism for translation between legacy formats and RDF/XML.