| Literature DB >> 29312824 |
Alexander Garcia1, Federico Lopez2, Leyla Garcia3, Olga Giraldo1, Victor Bucheli2, Michel Dumontier4.
Abstract
A significant portion of biomedical literature is represented in a manner that makes it difficult for consumers to find or aggregate content through a computational query. One approach to facilitate reuse of the scientific literature is to structure this information as linked data using standardized web technologies. In this paper we present the second version of Biotea, a semantic, linked data version of the open-access subset of PubMed Central that has been enhanced with specialized annotation pipelines that uses existing infrastructure from the National Center for Biomedical Ontology. We expose our models, services, software and datasets. Our infrastructure enables manual and semi-automatic annotation, resulting data are represented as RDF-based linked data and can be readily queried using the SPARQL query language. We illustrate the utility of our system with several use cases. Our datasets, methods and techniques are available at http://biotea.github.io.Entities:
Keywords: Linked data; Ontology; RDF; SPARQL; Semantic; Semantic web
Year: 2018 PMID: 29312824 PMCID: PMC5755483 DOI: 10.7717/peerj.4201
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Publication parsing process.
Ontologies used for metadata, structure, content and references.
| Ontology | Purpose | Main elements used in Biotea |
|---|---|---|
| Bibliographic ontology ( | Metadata | bibo:AcademicArticle, bibo:Document, bibo:doi, bibo:identifier, bibo:issn, bibo:Issue, bibo:issue, bibo:Journal, bibo:numPages, bibo:pageEnd, bibo:pageStart, bibo:pmid, bibo:shortDescription, bibo:volume |
| References | bibo:AcademicArticle, bibo:Book, ibo:Chapter, bibo:citedBy, bibo:cites bibo:Document, bibo:Proceedings | |
| Biotea ( | Metadata (list of elements) | biotea:authorList |
| Structure (list of elements) | biotea:paragraphList, biotea:sectionList | |
| Document ontology ( | Structure and content | doco:Figure, doco:Section, doco:Paragraph, doco:Table |
| Dublin core terms ( | Metadata | dcterms:description, dcterms:issued, dcterms:publisher, dcterms:title |
| Provenance | dcterms:creator, dcterms:hasFormat, dcterms:isFormatOf, dcterms:references, dcterms:source | |
| Friend of a friend ontology ( | Metadata | foaf:familyName, foaf:givenName, foaf:name, foaf:OnlineAccount, foaf:Organization, foaf:Person, foaf:publications |
| References | foaf:familyName, foaf:givenName, foaf:name, foaf:OnlineAccount, foaf:Organization, foaf:Person, foaf:publications | |
| OWL ( | Link to other semantic representations | owl:sameAs |
| Provenance ontology ( | Provenance | prov:generatedAtTime, prov:wasAttributedTo, prov:wasDerivedFrom |
| RDF ( | Content (text in paragraphs) | rdf:value |
| RDFS ( | Link to related web pages | rdfs:seeAlso |
| Semantic science integrated ontology ( | Provenance | sio:is_data_item_in |
Figure 2Semantic enrichment.
Ontologies used to support the annotation process.
| Ontology | Purpose | Main elements used in Biotea |
|---|---|---|
| Annotation ontology ( | Annotation | ao:Annotation, ot:ExactQualifier, ao:body |
| Link to biomedical ontologies | ao:hasTopic | |
| Link to RDFized publication | ao:annotatesResource, ao:context, ao:onResource | |
| Biotea ( | Frequency (occurrences and inverse document frequency) | biotea:idf, biotea:tf |
| Open Annotation Data Model ( | Annotation | oa:Annotation, oa:hasBody (with a oa:TextualBody) |
| Link to biomedical ontologied | oa:hasBody (with a direct link to the ontological concept) | |
| Link to RDFized publication | oa:hasSource, oa:hasTarget | |
| Provenance, authoring and versioning ontology ( | Provenance | pav:authoredBy, pav:createdBy |
| Provenance ontology ( | Provenance | prov:generatedAtTime |
Figure 3The Biotea model.
Figure 4Text structure RDF model.
Figure 5Annotations based on the OADM model.
Figure 6Human annotation interface.
Figure 7Resulting dataset; 34 papers related “American Joint Committee on Cancer” and clustered based on SNOMED CT annotations.
Two PMC papers classified with a “middle similarity” and one paper with a distant similarity.
| Document | PMCID | Title |
|---|---|---|
| doc1 ( | 3862691 | Impact of Interleukin-18 Polymorphisms -607A/C and -137G/C on Oral Cancer Occurrence and Clinical Progression |
| doc2 ( | 3862582 | Impacts of CA9 Gene Polymorphisms on Urothelial Cell Carcinoma Susceptibility and Clinicopathologic Characteristics in Taiwan |
| doc3 ( | 3942390 | The has-miR-526b Binding-Site |
SNOMED CT terms related to cancer.
| SNOMED CT term | ID |
|---|---|
| Carcinoma | snomedct:68453008 |
| Malignant neoplastic disease | snomedct:363346000 |
| Neoplasm | snomedct:108369006 |
| Neoplasm, malignant (primary) | snomedct:86049000 |
SNOMED CT terms describing the patients.
| SNOMED CT term | ID |
|---|---|
| Tobacco user | snomedct:110483000 |
| Tobacco | snomedct:39953003 |
| Taiwanese | snomedct:63736003 |
| AJCC | snomedct:258236004 |
SNOMED CT terms related to the sample.
| SNOMED CT term | ID |
|---|---|
| Whole blood | snomedct:420135007 |
| Ethylenediamine tetra-acetate | snomedct:69519002 |
SNOMED CT terms related to methods.
| SNOMED term | ID |
|---|---|
| Probe with target amplification | snomedct:702675006 |
| Polymerase chain reaction | snomedct:258066000 |
| Deoxyribonucleic acid extraction technique | snomedct:702943006 |
Queries against Biotea.
| Queries | Federated Y/N | Ontologies | Endpoints |
|---|---|---|---|
| Get the title and the PMC identifier for articles annotated with Chemical homeostasis, including its subclasses or Insulin or Homeostasis as well as their COLIL citation context and the Insulin related pathways from Reactome | Y | SNOMED CT, GO, NCIT | Biotea, Reactome, COLIL |
| Retrieve all the articles containing Placebo Control, Crossover Study, Glucose tolerance test, Insulin secretion, glucose metabolic process and the entries from Uniprot related with glucose metabolic process, response to insulin and Diabetes mellitus, non-insulin-dependent (NIDDM) | Y | NCIT, SNOMED CT, GO, Uniprot | Biotea, Uniprot |
| Get all the annotations from GO and ChEBI in articles containing “American Joint Committee on Cancer” | N | GO, ChEBI, SNOMED CT | Biotea |
| Common SNOMED CT tags for articles pmc:3875424 and pmc:3933681 | N | SNOMED CT | Biotea |
| Get all the annotations for the article pmc:3865095 | N | Multiples vocabularies | Biotea |
| Get all the articles annotated with “Calcitocin” and “Injury of kidney” with it’s PMC links and the DBPedia “Calcitocin” description as well as the Uniprot entries classified with “Calcitocin binding” | Y | Biotea, SNOMED CT, GO, Uniprot, DBPEDIA | Biotea, Uniprot, DBPedia |
| Retrieve all the articles annotated with “Renal cell carcinoma” and that cite them in the Open Citations dataset | Y | Open Citations | Biotea, Open Citations |
Figure 8Example of federated SPARQL Query.
Figure 9Calculating the distance between pairs of articles using annotations.