| Literature DB >> 22536974 |
Achille Zappa1, Andrea Splendiani, Paolo Romano.
Abstract
BACKGROUND: With the advent of high-throughput technologies, a great wealth of variation data is being produced. Such information may constitute the basis for correlation analyses between genotypes and phenotypes and, in the future, for personalized medicine. Several databases on gene variation exist, but this kind of information is still scarce in the Semantic Web framework. In this paper, we discuss issues related to the integration of mutation data in the Linked Open Data infrastructure, part of the Semantic Web framework. We present the development of a mapping from the IARC TP53 Mutation database to RDF and the implementation of servers publishing this data.Entities:
Mesh:
Year: 2012 PMID: 22536974 PMCID: PMC3303732 DOI: 10.1186/1471-2105-13-S4-S7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Examples of D2RQ mappings
| Description | Mapping | Generated triple (example) |
|---|---|---|
| Create a triple that defines the value of the title of a given bibliographic reference. | map:somatic_ref_Title a d2rq:PropertyBridge; | <http://bioinformatics.istge.it/logvd/resource/somatic_refs/7> |
| Create a triple that defines the value of the wild type aminoacid for a given gene variation. Conditional clauses avoid definitions for special cases (empty field, dash character and "NA" value). | map:variation_hasWildTypeResidue a d2rq:PropertyBridge; | <http://bioinformatics.istge.it/logvd/resource/variations/994> |
| Creates a triple that establishes a link to the TP53 human gene description in HGNC as implemented in Bio2RDF. | map:gene_HGNC a d2rq:PropertyBridge; | <http://bioinformatics.istge.it/logvd/resource/genes/TP53> |
Three examples of D2RQ mappings between the IARC TP53 Mutation database and its RDF representation. In the first one, the title of a paper is just extracted from the database and associated to an entity representing a given bibliographic reference by means of the bibtex:hasTitle property. In the second example, the one-letter code of the wild type aminoacid corresponding to the location of a given variation is associated to the entity representing the same variation through the mio:hasWildTypeResidue property. The third example creates a connection with an external entity, namely the HGNC identifier of the TP53 gene, by specifying its Linked Data URI.
Correspondence between classes and original datasets
| Class | Description | Linked to | IARC datasets |
|---|---|---|---|
| database | Database information | somatic_mutation | Implicit |
| gene | Gene information | gene_variation | Implicit |
| gene_variation | Detailed description of the mutation | gene, somatic_mutation | Gene variations |
| somatic_mutation | Summary mutation data, linked to bibliography, sample, and variation data | database, sample, gene_variation, somatic_ref | Somatic mutations |
| sample | Tumor topography, morphology, origin, and classification | somatic_mutation, individual | Somatic mutations |
| individual | Demographic details, life-style data and genetics of the donor | sample | Somatic mutations |
| somatic_ref | Bibliographic references where mutations are described | somatic_mutation | Somatic references |
A summary of published classes and the correspondence with the original datasets.
Figure 1Classes, relationships and external links. A schematic representation of classes that were created by the mapping, of their relationships, and of external links is presented in this figure. The great boxes represent classes, while the smaller represent external datasets. In the latter case, a yellow border denotes RDF dataset linked by URIs, a red one denotes web sites linked by URLs.
Figure 2Architecture of the system. The components of the system and their interfaces are shown. The triple store is populated by the RDF dump, that is created by D2RQ, and incremented by special SPAQRL updates. Access to the triple store is granted through Joseki, which can be queried by SPARQL clients. The Pubby interface allows data navigation by means of both HTML and RDF browsers.
Figure 3Browsing the RDF graph: a somatic mutation. Representation of properties and values of a defined somatic mutation (somatic_mutations/10000), including links to related gene variation (variations/1579), sample (samples/9557), and bibliographic reference (somatic_refs/1065), together with some proper attributes (mutation and structural motif). This class is central within the schema, linking the majority of classes.
Figure 4Browsing the RDF graph: a bibliographic reference. Properties and objects associated to the bibliographic reference denoted by somatic_refs/1065 are shown. All links to somatic references described in the paper are presented. Links to Pubmed referring to its Bio2RDF implementation and to the NCBI web site are also shown.
SPARQL query example 1: descriptive statistical analysis of dataset contents
| SELECT ?neoplasm ?variation (count (?variation) as ?occurrence) | ||
|---|---|---|
| WHERE { | ||
| ?sample NCIT:Neoplasm_by_Morphology ?neoplasm. | ||
| ?somatic_mutation logvd:hasSample ?sample. | ||
| ?variation_id rdfs:label ?variation. | ||
| ?somatic_mutation logvd:hasVariation ?variation_id. | ||
| } | ||
| GROUP BY ?neoplasm ?variation | ||
| ORDER BY ?neoplasm | ||
| Acinar cell carcinoma | NM_000546.1:c.186A>C | 1 |
| Acinar cell carcinoma | NM_000546.1:c.408del1 | 1 |
| Acinar cell carcinoma | NM_000546.1:c.454del1 | 1 |
| Acinar cell carcinoma | NM_000546.1:c.590T>G | 1 |
| Acute leukemia, NOS | NM_000546.1:c.524G>A | 2 |
| Acute megakaryoblastic leukemia | NM_000546.1:c.605G>T | 1 |
| Acute megakaryoblastic leukemia | NM_000546.1:c.734G>T | 1 |
| Acute monocytic leukemia | NM_000546.1:c.584T>C | 1 |
| Acute myeloid leukemia with maturation | NM_000546.1:c.743G>A | 1 |
| Acute myeloid leukemia with maturation | NM_000546.1:c.862A>T | 1 |
| ...... | ...... | ...... |
This query selects neoplasm and associated gene variation along with the number of related associations for all somatic mutations in the dataset. The output has been limited to the first 10 results. SPARQL query prefixes are not shown.
SPARQL query example 2: extraction of complementary data from DBpedia
| SELECT ?sample ?patient ?country ?capital | |||
|---|---|---|---|
| WHERE { | |||
| ?sample logvd:hasIndividual ?patient. | |||
| ?sample NCIT:Topography "BRAIN". | |||
| ?patient NCIT:Country ?country | |||
| SERVICE <http://dbpedia.org/sparql> {?country <http://dbpedia.org/ontology/capital> ?capital} | |||
| } | |||
| samples/112 | individual/112 | <http://dbpedia.org/resource/Japan> | <http://dbpedia.org/resource/Tokyo> |
| samples/113 | individual/113 | <http://dbpedia.org/resource/Japan> | <http://dbpedia.org/resource/Tokyo> |
| samples/115 | individual/115 | <http://dbpedia.org/resource/Japan> | <http://dbpedia.org/resource/Tokyo> |
| samples/116 | individual/116 | <http://dbpedia.org/resource/Japan> | <http://dbpedia.org/resource/Tokyo> |
| samples/963 | individual/963 | <http://dbpedia.org/resource/Japan> | <http://dbpedia.org/resource/Tokyo> |
| samples/964 | individual/964 | <http://dbpedia.org/resource/Japan> | <http://dbpedia.org/resource/Tokyo> |
| samples/1026 | individual/1025 | <http://dbpedia.org/resource/Canada> | <http://dbpedia.org/resource/Ottawa> |
| samples/1299 | individual/1292 | <http://dbpedia.org/resource/Spain> | <http://dbpedia.org/resource/Madrid> |
| samples/1300 | individual/1293 | <http://dbpedia.org/resource/Spain> | <http://dbpedia.org/resource/Madrid> |
| samples/1302 | individual/1295 | <http://dbpedia.org/resource/Spain> | <http://dbpedia.org/resource/Madrid> |
| samples/1303 | individual/1296 | <http://dbpedia.org/resource/Spain> | <http://dbpedia.org/resource/Madrid> |
| samples/1739 | individual/1728 | <http://dbpedia.org/resource/Germany> | <http://dbpedia.org/resource/Berlin> |
| ............ | ............ | ............ | ............ |
This query selects countries and capitals from DBpedia for individuals whose samples were used for the detection of somatic mutations. The SERVICE keyword supports the execution of the query among endpoints distributed across the Web. The output has been limited to the first 12 results. SPARQL query prefixes are not shown.
SPARQL query example 3: retrieving clinical trials of interest
| SELECT DISTINCT ?variation_label ?neoplasm ?clinical_trial | ||
|---|---|---|
| WHERE { | ||
| SERVICE <http://linkedlifedata.com/sparql> {?clinical_trial relontology:hasInclusionCriteria ?umls}. | ||
| ?sample logvd:Sub_topography "Middle third of esophagus". | ||
| ?sample NCIT:Neoplasm_by_Morphology ?neoplasm. | ||
| ?sample logvd:hasUMLS_neoplasm ?umls. | ||
| ?somatic_mutation logvd:hasSample ?sample. | ||
| ?variation_id rdfs:label ?variation_label. | ||
| ?somatic_mutation logvd:hasVariation ?variation_id. | ||
| } | ||
| ORDER BY ?variation_label ?neoplasm | ||
| NM_000546.1:c.507G > A | Adenocarcinoma, NOS | <http://data.linkedct.org/resource/trials/NCT00001332> |
| NM_000546.1:c.838A>G | Adenocarcinoma, NOS | <http://data.linkedct.org/resource/trials/NCT00001332> |
| NM_000546.1:c.507G>A | Adenocarcinoma, NOS | <http://data.linkedct.org/resource/trials/NCT00001428> |
| NM_000546.1:c.838A>G | Adenocarcinoma, NOS | <http://data.linkedct.org/resource/trials/NCT00001428> |
| NM_000546.1:c.482C>A | Dysplasia, NOS | <http://data.linkedct.org/resource/trials/NCT00001932> |
| NM_000546.1:c.482C>A | Dysplasia, NOS | <http://data.linkedct.org/resource/trials/NCT00003076> |
| NM_000546.1:c.482C>A | Dysplasia, NOS | <http://data.linkedct.org/resource/trials/NCT00003094> |
| NM_000546.1:c.482C>A | Dysplasia, NOS | <http://data.linkedct.org/resource/trials/NCT00003223> |
| NM_000546.1:c.482C>A | Dysplasia, NOS | <http://data.linkedct.org/resource/trials/NCT00003593> |
| NM_000546.1:c.469G>T | Hyperplasia, NOS | <http://data.linkedct.org/resource/trials/NCT00001378> |
| NM_000546.1:c.469G>T | Hyperplasia, NOS | <http://data.linkedct.org/resource/trials/NCT00001756> |
| NM_000546.1:c.469G>T | Hyperplasia, NOS | <http://data.linkedct.org/resource/trials/NCT00001760> |
| NM_000546.1:c.469G>T | Hyperplasia, NOS | <http://data.linkedct.org/resource/trials/NCT00003641> |
| NM_000546.1:c.422G>A | Squamous cell carcinoma, NOS | <http://data.linkedct.org/resource/trials/NCT00000692> |
| NM_000546.1:c.451C>G | Squamous cell carcinoma, NOS | <http://data.linkedct.org/resource/trials/NCT00000692> |
| NM_000546.1:c.469G>T | Squamous cell carcinoma, NOS | <http://data.linkedct.org/resource/trials/NCT00000692> |
| NM_000546.1:c.474_475ins1 | Squamous cell carcinoma, NOS | <http://data.linkedct.org/resource/trials/NCT00001450> |
| NM_000546.1:c.488A>G | Squamous cell carcinoma, NOS | <http://data.linkedct.org/resource/trials/NCT00001450> |
This query selects clinical trials of interest, given a defined sub topography (precise location of the tumor) and shows which variations are involved. The SERVICE keyword supports the execution of the query among endpoints distributed across the Web. The output has been limited to 18 results that were selected with the aim of showing different tumors associated with the given sub topography. SPARQL query prefixes are not shown.
Main statistics of triple store contents
| Triple store size | ||
|---|---|---|
| Number of entities | 85,785 | |
| Number of triples | 1,002,597 | |
| LinkedLifeData | 25,094 | |
| Bio2RDF | 2,244 | |
| DBpedia | 23,015 | |
| Total | 50,353 | |
| Total | 2,436 | |
| rdf: | 1 | 85,893 |
| rdfs: | 3 | 88,249 |
| owl: | 1 | 2,241 |
| diseasome: | 2 | 2 |
| mio: | 2 | 9,399 |
| bibo: | 6 | 11,385 |
| bibtex: | 2 | 4,478 |
| NCIT: | 12 | 146,553 |
| Total | 29 | 348,200 |
Statistics of the RDF triple store showing number of triples, entities, external Linked Data URIs, links to external web sites and shared properties from re-used ontologies.