| Literature DB >> 25937882 |
Muhammad Saleem1, Shanmukha S Padmanabhuni2, Axel-Cyrille Ngonga Ngomo1, Aftab Iqbal2, Jonas S Almeida3, Stefan Decker2, Helena F Deus4.
Abstract
BACKGROUD: The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to catalogue genetic mutations responsible for cancer using genome analysis techniques. One of the aims of this project is to create a comprehensive and open repository of cancer related molecular analysis, to be exploited by bioinformaticians towards advancing cancer knowledge. However, devising bioinformatics applications to analyse such large dataset is still challenging, as it often requires downloading large archives and parsing the relevant text files. Therefore, it is making it difficult to enable virtual data integration in order to collect the critical co-variates necessary for analysis.Entities:
Keywords: Federated queries; RDF; SPARQL; TCGA
Year: 2014 PMID: 25937882 PMCID: PMC4417511 DOI: 10.1186/2041-1480-5-47
Source DB: PubMed Journal: J Biomed Semantics
Figure 1Biological query results. We used TopFed to search for the methylation status of the KRAS gene (chr12:25386768-25403863) across five cancer histologies (hosted by five SPARQL endpoints) and created a box plot comparing the methylation values. The corresponding SPARQL query to retrieve the required methylation values is given in Listing 1.
Figure 2TCGA text to RDF conversion process. Given a text file, first it is refined by the Data Refiner. The refine file is then converted into RDF (N3 notation) by the RDFizer. Finally, the RDF file is uploaded into a SPARQL endpoint.
Figure 3TCGA data distribution/load balancing and source selection. The proposed data distribution and source selection diagram for hosting the complete Linked TCGA data.
Figure 4Text to RDF conversion process example. An example showing the refinement and RDFication of the TCGA file.
Statistics for 27 tumours sorted by number of triples
| Tumour type | Raw(GB) | Refined(GB) | RDF(GB) | Triples(Million) |
|---|---|---|---|---|
| Lymphoid Neoplasm Diffuse Large | 0.37 | 0.20 | 0.83 | 35 |
| B-cell Lymphoma (DLBC) | ||||
| Cutaneous melanoma (UCS) | 1.2 | 0.64 | 2.6 | 113 |
| Glioblastoma multiforme (GBM) | 2.3 | 0.77 | 2.8 | 132 |
| Esophageal carcinoma (ESCA) | 1.5 | 0.88 | 3.4 | 149 |
| Adrenocortical carcinoma (ACC) | 1.6 | 0.90 | 3.6 | 158 |
| Pancreatic adenocarcinoma (PAAD) | 2.6 | 1.1 | 4.5 | 200 |
| Kidney Chromophobe (KICH) | 3.7 | 1.4 | 5.3 | 242 |
| Sarcoma (SARC) | 3.8 | 1.5 | 5.9 | 267 |
| Cervical (CESC) | 8.75 | 2.44 | 8.86 | 400.19 |
| Ovarian serous cystadenocarcinoma (OV) | 8.2 | 2.4 | 8.7 | 410 |
| Rectal adenocarcinoma (READ) | 8.07 | 2.25 | 9.04 | 413.31 |
| Papillary Kidney (KIRP) | 10.40 | 2.90 | 10.4 | 469.65 |
| Stomach adenocarcinoma (STAD) | 5.5 | 2.9 | 12 | 529 |
| Liver hepatocellular carcinoma (LIHC) | 8.2 | 3.1 | 12 | 550 |
| Bladder cancer (BLCA) | 12.16 | 3.39 | 12.3 | 556.38 |
| Acute Myeloid Leukemia (LAML) | 14.85 | 4.14 | 15.1 | 684.05 |
| Lower Grade Glioma (LGG) | 17.08 | 4.76 | 17.1 | 778.82 |
| Prostate adenocarcinoma (PRAD) | 18.05 | 5.03 | 18.1 | 821.01 |
| Lung squamous carcinoma (LUSC) | 20.63 | 5.75 | 20.5 | 927.08 |
| Cutaneous melanoma (SKCM) | 23.22 | 6.47 | 23.2 | 1050.94 |
| Uterine Corpus Endometrial Carcinoma (UCEC) | 13 | 5.98 | 24.2 | 1070 |
| Colon adenocarcinoma (COAD) | 18 | 6.64 | 26 | 1175 |
| v Head and neck squamous cell(HNSC) | 27.6 | 7.69 | 27.5 | 1245.37 |
| Lung adenocarcinoma (LUAD) | 23 | 9.1 | 36 | 1611 |
| Kidney renal clear cell carcinoma (KIRC) | 24 | 9.4 | 37 | 1658 |
| Thyroid carcinoma (THCA) | 26 | 10.1 | 40 | 1796 |
| Breast invasive carcinoma (BRCA) | 45 | 17 | 65 | 2959 |
A total of 20.4 Billion triples.
Excerpt of the links for the lookup files of TCGA
| Source | Target | Class | # links | Runtime (ms) |
|---|---|---|---|---|
| DNA27 | HGNC | Genes | 23,181 | 154 |
| DNA27 | Homologene | Genes | 27,654 | 193 |
| DNA27 | OMIM | Genes | 15,171 | 158 |
| DNA450 | Homologene | Genes | 489,643 | 5,710 |
| DNA450 | OMIM | Genes | 212,284 | 429 |
| DNA27 | HGNC | Chromosomes | 108,662 | 96 |
| DNA27 | OMIM | Chromosomes | 16,039,535 | 8,055 |
The source column shows the name of the look-up file that was linked to the target dataset named in the second column. The class column shows the type of resources that were linked. The fourth column shows the number of links that were generated while the runtime column shows the time required by LIMES to carry out the linking process in ms.
Excerpt of the links for the methylation results of a single patient
| Source | Target | Class | # links | Runtime (ms) |
|---|---|---|---|---|
| Methylation | HGNC | Chromosomes | 97,530 | 205 |
| Methylation | OMIM | Chromosomes | 14,407,269 | 6,095 |
| Gene expression | HGNC | Chromosomes | 86,052 | 80 |
| Gene expression | OMIM | Chromosomes | 12,535,829 | 4,679 |
The source column shows the name of the patient file that was linked to the target dataset named in the second column. The class column shows the type of resources that were linked. The fourth column shows the number of links that were generated while the runtime column shows the time required by LIMES to carry out the linking process in ms.
Figure 5TCGA class diagram of RDFized results. Each level 3 data is further divided into three layers where: layer 1 contains patient data, layer 2 consists of clinical information and layer 3 contain results for different samples of a patient.
Figure 6Linked TCGA schema diagram. The schema diagram of the Linked TCGA, useful for formulating SPARQL queries.
Figure 7TopFed federated query processing model. TCGA tailored federated query processing diagram, showing system components.
Benchmark queries descriptions
| Query | Description |
|---|---|
| Q1 | Get the chromosome, start, stop and mean copy number values of the patient TCGA-18-4721 for genome locations 554268 to 5994290 |
| Q2 | Get the chromosome, start, stop and mean exon-expression values of all the TCGA patients |
| Q3 | Get the chromosome, position and mean methylation values of all the TCGA patients |
| Q4 | Get the chromosome, start and stop values of the TCGA patient TCGA-C4-A0F6 |
| Q5 | Get the chromosome, start, stop values of all the TCGA patients |
| Q6 | Get the chromosome, start, stop and miRNA values of the 20th record of TCGA patient TCGA-AB-2821 |
| Q7 | Get the chromosome, start and stop values of the TCGA patient TCGA-AB-2823 for mean sequence value of 0.0839 |
| Q8 | Get the chromosome, start, stop, mean protein expression and mean exon-expression values of the TCGA patient TCGA-18-3410 |
| Q9 | Get the chromosome, mean gene expression and mean methylation values of the TCGA patient TCGA-C5-A1BF |
| Q10 | Get the chromosome, mean gene expression, mean exon expression and mean methylation values of all the TCGA patients |
The corresponding SPARQL queries can be downloaded from http://goo.gl/UxUEXk.
Benchmark SPARQL endpoints specifications
| SPARQL endpoint | CPU | RAM | Hard disk |
|---|---|---|---|
| virtuoso-blue1 | 2.2 GHz, i3 | 4 GB | 300 GB |
| virtuoso-blue2 | 2.6 GHz, i5 | 4 GB | 150 GB |
| virtuoso-pink1 | 2.53 GHz, i5 | 4 GB | 300 GB |
| virtuoso-pink2 | 2.3 GHz, i5 | 4 GB | 500 GB |
| virtuoso-pink3 | 2.53 GHz, i5 | 4 GB | 300 GB |
| virtuoso-green1 | 2.9 GHz, i7 | 16 GB | 256 GB SSD |
| virtuoso-green2 | 2.9 GHz, i7 | 8 GB | 450 GB |
| virtuoso-green3 | 2.6 GHz, i5 | 8 GB | 400 GB |
| virtuoso-green4 | 2.6 GHz, i5 | 8 GB | 400 GB |
| virtuoso-green5 | 2.9 GHz, i7 | 16 GB | 500 GB |
Benchmark queries distribution
| Single Colour | Cross-Colour | |
|---|---|---|
| Star | 2 | 2 |
| Hybrid (star + path) | 2 | 4 |
Figure 8Efficient source selection. Comparison of the TopFed and FedX source selection in terms of the total number of triple pattern-wise sources selected. Y-axis shows the total triple pattern-wise sources selected for each of the benchmark query given in X-axis.
Comparison of average execution time for each query (based on a sample of 10)
| FedX(first run) | FedX(cached) | TopFed | |||
|---|---|---|---|---|---|
| Query no | Execution time(msec) | Execution time(msec) | S.E | Execution time(msec) | S.E |
| 1 | 913 | 401.2 | 5.22 | 341.5* | 5.60 |
| 2 | 81619 | 81170.7 | 655.93 | 866.5* | 22.08 |
| 3 | 82271 | 81817.8 | 653.22 | 666* | 27.12 |
| 4 | 1199 | 367.6 | 6.88 | 262.7* | 7.35 |
| 5 | 80423 | 78723.5 | 459.43 | 78691.5 | 458.70 |
| 6 | 837 | 416.9 | 8.38 | 246.1* | 3.56 |
| 7 | 921 | 399.6 | 4.41 | 248.1* | 7.20 |
| 8 | 900 | 89 | 2.45 | 72.7* | 1.52 |
| 9 | 950.3 | 76.8 | 2.16 | 63.3* | 1.89 |
| 10 | 912 | 63.6 | 1.99 | 49.6* | 1.02 |
| Average | 25094.53 | 24352.67 | 180.01 | 8150.8 | 53.60 |
*Significant improvement.
Comparison of source selection average execution time (based on a sampling of 10)
| FedX(first run) | FedX(cached) | TopFed | |||
|---|---|---|---|---|---|
| Query no | Execution time(msec) | Execution time(msec) | S.E | Execution time(msec) | S.E |
| 1 | 530 | 11.7 | 0.35 | 28.1 | 0.98 |
| 2 | 487 | 11.4 | 0.67 | 5.2 | 0.57 |
| 3 | 470 | 11.9 | 0.78 | 5 | 0.42 |
| 4 | 510 | 12 | 0.52 | 23.6 | 1.57 |
| 5 | 473 | 9.8 | 0.65 | 4.8 | 0.29 |
| 6 | 371 | 9.9 | 0.38 | 21.7 | 0.68 |
| 7 | 521 | 10 | 0.39 | 24.4 | 0.76 |
| 8 | 483 | 9.5 | 0.45 | 29.5 | 0.86 |
| 9 | 496 | 9.8 | 0.39 | 20.1 | 0.99 |
| 10 | 456 | 10.6 | 0.40 | 7.4 | 0.58 |
| Average | 479.7 | 10.66 | 0.50 | 16.98 | 0.77 |