Literature DB >> 30912749

Genomic Common Data Model for Seamless Interoperation of Biomedical Data in Clinical Practice: Retrospective Study.

Seo Jeong Shin¹, Seng Chan You², Yu Rang Park³, Jin Roh⁴, Jang-Hee Kim⁴, Seokjin Haam⁵, Christian G Reich⁶, Clair Blacketer⁷, Dae-Soon Son⁸, Seungbin Oh⁹, Rae Woong Park^1,2.

Abstract

BACKGROUND: Clinical sequencing data should be shared in order to achieve the sufficient scale and diversity required to provide strong evidence for improving patient care. A distributed research network allows researchers to share this evidence rather than the patient-level data across centers, thereby avoiding privacy issues. The Observational Medical Outcomes Partnership (OMOP) common data model (CDM) used in distributed research networks has low coverage of sequencing data and does not reflect the latest trends of precision medicine.
OBJECTIVE: The aim of this study was to develop and evaluate the feasibility of a genomic CDM (G-CDM), as an extension of the OMOP-CDM, for application of genomic data in clinical practice.
METHODS: Existing genomic data models and sequencing reports were reviewed to extend the OMOP-CDM to cover genomic data. The Human Genome Organisation Gene Nomenclature Committee and Human Genome Variation Society nomenclature were adopted to standardize the terminology in the model. Sequencing data of 114 and 1060 patients with lung cancer were obtained from the Ajou University School of Medicine database of Ajou University Hospital and The Cancer Genome Atlas, respectively, which were transformed to a format appropriate for the G-CDM. The data were compared with respect to gene name, variant type, and actionable mutations.
RESULTS: The G-CDM was extended into four tables linked to tables of the OMOP-CDM. Upon comparison with The Cancer Genome Atlas data, a clinically actionable mutation, p.Leu858Arg, in the EGFR gene was 6.64 times more frequent in the Ajou University School of Medicine database, while the p.Gly12Xaa mutation in the KRAS gene was 2.02 times more frequent in The Cancer Genome Atlas dataset. The data-exploring tool GeneProfiler was further developed to conduct descriptive analyses automatically using the G-CDM, which provides the proportions of genes, variant types, and actionable mutations. GeneProfiler also allows for querying the specific gene name and Human Genome Variation Society nomenclature to calculate the proportion of patients with a given mutation.
CONCLUSIONS: We developed the G-CDM for effective integration of genomic data with standardized clinical data, allowing for data sharing across institutes. The feasibility of the G-CDM was validated by assessing the differences in data characteristics between two different genomic databases through the proposed data-exploring tool GeneProfiler. The G-CDM may facilitate analyses of interoperating clinical and genomic datasets across multiple institutions, minimizing privacy issues and enabling researchers to better understand the characteristics of patients and promote personalized medicine in clinical practice. ©Seo Jeong Shin, Seng Chan You, Yu Rang Park, Jin Roh, Jang-Hee Kim, Seokjin Haam, Christian G Reich, Clair Blacketer, Dae-Soon Son, Seungbin Oh, Rae Woong Park. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 26.03.2019.

Entities: CellLine Chemical Disease Gene Mutation Species

Keywords: data visualization; databases, genetic; high-throughput nucleotide sequencing; multicenter study; patient privacy

Mesh：

Year: 2019 PMID： 30912749 PMCID： PMC6454347 DOI： 10.2196/13249

Source DB: PubMed Journal: J Med Internet Res ISSN： 1438-8871 Impact factor: 5.428

Introduction

Background

Recognition of the importance of clinical next-generation sequencing (NGS) in precision medicine has had a profound impact on improving medical care [1-3]. Patients’ sequencing data are currently generated through relatively large-scale projects aimed at exploring the role of clinical NGS in precision medicine conducted by organizations such as the American Association for Cancer Research Project GENIE [4] and the China Precision Medicine Initiative [5]. However, genomic data are considered to be privacy sensitive and potentially reidentifiable, which raises concerns about transmitting and sharing patient-level data outside of host institutions for collaborative research [6]. In addition, genomic sequencing data of subjects in a predefined cohort cannot reflect the full diversity of the entire population at the point of care, which limits the practical application of the data for research purposes [7]. There has been a recent widespread effort to collect genomic information on patients in clinical practice through routine laboratory tests by the UK Biobank [8] and Geisinger Health System [9]. Since March 2017, the South Korea government has provided conditional insurance for an NGS technology-based cancer gene panel [10], which is expected to lead to rapid accumulation of clinical sequencing data in each hospital. However, the vocabulary and structure of these datasets are not standardized, which makes it difficult to conduct appropriate multicenter or comparative analyses for clinical decision making [11]. This lack of standardization can be overcome by using the common data model (CDM), which applies the same data structure to run an identical analysis code for each data holder [12]. For example, the Informatics for Integrating Biology and the Bedside is a clinical data warehouse platform comprising genetic data that adopts the CDM to support the distributed research network [13,14], an infrastructure for novel internet-based strategies that allows researchers to use retrospective multicenter data in a CDM (in contrast to single-center or cloud-based research) without exporting the protected personal health information. Researchers can combine the results of an analysis code run over the network to generate a refined clinical hypothesis [12,15]. To date, the distributed research network has been adopted by global research collaboration groups, including the Observational Health Data Sciences and Informatics (OHDSI) consortium [16]. The Observational Medical Outcomes Partnership (OMOP) CDM, now in version 6.0, was developed by the OHDSI consortium and includes clinical data from over 20 countries, with information of 1.5 billion patients transformed to date.

Prior Work

Due to the nature and extraordinary complexity of sequencing data, it is challenging to effectively describe and interpret the status of sequence alterations [17]. Furthermore, sequencing data were applied in the clinical domain of NGS relatively later than other types of genomic tests; hence, the analytical process has not been standardized [18]. To improve the efficiency of data processing, sequencing data should be managed using standardized structures and semantics. Although several standard models for genomic data have been introduced to date, they have limited applicability. For example, the standard for non-NGS–specific data models, including the minimum information about a microarray experiment [19] for DNA microarray analysis, the tissue microarray object model [20] for tissue microarray analysis, and the proteomics experiment data repository [21] for proteomics, cannot be properly adopted for sequencing data. Although the minimum information about a high-throughput nucleotide sequencing experiment was developed as a data model specific for sequencing data, it requires experimental processing data and detailed analytical protocols to enable researchers to reproduce the analysis [22].

Aim

Given the limitations outlined above, the objective of this study was to create a genomic data CDM (G-CDM) for use in the distributed research network. To address patient privacy issues and support the diversity of genomic data such as ethnicity, the OMOP-CDM used in the OHDSI consortium was chosen for this study for expansion. Furthermore, we validated the feasibility of the model by exploring the difference in genomic data retrieved from public databases and clinical practice.

Methods

Construction of the Genomic Data Common Data Model

The proposed G-CDM was developed by extending the OMOP-CDM to achieve the seamless management of clinical sequencing data through a structured database model. Clinical information such as basic patient background (eg, sex and age), clinical diagnosis, procedures, or specimen type was stored in existing tables of the OMOP-CDM. We further reviewed other genomic data models and clinical sequencing reports to design additional tables for describing and interpreting sequence alterations occurring in target genes. There are various types (>50) of public cancer databases describing variants, including comprehensive cancer projects, resources, and cancer type-specific databases [23]. According to our inclusion and exclusion criteria (Multimedia Appendix 1), we selected datasets from The Cancer Genome Atlas (TCGA), Catalogue of Somatic Mutations in Cancer, and International Cancer Genome Consortium for review and reference, to define the method of sequence alteration description. The data quality of these representative databases has been validated through many studies and papers. The database TCGA provides large-scale datasets of genomic alterations, including insertions/deletions (INDELs) or single nucleotide polymorphisms (SNPs), discovered in over 30 human tumor types to generate comprehensive profiles of cancer genomics [24]. The database Catalogue of Somatic Mutations in Cancer provides somatic mutations across 1,391,372 tumor samples encompassing 5,977,977 coding mutations as of August 2018 [25], while the database International Cancer Genome Consortium provides the datasets of oncogenic mutations of 50 different cancer types to support large-scale studies [26,27]. We excluded the databases built based on non-NGS techniques or cancer type–specific databases from referencing. The ISO20428 document, which is a standard format for reporting sequencing results, was reviewed to design columns for variant annotation (Multimedia Appendix 2). To guarantee interoperability of the data, standard terminologies were adopted in the G-CDM [28,29]. The name of a human gene, a key factor in sequencing data, was fixed according to the nomenclature of the Human Genome Organisation Gene Nomenclature Committee, which currently contains and maintains approximately 41,000 unique gene symbols. In addition, the Human Genome Variation Society nomenclature was adopted to standardize the manner of describing sequence alterations in each gene at both the DNA and protein level. Although either one- or three-letter abbreviations are permitted in the Human Genome Variation Society nomenclature, we propose expressing the amino acid by its three-letter code only to permit seamless data analysis for widespread research (Multimedia Appendix 2).

Data Structure of the Genomic Data Common Data Model

To link clinical data in the OMOP-CDM, the following information on each patient with NGS data was stored in a separate corresponding table: Person, Condition_Occurrence (diagnosis), Procedure_Occurrence, Specimen, and Care_Site (Figure 1). The Person table included personal patient information such as individual identification, sex, age, and race. The Condition_Occurrence table contained information on the patient’s condition or diagnosis, including the disease such as “lung cancer” or condition type such as “primary condition.” The Procedure_Occurrence table included information on how the specimen used for NGS was obtained and the name of the genomic test conducted for a patient. The Specimen table included information on the specimen used for the genomic test, such as “target” (tumor tissue) and “reference” (normal tissue), along with specimen type, including paraffin-embedded slide, the date the specimen was obtained, and the anatomical site of the specimen. The Care_Site table included information on the site at which the genomic test was conducted.

Figure 1

Schematic diagram of the relationship between tables composing the genomic common data model. Tables in red (“Genomic_Test,” “Target_Gene,” “Variant_Occurrence,” and “Variant_Annotation”) are those storing genomic sequencing data and processes, whereas tables in blue (“Person,” “Condition_Occurrence,” “Procedure_Occurrence,” “Specimen,” and “Care_Site”) are those already existing in the Observational Medical Outcomes Partnership-common data model and store clinical data directly linked to the “Variant_Occurrence” and “Genomic_Test” tables. ID: identification; HGVS: Human Genome Variation Society; HGNC: Human Genome Organisation Gene Nomenclature Committee. In addition to these five tables, we expanded the model to be linked to four other tables containing information related to the sequencing data: (1) the Genomic_Test table included the test name, version, sequencing device, analytical tools, and reference databases, with a care site identification column; (2) the Target_Gene table contained a list of genes targeted by the genomic test following Human Genome Organisation Gene Nomenclature Committee nomenclature for standardized gene symbols; (3) the Variant_Occurrence table included descriptive information about the variants of target genes; and (4) the Variant_Annotation table included information on each variant and the clinical interpretation thereof, such as annotation database name, variant origin such as somatic or germline, pathogenicity of the variant, allele frequency, and medication. Procedure identification for conducting sequencing, specimen identification of both the target and reference specimens, and target gene identification were included as foreign keys to link the information in the Procedure, Specimen, and Target_Gene tables. Data on reference sequence, reference SNP identification, Human Genome Variation Society nomenclature at both the DNA and protein levels, read depth, exon number, and variant type of both structural DNA and functional proteins were stored as variant description parameters. Detailed schemes and descriptions of each column and table used in the genomic extension model are provided in Multimedia Appendices 3 and 4.

Data Description

The Ajou University School of Medicine (AUSOM) database consists of electronic medical record data of patients who underwent NGS-based cancer panel screening of the tumor tissue between June 2017 and August 2018 at Ajou University Hospital, including 92 patients with lung adenocarcinoma and 22 patients with lung squamous cell carcinoma. Public sequence alteration data of the lung cancer cohort Pan-Lung Cancer study of TCGA [30] were obtained from the Memorial Sloan-Kettering Cancer Center cBioPortal [31]. The overall processes of NGS conducted at Ajou University Hospital and the TCGA database are detailed in Multimedia Appendix 5. Two representative differences between the sequencing pipelines of the two databases are the number of genes and the composition of variant types targeted in the test. For example, in the cancer panel of AUSOM, 49 cancer-related genes were targeted for sequencing, while the TCGA data were harvested using whole-exome sequencing with 16,896 genes. Thus, for development and testing of the proposed G-CDM, we selected 1060 patients from TCGA with available variant data of the 49 target genes selected in the AUSOM panel (Table 1).

Table 1

Description of data used to build the genomic common data model and to validate the data model.

Variable		AUSOM^a (N=114), n (%)	TCGA^b (N=1060), n (%)
Age (years)
	≤49	7 (6.1)	44 (4.2)
	50-59	26 (22.8)	163 (15.4)
	60-69	41 (36.0)	310 (29.2)
	70-79	35 (30.7)	317 (29.9)
	≥80	5 (4.4)	56 (5.2)
	Unknown	0 (0.0)	170 (16.0)
Gender
	Male	64 (56.1)	628 (59.0)
	Female	50 (43.9)	429 (41.0)
	Unknown	0 (0.0)	3 (0.2)
Pathology
	Lung adenocarcinoma	92 (80.7)	603 (56.9)
	Lung squamous carcinoma	22 (19.3)	457 (43.1)
Cancer stage
	Stage I	78 (68.4)	526 (49.6)
	Stage II	16 (14.0)	286 (27.0)
	Stage III	18 (15.8)	184 (17.4)
	Stage IV	0 (0.0)	36 (3.4)
	Unknown	2 (1.8)	28 (2.6)

aAUSOM: Ajou University School of Medicine.

bTCGA: The Cancer Genome Atlas.

Description of data used to build the genomic common data model and to validate the data model. aAUSOM: Ajou University School of Medicine. bTCGA: The Cancer Genome Atlas. The variant types, including SNPs, INDELs, multinucleotide polymorphisms (MNPs), copy number variants (CNVs), and translocations, were explored in the AUSOM database, whereas only SNPs and INDELs were identified in the TCGA database. Information on clinical characteristics such as age, sex, and disease status and genomic alterations such as variant type, DNA and protein level changes, and functional impact were used to compare the AUSOM and TCGA databases.

Study Design

Sequencing data of the TCGA database, which was licensed by Yonsei University for use, and of the AUSOM database were transformed into the G-CDM at Yonsei University and Ajou University, respectively. To execute the transformation process, the Structured Query Language (SQL) script in Microsoft SQL Server 2017 was used as the relational database backend for storage and querying the sequencing data. The G-CDM database was built using the Intel Xeon CPU E5-2596 v4 2.20 GHz, Java v.1.8.0, R v.3.5.1, and DBMS SQL Server 2017 at Ajou University, while the Intel Xeon Gold 6132 CPU 2.60 GHz, Java v.1.8.0, R v.3.4.4, and DBMS SQL Server 2017 were used at Yonsei University. After extracting parameters of interest for a cohort of patients by using a Condition_Occurrence table, the genetic information of the patients was summarized in each of the two institutions. Owing to the restrictions on exporting the original clinical sequencing data in the AUSOM database outside the hospital, the two institutions gathered and compared only the descriptive statistical analysis results to compare the two sequencing databases in further research. The data visualization tool “GeneProfiler” was developed to run based on the G-CDM as a demonstration that the standardized structure and vocabulary system can serve as a usable medium for performing distributed research by allowing genomic analysis with an identical code. To validate the feasibility of the G-CDM as a storage system and analysis medium, the differences in sequencing data between the AUSOM and TCGA databases were explored. The background profile of variants was described based on several aspects such as gene names, variant types, and disease subtypes. Representative actionable mutations for patients with non-small cell lung cancer (NSCLC) tend to occur in the EGFR, KRAS, PIK3CA, BRAF, and NRAS genes according to National Comprehensive Cancer Network guidelines [32,33]. Therefore, the proportions of actionable mutations in these five genes were compared between the two databases and between the subtypes of lung cancer.

Data Visualization Tool

We developed a new data visualizing tool called “GeneProfiler” using the R Shiny package to facilitate the utility and accessibility of the G-CDM. After converting genomic data into the G-CDM, the data can be visualized by simply connecting the database with the graphic user interface (Figure 2). As users link their database into “GeneProfiler,” this tool automatically provides the descriptive statistics as several plots and tables. “GeneProfiler” includes action buttons to generate plots of overall variant profiles, proportion of certain mutation types, and proportion of genes with actionable mutations. Users can also freely explore the proportion of patients with mutations in specific genes or specific variants and can download the results as a plot or table to conduct distributed research. After downloading result tables of several databases from GeneProfiler, users can generate graphs comparing these databases by uploading the merged tables (Multimedia Appendix 6). The R Shiny code of “GeneProfiler” was uploaded and is open to the public in GitHub [34].

Figure 2

Data visualization tool for clinical sequencing data holders who converted their genomic data into genomic CDM. Users can (a) connect their genomic CDM database; (b) get analysis plots such as an overall profile, (c) mutation type, and (d) pathogeny of variants; and (e) search the proportion of patients with gene name and variant information. CDM: common data model.

Statistical Analysis

Descriptive analysis was performed using frequencies for categorical variables. Genomic characteristics were compared between the two databases using a chi-squared test, and values of P<.05 were considered statistically significant. The R program version 3.5.1 was used for data preprocessing and statistical analysis. A mutation waterfall plot was created using “GenVisR,” an R package available via Bioconductor [35], which also provided the proportions of genes, variant types, and specific variants using the R Shiny tool developed in this study.

Ethics Statement

This study was approved by the institutional review board at Ajou University Hospital of Korea (IRB approval number: AJIRB-MED-MDB-18-390). Data visualization tool for clinical sequencing data holders who converted their genomic data into genomic CDM. Users can (a) connect their genomic CDM database; (b) get analysis plots such as an overall profile, (c) mutation type, and (d) pathogeny of variants; and (e) search the proportion of patients with gene name and variant information. CDM: common data model.

Results

Data Comparison for Model Validation

To confirm the differences between the AUSOM and TCGA databases, the summary results of the sequencing data such as the gene, variant type, and disease subtypes were gathered and compared. We characterized the biological background of total variants in both databases for variant types, with DNA-level structural variants classified as “sequence alteration” and protein functional types classified as “variant feature.” Among the SNPs, insertions, and deletions, the most frequent structural variant type was SNPs, accounting for >80% of total variants in both databases (Multimedia Appendix 7). However, the functional types of the variants, including missense, nonsense, frameshift, inframe, and splice variants, showed different frequencies between the databases (all P<.001), with intron and synonymous variants being most frequent in the AUSOM database (combined frequency of 83%) and missense variants being the most frequent in the TCGA database (73%; Multimedia Appendix 7). A waterfall plot was created in both the AUSOM and TCGA databases, which focused only on protein-altering variants such as missense, nonsense, frameshift, inframe, and splicing variants to obtain a variant profile (Figure 3; Multimedia Appendix 8). The 15 genes as a union of the top 10 genes in each database were selected as targets for overall profiling. In the AUSOM database, the top 10 genes had a variant frequency > 75% among patients with lung cancer, whereas only one gene, TP53, had a variant frequency > 25% in the TCGA database. In particular, EGFR variants showed very different frequencies in the AUSOM and TCGA databases (89.5% and 11.5%, respectively). All 15 genes had different proportions of variants in the two databases (all P<.001). Although the ranking of genes with high frequencies of variants differed between databases; the most frequent variant type was a missense variant in both databases (Figure 3).

Figure 3

Waterfall plot describing the variant profile of the top 10 genes in (a) Ajou University School of Medicine and (b) The Cancer Genome Atlas databases. Each row represents gene symbols ordered by their frequency of variants with different colors indicating different variant types. Columns represent each patient with only one sample per patient. The bar graph on the left corresponds to the frequency of variants in each gene. Clinical groups such as age, sex, and condition are shown in the bottom box. LUAD: lung adenocarcinoma; LUSC: lung squamous cell carcinoma.

In contrast, comparison of the waterfall plot of all 49 genes targeted in the cancer panel of the AUSOM database to that of the same gene set of the TCGA database showed a higher frequency of frameshift and nonsense type variants than splice type variants in the TCGA data, although the ranking of genes with more variants still differed between the two databases (Multimedia Appendix 8). Exploration of the CNVs in AUSOM showed that RET was the gene with the most frequent CNVs, specifically due to amplification (Multimedia Appendix 8). Waterfall plot describing the variant profile of the top 10 genes in (a) Ajou University School of Medicine and (b) The Cancer Genome Atlas databases. Each row represents gene symbols ordered by their frequency of variants with different colors indicating different variant types. Columns represent each patient with only one sample per patient. The bar graph on the left corresponds to the frequency of variants in each gene. Clinical groups such as age, sex, and condition are shown in the bottom box. LUAD: lung adenocarcinoma; LUSC: lung squamous cell carcinoma.

Comparison of Actionable Mutations for Model Validation

An actionable mutation is a specific genomic event that potentially affects a patient’s response to a targeted therapy [36]. Of the five representative actionable mutations for NSCLC examined (EGFR, KRAS, PIK3CA, BRAF, and NRAS), EGFR showed the greatest frequency of variants in the AUSOM database (21.9%), while KRAS showed the greatest frequency of variants in the TCGA database (20.2%; Figure 4a). In particular, the point mutation p.Leu858Arg in EGFR was found in 17.5% of the patients, followed by p.Thr790Met (1.8%) in the AUSOM database (Figure 4b). Point mutations in the KRAS gene, such as p.Gly12Xaa and p.Gly13Xaa, were more frequent in the TCGA database (20.2%) than in the AUSOM database (9.7%; Figure 4a,c). In addition, patients with lung adenocarcinoma (Figure 4e-h) tended to have more actionable mutations than those with lung squamous cell carcinoma (Figure 4i-l).

Figure 4

Frequencies of actionable mutations detected in the sequencing process between the AUSOM and TCGA databases. Frequency is shown according to the (a, e, i) level of five selected genes and (b, f, j) actionable mutations in EGFR, (c, g, k) KRAS, and (d, h, l) others such as PIK3CA, BRAF, and NRAS. Frequency is also shown according to patient groups: (a-d) total, (e-h) lung adenocarcinoma, and (k-l) lung squamous cell carcinoma. AUSOM: Ajou University School of Medicine; TCGA: The Cancer Genome Atlas; LUAD: lung adenocarcinoma; LUSC: lung squamous cell carcinoma.

Discussion

Overview

We developed a new data model for clinical sequencing data, which was applied using sequencing data of patients with lung cancer from two different databases, AUSOM and TCGA, which were transformed into an identical format for the G-CDM. To evaluate the feasibility of the G-CDM, the composition of the datasets was compared with regard to the frequency of a gene name and variant types in which a sequence alteration occurred and to the prevalence of actionable mutations. Moreover, we developed novel user-friendly software—GeneProfiler—for visualization of clinical sequencing data.

Interpretation of the Principal Results

The first result obtained by comparison of the databases transformed in a standardized form for the G-CDM was the clear difference in the composition of the sequencing data between TCGA, a controlled research-oriented database, and AUSOM, an actual clinical practice database. This difference suggested a difference in variant frequencies and types between the two databases. Indeed, the total number of variants per patient was much higher for the AUSOM database than for the TCGA database, whereas the frequency of variants differed according to the variant type considered. Comparison of actionable mutations in five genes of NSCLC showed a much higher mutation frequency of EGFR in the AUSOM database (a cohort of Asian patients) than in the TCGA database (a cohort of American patients). This finding is in line with previous knowledge that Asian patients with NSCLC have a higher prevalence of EGFR mutations than Americans [32,37]. In contrast, actionable mutations in the KRAS gene were less prevalent in patients in the AUSOM database than in those in the TCGA database, which is also consistent with previous knowledge that Asian populations have a much lower rate of mutations in KRAS than non-Asian populations with NSCLC [32,37]. The second key result of this study is conduct of a multicenter research through internet-based sharing of analysis codes with CDM-based conversion of databases from different institutions. This is meaningful because the distributed research was conducted with genomic data that had not been previously verified. Such distributed research would be a useful strategy to address the problem of limited data integration due to privacy issues of clinical sequencing data. Moreover, because data from the TCGA database were generated relatively earlier than those in the AUSOM database, the sequencing equipment or bioinformatics method may have caused the observed differences. These differences between the databases further emphasize the importance of analyzing data obtained from multiple clinical sites together with research-driven public data to obtain a higher level of representative evidence from diverse populations. Both genomic data models and intermediate results should be shared as widely as possible to promote clinical advances by overcoming the current challenges of unstructured and siloed data environments that lead to a lack of interoperability [38]. Our proposed OMOP-CDM extension model was developed by referencing the OHDSI distributed research network, because existing models such as the HL7 reference information model are not suitable for internet-based research and have limited practical use [39,40]. In the process of modeling the structure of the G-CDM, two specimen identifications were allocated in the Variant_Occurrence table, because recent methods of NGS testing in cancer patients tend to be based on a comparison of normal and tumor tissues simultaneously from the same individual. In cases of patients with a congenital disease, there is an option to fill out this field with only single-specimen identification. The contents of annotation to a variant can also differ according to the type or version of the annotation databases used in the annotation process. For this reason, the Variant_Annotation table was separated from the Variant_Occurrence table to allow for subsequent updating of diverse or new interpretations.

Limitations

Genomic data are generated using highly complicated sequencing pipelines and analytical processes; consequently, NGS data have inherent limitations in terms of data quality and reliability. Although we compared the sequencing pipelines and analytical processes used to generate the sequencing data of both the AUSOM and TCGA databases, we were unable to confirm the detailed parameters and options used in each process. Thus, the differences between the two databases found in this study should be interpreted considering the possibility that the data may have been generated by dissimilar methods and criteria. Moreover, the clinical NGS data used in this study were generated in the clinical practice of Ajou University Hospital within the last 2 years. Given the recent time frame, mortality was rare among these patients; thus, we were not able to perform survival analysis by leveraging both genomic data and clinical data. The G-CDM, as a common data structure and vocabulary system, minimizes privacy issues when conducting multicenter studies by integrating statistical results of the same analysis code rather than sharing the clinical sequencing data directly. However, when the G-CDM is used for repeated queries with a malicious purpose, there is concern for compromising the privacy of the individual, even if the queries target only the aggregated statistics. The G-CDM can be complemented by inhibiting reidentification attacks, as proposed in previous studies related to the mitigation of privacy risks, through limiting response to a query targeting a unique individual or through introduction of noise into the original data [41,42].

Conclusions

We propose the distributed research network–based G-CDM as a starting point for a broad community discussion on genomic data–based precision medicine. Based on the G-CDM developed in this study, the data validation process identified differences between the clinical NGS data derived from a clinical practice and those derived from prospective research. We believe that the construction and adoption of this standard data model will facilitate the usefulness of clinical NGS data.

34 in total

1. Integration of clinical and genetic data in the i2b2 architecture.

Authors: Shawn N Murphy; Michael E Mendis; David A Berkowitz; Isaac Kohane; Henry C Chueh
Journal: AMIA Annu Symp Proc Date: 2006

2. HL7 version 3--an object-oriented methodology for collaborative standards development.

Authors: G W Beeler
Journal: Int J Med Inform Date: 1998-02 Impact factor: 4.046

3. The Path to Routine Genomic Screening in Health Care.

Authors: Michael F Murray
Journal: Ann Intern Med Date: 2018-07-31 Impact factor: 25.391

4. Using multiplexed assays of oncogenic drivers in lung cancers to select targeted drugs.

Authors: Mark G Kris; Bruce E Johnson; Lynne D Berry; David J Kwiatkowski; A John Iafrate; Ignacio I Wistuba; Marileila Varella-Garcia; Wilbur A Franklin; Samuel L Aronson; Pei-Fang Su; Yu Shyr; D Ross Camidge; Lecia V Sequist; Bonnie S Glisson; Fadlo R Khuri; Edward B Garon; William Pao; Charles Rudin; Joan Schiller; Eric B Haura; Mark Socinski; Keisuke Shirai; Heidi Chen; Giuseppe Giaccone; Marc Ladanyi; Kelly Kugler; John D Minna; Paul A Bunn
Journal: JAMA Date: 2014-05-21 Impact factor: 56.272

5. Precision Oncology: The UC San Diego Moores Cancer Center PREDICT Experience.

Authors: Maria Schwaederle; Barbara A Parker; Richard B Schwab; Gregory A Daniels; David E Piccioni; Santosh Kesari; Teresa L Helsten; Lyudmila A Bazhenova; Julio Romero; Paul T Fanta; Scott M Lippman; Razelle Kurzrock
Journal: Mol Cancer Ther Date: 2016-02-12 Impact factor: 6.261

6. International network of cancer genome projects.

Authors: Thomas J Hudson; Warwick Anderson; Axel Artez; Anna D Barker; Cindy Bell; Rosa R Bernabé; M K Bhan; Fabien Calvo; Iiro Eerola; Daniela S Gerhard; Alan Guttmacher; Mark Guyer; Fiona M Hemsley; Jennifer L Jennings; David Kerr; Peter Klatt; Patrik Kolar; Jun Kusada; David P Lane; Frank Laplace; Lu Youyong; Gerd Nettekoven; Brad Ozenberger; Jane Peterson; T S Rao; Jacques Remacle; Alan J Schafer; Tatsuhiro Shibata; Michael R Stratton; Joseph G Vockley; Koichi Watanabe; Huanming Yang; Matthew M F Yuen; Bartha M Knoppers; Martin Bobrow; Anne Cambon-Thomsen; Lynn G Dressler; Stephanie O M Dyke; Yann Joly; Kazuto Kato; Karen L Kennedy; Pilar Nicolás; Michael J Parker; Emmanuelle Rial-Sebbag; Carlos M Romeo-Casabona; Kenna M Shaw; Susan Wallace; Georgia L Wiesner; Nikolajs Zeps; Peter Lichter; Andrew V Biankin; Christian Chabannon; Lynda Chin; Bruno Clément; Enrique de Alava; Françoise Degos; Martin L Ferguson; Peter Geary; D Neil Hayes; Thomas J Hudson; Amber L Johns; Arek Kasprzyk; Hidewaki Nakagawa; Robert Penny; Miguel A Piris; Rajiv Sarin; Aldo Scarpa; Tatsuhiro Shibata; Marc van de Vijver; P Andrew Futreal; Hiroyuki Aburatani; Mónica Bayés; David D L Botwell; Peter J Campbell; Xavier Estivill; Daniela S Gerhard; Sean M Grimmond; Ivo Gut; Martin Hirst; Carlos López-Otín; Partha Majumder; Marco Marra; John D McPherson; Hidewaki Nakagawa; Zemin Ning; Xose S Puente; Yijun Ruan; Tatsuhiro Shibata; Michael R Stratton; Hendrik G Stunnenberg; Harold Swerdlow; Victor E Velculescu; Richard K Wilson; Hong H Xue; Liu Yang; Paul T Spellman; Gary D Bader; Paul C Boutros; Peter J Campbell; Paul Flicek; Gad Getz; Roderic Guigó; Guangwu Guo; David Haussler; Simon Heath; Tim J Hubbard; Tao Jiang; Steven M Jones; Qibin Li; Nuria López-Bigas; Ruibang Luo; Lakshmi Muthuswamy; B F Francis Ouellette; John V Pearson; Xose S Puente; Victor Quesada; Benjamin J Raphael; Chris Sander; Tatsuhiro Shibata; Terence P Speed; Lincoln D Stein; Joshua M Stuart; Jon W Teague; Yasushi Totoki; Tatsuhiko Tsunoda; Alfonso Valencia; David A Wheeler; Honglong Wu; Shancen Zhao; Guangyu Zhou; Lincoln D Stein; Roderic Guigó; Tim J Hubbard; Yann Joly; Steven M Jones; Arek Kasprzyk; Mark Lathrop; Nuria López-Bigas; B F Francis Ouellette; Paul T Spellman; Jon W Teague; Gilles Thomas; Alfonso Valencia; Teruhiko Yoshida; Karen L Kennedy; Myles Axton; Stephanie O M Dyke; P Andrew Futreal; Daniela S Gerhard; Chris Gunter; Mark Guyer; Thomas J Hudson; John D McPherson; Linda J Miller; Brad Ozenberger; Kenna M Shaw; Arek Kasprzyk; Lincoln D Stein; Junjun Zhang; Syed A Haider; Jianxin Wang; Christina K Yung; Anthony Cros; Anthony Cross; Yong Liang; Saravanamuttu Gnaneshan; Jonathan Guberman; Jack Hsu; Martin Bobrow; Don R C Chalmers; Karl W Hasel; Yann Joly; Terry S H Kaan; Karen L Kennedy; Bartha M Knoppers; William W Lowrance; Tohru Masui; Pilar Nicolás; Emmanuelle Rial-Sebbag; Laura Lyman Rodriguez; Catherine Vergely; Teruhiko Yoshida; Sean M Grimmond; Andrew V Biankin; David D L Bowtell; Nicole Cloonan; Anna deFazio; James R Eshleman; Dariush Etemadmoghadam; Brooke B Gardiner; Brooke A Gardiner; James G Kench; Aldo Scarpa; Robert L Sutherland; Margaret A Tempero; Nicola J Waddell; Peter J Wilson; John D McPherson; Steve Gallinger; Ming-Sound Tsao; Patricia A Shaw; Gloria M Petersen; Debabrata Mukhopadhyay; Lynda Chin; Ronald A DePinho; Sarah Thayer; Lakshmi Muthuswamy; Kamran Shazand; Timothy Beck; Michelle Sam; Lee Timms; Vanessa Ballin; Youyong Lu; Jiafu Ji; Xiuqing Zhang; Feng Chen; Xueda Hu; Guangyu Zhou; Qi Yang; Geng Tian; Lianhai Zhang; Xiaofang Xing; Xianghong Li; Zhenggang Zhu; Yingyan Yu; Jun Yu; Huanming Yang; Mark Lathrop; Jörg Tost; Paul Brennan; Ivana Holcatova; David Zaridze; Alvis Brazma; Lars Egevard; Egor Prokhortchouk; Rosamonde Elizabeth Banks; Mathias Uhlén; Anne Cambon-Thomsen; Juris Viksna; Fredrik Ponten; Konstantin Skryabin; Michael R Stratton; P Andrew Futreal; Ewan Birney; Ake Borg; Anne-Lise Børresen-Dale; Carlos Caldas; John A Foekens; Sancha Martin; Jorge S Reis-Filho; Andrea L Richardson; Christos Sotiriou; Hendrik G Stunnenberg; Giles Thoms; Marc van de Vijver; Laura van't Veer; Fabien Calvo; Daniel Birnbaum; Hélène Blanche; Pascal Boucher; Sandrine Boyault; Christian Chabannon; Ivo Gut; Jocelyne D Masson-Jacquemier; Mark Lathrop; Iris Pauporté; Xavier Pivot; Anne Vincent-Salomon; Eric Tabone; Charles Theillet; Gilles Thomas; Jörg Tost; Isabelle Treilleux; Fabien Calvo; Paulette Bioulac-Sage; Bruno Clément; Thomas Decaens; Françoise Degos; Dominique Franco; Ivo Gut; Marta Gut; Simon Heath; Mark Lathrop; Didier Samuel; Gilles Thomas; Jessica Zucman-Rossi; Peter Lichter; Roland Eils; Benedikt Brors; Jan O Korbel; Andrey Korshunov; Pablo Landgraf; Hans Lehrach; Stefan Pfister; Bernhard Radlwimmer; Guido Reifenberger; Michael D Taylor; Christof von Kalle; Partha P Majumder; Rajiv Sarin; T S Rao; M K Bhan; Aldo Scarpa; Paolo Pederzoli; Rita A Lawlor; Massimo Delledonne; Alberto Bardelli; Andrew V Biankin; Sean M Grimmond; Thomas Gress; David Klimstra; Giuseppe Zamboni; Tatsuhiro Shibata; Yusuke Nakamura; Hidewaki Nakagawa; Jun Kusada; Tatsuhiko Tsunoda; Satoru Miyano; Hiroyuki Aburatani; Kazuto Kato; Akihiro Fujimoto; Teruhiko Yoshida; Elias Campo; Carlos López-Otín; Xavier Estivill; Roderic Guigó; Silvia de Sanjosé; Miguel A Piris; Emili Montserrat; Marcos González-Díaz; Xose S Puente; Pedro Jares; Alfonso Valencia; Heinz Himmelbauer; Heinz Himmelbaue; Victor Quesada; Silvia Bea; Michael R Stratton; P Andrew Futreal; Peter J Campbell; Anne Vincent-Salomon; Andrea L Richardson; Jorge S Reis-Filho; Marc van de Vijver; Gilles Thomas; Jocelyne D Masson-Jacquemier; Samuel Aparicio; Ake Borg; Anne-Lise Børresen-Dale; Carlos Caldas; John A Foekens; Hendrik G Stunnenberg; Laura van't Veer; Douglas F Easton; Paul T Spellman; Sancha Martin; Anna D Barker; Lynda Chin; Francis S Collins; Carolyn C Compton; Martin L Ferguson; Daniela S Gerhard; Gad Getz; Chris Gunter; Alan Guttmacher; Mark Guyer; D Neil Hayes; Eric S Lander; Brad Ozenberger; Robert Penny; Jane Peterson; Chris Sander; Kenna M Shaw; Terence P Speed; Paul T Spellman; Joseph G Vockley; David A Wheeler; Richard K Wilson; Thomas J Hudson; Lynda Chin; Bartha M Knoppers; Eric S Lander; Peter Lichter; Lincoln D Stein; Michael R Stratton; Warwick Anderson; Anna D Barker; Cindy Bell; Martin Bobrow; Wylie Burke; Francis S Collins; Carolyn C Compton; Ronald A DePinho; Douglas F Easton; P Andrew Futreal; Daniela S Gerhard; Anthony R Green; Mark Guyer; Stanley R Hamilton; Tim J Hubbard; Olli P Kallioniemi; Karen L Kennedy; Timothy J Ley; Edison T Liu; Youyong Lu; Partha Majumder; Marco Marra; Brad Ozenberger; Jane Peterson; Alan J Schafer; Paul T Spellman; Hendrik G Stunnenberg; Brandon J Wainwright; Richard K Wilson; Huanming Yang
Journal: Nature Date: 2010-04-15 Impact factor: 49.962

Review 7. Defining actionable mutations for oncology therapeutic development.

Authors: T Hedley Carr; Robert McEwen; Brian Dougherty; Justin H Johnson; Jonathan R Dry; Zhongwu Lai; Zara Ghazoui; Naomi M Laing; Darren R Hodgson; Francisco Cruzalegui; Simon J Hollingsworth; J Carl Barrett
Journal: Nat Rev Cancer Date: 2016-04-26 Impact factor: 60.716

8. PEDRo: a database for storing, searching and disseminating experimental proteomics data.

Authors: Kevin Garwood; Thomas McLaughlin; Chris Garwood; Scott Joens; Norman Morrison; Christopher F Taylor; Kathleen Carroll; Caroline Evans; Anthony D Whetton; Sarah Hart; David Stead; Zhikang Yin; Alistair J P Brown; Andrew Hesketh; Keith Chater; Lena Hansson; Muriel Mewissen; Peter Ghazal; Julie Howard; Kathryn S Lilley; Simon J Gaskell; Andy Brass; Simon J Hubbard; Stephen G Oliver; Norman W Paton
Journal: BMC Genomics Date: 2004-09-17 Impact factor: 3.969

9. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age.

Authors: Cathie Sudlow; John Gallacher; Naomi Allen; Valerie Beral; Paul Burton; John Danesh; Paul Downey; Paul Elliott; Jane Green; Martin Landray; Bette Liu; Paul Matthews; Giok Ong; Jill Pell; Alan Silman; Alan Young; Tim Sprosen; Tim Peakman; Rory Collins
Journal: PLoS Med Date: 2015-03-31 Impact factor: 11.069

10. National Healthcare Service and Its Big Data Analytics.

Authors: Da Jeong Nam; Hyuk Won Kwon; Haeyeon Lee; Eun Kyung Ahn
Journal: Healthc Inform Res Date: 2018-07-31

5 in total

1. Identification of High-Order Single-Nucleotide Polymorphism Barcodes in Breast Cancer Using a Hybrid Taguchi-Genetic Algorithm: Case-Control Study.

Authors: Cheng-Hong Yang; Li-Yeh Chuang; Cheng-San Yang; Huai-Shuo Yang
Journal: JMIR Med Inform Date: 2020-06-17

2. EHR-Independent Predictive Decision Support Architecture Based on OMOP.

Authors: Philipp Unberath; Hans Ulrich Prokosch; Julian Gründner; Marcel Erpenbeck; Christian Maier; Jan Christoph
Journal: Appl Clin Inform Date: 2020-06-03 Impact factor: 2.342

Review 3. OMOP CDM Can Facilitate Data-Driven Studies for Cancer Prediction: A Systematic Review.

Authors: Najia Ahmadi; Yuan Peng; Markus Wolfien; Michéle Zoch; Martin Sedlmayr
Journal: Int J Mol Sci Date: 2022-10-05 Impact factor: 6.208

4. Integrating Genomics and Clinical Data for Statistical Analysis by Using GEnome MINIng (GEMINI) and Fast Healthcare Interoperability Resources (FHIR): System Design and Implementation.

Authors: Julian Gruendner; Nicolas Wolf; Lars Tögel; Florian Haller; Hans-Ulrich Prokosch; Jan Christoph
Journal: J Med Internet Res Date: 2020-10-07 Impact factor: 5.428

5. Development and Validation of the Radiology Common Data Model (R-CDM) for the International Standardization of Medical Imaging Data.

Authors: ChulHyoung Park; Seng Chan You; Hokyun Jeon; Chang Won Jeong; Jin Wook Choi; Rae Woong Park
Journal: Yonsei Med J Date: 2022-01 Impact factor: 2.759

5 in total