| Literature DB >> 31261630 |
Hyo Soung Cha1, Jip Min Jung1, Seob Yoon Shin1, Young Mi Jang1, Phillip Park1, Jae Wook Lee1,2, Seung Hyun Chung1, Kui Son Choi3.
Abstract
Data warehousing is the most important technology to address recent advances in precision medicine. However, a generic clinical data warehouse does not address unstructured and insufficient data. In precision medicine, it is essential to develop a platform that can collect and utilize data. Data were collected from electronic medical records, genomic sequences, tumor biopsy specimens, and national cancer control initiative databases in the National Cancer Center (NCC), Korea. Data were de-identified and stored in a safe and independent space. Unstructured clinical data were standardized and incorporated into cancer registries and linked to cancer genome sequences and tumor biopsy specimens. Finally, national cancer control initiative data from the public domain were independently organized and linked to cancer registries. We constructed a system for integrating and providing various cancer data called the Korea Cancer Big Data Platform (K-CBP). Although the K-CBP could be used for cancer research, the legal and regulatory aspects of data distribution and usage need to be addressed first. Nonetheless, the system will continue collecting data from cancer-related resources that will hopefully facilitate precision-based research.Entities:
Keywords: big data platform; cancer data; clinical cancer registry; de-identification
Mesh:
Year: 2019 PMID: 31261630 PMCID: PMC6651426 DOI: 10.3390/ijerph16132290
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 3.390
Figure 1Overview of the integrated Korea Cancer Big Data Platform (K-CBP) by data type. ECOG, Eastern Cooperative Oncology Group performance status; EMR, electronic medical record; NGS, next-generation sequencing.
Figure 2Clinical cancer registry construction and utilization process. ETL, extract, transform, and load.
Figure 3Data validation process. Temp, temporary.
Figure 4Dataset configuration of the Korea Cancer Big Data Platform (K-CBP). NCC, National Cancer Center.
Number and type of items integrated in the Korea Cancer Big Data Platform (K-CBP).
| Category | Subcategory | Number of Subjects | Description of Features | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Number of patients | 515,780 | ||||||||||
| Medical records | Pathology records | 925,599 | Specimen type, method of examination, clinical and pathological diagnosis | ||||||||
| Order sheet | 11,703,931 | Orders related to treatment and discharge | |||||||||
| Tumor bank | Blood sample | 32,760 | Information on pathologic stages; normal and tumor tissues, sample status, and location information | ||||||||
| Tissue sample | 17,813 | ||||||||||
| Genomics | NGS test | 280 | Mutations detected on panel-based NGS tests | ||||||||
| Clinical cancer registries at the NCC Hospital | Cancer – core features | 144,944 | Patient information, diagnosis, images, normal and tumor tissues, surgery, pathological and clinical stages, chemotherapy, radiation therapy, recurrence and metastasis, death | ||||||||
| Prostate cancer | 5167 | ||||||||||
| Lung cancer | 24,504 | ||||||||||
| Pancreatobiliary cancer | 9966 | ||||||||||
| Kidney cancer | 2886 | ||||||||||
| Ovarian cancer | 4240 | ||||||||||
| Colorectal cancer | 17,328 | ||||||||||
| Liver cancer | 12,932 | ||||||||||
| Breast cancer | 18,287 | ||||||||||
| Gastric cancer | 15,056 | ||||||||||
| Thyroid cancer | 10,404 | ||||||||||
| Cancer statistical registry data from the National Cancer Control initiatives | Korea Central Cancer Registry | 2,745,050 | Nationwide data on the diagnosis and treatment of cancer and survival of patients | ||||||||
| National Cancer Screening Program | 90,197,402 | Data obtained from nationwide screening for stomach, liver, colorectal, breast, and uterine cervix cancers | |||||||||
| Financial Aid Program for Cancer Patients | 543,325 | Data relevant to financial aid for low-income cancer patients | |||||||||
| Hospice and Palliative Care | 56,433 | Information on performance status (ECOG), admission to and discharge from hospice institutions, and the use of hospice care | |||||||||
| Clinical cancer registries from external sources | Prostate cancer | 7934 | Complications, surgery | ||||||||
| Lung cancer | 3496 | Results of biopsy, gene mutation, surgery | |||||||||
| Pancreatic cancer | 538 | Tumor, physical examination findings, surgery | |||||||||
ECOG, Eastern Cooperative Oncology Group; NGS, next-generation sequencing; NCC, National Cancer Center.
Features of structured, unstructured, and manual input data in the clinical cancer registry by cancer type.
| Cancer Type | Data Type | Total | ||
|---|---|---|---|---|
| Structured | Unstructured | Manual Input | ||
| Prostate cancer | 165 | 66 | 13 | 244 |
| Lung cancer | 146 | 85 | 4 | 235 |
| Pancreatobiliary cancer | 319 | 34 | 54 | 407 |
| Kidney cancer | 369 | 70 | 41 | 480 |
| Ovarian cancer | 428 | 59 | 32 | 519 |
| Colorectal cancer | 230 | 51 | 84 | 365 |
| Liver cancer | 216 | 84 | 50 | 350 |
| Breast cancer | 228 | 85 | 27 | 340 |
| Gastric cancer | 175 | 141 | 14 | 330 |
| Thyroid cancer | 156 | 244 | 4 | 404 |
Structured data, data that can be represented by a specific number or word and whose format is roughly defined; unstructured data, free-text format data; manual input data, data that cannot be automatically entered through computerization.
Definition of terms.
| Term | Definition |
|---|---|
| Alternative patient key | A primary key that replaces a direct identifier with a random 8-digit number |
| De-identification | Elimination of direct identifiers and quasi-identifiers so that individuals cannot be identified |
| Clinical cancer registry | Outcome data such as diagnosis, treatment, and surgery that are selected among cancer clinical data; dataset refined in a form that can be used meaningfully |
| National cancer control initiative data | Cancer-related data collected under a nationally led project |
| External clinical cancer registry | Cancer-related clinical data from multiple institutions, including diagnosis, treatment, or surgery; dataset selected and refined for outcome data |
| Structured data | Data that can be represented by a specific number or word and whose format is standardized |
| Unstructured data | Free-text format data |
| Manual input data | Data that cannot be automatically imported through computerization |
Figure 5A screenshot of the Korea Cancer Big Data Platform (K-CBP) clinical cancer registry website showing the summary statistics for lung cancer as of December 2018.
Figure 6A screenshot of the National Cancer Data Center (NCDC) webpage displaying a list of items and a description for ovarian cancer (https://www.cancerdata.kr/dataSetMetaLst.do).