| Literature DB >> 36151511 |
Jong Cheol Jeong1,2, Isaac Hands3,4, Jill M Kolesar5, Mahadev Rao6, Bront Davis3,4, York Dobyns3,4, Joseph Hurt-Mueller3,4, Justin Levens3,4, Jenny Gregory3,4, John Williams3,4, Lisa Witt3,4, Eun Mi Kim7, Carlee Burton3, Amir A Elbiheary3, Mingguang Chang3, Eric B Durbin8,9,10.
Abstract
BACKGROUND: Public Data Commons (PDC) have been highlighted in the scientific literature for their capacity to collect and harmonize big data. On the other hand, local data commons (LDC), located within an institution or organization, have been underrepresented in the scientific literature, even though they are a critical part of research infrastructure. Being closest to the sources of data, LDCs provide the ability to collect and maintain the most up-to-date, high-quality data within an organization, closest to the sources of the data. As a data provider, LDCs have many challenges in both collecting and standardizing data, moreover, as a consumer of PDC, they face problems of data harmonization stemming from the monolithic harmonization pipeline designs commonly adapted by many PDCs. Unfortunately, existing guidelines and resources for building and maintaining data commons exclusively focus on PDC and provide very little information on LDC.Entities:
Keywords: Cancer registry; Clinical data; Data harmonization; Data integration; Data standardization; End-to-end model; Genomic data; Local data commons; Public data commons
Mesh:
Year: 2022 PMID: 36151511 PMCID: PMC9502580 DOI: 10.1186/s12859-022-04922-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1Three types of Local Data Commons (LDC) service models: a Liaison model; b Enterprise model; c Network model; d End-to-end data sharing model
Fig. 2Applications and workflow: a Online molecular report explorer utilizing LabKey. b Online statistics tool for molecular report utilizing Tableau. c MCC cBioPortal management system utilizing end-to-end model. d Data linkage application gathering all information in one view and committing linkage with one click. Identifiable patient information is masked with a black box. e MCC Molecular Tumor Board on-demand data visualization tool
Fig. 3Divide-and-conquer and bottom-up data intergration method: a Sequencing data are divided based on sequencing vendors, assay types, and research types, etc. and then data are integrated into LDC. b LDC’s data produced by divide-and-conquer approach are submitted to PDC (bottom-up). Users can have choices to select original or harmonized data. Note that LDC can be both a consumer and data provider of PDC
Fig. 4LDC's core services and workflow: a LDC can facilitate molecular data processing by orchestrating core services and interacting with other facilities. LDC can help data collection by utilizing REDCap or EMR applications. b Workflow for processing molecular data. c Role-based access control and user services for data access
Essential data elements of patients and samples
| Category | Data elements |
|---|---|
| Patient demographics | Name, SSN, sex, race, birth date, address, etc |
| Disease diagnosis | Date, site, histology, behavior, grade, stage, etc |
| Disease treatment | Course, date, type (surgery, radiation, chemo, etc.), agents |
| Long term disease outcomes | Date of last contact, vital status, recurrence status, etc |
| Specimen ID | Unique specimen ID |
| Specimen Site | Blood or body tissue that is taken for medical testing |
| Specimen Type | Fresh frozen, FFPE, slide, etc |
| Date of Collection | Specimen collection date |
| Tumor specimen | Total tissue volume, tumor purity by stain, tumor nuclei percentage, etc |
Essential tools for genomic data commons
| Products | Information |
|---|---|
| FastQC | |
| Trimmomatic | |
| IGV | |
| SeqMonk | |
| UCSC | |
| UCSC LiftOver | |
| Picard LiftOver | |
| Chain Files (hg38 to hg19) | |
| Chain Files (hg19 to hg38) | |
| Funcotator | |
| OncoKB | |
| ANNOVAR | |
| VEP | |
| ClinVar | |
| VarScan | |
| fmi-converter | |
| VCF2MAF | |
| BAM2VCF | |
| samtools | |
| bedtools | |
| vcftools | |
| bcftools | |
| gnomAD | |
| Genome (hg38) | |
| Genome (hg19) | |
| Broad Institute Data Bundle | |
| UCSC Table Browser | |
| Sequencing Vendor Specific Data | FASTQ, SAM, BAM, CRAM, BED, XML, PDF, etc |
| cBioPortal | |
| JupyterHub | |
| Genomic Data Commons | |
| GDC Data Access | |
| GDC Pipelines | |
| Cancer Genomic Data Server | |
| cBioPortal R package |
Basic genomic data dictionary
| Report type | Data elements |
|---|---|
| Variant Type | SNP, insertion, deletions, copy number variant, rearrangement |
| Mutation | Gene name, position, coding sequence effect, protein effect, allele fraction, transcript ID, strand |
| Copy Number Variant (CNV) | Copy Number, gene name, involved exons, position, CNV type (e.g., loss, amplification) |
| Rearrangement | Gene names, positions, rearrangement types (e.g., fusion, truncation, etc.) |
| Microsatellite-instability | Result values, category value (i.e., MSS, MSL, MSH) |
| Tumor Mutational Burden | Unit (e.g., Mutations per Million Base), Result values |
| Expression | Gene name, expression unit (i.e., RPM/CPM, RPKM/FPKM, TPM, TMM, etc.), expression level, gene type (e.g., mRNA, lncRNA, circRNA, etc.), transcript ID |
| Fusion | Positions, Junction read count, fusion sequence, expression unit |