| Literature DB >> 34595403 |
David M Miller1,2, Sophia Z Shalhout1,2.
Abstract
OBJECTIVES: Clinico-genomic data (CGD) acquired through routine clinical practice has the potential to improve our understanding of clinical oncology. However, these data often reside in heterogeneous and semistructured data, resulting in prolonged time-to-analyses.Entities:
Keywords: REDCap; Shiny app; clinical informatics; clinico-genomics; data abstraction; electronic health records
Year: 2021 PMID: 34595403 PMCID: PMC8476929 DOI: 10.1093/jamiaopen/ooab082
Source DB: PubMed Journal: JAMIA Open ISSN: 2574-2531
Figure 1.Schema of GENETEX. The GENETEX package takes CGD, which is typically stored in semistructured data, as an input via a Shiny application user interface. Once the input data have been captured, the package executes a series of server-side functions that text mine CGD reports for relevant genomic data. These structured data are then imported directly into the REDCap electronic data capture system (EDC), placing the data in the Genomics Instrument in REDCap. CGD: clinico-genomic data; REDCap: Research Electronic Data Capture.
Figure 2.Browser-based user interface. Depicted is the user interface (UI) of the Shiny app of GENETEX. This UI is produced by running the code in the R script “GENETEX Shiny app. R,” which can be found on GITHUB (https://github.com/TheMillerLab/genetex/blob/main/GENETEX%20Shiny%20app.R).
GENETEX functions
| Function | Functionality |
|---|---|
| genetex_to_redcap() | Integrates key verbs to provide NLP tools to abstract data from a variety of genomic reports and import them to REDCap |
| gene.variants() | Integrates various platform-specific NLP functions to text mine gene names and nucleotide variants from genomic reports and transforms them to structured data for import into REDCap |
| cnv() | Integrates various platform-specific NLP functions to text mine gene names and copy number variants data from a variety of genomic reports and transforms them to structured data for import into REDCap |
| mmr() | Text mines mismatch repair status from genomic reports and transform it to structured data for import into REDCap |
| mutational.signatures() | Text mines mutational signatures data from a variety of genomic reports and transforms it to structured data for import into REDCap |
| tmb() | Text mines tumor mutation burden (TMB) data from a variety of genomic reports and transforms it to structured data for import into REDCap |
| platform() | Applies regular expressions to assign a numerical value for the various platforms used for genomic reports that aligns with the “genomics_platform” field in the REDCap Genomics Instrument |
| genes_regex() | Produces a regular expression of over 900 HGNC gene names |
| genes_boundary_regex() | Produces a regular expression of over 900 HGNC gene names as a unique string with word boundaries |
| genomics.tissue.type() | Applies regular expressions to assign a numerical value for the various platforms used for genomic reports that aligns with the “genomics_platform” field in the REDCap Genomics Instrument |
Notes: Key functions unique to GENETEX with a brief description of action are shown. Description of other functions can be found in the package’s Help Page.
REDCap: Research Electronic Data Capture; NLP: atrual Language Processing.
Figure 3.(A) Regular expression of gene names. Depicted is a portion of the character vector output of the function “genes_boundary_regex().” This function produces a regular expression that is used by GENETEX to identify gene names in CGD reports. (B) Regular expression of nucleotide and amino acid sequences, and cfDNA. Shown are the 3 regular expressions used to identify nucleotide sequences (“nuc_regex”), amino acid sequences (“aa_regex”), and cfDNA (“cfdna_regex”) contained within CGD reports. (C) Tokenized genomics report. Depicted is a portion of a CGD report that has been tokenized. Here, each word of the report is partitioned into a single cell of the vector “X.” (D) Example of report filtered with genes_nuc_aa_cfdna_regex. Shown is vector “X” from (C), which has been filtered by a regular expression that selects only cells with elements relevant to gene names, nucleotide, and amino acid sequences and cfDNA. The code used for this step is demonstrated above the output. (E) Example of report grouped by gene name. Related tokens are grouped using the “stringr::str_detect()” function by incorporating the regular expression “genes_boundary_regex.” With this method, HUGO gene names serve as the “keyword” and thus the boundary for each group. As a result, the appropriate nucleotide, amino acid, and cfDNA data are linked with the corresponding gene name. (F) Mapping REDCap variables to data elements. Each tokenized data element in vector “X” must be linked with an appropriate variable name from the Genomics Instrument. In this step, the 4 relevant variable stems, “variant_gene,” “variant_nucleotide,” “variant_protein,” and “variant_gene_perc_cfdna” are matched with the relevant data in vector “X” by combining the “ifelse()” and “str_detect()” functions with the regular expressions “genes_boundary_regex,” “nuc_regex,” “aa_regex,” and “cfdna_regex.” (G) Complete mapping REDCap variables to data elements with the NSLS. Each data element in vector “X” must correspond to a unique variable name to be imported into REDCap. Therefore, in this final step, the variable stems “variant_gene,” “variant_nucleotide,” “variant_protein,” and “variant_gene_perc_cfdna” and linked to the number found in the column “group” which produces a unique variable. All of those variables with the suffix “_1” will be “linked” together using an NSLS. cfDNA: cell-free DNA; CGD: clinico-genomic data; NSLS: Numeric Suffix Linker System; REDCap: Research Electronic Data Capture.