| Literature DB >> 35743803 |
Mohd Amin Azuwar1, Nor Azlan Nor Muhammad1, Nor Afiqah-Aleng2, Nurul-Syakima Ab Mutalib3, Najwa Farhah Md Yusof3, Ryia Illani Mohd Yunos3, Muhiddin Ishak3, Sazuita Saidin3, Isa Mohamed Rose4, Ismail Sagap5, Luqman Mazlan5, Zairul Azwan Mohd Azman5, Musalmah Mazlan6, Sharaniza Ab Rahim6, Wan Zurinah Wan Ngah7, Sheila Nathan8, Nurul Azmir Amir Hashim6, Zeti-Azura Mohamed-Hussein1,9, Rahman Jamal3.
Abstract
Colorectal cancer (CRC) ranks second among the most commonly occurring cancers in Malaysia, and unfortunately, its pathobiology remains unknown. CRC pathobiology can be understood in detail with the implementation of omics technology that is able to generate vast amounts of molecular data. The generation of omics data has introduced a new challenge for data organization. Therefore, a knowledge-based repository, namely TCGA-My, was developed to systematically store and organize CRC omics data for Malaysian patients. TCGA-My stores the genome and metabolome of Malaysian CRC patients. The genome and metabolome datasets were organized using a Python module, pandas. The variants and metabolites were first annotated with their biological information using gene ontologies (GOs) vocabulary. The TCGA-My relational database was then built using HeidiSQL PorTable 9.4.0.512, and Laravel was used to design the web interface. Currently, TCGA-My stores 1,517,841 variants, 23,695 genes, and 167,451 metabolites from the samples of 50 CRC patients. Data entries can be accessed via search and browse menus. TCGA-My aims to offer effective and systematic omics data management, allowing it to become the main resource for Malaysian CRC research, particularly in the context of biomarker identification for precision medicine.Entities:
Keywords: CRC database; CRC repository; TCGA-My; colorectal cancer; genome; metabolome; systematic repository
Year: 2022 PMID: 35743803 PMCID: PMC9224961 DOI: 10.3390/life12060772
Source DB: PubMed Journal: Life (Basel) ISSN: 2075-1729
Figure 1The variants and genes data normalization algorithm that was deployed for generating an SQL file for TCGA-My.
Figure 2The relational tables for the TCGA-My schema. The yellow tables are the main tables, and the blue tables are the pivot tables. The relationship between main and pivot tables is indicated with a straight line (―●). The dotted line represents the relationship between main tables (◊---●).
Number of entries in the datasets of TCGA-My.
| Dataset | Number of Entries |
|---|---|
| Sample | 50 |
| Variant | 1,517,841 |
| COSMIC | 1113 |
| dbSNP | 291,397 |
| Gene | 23,695 |
| PDB | 4420 |
| RefSeq ncRNA | 2637 |
| UniProt | 17,910 |
| Metabolite | 89,256 |
| Pathway | 344 |
| KEGG | 186 |
| PANTHER | 158 |
| Gene ontology | 17,459 |
Patients details.
| Patient | Gender | Age | Ethnicity | Diagnosis | Anatomical Location | Stage | |
|---|---|---|---|---|---|---|---|
| TNM | Dukes | ||||||
| C187 | Male | 63 | Malay | Well differentiated adenocarcinoma | Rectosigmoid | pT3 N2 MX | C2 |
| C330 | Male | 71 | Chinese | Well differentiated adenocarcinoma | Sigmoid colon | T3 N0 MX | B2 |
| C404 | Male | 68 | Chinese | Well differentiated adenocarcinoma | Rectum | pT3 pN1a MX | - |
| Sessile polyp in ascending colon | pT1 | A | |||||
| C414 | Male | 76 | Malay | Well differentiated adenocarcinoma (WHO Grade 1) | Sigmoid colon | pT3 pN1b pMX | C |
| C449 | Male | 65 | Malay | Moderately differentiated adenocarcinoma | Rectosigmoid colon | pT3 N2 MX | C |
| C476 | Male | 72 | Chinese | Well differentiated adenocarcinoma. | Recto-sigmoidectomy | pT4a N1 MX | |
| C194 | Female | 70 | Malay | Well differentiated adenocarcinoma | Sigmoid colon | - | B |
| C273 | Female | 73 | Malay | Moderately differentiated adenocarcinoma | Rectosigmoid colon | pT1 N0 MX | A |
| C373 | Female | 74 | Chinese | Moderately differentiated adenocarcinoma | Anterior resection specimen | T2 N0 MX | B |
| C388 | Female | 65 | Chinese | Moderately differentiated adenocarcinoma | Anterior resection specimen | pT2 pN1 pMx | C |
| C398 | Female | 71 | Chinese | Moderately differentiated adenocarcinoma. | Sigmoid colon with bladder | pT4 N1 MX | C |
| C467 | Female | 65 | Malay | Well differentiated adenocarcinoma | Rectum | T4b N1b pMX | C |
| C474 | Female | 79 | Malay | Well-differentiated adenocarcinoma | Left hemicolectomy | pT3 N0 MX | B1 |
Note: p indicates the pathological state has been examined for the respective component of the TNM staging system.
TNM and Dukes staging systems.
| Staging System | Component | Explanation | |
|---|---|---|---|
| TNM | Primary Tumor (T) | T1 | Tumor invades submucosa. |
| T2 | Tumor invades muscularis propria. | ||
| T3 | Tumor invades into the subserosa or perirectal tissues via muscularis propria. | ||
| T4 | Tumor has spread to other organs or structures directly and/or the visceral peritoneum. | ||
| T4a | The tumor has expanded into the surface of the visceral peritoneum, where it has penetrated all layers of the colon. | ||
| T4b | The tumor has spread to other organs or structures or has attached itself to them. | ||
| Regional lymph node (N) | N0 | Negative regional lymph node metastases. | |
| N1 | Metastases in one to three regional lymph nodes. | ||
| N1a | Tumor cells have been detected in one regional lymph node. | ||
| N1b | Tumor cells have been detected in two or three regional lymph nodes. | ||
| N2 | Metastases in four or more regional lymph nodes. | ||
| Distant metastases (M) | MX | Distant metastases could not be assessed. | |
| Dukes | A | Tumor limited to the submucosa. | |
| B | Tumor grows through the colon wall into muscular layers, no lymph nodes involved | ||
| B1 | Into but not through the muscularis propria, nodes not involved. | ||
| B2 | Through the muscularis propria, nodes not involved. | ||
| C | Lymph node involved. | ||
| C2 | Through the muscularis propria with nodes involved. | ||
Figure 3The correlation between the number of variants with the age and gender of CRC patients.
List of DNA regions for the variants listed in TCGA-My.
| DNA Region | Number of Variants | Description |
|---|---|---|
| Intergenic | 926,482 | Variant overlaps in intergenic region. |
| Intronic | 409,632 | Variant overlaps in intronic region. |
| Non-coding RNA, intronic | 84,913 | Non-coding transcript variant overlaps with one of the transcripts in the intronic region. |
| Exonic | 8381 | Variant overlaps in exonic region. |
| Upstream | 8855 | Variant overlaps a 1-kb region upstream of the transcription start site. |
| Downstream | 9116 | Variant overlaps a 1-kb region downstream of the transcription termination site. |
| UTR3 | 8603 | Variant overlap in 3′ untranslated region. |
| Upstream, downstream | 922 | Variant overlaps in both upstream and downstream regions. |
| UTR5 | 1176 | Variant overlaps in 5′ untranslated region. |
| Splicing | 108 | Variant overlaps in splice region. |
| Non-coding RNA, splicing | 34 | Non-coding transcript variant overlaps with one of the transcripts in the splice region. |
| Exonic, splicing | 2 | Variant overlaps in both exonic and splice regions. |
Type of mutations identified for the variants listed in TCGA-My.
| Type of Mutations | Number of Variants | Description |
|---|---|---|
| Nonsynonymous SNV | 3922 | A single nucleotide change that alters an amino acid of a protein. |
| Frameshift insertion | 510 | Insertion of one or more nucleotides that shifts the codon reading frame. |
| Frameshift deletion | 917 | Deletion of one or more nucleotides that shifts the codon reading frame. |
| Stop-gain | 271 | Mutations caused by nonsynonymous SNV, frameshift insertion and frameshift deletion that leads to the gain of a stop codon. |
| Stop-loss | 9 | Mutations caused by nonsynonymous SNV, frameshift insertion and frameshift deletion that leads to the loss of a stop codon. |
| Non-frameshift deletion | 587 | Deletion of a set of nucleotides divisible by three that may not shift a reading frame. |
| Synonymous SNV | 2226 | A change of a single nucleotide that retains an amino acid of a protein. |
| Non-frameshift insertion | 153 | Insertion of a set of nucleotides divisible by three that may not shift a reading frame. |
| Unknown | 223 | Unknown mutation. |
Figure 4Circos plot for sample C474. The plot was constructed using Strawberry Perl to visualize the location of variants in the chromosomes.
Number of variant genes in genome sample.
| Patient | Number of Genes | Number of Driver Genes |
|---|---|---|
| C187 | 11,988 | 6 |
| C194 | 11,644 | 7 |
| C273 | 11,837 | 5 |
| C373 | 11,951 | 11 |
| C404 | 13,188 | 6 |
| C414 | 12,446 | 9 |
| C449 | 13,888 | 5 |
| C474 | 23,213 | 12 |
| C330 | 11,989 | 2 |
| C388 | 11,515 | 8 |
| C398 | 11,763 | 2 |
| C467 | 12,489 | 7 |
| C476 | 11,666 | 3 |
Figure 5Significantly altered metabolites. The CRC metabolome samples reveal nine upregulated (red) and two downregulated (blue) metabolites.