| Literature DB >> 32236523 |
Thomas McGowan1, James E Johnson1, Praveen Kumar2,3, Ray Sajulga2, Subina Mehta2, Pratik D Jagtap2, Timothy J Griffin2.
Abstract
BACKGROUND: Proteogenomics integrates genomics, transcriptomics, and mass spectrometry (MS)-based proteomics data to identify novel protein sequences arising from gene and transcript sequence variants. Proteogenomic data analysis requires integration of disparate 'omic software tools, as well as customized tools to view and interpret results. The flexible Galaxy platform has proven valuable for proteogenomic data analysis. Here, we describe a novel Multi-omics Visualization Platform (MVP) for organizing, visualizing, and exploring proteogenomic results, adding a critically needed tool for data exploration and interpretation.Entities:
Keywords: Galaxy; Integrated Genomics Viewer; RNA-Seq; mass spectrometry; proteogenomics; proteomics; transcriptomics; visualization
Mesh:
Substances:
Year: 2020 PMID: 32236523 PMCID: PMC7102281 DOI: 10.1093/gigascience/giaa025
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 7.658
Figure 1:Inputs to the MVP Galaxy Plug-in.
Example data structure and format for the variant_annotation table
| Example entry | variant_annotation table columns | |||
|---|---|---|---|---|
| Name text | Reference text | CIGAR text | Annotation text | |
| Single amino acid substitution | ENSMUSP00000092502_E5 9 G.Q267R | ENSMUSP00000092502 | 5 8 = 1 X 207 = 1 X 26 = | E59G, Q267R |
| Nucleotide insertion/deletion resulting in frame shift | P_0 013 16432_24_C>cA | P_0 013 16432 | 7 = 99X | C>cA |
| In-frame amino acid insertion | P_0 013 16432_24_C>cATG | P_0 013 16432 | 7 = 1199= | C>cATG |
| In-frame amino acid deletion | P_0 013 16432_24_CATG>c | P_0 013 16432 | 7 = 1097= | CATG>c |
| Structural variant (fusion) | TAF1D_SNORA40 | TAFID | 24 = 14X | chr3:18 100 200+, chr6:282 33 455- |
MVP reads only the name text field and CIGAR text fields for identified sequence variants. The reference text and annotation text fields are included to identify the reference gene and a free-form description of the nature of the annotated variant, respectively, and are not directly read by MVP. For the example variants provided, the CIGAR text string describes the following: Single amino acid substitution: For this example the text depicts an amino acid (AA) sequence that contains 58 AAs that match the reference exactly, followed by 1 AA difference (X), followed by 207 AAs that match exactly, followed by 1 AA difference, followed by 26 AAs that match exactly. Nucleotide insertion/deletion causing a frame shift: For this example, the CIGAR text depicts an AA sequence where the first 7 AAs exactly match the reference, followed by 99 mismatched AAs due to a single-nucleotide insertion into the coding sequence, which causes a change in reading frame for the remainder of the protein sequence. In-frame AA insertion/deletion: Two examples are shown, where 3 nucleotides are inserted or deleted within the coding sequence. For the insertion, the nucleotides ATG are inserted at a position such that the first 7 AAs match the reference exactly, followed by insertion of 1 mismatched AA, followed by 99 AAs that are still coded in-frame and exactly match the reference; for the deletion example, the nucleotides ATG are deleted, removing a single AA. Again, the first 7 AAs match exactly, followed by a deletion of a single AA and an exact match of the remaining 97 AAs in the protein sequence. Structural variant: this example shows the depiction of a structural variant (fusion), which results from the joining of 2 distinct annotated genes (TAFID and SNORA40). The CIGAR string depicts the nature of this fusion—the first 24 AAs match the reference TAFID sequence exactly, with the next 14 AAs mismatching because they come from the fused SNORA40 sequence. The annotation text field provides information on the chromosomal locations of the fused genes that code for the variant protein.
Example data structure and format for the feature_cds_map table
| feature_cds_map table columns | |||||||
|---|---|---|---|---|---|---|---|
| Example entry | Name text | Chrom text | Start integer | End integer | Strand text | cds_start integer | cds_end integer |
| Exon | ENSMUSP00000000033 | chr7 | 142 655 653 | 142 655 810 | 0 | 157 | |
| Exon | ENSMUSP00000000033 | chr7 | 142 654 299 | 142 654 448 | − | 157 | 306 |
| Exon | ENSMUSP00000000033 | chr7 | 142 653 815 | 142 654 052 | − | 306 | 543 |
| Structural variant (fusion)# | TAF1D_SNORA40 | chr3 | 18 100 128 | 18 100 200 | + | 1 | 24 |
| Structural variant (fusion)# | TAF1D_SNORA40 | chr6 | 28 233 413 | 28 233 455 | 25 | 39 | |
ln this example, the reference ENSMUSP00000000033 has 3 exons that are mapped to chromosome 7. The table uses coding sequence (CDS) start and end locations because the codon sequence for an amino acid can be potentially split across exons. For variants such as single amino acid substitutions and InDels, the number of nucleotides between start and end locations may differ from the number of nucleotides in the reference CDS. #An example depiction of a structural variant (fusion) is also provided. Here, 2 entries are included, which provides information on genomic localization of the 2 coding sequences that were fused together to create the fusion protein, one from the positive strand of Chromosome S and the other from the negative strand of Chromosome 6.
Figure 2:Launching MVP from an mzSQLite data type within an active Galaxy History.
Figure 3:The MVP interface after initial launch from Galaxy.
Figure 5:Example data shown within the Protein-Peptide Viewer functionality.
Figure 4:Viewing of annotated MS/MS data supporting selected PSMs.
Figure 6:Snapshot of visualization of peptide and transcript mapping to the genome in IGV.