| Literature DB >> 29386051 |
Gerben Menschaert1, Xiaojing Wang2,3, Andrew R Jones4, Fawaz Ghali4,5, David Fenyö6,7, Volodimir Olexiouk8, Bing Zhang9,10, Eric W Deutsch11, Tobias Ternent12, Juan Antonio Vizcaíno13.
Abstract
On behalf of The Human Proteome Organization (HUPO) Proteomics Standards Initiative, we introduce here two novel standard data formats, proBAM and proBed, that have been developed to address the current challenges of integrating mass spectrometry-based proteomics data with genomics and transcriptomics information in proteogenomics studies. proBAM and proBed are adaptations of the well-defined, widely used file formats SAM/BAM and BED, respectively, and both have been extended to meet the specific requirements entailed by proteomics data. Therefore, existing popular genomics tools such as SAMtools and Bedtools, and several widely used genome browsers, can already be used to manipulate and visualize these formats "out-of-the-box." We also highlight that a number of specific additional software tools, properly supporting the proteomics information available in these formats, are now available providing functionalities such as file generation, file conversion, and data analysis. All the related documentation, including the detailed file format specifications and example files, are accessible at http://www.psidev.info/probam and at http://www.psidev.info/probed .Entities:
Mesh:
Year: 2018 PMID: 29386051 PMCID: PMC5793360 DOI: 10.1186/s13059-017-1377-x
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Overview of the proBAM and proBed proteogenomics standard formats. Both proBAM and proBed can be created from well-established proteomics standard formats containing peptide and protein identification information (mzTab and mzIdentML, blue box), which are derived from their corresponding MS data spectrum files (mzML, brown box). The proBAM and proBed formats (green box) contain similar PSM-related and genomic mapping information, yet proBAM contains more details, including enzymatic (protease) information, key in proteomics experiments (enzyme type, mis-cleavages, enzymatic termini, etc.) and mapping details (CIGAR, flag, etc.). Additionally, proBAM is able to hold a full MS-based proteomics identification result set, enabling further downstream analysis in addition to genome-centric visualization, as it is also the purpose for proBed (purple box)
Fig. 2Fields of proBAM and proBed format. A proBed file holds 12 original BED columns (highlighted by a bold box) and 13 additional proBed columns. The proBAM alignment record contains 11 original BAM columns (highlighted by a bold box) and 21 proBAM-specific columns, using the TAG:TYPE:VALUE format. Each row in the table represents a column in proBAM and proBed. The rows are colored to reflect the categories of information provided in the two formats (see color legend at the bottom, the header section of proBAM format is not included here). The rows without any background color in the proBAM table represent original BAM columns that are not used in proBAM but that are retained for compatibility. The last row in the proBAM table indicates the customized columns that could be potentially used
Fig. 3Visualization of proBAM and proBed files in genome browsers. a IGV visualization: proBAM (green box) and proBed (red box) files coming from the same dataset (accession number PXD001524 in the PRIDE database). proBed files are usually loaded as annotation tracks in IGV whereas proBAM files are loaded in the mapping section. b Ensembl visualization: proBAM (green box) and proBed (red box) files derived from the same dataset (accession number PXD000124) illustrating a novel translational event. The N-terminal proteomics identification result points to an alternative translation initiation site (TIS) for the gene HDGF at a near-cognate start-site located in the 5’-UTR of the transcript (blue box)
Existing software implementations of the proBAM and proBed formats (by December 2017)