| Literature DB >> 33166281 |
Oliver M Crook1,2,3, Aikaterini Geladaki1,4, Daniel J H Nightingale1, Owen L Vennard1,3, Kathryn S Lilley1,3, Laurent Gatto5, Paul D W Kirk2,6.
Abstract
The cell is compartmentalised into complex micro-environments allowing an array of specialised biological processes to be carried out in synchrony. Determining a protein's sub-cellular localisation to one or more of these compartments can therefore be a first step in determining its function. High-throughput and high-accuracy mass spectrometry-based sub-cellular proteomic methods can now shed light on the localisation of thousands of proteins at once. Machine learning algorithms are then typically employed to make protein-organelle assignments. However, these algorithms are limited by insufficient and incomplete annotation. We propose a semi-supervised Bayesian approach to novelty detection, allowing the discovery of additional, previously unannotated sub-cellular niches. Inference in our model is performed in a Bayesian framework, allowing us to quantify uncertainty in the allocation of proteins to new sub-cellular niches, as well as in the number of newly discovered compartments. We apply our approach across 10 mass spectrometry based spatial proteomic datasets, representing a diverse range of experimental protocols. Application of our approach to hyperLOPIT datasets validates its utility by recovering enrichment with chromatin-associated proteins without annotation and uncovers sub-nuclear compartmentalisation which was not identified in the original analysis. Moreover, using sub-cellular proteomics data from Saccharomyces cerevisiae, we uncover a novel group of proteins trafficking from the ER to the early Golgi apparatus. Overall, we demonstrate the potential for novelty detection to yield biologically relevant niches that are missed by current approaches.Entities:
Mesh:
Substances:
Year: 2020 PMID: 33166281 PMCID: PMC7707549 DOI: 10.1371/journal.pcbi.1008288
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1An overview of novelty detection in subcellular proteomics.
Examples of computational methods for spatial proteomics datasets for prediction and novelty detection.
| MS-based Spatial Proteomics Computational Methods for Prediction and Novelty Detection | |||||||
|---|---|---|---|---|---|---|---|
| Method | Localisation prediction | Uncertainty in protein localisation | Outlier detection | Novelty detection | Uncertainty in number of novel phenotypes | Uncertainty in allocation to new phenotypes | Integrative |
| Supervised Machine Learning (as reviewed in [ | ✓ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ |
| Correlation Profiling [ | ✓ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ |
| Transfer Learning [ | ✓ | ✘ | ✘ | ✘ | ✘ | ✘ | ✓ |
| ✘ | ✘ | ✓ | ✓ | ✘ | ✘ | ✘ | |
| ✘ | ✘ | ✓ | ✓ | ✘ | ✘ | ✘ | |
| TAGM [ | ✓ | ✓ | ✓ | ✘ | ✘ | ✘ | ✘ |
| Novelty TAGM (This manuscript) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✘ |
Fig 2(a) PCA plot of the hyperLOPIT U-2 OS cancer cell line data. Points are scaled according to the discovery probability with larger points indicating greater discovery probability. (b) Heatmaps of the posterior similarity matrix derived from U-2 OS cell line data demonstrating the uncertainty in the clustering structure of the data. We have only plotted the proteins which have greater than 0.99 probability of belonging to a new phenotype and probability of being an outlier less than 0.5 for the U-2 OS dataset to reduce the number of visualised proteins. (c) Tile plot of discovered phenotypes against GO CC terms to demonstrate over-representation, where the colour intensity is the -log10 of the p-value.
Fig 3(a, c) PCA plots of the LOPIT-DC U-2 OS data and the hyper LOPIT yeast data. The points are scaled according to the discovery probability. (b, d) Heatmaps of the posterior similarity matrix derived from the U-2 OS and yeast datasets demonstrating the uncertainty in the clustering structure of the data. We have only plotted the proteins which have greater than 0.99 probability of belonging to a new phenotype and probability of being an outlier less than 0.95 (10−5 for LOPIT-DC to reduce the number of visualised proteins). (e, f) Tile plots of phenotypes against GO CC terms where the colour intensity is the -log10 of the p-value.
Fig 4(a) PCA plots of the HeLa data. The pointers are scaled according to their discovery probability. (b) Heatmaps of the HeLa Itzhak data. Only the proteins with discovery probability greater than 0.99 and outlier probability less than 0.95 are shown. The heatmaps demonstrate the uncertainty in the clustering structure present in the data. (c) Tile plot of phenotypes against GO CC terms where the colour intensity is the -log10 of the p-value.
Fig 5(a) PCA plot showing marker proteins for the HEK-293 dataset. (b) PCA plot with phenotypes identified by phenoDisco. (c) PCA plot with phenotypes identified by Novelty TAGM with pointer size scaled to discovery probability. (d, e) Barplots showing the number of proteins allocated to different phenotypes by phenoDisco and Novelty TAGM respectively. (f) A table demonstrating the number of phenotypes with functional enrichment for both methods and the number of phenotypes discovered. (g) A heatmap showing the overlap between phenoDisco and Novelty TAGM allocations.
Fig 6(a) PCA of U-2 OS hyperLOPIT data with pointer scaled to localisation probability and outliers shrunk. Points are coloured according to their most probable organelle. (b) Immunofluorescence images and sub-cellular localisation annotation taken from the HPA database (https://www.proteinatlas.org/humanproteome/cell) for the proteins with UniProt accessions P61020 (Rab5b), O15498 (Ykt6), Q9NZN3 (EHD3), and Q96L93 (KIF16B). The nucleus is stained in blue; microtubules in red, and the antibody staining targeting the protein in green. (c) A barplot representing the number of proteins allocated before and after re-annotation of the endosomal class. (d) Violin plots of full probability distribution of proteins to organelles, where each violin plot is for a single protein.