Literature DB >> 35685370

DisPhaseDB: An integrative database of diseases related variations in liquid-liquid phase separation proteins.

Alvaro M Navarro¹, Fernando Orti¹, Elizabeth Martínez-Pérez¹, Macarena Alonso¹, Franco L Simonetti¹, Javier A Iserte¹, Cristina Marino-Buslje¹.

Abstract

Motivation: Proteins involved in liquid-liquid phase separation (LLPS) and membraneless organelles (MLOs) are recognized to be decisive for many biological processes and also responsible for several diseases. The recent explosion of research in the area still lacks tools for the analysis and data integration among different repositories. Currently, there is not a comprehensive and dedicated database that collects all disease-related variations in combination with the protein location, biological role in the MLO, and all the metadata available for each protein and disease. Disease-related protein variants and additional features are dispersed and the user has to navigate many databases, with a different focus, formats, and often not user friendly.
Results: We present DisPhaseDB, a database dedicated to disease-related variants of liquid-liquid phase separation proteins. It integrates 10 databases, contains 5,741 proteins, 1,660,059 variants, and 4,051 disease terms. It also offers intuitive navigation and an informative display. It constitutes a pivotal starting point for further analysis, encouraging the development of new computational tools.The database is freely available at http://disphasedb.leloir.org.ar.

Entities: Chemical

Keywords: Database; Disease variations; Diseases; LLPS; Liquid–liquid phase separation proteins; MLO; Membrane-less organelles; Web server

Year: 2022 PMID： 35685370 PMCID： PMC9156858 DOI： 10.1016/j.csbj.2022.05.004

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 6.155

Introduction

Cells compartmentalize biological processes to achieve spatial and temporal control over biochemical reactions. This is accomplished through both membrane bound and membraneless organelles (MLOs). MLOs are formed through the process of Liquid–liquid phase separation (LLPS) in which a liquid demixes in two phases where one phase is enriched in particular macromolecules, while depleted in others [1], [2], [3]. Examples are Nucleolus, Cajal bodies, and nuclear speckles in the nucleus and stress granules, P granules and P-bodies in the cytoplasm [1], [3], among others. These structures play diverse roles in various biological processes such as organization of the cytoplasm and nucleoplasm, regulation of gene expression, signaling, transport and compartmentalization [4], [5]. However, they are also increasingly implicated in several complex human diseases [5], [6], [7]. Examples of abnormal LLPS have been implicated in cancer, neurodegenerative and infectious diseases among others [8], [9], [10], [11], [12]. Therefore, it is not surprising that a perturbation in proteins that undergo LLPS, like a single nucleotide variant (SNV), gene copy number variation (CNV), protein mutation and post-translational modifications (PTMs) can upset the fine tuned process of MLO formation, stability and dynamics [6], [13], [14], [15], [16], [17], [18], [19], [20]. Proteins that undergo LLPS are often intrinsically disordered or have disordered regions (IDPs and IDRs, respectively), they might also have a biased amino acid composition or low-complexity regions (LCRs), and are therefore highly dynamic [21], [22], [23]. To a large extent, these regions are responsible for the separation in phases, although other types of regions or domains can also be found in proteins that separate into phases [21], [24], [25]. There are several molecular interaction types contributing to LLPS, such as multivalent protein–protein and protein-DNA/RNA interactions. Also, dynamically transient interacting regions as IDRs, LCR and prion-like, aggregation, coacervation, electrostatic, cation-π and π-π interactions, among others [26]. Mapping mutations to structural features could help to understand mechanisms involved in the formation of pathological aggregates. As an example, it was shown that mutations in the prion-like domains (PLDs) of several proteins are involved in neurodegenerative diseases such as amyotrophic lateral sclerosis (ALS), frontotemporal dementia (FTD), and multisystem proteinopathy [11]. Numerous experimental methods have been developed or repurposed to study the LLPS process and proteins involved, such as fluorescence recovery after photobleaching (FRAP), nuclear magnetic resonance spectroscopy (NMR), immunofluorescence, fluorescence correlation spectroscopy (FCS), and many others [27], [28], [29], [30]. However, there are still not enough bioinformatics tools and databases to study them, much less in the context of human diseases. We hypothesize that this is in part due to the lack of centralized data repositories, the low agreement among existing ones, the scarcity of dedicated cross-referenced databases and, the poor scalability offered for large integrative analysis of phase separating proteins. It was shown that the agreement between 4 dedicated databases of LLPS proteins [31], [32], [33], [34] is rather poor, sharing 42 human proteins out of 4,367, proving that none of the four databases taken alone provides enough data to enable a meaningful analysis [35], added to the fact that they do not focus on protein variations in diseases. To cover this gap, we present DisPhaseDB, an integrative database focused on disease variations in LLPS proteins. The database encompasses all known phase separating proteins, including Drivers, Clients, Regulators and other MLOs experimentally associated proteins together with their disease related variations. We expect our database to be of interest for researchers studying MLOs, LLPS proteins, diseases, proteins for targeting therapies, specific MLO components in a disease and also for computational groups developing methods to understand sequence-function relationships and mutational impact.

Methods

Selection of proteins involved in LLPS and MLO associated

Our starting point was an integrated set of MLO associated human proteins that were collected in a previous group effort [35]. It consists of the entries of four databases of LLPS and MLOs associated proteins that were compiled, merged, completed and stored in a local database: PhaSePro [31], PhaSepDB [32] DrLLPS [33] and LLPSDB [34]. This set is periodically updated with the databases' new releases. The consolidated dataset is available at https://mlos.leloir.org.ar/ [35]. The role of the proteins in the LLPS process and their association with the MLOs, is taken from the annotation of the source database. There are four types of Protein roles: Drivers, Regulators, Clients and Unassigned when no database describes its role. In addition, we grouped their experimental evidence supporting the roles as low throughput and high throughput for user evaluation of their confidence.

Mutation collection

We obtained human coding variants from ClinVar release 20200402 [36], a public archive of human genetic variants and their interpretation with respect to a clinical condition or phenotypes, along with supporting evidence for such association. DisGeNET [37] offers several datasets based on gene-disease associations (GDAs) and variant-disease associations (VDAs). For our database we took mutations from the curated VDA dataset (October 2020), which at the same time integrates variants from UniProt, ClinVar, GWASdb [38] and GWAS catalog [39]. From UniProt [40] we used the dataset of human variants that are manually annotated in UniProtKB/Swiss-Prot (release-2021_02). Lastly, COSMIC release v94 was used to obtain the coding point mutations in human cancers [41]. In all cases, we mapped variants with genomic coordinates from the human genome assembly GRCh38 onto the canonical protein sequence. Disease and other altered phenotypic effects annotations in ClinVar, COSMIC, DisGeNET and UniProt are not consistent between databases nor within the same database. They are frequently cross referenced to one or many ontologies that collect medical terms, and/or diseases, such as Disease Ontology (DO) [42], the Human Phenotype Ontology (HPO) [43], Medical Subject Headings (MeSH) [44], Medical Genetics (MedGen, https://www.ncbi.nlm.nih.gov/medgen/), The Monarch Merged Disease Ontology (MONDO) [45], National Cancer Institute Thesaurus (NCI, https://ncim.nci.nih.gov/), Online Mendelian Inheritance in Man (OMIM) [46], among others. In some cases there is no reference to any ontology. Furthermore, a mutation can be associated with several diseases and vice versa. Thus, in this context studying a variant, a protein or a disease is challenging. As an example, mutation R521C in FUS protein is associated with different diseases in different ontologies: Melanoma of skin (SNOMEDCT_US: 93655004), amyotrophic lateral sclerosis ALS6 (MEDCIN: 315716 and MedGen: C1842675) and Gastric Carcinoma (NCI: C4911). In addition, there are many synonymous annotated for the same disease in one ontology, as an example “Cancer of Stomach”, “Cancer of the Stomach”, “Carcinoma of Stomach”, “Gastric Cancer”, etc, are references to the same disease in NCI. Another case are synonymous in different ontologies, as example: Cutaneous Melanoma (MedGen: C0151779), Melanoma of skin (SNOMEDCT_US: 93655004) and “Melanoma, Cutaneous Malignant” (OMIM: 155600). Finally, there are different grades of specificity of a disease that are referred to as different terms, as an example, “Acanthoma” is a type of “Skin Neoplasms”. Therefore, mapping all disease terminology into a single ontology is not feasible. So, to facilitate the user navigating through this tangle of terms in dozens of ontologies to study a variation or a protein, DisPhaseDB includes all available disease annotations and, when there are no references to an ontology, reference to the source mutation database.

Additional information

We also included molecular features such as structural domains from Pfam database (Mistry et al., 2020), Intrinsically disordered Regions (IDRs) and Low-Complexity Regions (LCRs) from MobiDB [47], post-translational modifications (PTMs) retrieved from PhosphoSitePlus [48] and Prion-like domains (PLDs) predicted by PLAAC [48], [49]. These features are displayed on the protein sequence using the “Feature-Viewer” tool to visualize positional data [50].

Server construction and access

The server backend consists of a http web-server developed in Python 3.8+ using the Flask framework and MySQL. The client web application was developed with the AngularJS framework.

Results

DisPhaseDB in numbers

We present DisPhaseDB, available at https://disphasedb.leloir.org.ar. DisPhaseDB contains 5,741 LLPS proteins, all of them with experimental evidence that supports their association to the MLOs. For these proteins we collected human disease mutations from up-to-date databases including UniProt, ClinVar, DisGeNET and COSMIC. After merging the four databases, the total number of unique coding variants (protein mutations) is 1,660,059. COSMIC contributes 1,464,124, ClinVar 221,097, DisGeNET 56,813 and UniProt 22,965. Supplementary Fig. 1 shows the overlap of the four protein variation resources, showing that all of them are needed to have a better landscape of mutation in LLPS proteins. The most common type are missense mutations, followed by synonymous mutations (66.57% and 23.41% respectively) (Supplementary Fig. 2). It is evident that an amino acid change due to a missense mutation could influence protein structure, function and LLPS behavior. However, synonymous SNPs can have a substantial contribution to disease risk and other complex traits. There are various molecular mechanisms that underlie these effects such as: altering splicing efficiency and/or accuracy, losing information of exon–intron boundaries [51], affecting post-transcriptional processing and regulation of RNA, influencing the kinetics of mRNA translation [52] and affecting the timing of cotranslational folding due to rare codons [53], among others. On average, proteins in DisPhaseDB have around 200 mutations, although few proteins are exceptionally highly mutated (Fig. 1). As an example, TITIN (20,552 mutations) is a key component of striated muscles and mutations in this protein are related to different types of cardiomyopathies and muscular dystrophies [54], [55], [56]. BRCA1, BRCA2 and APC (9,172, 12,063 and 9,237 mutations respectively) are proteins involved in DNA repair and tumor suppressor [57], [58], [59]. It is well known mutations in these proteins produce an increased risk for different types of cancer, especially breast, ovarian and colorectal cancer [60], [61], [62]. Mutations do not appear equally in different protein regions, IDR and LCR have more mutations than the ordered portion of the protein (Supplementary Fig. 3).

Fig. 1

A) Number of mutations by protein (only the first 1000 most mutated proteins are shown). B) Distribution of proteins by the number of mutations.

A) Number of mutations by protein (only the first 1000 most mutated proteins are shown). B) Distribution of proteins by the number of mutations. Each protein is associated with one or more LLPS source databases and, when possible, with their role in the LLPS process. Protein roles can vary depending on the MLO and the source database leading to diverse situations. A protein can be annotated as Driver in a particular MLO and as Client in another, also a protein can have a role in one database and be unassigned or have a different one in another for the same MLO. There are 285 proteins classified as Drivers, 357 regulators, 3,157 potential clients, and 4,105 have no role assigned in their source databases or MLO (Supplementary Fig. 4 shows the distribution of proteins by their role and, disaggregated by MLO). Mutated proteins of DisPhaseDB are associated with 103 MLOs, varying in number across them. As an example, the nucleolus has 3,315 associated proteins while the synaptosome has only 1. Most proteins are associated with a single MLO (3,729), being the maximum 13 MLOs (1 protein) (Fig. 2).

Fig. 2

A) Protein distribution by MLO in DisPhaseDB, showing only MLOs with more than 10 associated proteins. B) Number of MLOs in which a protein can be present.

A) Protein distribution by MLO in DisPhaseDB, showing only MLOs with more than 10 associated proteins. B) Number of MLOs in which a protein can be present. Also mutated proteins are associated with one or more diseases, Fig. 3 (upper panel) shows the number of DisPhaseDB mutated proteins associated with all the Mesh ontology subheadings in the disease category. These headings are nodes near the root of the ontology, but the annotations allow going forward to more specificity, for example Supplementary Fig. 5 shows the terms under “neoplasms” subheading disaggregated by site. Since 80% of the mutations in DisPhaseDB are contributed by COSMIC (somatic mutations in cancer). Fig. 3 (lower panel) shows the distribution of mutated proteins by disease removing those mutations contributed by COSMIC. Even though removing COSMIC mutations, proteins associated with neoplasms are still predominant.

Fig. 3

A) Distribution of the total mutated proteins among all the subheading in MeSh ontology. B) Same as A, but excluding COSMIC contributed mutations to see the tendencie (COSMIC contributes with 1,464,124 mutations out of 1,660,059).

Server usage

DisPhaseDB offers either a quick search by protein, MLO or disease or an advanced search applying one or several filters. Possible fIlters are by protein, role, MLO, disease name or keyword, by evidence (low or high throughput experiments), by protein disorder content and mutation type (missense, frameshift, nonsense, etc). In addition, filters can be combined in such a way that users can customize the set of proteins according to their need or interest. As an example, Fig. 4 shows a search by a particular disease: Hepatobiliary Neoplasm. The output is a list of proteins involved in this disease with relevant annotations. By clicking a protein, further characteristics are expanded. As an example, synaptic functional regulator FMR1 (UP: Q06787) is selected. The information displayed is divided in the following sections: I) a protein summary with general information and the fasta sequence; II) protein MLO location; III) protein features mapped onto the sequence such as regions, domains, disorder content and mutations (disaggregated by type), among others IV) a mutation summary and V) a mutation table to download. Fig. 4 is a composite of different parts of the search results and output for illustrative purposes.

Fig. 4

Example of search by Hepatobiliary Neoplasm disease. The top panel shows the first two proteins of a list of 87 related to the query disease. Middle panel shows the protein features mapped onto the sequence and the bottom panel shows a portion of the list of disease related mutations in which the protein is involved (three out of 352 mutations).

Discussion

To the best of our knowledge, there is not an integrated and comprehensive resource for mutations in MLOs associated proteins. For this reason, we integrated all state-of-the-art resources of proteins involved in LLPS and MLOs with four relevant disease databases that annotate medical terms and phenotypic effects. The selected variant databases with clinical relevance are not redundant showing very little overlap among them. In such a way to cover the range of diseases and variant effects. Variant databases are often not user friendly and they cross-reference to different disease ontologies and many other databases. This highlights the need for a unification of these resources. DisPhaseDB also provides mutation mapping onto the protein sequence and associated metadata, such as disordered, low complexity and ordered regions, post translational modifications, among other features. Therefore this resource will be helpful to investigate sequence-function relationships and mutational impact on LLPS proteins, to assist researchers to better understand complex human diseases under the lens of phase separation.

Funding

AMN, FO, MA are PhD fellows, EMP is Postdoctoral fellow and FS, JI and CMB are researchers of (CONICET) - Argentina. This work was partially funded by PICT-2018-01015.

CRediT authorship contribution statement

Alvaro M. Navarro: Formal analysis, Investigation, Methodology, Software, Visualization, Validation. Fernando Orti: Data curation, Methodology, Software. Elizabeth Martínez-Pérez: Data curation, Investigation, Methodology, Software. Macarena Alonso: Data curation, Methodology. Franco Simonetti: Conceptualization, Methodology, Supervision, Writing review. Javier Iserte: Formal analysis, Investigation, Methodology, Software, Supervision, Validation, Writing-original draft. Cristina Marino-Buslje: Conceptualization, Formal analysis, Funding acquisition, Resources, Investigation, Supervision, Writing original draft, review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

61 in total

1. Sizing subcellular organelles and nanoparticles confined within aqueous droplets.

Authors: Jennifer C Gadd; Christopher L Kuyper; Bryant S Fujimoto; Richard W Allen; Daniel T Chiu
Journal: Anal Chem Date: 2008-03-26 Impact factor: 6.986

Review 2. Biogenesis and function of nuclear bodies.

Authors: Yuntao S Mao; Bin Zhang; David L Spector
Journal: Trends Genet Date: 2011-06-15 Impact factor: 11.639

Review 3. Liquid-liquid phase separation: Orchestrating cell signaling through time and space.

Authors: Qi Su; Sohum Mehta; Jin Zhang
Journal: Mol Cell Date: 2021-10-06 Impact factor: 19.328

4. Tibial muscular dystrophy is a titinopathy caused by mutations in TTN, the gene encoding the giant skeletal-muscle protein titin.

Authors: Peter Hackman; Anna Vihola; Henna Haravuori; Sylvie Marchand; Jaakko Sarparanta; Jerome De Seze; Siegfried Labeit; Christian Witt; Leena Peltonen; Isabelle Richard; Bjarne Udd
Journal: Am J Hum Genet Date: 2002-07-26 Impact factor: 11.025

5. GWASdb: a database for human genetic variants identified by genome-wide association studies.

Authors: Mulin Jun Li; Panwen Wang; Xiaorong Liu; Ee Lyn Lim; Zhangyong Wang; Meredith Yeager; Maria P Wong; Pak Chung Sham; Stephen J Chanock; Junwen Wang
Journal: Nucleic Acids Res Date: 2011-12-01 Impact factor: 16.971

6. Nup98 FG domains from diverse species spontaneously phase-separate into particles with nuclear pore-like permselectivity.

Authors: Hermann Broder Schmidt; Dirk Görlich
Journal: Elife Date: 2015-01-06 Impact factor: 8.140

7. DrLLPS: a data resource of liquid-liquid phase separation in eukaryotes.

Authors: Wanshan Ning; Yaping Guo; Shaofeng Lin; Bin Mei; Yu Wu; Peiran Jiang; Xiaodan Tan; Weizhi Zhang; Guowei Chen; Di Peng; Liang Chu; Yu Xue
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

8. PhaSepDB: a database of liquid-liquid phase separation related proteins.

Authors: Kaiqiang You; Qi Huang; Chunyu Yu; Boyan Shen; Cristoffer Sevilla; Minglei Shi; Henning Hermjakob; Yang Chen; Tingting Li
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

9. COSMIC: the Catalogue Of Somatic Mutations In Cancer.

Authors: John G Tate; Sally Bamford; Harry C Jubb; Zbyslaw Sondka; David M Beare; Nidhi Bindal; Harry Boutselakis; Charlotte G Cole; Celestino Creatore; Elisabeth Dawson; Peter Fish; Bhavana Harsha; Charlie Hathaway; Steve C Jupe; Chai Yin Kok; Kate Noble; Laura Ponting; Christopher C Ramshaw; Claire E Rye; Helen E Speedy; Ray Stefancsik; Sam L Thompson; Shicai Wang; Sari Ward; Peter J Campbell; Simon A Forbes
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971