Ioannis Kirmitzoglou1, Vasilis J Promponas1. 1. Bioinformatics Research Laboratory, Department of Biological Sciences, University of Cyprus, CY 1678, Nicosia, Cyprus.
Abstract
MOTIVATION: Local compositionally biased and low complexity regions (LCRs) in amino acid sequences have initially attracted the interest of researchers due to their implication in generating artifacts in sequence database searches. There is accumulating evidence of the biological significance of LCRs both in physiological and in pathological situations. Nonetheless, LCR-related algorithms and tools have not gained wide appreciation across the research community, partly due to the fact that only a handful of user-friendly software is currently freely available. RESULTS: We developed LCR-eXXXplorer, an extensible online platform attempting to fill this gap. LCR-eXXXplorer offers tools for displaying LCRs from the UniProt/SwissProt knowledgebase, in combination with other relevant protein features, predicted or experimentally verified. Moreover, users may perform powerful queries against a custom designed sequence/LCR-centric database. We anticipate that LCR-eXXXplorer will be a useful starting point in research efforts for the elucidation of the structure, function and evolution of proteins with LCRs. AVAILABILITY AND IMPLEMENTATION: LCR-eXXXplorer is freely available at the URL http://repeat.biol.ucy.ac.cy/lcr-exxxplorer. CONTACT: vprobon@ucy.ac.cy SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Local compositionally biased and low complexity regions (LCRs) in amino acid sequences have initially attracted the interest of researchers due to their implication in generating artifacts in sequence database searches. There is accumulating evidence of the biological significance of LCRs both in physiological and in pathological situations. Nonetheless, LCR-related algorithms and tools have not gained wide appreciation across the research community, partly due to the fact that only a handful of user-friendly software is currently freely available. RESULTS: We developed LCR-eXXXplorer, an extensible online platform attempting to fill this gap. LCR-eXXXplorer offers tools for displaying LCRs from the UniProt/SwissProt knowledgebase, in combination with other relevant protein features, predicted or experimentally verified. Moreover, users may perform powerful queries against a custom designed sequence/LCR-centric database. We anticipate that LCR-eXXXplorer will be a useful starting point in research efforts for the elucidation of the structure, function and evolution of proteins with LCRs. AVAILABILITY AND IMPLEMENTATION: LCR-eXXXplorer is freely available at the URL http://repeat.biol.ucy.ac.cy/lcr-exxxplorer. CONTACT: vprobon@ucy.ac.cy SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
During the past 30 years, the main focus of research related to regions of local compositional extremes (low complexity regions; LCRs) was their identification for the purpose of sequence masking (Altschul ; Wootton and Federhen, 1993; Ye ) for eliminating spurious hits in database searches (Promponas ; Tsoka ). Several studies have been published showcasing the abundance and importance of such regions at the molecular/structural (e.g. Radivojac ; Tamana ), functional (e.g. Andrade ; Haerty and Golding, 2010), organismic (e.g. Miskinyte ; Pizzi and Frontali, 2001) and habitat level (e.g. Nandi ). Despite the apparent biological importance of LCRs there’s a distinct lack of tools or services capable of helping biologists to study them in depth. Most of the methods capable of detecting LCRs were developed for the sole purpose of masking them and are meant to be used from the command line as part of a sequence analysis or search pipeline. While some tools, such as SEG (Wootton and Federhen, 1993), CAST (Promponas ) or BIAS (Kuznetsov and Hwang, 2006) do offer more advanced reports as an option, their results are mostly meant to be parsed by a computer software and not a biologist.In this work, we present LCR-eXXXplorer, an online service to search, visualize and share LCRs in protein sequences. We highlight its unique features that may facilitate research efforts towards understanding the biological roles of proteins with LCRs.
2 Functionality
2.1 General description
LCR-eXXXplorer is built upon a customized instance of GBrowse (Stein ) modified to properly work with protein sequences. It currently contains 545 000 sequences (retrieved from UniProt/SwissProt) annotated with over 16 million LCR-related annotations. Along with information about sequence complexity, LCR-eXXXplorer displays external annotations from UniProt, as well as predicted disordered and binding regions by utilizing IUPRED (Dosztányi ) and ANCHOR (Dosztányi ; Mészáros ) respectively. Data are stored in a MySQL database, using a database schema based on the SeqFeature schema internally used by GBrowse (see Supplementary Methods and Supplementary Fig. S1).
2.2 Key functionality
A basic keyword-based search functionality (allowing wildcards) is available for retrieving protein sequences with matching UniProtKB Accession(s)/Entry Name(s) or gene name(s). Moreover, the ‘Advanced Search’ option (specifically implemented for this process as a custom-made GBrowse plug-in) facilitates more fine-tuned queries. Using the basic search mode, users are able to retrieve up to 500 entries using simple keyword search (e.g. with a single UniProt identifier or accession number). An ‘Advanced Search’ may be initiated by querying a suitable combination of UniProt fields (e.g. gene or protein name, source organism) or LCR properties (e.g. type of LCR, percent of masked residues)—yet, only the AND Boolean operator is currently supported for combining search criteria. Under this mode, batch search functionality is also available using a list of UniProt accession numbers: this feature enables users to take advantage of the powerful UniProt search engine and come up with a list of entries specifying complex search criteria. Results can be displayed in the browser (with a limit of 15 000 entries) or downloaded in a plain text tab-delimited formatted file providing statistics on the LCR content for further processing (with a limit of 50 000 entries). Different options of masking protein sequences are provided for each individual sequence from the graphical GBrowse ‘protein details’ view and sequences are available in FASTA format.The Downloads section offers LCR-eXXXplorer the option of downloading the complete set of sequences in FASTA formatted files masked for LCRs, the complete set of annotations in GFF3 format or a CSV formatted table with LCR statistics for each sequence in the database.Users may also search for data in LCR-eXXXplorer using BLASTP (Ye ) powered by the user-friendly SequenceServer (Priyam et al., manuscript in preparation). Three underlying databases (unmasked, SEG or CAST masking with default parameters) are provided, with the masked databases being a unique feature of this service; this configuration is shown to improve database search results (Kirmitzoglou, 2014; Kirmitzoglou et al., in preparation). Furthermore, users may initiate BLASTP searches against the sequence databases hosted at the NCBI web servers (http://www.ncbi.nlm.nih.gov/) using as input query the currently displayed sequence; several options of applying masking using any combination of amino acid residue types and detection algorithm are available.The main strength of LCR-eXXXplorer—setting it apart from similar services—is its visualization capabilities. Displaying LCRs in a protein sequence is more informative when information regarding other functional or structural features is also shown (Supplementary Fig. S2). By taking advantage of the underlying GBrowse capability to display features stored on a remote web accessible server, LCR-eXXXplorer incorporates selected annotations from UniProt into the main browser interface. UniProt annotations displayed in LCR-eXXXplorer are of two major types: (i) general annotations associated with the protein sequence (e.g. protein name, gene ontology terms, PDB accession IDs) and (ii) position-specific annotations, which may include domains, sites, secondary structure etc. These annotations are fetched from UniProt/SwissProt on-the-fly for the protein sequence of interest. This is facilitated by a custom-designed cgi-bin script and the retrieved features are further post-processed to a format suitable for the LCR-eXXXplorer.Using the same underlying mechanism, LCR-eXXXplorer can display tracks generated by another instance of GBrowse, a Distributed Annotation System (DAS) server or valid GFF3 files generated by the user. The only requirement is that the remote tracks must use the same coordinates system, which in the case of LCR-eXXXplorer is the protein sequence itself. Thus, users may practically display results from any LCR-detection tool (or any other protein sequence analysis tool) alongside the data provided by LCR-eXXXplorer.
2.3 Comparison to similar services
Two services for providing access to protein sequence LCR-related data are currently available online. The one most closely related to LCR-eXXXplorer is LPS-annotate (Harbi ), which identifies LCRs based on the LPS algorithm (Harrison and Gerstein, 2003), compared to SEG. These LCR annotations are accompanied with disordered region predictions by DISOPRED (Buchan ). Even though LPS-annotate is an invaluable resource for researchers interested in compositionally biased proteins, its main drawback is the lack of any effective visualization options. Moreover, the underlying database (according to data available at the LPS-annotate website) has not been updated since 2009. Recently, the HRaP server (Lobanov ) was developed, specializing in the study of homopolymeric repeats, which comprise a highly specialized case of LCRs, thus it is not further discussed herein. A detailed presentation of web-based services providing information related to LCRs is presented in Kirmitzoglou (2014).
3 Future Developments
The current version of the LCR-eXXXplorer web server offers several tools for facilitating research on proteins with LCRs, including BLAST search and interactive visualization by exploiting inherent GBrowse features. Given the genuine interest of our research group in LCR-containing proteins, we plan to expand this service in the near future.More specifically, we are in the process of automating the LCR-eXXXplorer update procedure to regularly synchronize with UniProt updates. Moreover, the customizations performed on different GBrowse modules require some additional work (and appropriate documentation) for enabling full programmatic access to our service through the REST interface already available for GBrowse. An important improvement destined for the next version of LCR-eXXXplorer is enabling full support of Boolean queries against fields in the underlying database. The modular (both in terms of data and software) architecture of LCR-eXXXplorer enables easy incorporation of novel datasets (e.g. complete genome sequences) and LCR detection tools in future versions.
Authors: Lincoln D Stein; Christopher Mungall; ShengQiang Shu; Michael Caudy; Marco Mangone; Allen Day; Elizabeth Nickerson; Jason E Stajich; Todd W Harris; Adrian Arva; Suzanna Lewis Journal: Genome Res Date: 2002-10 Impact factor: 9.043
Authors: Migla Miskinyte; Ana Sousa; Ricardo S Ramiro; Jorge A Moura de Sousa; Jerzy Kotlinowski; Iris Caramalho; Sara Magalhães; Miguel P Soares; Isabel Gordo Journal: PLoS Pathog Date: 2013-12-12 Impact factor: 6.823
Authors: Fabia U Battistuzzi; Kristan A Schneider; Matthew K Spencer; David Fisher; Sophia Chaudhry; Ananias A Escalante Journal: BMC Evol Biol Date: 2016-02-29 Impact factor: 3.260
Authors: Patryk Jarnot; Joanna Ziemska-Legiecka; Laszlo Dobson; Matthew Merski; Pablo Mier; Miguel A Andrade-Navarro; John M Hancock; Zsuzsanna Dosztányi; Lisanna Paladin; Marco Necci; Damiano Piovesan; Silvio C E Tosatto; Vasilis J Promponas; Marcin Grynberg; Aleksandra Gruca Journal: Nucleic Acids Res Date: 2020-07-02 Impact factor: 16.971
Authors: Pablo Mier; Lisanna Paladin; Stella Tamana; Sophia Petrosian; Borbála Hajdu-Soltész; Annika Urbanek; Aleksandra Gruca; Dariusz Plewczynski; Marcin Grynberg; Pau Bernadó; Zoltán Gáspári; Christos A Ouzounis; Vasilis J Promponas; Andrey V Kajava; John M Hancock; Silvio C E Tosatto; Zsuzsanna Dosztanyi; Miguel A Andrade-Navarro Journal: Brief Bioinform Date: 2020-03-23 Impact factor: 11.622
Authors: Guillaume Tetreau; Michael R Sawaya; Elke De Zitter; Elena A Andreeva; Anne-Sophie Banneville; Natalie A Schibrowsky; Nicolas Coquelle; Aaron S Brewster; Marie Luise Grünbein; Gabriela Nass Kovacs; Mark S Hunter; Marco Kloos; Raymond G Sierra; Giorgio Schiro; Pei Qiao; Myriam Stricker; Dennis Bideshi; Iris D Young; Ninon Zala; Sylvain Engilberge; Alexander Gorel; Luca Signor; Jean-Marie Teulon; Mario Hilpert; Lutz Foucar; Johan Bielecki; Richard Bean; Raphael de Wijn; Tokushi Sato; Henry Kirkwood; Romain Letrun; Alexander Batyuk; Irina Snigireva; Daphna Fenel; Robin Schubert; Ethan J Canfield; Mario M Alba; Frédéric Laporte; Laurence Després; Maria Bacia; Amandine Roux; Christian Chapelle; François Riobé; Olivier Maury; Wai Li Ling; Sébastien Boutet; Adrian Mancuso; Irina Gutsche; Eric Girard; Thomas R M Barends; Jean-Luc Pellequer; Hyun-Woo Park; Arthur D Laganowsky; Jose Rodriguez; Manfred Burghammer; Robert L Shoeman; R Bruce Doak; Martin Weik; Nicholas K Sauter; Brian Federici; Duilio Cascio; Ilme Schlichting; Jacques-Philippe Colletier Journal: Nat Commun Date: 2022-07-28 Impact factor: 17.694
Authors: Florian Wilfling; Chia-Wei Lee; Philipp S Erdmann; Yumei Zheng; Dawafuti Sherpa; Stefan Jentsch; Boris Pfander; Brenda A Schulman; Wolfgang Baumeister Journal: Mol Cell Date: 2020-11-17 Impact factor: 17.970