Literature DB >> 33214769

Hypothetical Proteins as Predecessors of Long Non-coding RNAs.

Girik Malik¹, Tanu Agarwal¹, Utkarsh Raj¹, Vijayaraghava Seshadri Sundararajan¹, Obul Reddy Bandapalli¹, Prashanth Suravajhala¹.

Abstract

Hypothetical Proteins [HP] are the transcripts predicted to be expressed in an organism, but no evidence of it exists in gene banks. On the other hand, long non-coding RNAs [lncRNAs] are the transcripts that might be present in the 5' UTR or intergenic regions of the genes whose lengths are above 200 bases. With the known unknown [KU] regions in the genomes rapidly existing in gene banks, there is a need to understand the role of open reading frames in the context of annotation. In this commentary, we emphasize that HPs could indeed be the predecessors of lncRNAs.

Entities: Chemical Disease Gene Species

Keywords: Hypothetical proteins; annotation; aptamers; functional genomics; lncRNA; transcripts

Year: 2020 PMID： 33214769 PMCID： PMC7604745 DOI： 10.2174/1389202921999200611155418

Source DB: PubMed Journal: Curr Genomics ISSN： 1389-2029 Impact factor: 2.236

INTRODUCTION

There are known knowns [KK], known unknowns [KU] and unknown knowns [UK] in the genome, as aptly put by David Logan [1]. Fully sequenced genomes have the KUs that encode Hypothetical Proteins [HPs] or domains of unknown function [DUF]. These proteins are predicted through in silico approaches and their biological activities are not substantiated by experimental evidence, and hence referred to as HPs [2, 3]. Despite their lack of functional characterization, HPs play an important role in understanding biochemical and physiological pathways [4], for instance, to find new structures and functions [5], biomarkers and relevant targets [6], and early detection for proteomic and genomic research [2]. Importantly, structural and functional characterization of HPs revealed crucial roles in microorganisms, particularly in pathogens associated with human diseases [7, 8]. In the recent past, many robust in silico approaches were developed and these tools are publicly available to predict the putative function of the HPs [9]. As the HPs are implicated in a myriad of biological functions, it is important to study the Protein-Protein Interactions [PPIs] as this would allow us to see how many of them are the products of pseudogenes [10]. Further, the HP function could as well be attributed to motif similarity data and homology reference [11] With the advancement of Next-Generation Sequencing [NGS], there is an inherent need to further annotate, classify and curate HPs which will help in not only understanding their functions but also allow us to use them as biomarkers [12]. Across the whole genome, ca. 2% genome encodes for proteins, while the remaining are non-coding or still functionally unknown [13]. In our recent study, to examine the role of HPs and the long non-coding RNAs [lncRNA] in metabolic diseases and cancers, we deciphered the function of HPs using a nine-point classification-scoring schema [14]. Most HPs from GenBank lack protein-coding capacity; therefore, we argue that they could essentially be a part of non-coding RNA [ncRNAs] transcripts. We earlier proposed an interactome of HPs [Hypothome] to define a network of Protein-Protein Interactions [PPI] wherein a connotation to the interactomes for predicted proteins would devalue the integrity of the interactome [15]. In this commentary, we describe a gist of what ‘hypothome’ is all about and argue that they also serve as predecessors of lncRNAs. This HP protein annotation, which we describe, could largely be inferred for eukaryotes as lncRNAs are considered as subjects.

CHALLENGES

Essential HPs

We developed a database of HPs a.k.a Hypo2 [12] and a “quick search” enabled us to retrieve the list of HPs, with the new annotation records from NCBI. Similarly, the searches using the keywords “Hypothetical protein” AND “Homo sapiens” at NCBI, using the Boolean logic, enabled us to filter 8653 accessions. After a careful validation in checking “M” or “L” [Methionine/Leucine] start sites through an in-house python script, the sequences were mapped to Uniprot Ids using the “ID mapping” tool wherein 4192 out of 6129 linked Gene Indices [GI] were successfully represented to 3212 Uniprot KB identities (Supplementary Table : labelled as All proteins). Other 787 (Supplementary Table : labelled as 787) were mapped to the UniParc sequence archive. Keeping an epilogue of the HPs turned out to be non-coding RNA or pseudogenes, we blasted our dataset of 3212 proteins with the human Noncode dataset [http://www.noncode.org last accessed, September 11, 2019] and found two hits with E value 0 (Supplementary Table ; highlighted in yellow and labelled as: Noncode). These two-noncoding RNAs eventually turned out to be pseudogenes [marked in yellow in the LncRNAs sheet]. From further search using filters set with keywords like ‘ncRNA’ and ‘LINC’ for the genes, we obtained 12 ncRNA (Supplementary Table : labelled as LncRNAs) and lncRNA which were further validated using the Noncode [16] and Lncipedia databases [17]. A total of 809 matched interactions formed the primary basis of ‘hypothome’ and among them, 73 (Supplementary Table :, labelled as 73) were available as Noncode Blast results, which were further used for downstream annotation to check for consolidated pathways and drug interactions. A representation of how an HP could prospectively be annotated as lncRNA is shown in Fig. (. Further, a contextual hub analysis tool [CHAT] [18] network was created using these matched entries and in order to reach a consensus, the same protein interactions were visualized in Osprey (Fig. ) as well [19] as well. Interestingly, these were associated with diseases like Alzheimer’s, metabolic pathways like Arginine and proline metabolism, DNA repair, etc. Thus, we believe that the HPs if annotated could be lncRNAs and they can be used as putative biomarkers for various diseases. Nevertheless, they also could be used for identifying specific antibodies/aptamers and novel targets for drugs with further validation, and finally could serve as ‘essential HPs’.

Fig. (1)

The figure showing the difference between a known-known and known unknown protein. Precisely, the characteristic domains such as domains of unknown function (DUF) or ORFs unrelated or KIAA domains are associated with hypothetical proteins, which usually are present in the c terminal region of the protein. We show a classic example of how CAC92745, an HP, could be annotated as a lncRNA, viz. LINC00208. (A higher resolution / colour version of this figure is available in the electronic copy of the article).

Large-Scale Protein Interactions for Structural ‘mer’ Studies

Given the aforementioned reason for calling them as ‘essential HPs”, we believe, these protein interactions can be processed as graph structures. The HPs could be ideal candidates for representing graphs wherein one can easily find metrics like cliques, communities, small worlds, etc., that lead to an abstract level of understanding of interactions, drug delivery, etc. With growing numbers of pseudogenes, the hypothome problem could be scaled towards the existing algorithms in parallel using methods like MapReduce [20]. Although for the initial steps of using BLAST for finding sequence matches, there are other approaches exploiting the use of parallel infrastructure [21-23], however, these approaches are often underutilised because of the complex setup, compared with the sequential processing, and are limited. Lately, there has been a sudden increase in tools and techniques for data analysis that has helped solve some interesting problems in the bioinformatics domain. Such a technique has also incited the interest of large community projects in using novel techniques for crunching data. Google DeepMind’s AlphaFold [https://deepmind.com/blog/article/ alphafold last accessed, December 16, 2019] is one such example, which outperformed other techniques at the completion of popular Critical Assessment of Structural Prediction [CASP]. Deep Learning, in general, is picking up rapidly because of the gargantuan amounts of data now accessible that has a bit of leeway of conceivably giving an answer for addressing the data analysis and learning issues found in enormous volumes of input data. Even more explicitly, it helps in automatically extracting complex data representations from a massive volume of unsupervised data. This makes it a profitable means for big data analytics, which encompasses data analysis from huge accumulations of raw data, which is usually unsupervised and un-categorized. With the increased number of HPs turning out to be pseudogenes, there is an accurate need to develop novel techniques of processing that would be immensely useful. Very recently, these techniques have not only started to make their way into functional prediction [24-26] but also proved to be useful for what is described by Logan DC [1], for finding KUs, given the other one, and their combinations. Such techniques are generally rate limited by the processing pipeline, wherein the difference between the amounts of data to be taken as input and given as output is far from insignificant. Eliminating this bottleneck, generally results in faster data processing and a better performance system, in general [27, 28].

A Case Study on Targeting Domains of Unknown Function

Moving forward to structures, the 3D swap database has a couple of DUFs, which we would like to consider as a case study to infer the role of aptamers for effective specificity. Assuming that the functions of DUFs and HPs can be better used as targets for diagnostics, the most common entity used are antibodies that could possibly circumvent the effect/targets. While the experimental characterization of antibodies is cumbersome, it is assumed that aptamer-protein prediction methods may serve as a benchmark besides providing cost-effective measures [29, 30]. What remains to be elucidated is whether the aptamer is bound? If bound, whether the 3D domains are swapped? If swapped, could it be applied for domains caused due to extensive multimerization as well? To answer this, we have analysed it and found that there is a dimer interface communicated to the catalytic domain of 2A9U. Therefore, we assume that the aptamers specific to this variable fragment could be used. With this, we expect that through the antigen-binding capacity of aptamer with the molecule, a wide number of HPs or DUFs can be targeted, which could be associated with diseases. Thus, we hope active conformation and aptamers as small molecules for therapies could prove to be very useful in the development of medical technology. Therefore, we hypothesize that the role of aptamers over antibody isotypes can be inferred and based on the affinity of aptamers bound to swapped domains particular to HPs or DUFs.

CONCLUSION

Automated genome sequence analysis and annotation may provide ways to understand genomes, although with limited precision. Given the challenge of determining protein functions, bioinformatics algorithms have not only allowed us to predict near functions to these HPs but also provided us to benchmark these methods for developing efficient tools. From the experimentally determined partners or interologs [orthologous interacting partner pairs], in principle, it is possible to suggest a role for HPs in a biological context. In recent years, lncRNAs have emerged as key regulators of cellular processes, including transcription, splicing, translation, DNA repair, and their role in regulation and relation with HPs have not only provided a big hope for characterising the KUs but also revealed new insights in modern biology. The experimental characterization of these and other ‘conserved hypothetical’ proteins is expected to reveal new directions, for example, understanding crucial aspects of microbial biology as in the case of HPs and, in addition, could also lead to the better functional prediction for characterizing medically relevant human homologs.

23 in total

Review 1. Biological function made crystal clear - annotation of hypothetical proteins via structural genomics.

Authors: E Eisenstein; G L Gilliland; O Herzberg; J Moult; J Orban; R J Poljak; L Banerjei; D Richardson; A J Howard
Journal: Curr Opin Biotechnol Date: 2000-02 Impact factor: 9.740

2. A WW domain protein TAZ is a critical coactivator for TBX5, a transcription factor implicated in Holt-Oram syndrome.

Authors: Masao Murakami; Masayo Nakagawa; Eric N Olson; Osamu Nakagawa
Journal: Proc Natl Acad Sci U S A Date: 2005-12-06 Impact factor: 11.205

3. Detection of functionally important regions in "hypothetical proteins" of known structure.

Authors: Guy Nimrod; Maya Schushan; David M Steinberg; Nir Ben-Tal
Journal: Structure Date: 2008-12-10 Impact factor: 5.006

4. Sequence homology and expression profile of genes associated with DNA repair pathways in Mycobacterium leprae.

Authors: Mukul Sharma; Sundeep Chaitanya Vedithi; Madhusmita Das; Anindya Roy; Mannam Ebenezer
Journal: Int J Mycobacteriol Date: 2017 Oct-Dec

5. Functional Annotation of Proteins Encoded by the Minimal Bacterial Genome Based on Secondary Structure Element Alignment.

Authors: Zhiyuan Yang; Stephen Kwok-Wing Tsui
Journal: J Proteome Res Date: 2018-05-24 Impact factor: 4.466