Literature DB >> 22564899

Transmembrane helix: simple or complex.

Wing-Cheong Wong¹, Sebastian Maurer-Stroh, Georg Schneider, Frank Eisenhaber.

Abstract

Transmembrane helical segments (TMs) can be classified into two groups of so-called 'simple' and 'complex' TMs. Whereas the first group represents mere hydrophobic anchors with an overrepresentation of aliphatic hydrophobic residues that are likely attributed to convergent evolution in many cases, the complex ones embody ancestral information and tend to have structural and functional roles beyond just membrane immersion. Hence, the sequence homology concept is not applicable on simple TMs. In practice, these simple TMs can attract statistically significant but evolutionarily unrelated hits during similarity searches (whether through BLAST- or HMM-based approaches). This is especially problematic for membrane proteins that contain both globular segments and TMs. As such, we have developed the transmembrane helix: simple or complex (TMSOC) webserver for the identification of simple and complex TMs. By masking simple TM segments in seed sequences prior to sequence similarity searches, the false-discovery rate decreases without sacrificing sensitivity. Therefore, TMSOC is a novel and necessary sequence analytic tool for both the experimentalists and the computational biology community working on membrane proteins. It is freely accessible at http://tmsoc.bii.a-star.edu.sg or available for download.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：

Year: 2012 PMID： 22564899 PMCID： PMC3394259 DOI： 10.1093/nar/gks379

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The ‘modus operandi’ of the sequence homology concept is governed by two principles. First is the inference of evolutionary history from sets of homologous protein sequences for building believable phylogenetic trees (1,2) [e.g. 1964, fibronopeptides (3); 1967, cytochrome c (4)]. Second is the inference of sequence–structure–function relationships from well-studied proteins to uncharacterized sequences [e.g. 1967, lactalbumin (5); 1986, angiogenin (6,7)]. The overall concept can be formally rationalized as similarity in amino acid sequence implies, to a certain degree, similarity in 3D structure and, hence, biological function where the conservation of the hydrophobic pattern in amino acid sequence of globular proteins is required to form the tightly packed hydrophobic core of the tertiary structure (8–11). High level of sequence similarity is thought to have originated from common ancestry under the pressure of selection at each step of mutational divergence with rare, alternative instances of convergent evolution (12,13). When applying the sequence homology concept, there are two important caveats. First, homology (as a hypothesis about common ancestry) can only be inferred via similarity measures. While similarity by chance can be eliminated through strict statistical criteria (e.g. E-value cutoff), ambiguity remains between convergent evolution and common ancestry for the high similarity scores (14–16). In practice, alignment tools [e.g. BLAST (17), HMMER (18–20)] do not differentiate between common ancestry and convergent evolution for high similarity scores. Therefore, one must be mindful in distinguishing between long stretches of similarity versus local resemblances that are physiologically constrained (e.g. membrane-spanning stretches from non-polar residues; linkers between globular domains from polar ones) (12). Second, proof of the sequence homology concept stems from cases of globular sequence segments and it is not directly applicable to non-globular ones. In particular, signal-peptides (SP) and transmembrane helices (TM) belong to a special class of non-globular sequences. Their mimicry of hydrophobic core patterns in similarity searches can attract unrelated spurious hits with impressive similarity scores. Essentially, these hits are unrelated to the seed sequence other than some hydrophobic pattern matches via their SP/TM segments (21). As collateral damage, such unjustified application of the sequence homology concept to infer homology will result in wrongful annotation, especially in automated annotation pipelines. With regard to the SPs, their necessary exclusion from seed sequences prior to similarity searches is uncontested since these segments are cleaved away from the mature proteins. However, the exclusion of all TMs is unsatisfactory due to the diverse architecture of membrane proteins (from the single-spanning TM proteins with some globular segments to the multi-spanning TM ones that are connected via loops with essentially no globular segments). In fact, not all TM helices need to be excluded. This is because a TM helix can either be simple or complex (22). Specifically, simple TMs have low sequence complexity but high hydrophobicity and are enriched in aliphatic hydrophobic residues. They merely serve as membrane anchors and can be a result of convergent evolution. In contrast, the complex TMs have higher sequence complexity, lower hydrophobicity and are enhanced with structural, charged and aromatic residues. They have additional functional roles (e.g. ligand binding, active sites, signal transduction) aside from membrane insertion and are likely derived from common ancestry (22). Most importantly, the simple TMs which can be present in membrane proteins regardless of any topology cause spurious hits in similarity searches. This necessitates for their identification and exclusion from the seed sequences prior to similarity searches. To provide a simple way of identifying and masking simple TMs within a membrane protein sequence, we provide a user-friendly web-interface transmembrane helix: simple or complex (TMSOC). In a nutshell, TMSOC first predicts any TM segments within the sequence if they are not defined by the users. Next, based on the sequence complexity and hydrophobicity of each TM segment, TMSOC will identify the simple TM segments [in accordance with criteria in (22)] and mask them in the fasta-formatted protein sequence that can serve as an input to the BLAST (17) suite or other sequence similarity search routines.

THE WEBSERVER

Input description

TMSOC requires: (i) a fasta-formatted sequence as a mandatory input and (ii) the associated TM segments as an optional input.

Output description

TMSOC produces four sections in the output. First, TMSOC displays the sequence with complex, twilight and simple TMs colored in red, orange and blue, respectively (see Figure 1A). Next, a summary table that contains: (i) the indices and (ii) sequences of the TM segments, (iii) the positions of the predicted/user-defined TM segments, (iv) the sequence complexity, (v) hydrophobicity, (vi) z-score and (vii) classification [simple/twilight/complex based on (22)] for each TM segment, is given (see Figure 1B). The third section outputs a sequence complexity/hydrophobicity plot of the predicted/user-defined TM segments (in black) against the background of membrane anchors (in blue), functional TMs (in red) and α-helices (in green) from the SCOP (23,24) database (see Figure 1C). Finally, the last section displays the fasta-formatted input sequence with the masked simple TMs (replaced by a continuum of ‘X’). This output sequence serves as an input into any appropriate similarity search routines (see Figure 1D).

Figure 1.

Example output of TMSOC analysis for the bovine rhodopsin (P02699) sequence. Generally, TMSOC produces four sections (see A–D of Figure 1) for each analysis. In Figure 1A, the sequence of the bovine rhodopsin reveals six complex TMs (in red) and one simple TM (in blue). There are no twilight TMs in this case, otherwise they will be colored in orange. In Figure 1B, a summary table that contains: (i) the indices and (ii) sequences of the TM segments, (iii) the positions of the predicted or user-defined TM segments, (iv) the sequence complexity, (v) hydrophobicity, (vi) z-score and (vii) classification [simple/twilight/complex based on (22)] for each TM segment in the bovine rhodopsin sequence is shown. In addition, enriched functional residues (aromatic/charged/structurally related) in the complex TMs are coded with the ClusterX color scheme. Figure 1C depicts the sequence complexity/hydrophobicity plot of the predicted TM segments of the bovine rhodopsin (in black) against the background of membrane anchors (in blue), functional TMs (in red) and α-helices (in green) from the SCOP (23,24) database. Figure 1D shows the fasta-formatted bovine rhodopsin sequence with its simple TM masked by a continuum of ‘X’ which can serve as an input into any appropriate similarity search routines.

Workflow description

Behind the web-interface, TMSOC is comprised of two main computational steps (see ‘Materials and Methods’ section for detail). In the first step, if the user does not input any TM segments, the presence and length of any TM helices within the input protein sequence will be derived from a set of five TM predictors [DASTM (25,26), TMHMM (27), HMMTOP (28), SAPS (29), PhobiusTM (30,31)] where the TM prediction results are statistically combined as described in (21). In most situations, a predicted TM segment will correspond to a TM helix. However, it is possible that the predicted TM segment may contain more than one TM helices in situations where the various TM predictors output varying TM helix borders. It is strongly recommended to the users to enter the TM segments and to use the TM prediction option only as the next best alternative since the predicted TM boundaries might be inaccurate. In the second step, each user-defined or predicted TM segment will be assigned a z-score that is calculated from the sequence complexity and hydrophobicity of each segment in accordance with Equations (1–3) in (22). A z-score criterion, that is associated to some preset false-negative rates (FNRs), will then be applied to determine if each TM segment is simple, twilight or complex (22). Subsequently, only the simple TMs will be masked (replaced by a continuum of ‘X’) in the input protein sequence. If the application of Phobius (30,31) generates a predicted signal peptide that is non-overlapping with a TM region, this signal peptide will be removed from the masked sequence. In the case of an overlap, a warning will be issued. The backend code for TMSOC was developed as PERL modules while the web-interface was written as a CGI script. The WWW server is available via http://tmsoc.bii.a-star.edu.sg or http://mendel.bii.a-star.edu.sg/METHODS/TMSOC/cgi-bin/. Alternatively, the TMSOC program is freely available for download as a command-line version at the WWW server site. Note that the command-line TMSOC will contain only the TM classification module.

MATERIALS AND METHODS

The algorithms used in TMSOC are described in our previous publications (21,22) in detail. For brevity, only a simplistic outline is given here.

Statistical quantification of TM segments

The input sequence is first analyzed by five TM predictors [DASTM (25,26), TMHMM (27), HMMTOP (28), SAPS (29), PhobiusTM (30,31)]. For every j-th position in the sequence, the total logarithmic probability for M predictors is given as: where is a Bernoulli random variable and it takes either 1 for positive TM detection or 0 (in the implementation, it is set as 0.01 so that logarithm can be evaluated) for negative TM prediction. Then, for each TM segment, the average logarithmic probability is given as: where R is the total number of predicted residues for the TM segment and r is the starting position of the TM segment. The cutoff criterion for a valid TM segment is set at which corresponds to an approximate false-positive rate of 5% and FNR of 8% (21).

Quantitative criteria for identifying simple and complex TM segments

The z-score of each TM segment is calculated from its associated sequence complexity and hydrophobicity (22). It is given as: where x and xΦ are the moving window averages for sequence complexity c (32) and hydrophobicity Φ (33,34) for a given segment, μ and σ are the mean and SD of sequence complexity and hydrophobicity for the functional TM set [defined as TMs containing active residues. See ‘Methods and Materials’ section in (22) for details]. The exponent s is set to one if: and zero otherwise. ρ,Φ is the correlation between sequence complexity and hydrophobicity for the set of functional TMs. For determining simple, twilight or complex TMs, the cutoff criterion is given as: where f = 0.840, 1.000, 1.282, 1.645, 1.980 [corresponding to FNRs of 20, 16, 10, 5 and 2.5%]. Note that simple TMs are declared at FNR of 5% and below, twilight TMs at FNR of between 5% and 10% and complex TMs at FNR of 10% and above.

PROOF OF CONCEPT

The workflow in TMSOC was previously applied to domain and sequence databases for generating the results in Refs (21,22), collectively showing: (i) the importance of identifying ‘simple’ and ‘complex’ TMs, (ii) the necessity of ‘simple’ TM removal prior to similarity searches without sacrificing sensitivity and (iii) the expected number of simple/complex TMs per protein. Specifically, simple and complex TMs were successfully identified by TMSOC in the 7-TM rhodopsin (P02699), 6-TM bacterial rhomboid protease (P09391), E. coli aspartate receptor (P07010) and colicin (PDB:1COL) where only the complex ones were experimentally shown to be functionally important (22). Here, we further illustrate useful insights that TMSOC can provide for the bovine rhodopsin sequence (P02699). To recapitulate, the functional role of TM-5 in the latter has not been established whereas the Gly51 in TM-1 and Gly89 in TM-2 have been linked to the retinal degenerative disease autosomal dominant retinitis pigmentosa (35) while Glu113 in TM-3, Ala169 in TM-4, Trp265 in TM-6 and Lys296 in TM7 are functionally important (36,37). Indeed, only TM-5 [positions 200–225 (38); z-score of −6.12] in bovine rhodopsin is considered simple by TMSOC and was masked in the sequence. The PSI-BLAST search results of the original and masked bovine rhodopsin (P02699) were then generated (see Supplementary Data S1 and S2) for further investigations.

TMSOC detects heterogeneity in transmembrane helix 5 among the rhodopsins

Orthologous rhodopsin hits (107) were detected in our PSI-BLAST runs (five iterations with standard parameters against Swiss-Prot). Consequently, they were analyzed by TMSOC for simple and complex TMs as summarized in Table 1. Based on the results, TM-5 shows the highest percentage of simple TM at ∼10%, followed by TM-1 at >1% while the rest contain no simple TMs. In a nutshell, besides bovine, many species of fishes (e.g. OPSD_DANRE, OPSD_CYPCA, OPSD_CARAU, OPSD_LEOKE) and amphibians (e.g. OPSD_RANCA, OPSD_RANPI, OPSD_BUFMA) also possess simple TM-5. Essentially, these collective findings from TMSOC suggests heterogeneity in TM-5 among rhodopsins and this is in agreement with previous report that TM-5 exhibits the highest level of sequence divergence among the seven TM helices in GPCR (39). Indeed, a comparison of crystal structures between bovine and squid rhodopsin reveals that TM-5 can be different. Notably, TM-5 and TM-6 of squid rhodopsin (P31356) extends into the cytoplasmic medium. This unusual structure, that is not observed in bovine, is regarded as an important structural motif for coupling with Gq-type G protein. In particular, TM-5 is divided into a membrane embedded region and a medium-exposed region that has motional freedom due to a flexible joint at Ser266 (40). This TM helix [positions 195–239 (38); z-score is 0.41] is considered complex by TMSOC.

Table 1.

Percentage of simple TMs found within each TM helix (1–7) among the 108 (including bovine seed sequence) PSI-BLAST rhodopsin hits

	TM-1	TM-2	TM-3	TM-4	TM-5	TM-6	TM-7
Complex	76	108	108	105	80	99	104
Twilight	31	0	0	3	18	9	0
Simple	1	0	0	0	10	0	0
Percentage of simple TMs	0.93	0.00	0.00	0.00	9.25	0.00	0.00

The first column describes the type of TM helices and the percentage of simple TMs for each of the seven helices across all 108 rhodopsins. Specifically, the second to last columns detail the specific numbers and percentages of each TM helix. Note that TM-7 of some hits was undetected.

Percentage of simple TMs found within each TM helix (1–7) among the 108 (including bovine seed sequence) PSI-BLAST rhodopsin hits The first column describes the type of TM helices and the percentage of simple TMs for each of the seven helices across all 108 rhodopsins. Specifically, the second to last columns detail the specific numbers and percentages of each TM helix. Note that TM-7 of some hits was undetected.

Exclusion of simple TM in bovine rhodospin clarifies the sequence similarity distance between rhodopsin and cholecystokinin-1 receptors

A comparison between the PSI-blast output hit lists of the original and of the masked bovine rhodopsin sequence revealed a cluster of cholecystokinin-1 receptors (CCKAR_CAVPO, CCKAR_RAT, CCKAR_MOUSE, CCKAR_HUMAN, CCKAR_CANFA, CCKAR_RABIT) that was present in the original list of top 500 hits, was excluded from that of the masked one. This cluster of CCKAR was analyzed by TMSOC and all the respective sequences have in common simple TM-1, TM-5, TM-6 but complex TM-2, TM-3, TM-4 and TM-7. These findings suggest that the sequence similarity between rhodopsins and cholecystokinin-1 receptors has been overestimated if their simple TM-5 s are included into the alignment. As it turns out, classical homology modeling of CCKAR using rhodopsin as the reference structure leads to a model that cannot correctly accommodate the cholecystokinin (CCK) ligand in its binding site due to obvious structural divergence. Notably, the binding site of CCK requires a number of residues located in the extracellular surface as well as the upper third of the TM helices whereas most binding residues in rhodopsin are buried in the TM helices (41,42).

CONCLUSION

TMSOC enables researchers to identify simple and complex TMs in membrane proteins for differentially treating them in sequence similarity searches and for planning further functional characterization of membrane proteins.

SUPPLEMENTARY DATA

Supplementary Data is available at NAR Online: Supplementary Data 1 and 2.

FUNDING

Agency of Science, Technology and Research (A*STAR). Funding for open access charge: Biomedical Research Council (A*STAR). Conflict of interest statement. None declared.

40 in total

1. Evolution of function in protein superfamilies, from a structural perspective.

Authors: A E Todd; C A Orengo; J M Thornton
Journal: J Mol Biol Date: 2001-04-06 Impact factor: 5.469

2. Movement of retinal along the visual transduction path.

Authors: B Borhan; M L Souto; H Imai; Y Shichida; K Nakanishi
Journal: Science Date: 2000-06-23 Impact factor: 47.728

3. On filtering false positive transmembrane protein predictions.

Authors: Miklos Cserzö; Frank Eisenhaber; Birgit Eisenhaber; Istvan Simon
Journal: Protein Eng Date: 2002-09

4. TM or not TM: transmembrane protein prediction with low false positive rate using DAS-TMfilter.

Authors: Miklos Cserzo; Frank Eisenhaber; Birgit Eisenhaber; Istvan Simon
Journal: Bioinformatics Date: 2004-01-01 Impact factor: 6.937

5. A combined transmembrane topology and signal peptide prediction method.

Authors: Lukas Käll; Anders Krogh; Erik L L Sonnhammer
Journal: J Mol Biol Date: 2004-05-14 Impact factor: 5.469

Review 6. What is a hidden Markov model?

Authors: Sean R Eddy
Journal: Nat Biotechnol Date: 2004-10 Impact factor: 54.908

7. Applying motif and profile searches.

Authors: P Bork; T J Gibson
Journal: Methods Enzymol Date: 1996 Impact factor: 1.600

8. Structural and functional role of helices I and II in rhodopsin. A novel interplay evidenced by mutations at Gly-51 and Gly-89 in the transmembrane domain.

Authors: Laia Bosch; Eva Ramon; Luis J Del Valle; Pere Garriga
Journal: J Biol Chem Date: 2003-03-26 Impact factor: 5.157

9. Structure of bovine rhodopsin in a trigonal crystal form.

Authors: Jade Li; Patricia C Edwards; Manfred Burghammer; Claudio Villa; Gebhard F X Schertler
Journal: J Mol Biol Date: 2004-11-05 Impact factor: 5.469

Review 10. Rhodopsin crystal: new template yielding realistic models of G-protein-coupled receptors?

Authors: Elodie Archer; Bernard Maigret; Chantal Escrieut; Lucien Pradayrol; Daniel Fourmy
Journal: Trends Pharmacol Sci Date: 2003-01 Impact factor: 14.819

10 in total

1. Discovery and Characterization of the Phospholemman/SIMP/Viroporin Superfamily.

Authors: Daniel Tyler; Kevin J Hendargo; Arturo Medrano-Soto; Milton H Saier
Journal: Microb Physiol Date: 2022-02-11

2. Powerful sequence similarity search methods and in-depth manual analyses can identify remote homologs in many apparently "orphan" viral proteins.

Authors: Durga B Kuchibhatla; Westley A Sherman; Betty Y W Chung; Shelley Cook; Georg Schneider; Birgit Eisenhaber; David G Karlin
Journal: J Virol Date: 2013-10-23 Impact factor: 5.103

3. xHMMER3x2: Utilizing HMMER3's speed and HMMER2's sensitivity and specificity in the glocal alignment mode for improved large-scale protein domain annotation.

Authors: Choon-Kong Yap; Birgit Eisenhaber; Frank Eisenhaber; Wing-Cheong Wong
Journal: Biol Direct Date: 2016-11-29 Impact factor: 4.540

4. Charged residues next to transmembrane regions revisited: "Positive-inside rule" is complemented by the "negative inside depletion/outside enrichment rule".

Authors: James Alexander Baker; Wing-Cheong Wong; Birgit Eisenhaber; Jim Warwicker; Frank Eisenhaber
Journal: BMC Biol Date: 2017-07-24 Impact factor: 7.431

5. Expansion of the Transporter-Opsin-G protein-coupled receptor superfamily with five new protein families.

Authors: Arturo Medrano-Soto; Faezeh Ghazi; Kevin J Hendargo; Gabriel Moreno-Hagelsieb; Scott Myers; Milton H Saier
Journal: PLoS One Date: 2020-04-22 Impact factor: 3.240

6. Conserved sequence motifs in human TMTC1, TMTC2, TMTC3, and TMTC4, new O-mannosyltransferases from the GT-C/PMT clan, are rationalized as ligand binding sites.

Authors: Birgit Eisenhaber; Swati Sinha; Chaitanya K Jadalanki; Vladimir A Shitov; Qiao Wen Tan; Fernanda L Sirota; Frank Eisenhaber
Journal: Biol Direct Date: 2021-01-12 Impact factor: 4.540

7. On the necessity of dissecting sequence similarity scores into segment-specific contributions for inferring protein homology, function prediction and annotation.

Authors: Wing-Cheong Wong; Sebastian Maurer-Stroh; Birgit Eisenhaber; Frank Eisenhaber
Journal: BMC Bioinformatics Date: 2014-06-02 Impact factor: 3.169

8. Transamidase subunit GAA1/GPAA1 is a M28 family metallo-peptide-synthetase that catalyzes the peptide bond formation between the substrate protein's omega-site and the GPI lipid anchor's phosphoethanolamine.

Authors: Birgit Eisenhaber; Stephan Eisenhaber; Toh Yew Kwang; Gerhard Grüber; Frank Eisenhaber
Journal: Cell Cycle Date: 2014-04-17 Impact factor: 4.534

9. Beyond the E-Value: Stratified Statistics for Protein Domain Prediction.

Authors: Alejandro Ochoa; John D Storey; Manuel Llinás; Mona Singh
Journal: PLoS Comput Biol Date: 2015-11-17 Impact factor: 4.475

10. Function of a membrane-embedded domain evolutionarily multiplied in the GPI lipid anchor pathway proteins PIG-B, PIG-M, PIG-U, PIG-W, PIG-V, and PIG-Z.

Authors: Birgit Eisenhaber; Swati Sinha; Wing-Cheong Wong; Frank Eisenhaber
Journal: Cell Cycle Date: 2018-05-15 Impact factor: 4.534

10 in total