Literature DB >> 19591793

A method for identification of selenoprotein genes in archaeal genomes.

Abstract

The genetic codon UGA has a dual function: serving as a terminator and encoding selenocysteine. However, most popular gene annotation programs only take it as a stop signal, resulting in misannotation or completely missing selenoprotein genes. We developed a computational method named Asec-Prediction that is specific for the prediction of archaeal selenoprotein genes. To evaluate its effectiveness, we first applied it to 14 archaeal genomes with previously known selenoprotein genes, and Asec-Prediction identified all reported selenoprotein genes without redundant results. When we applied it to 12 archaeal genomes that had not been researched for selenoprotein genes, Asec-Prediction detected a novel selenoprotein gene in Methanosarcina acetivorans. Further evidence was also collected to support that the predicted gene should be a real selenoprotein gene. The result shows that Asec-Prediction is effective for the prediction of archaeal selenoprotein genes.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
Selenoproteins

Year: 2009 PMID： 19591793 PMCID： PMC5054222 DOI： 10.1016/S1672-0229(08)60034-0

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 7.691

Introduction

Genetic code rules the translation from mRNA to protein. Sixty-one codons encode twenty amino acids and three remained codons (UGA, UAG and UAA) act as terminators. However, more and more works challenge this universal rule. Some reports declared that UGA codon has a dual function as it not only serves as a terminator but also encodes selenocysteine (Sec), the twenty-first amino acid in protein 1., 2., 3., 4., 5.. The same phenomenon has also been found for UAG codon, which had been identified translating to pyrrolysine (Pyl) in several methanogenic archaeas 6., 7.. However, most available gene prediction programs only annotated UGA codon as a terminator, which may lead to misannotation or completely missing selenoprotein genes in databases 8., 9., 10., 11.. Although several programs have already added the dual meaning of UGA codon in prediction, for example, the fully functional prediction for mammalian and insect genomes in Castellano’s work (, their applications did not extend to archaeal genomes. Translating UGA codon to Sec is linked to two components, one is the Sec insertion sequence (SECIS) element, which is defined by characteristic nucleotide sequences and secondary structure basepairing patterns, and the other is the trans-acting factor SelB, which is a GTP-dependent elongation factor specific for Sec incorporation (. The interaction mechanism of SelB with the Sec synthesis has been elucidated extensively in the literature 14., 15.. The SECIS elements locate at different positions relative to the coding genes in three main domains of life. In bacteria, the SECIS elements are presented immediately downstream of UGA codon and are part of the coding regions 16., 17., whereas in archaea and eukaryotes the SECIS elements usually locate at the 3′-untranslated regions (3′-UTRs) 18., 19., 20., 21., 22., 23., 24.. Gladyshev and co-workers have comprehensively studied their different features 17., 22., 23., 24.. In archaea, they investigated 14 genomes by developing previous studies of Böck’s group 20., 25., 26. and gave a consensus model of the SECIS elements. Using this model, they successfully predicted 15 archaeal selenoprotein genes (. Currently, besides the 14 archaeal genomes, other 12 archaeal genomes have already been completely sequenced and deposited in the NCBI database, but their UGA codon misannotations have not been studied. In this study, we developed a computational method, named Asec-Prediction, to identify the potential misannotated selenoprotein genes in these genomes. Since Asec-Prediction properly adjusts previous prediction strategies, it achieved a quick and correct prediction. Moreover, Asec-Prediction successfully identified a novel selenoprotein gene in Methanosarcina acetivorans.

Method

Framework of Asec-Prediction

Asec-Prediction contains four modules (Figure 1A). The first module is composed of five functional parts. It mainly aims at detecting the locations of the non-UGA-ending open reading frames (ORFs), as well as the candidates of SECIS elements and selenoprotein genes. The non-UGA-ending ORF is obtained by extending the ORF terminated as UGA codon until meeting UAG/UAA terminator (Figure 1B). It must ensure that both the distance between any pair of UGA codons and the distance between the start codon and the UAA/UAG terminator equal to the multiple of 3 nt. The second module predicts the RNA secondary structure and calculates the free energies for the putative SECIS elements by using Vienna RNAfold 1.4 (. The third module predicts the coding genes in the nucleotide sequences with replaced UGA codons by using Glimmer 2.13 (. The fourth module, integrating BLAST 2.2.14 (, searches for the Cys-containing homologous sequences for the predicted selenoprotein genes in the NCBI non-redundant (NR) protein database. The Asec-Prediction program is written in ANSI C and implemented in Linux platform. The source codes of Asec-Prediction can be obtained freely from the corresponding author upon request.

Figure 1

The flowchart of Asec-Prediction (A) and the defined non-UGA-ending ORF (B). The modules are linked by arrows according to their executing orders. The function descriptions of each part are indicated via several keywords in the rectangle boxes.

Flow of selenoprotein gene prediction

We introduce the procedure of Asec-Prediction in detail according to its flowchart (Figure 1A). The final output genes should be selenoprotein genes. For a given archaeal genome, Asec-Prediction finds all non-UGA-ending ORFs. Asec-Prediction finds the candidates of SECIS elements both in the regions of 500 nt upstream and 500 nt downstream of the non-UGA-ending ORFs. If the non-UGA-ending ORFs contain more than one UGA codon, the searching region in the downstream starts from the second UGA codon. RNAfold 1.4 predicts the secondary structures and calculates the free energies of the putative SECIS elements. The cutoff of free energy is —16 kcal/mol (. For the satisfied SECIS elements, Asec-Prediction calculates their positions relative to the non-UGA-ending ORFs. If they locate at the 3′-UTRs, Asec-Prediction finds all UGA codons contained in the non-UGA-ending ORFs. Otherwise, Asec-Prediction finds all UGA codons inbetween the start codons and the SECIS elements except the last UGA codon that will act as terminator. For these UGA codons in the nucleotide sequences, Asec-Prediction replaces them with other encoded codons. In order to avoid producing additional start or stop codons, Asec-Prediction chooses GGG codon or CGG codon to replace UGA codons in different situations. For the replaced nucleotide sequences, Asec-Prediction uses Glimmer 2.13 to predict genes. Asec-Prediction extracts the predicted archaeal selenoprotein genes. Asec-Prediction finds the homologous sequences in the NR protein database using BLAST 2.2.14.

Flexibility of Asec-Prediction

Asec-Prediction takes account of all possible cases of archaeal selenoprotein genes. For example, for some archaeal selenoprotein genes having more than one Sec residues, Asec-Prediction introduces the non-UGA-ending ORF so as to find those selenoprotein genes; for some archaeal selenoprotein genes observed to correlate with the SECIS elements locating at the 5′-UTRs, Asec-Prediction searches the SECIS elements both in the 3′-UTRs and the 5′-UTRs 20., 23.. On the other hand, Asec-Prediction properly adjusts previous prediction strategy and thus achieves a higher efficiency. It firstly searches the non-UGA-ending ORFs and then the candidates of SECIS elements in the UTRs, reducing much more searching regions compared with the approach provided by Kryukov and Gladyshev that ordered them in the reverse order (. In addition, Asec-Prediction searches the candidates of SECIS elements only in the UTRs of the non-UGA-ending ORFs, which is different from searching thoroughly the whole genome sequence (. This measure also reduces some searching regions. As we know, the consensus model of the SECIS elements is not an exclusive sequence motif (. Numerous sequence motifs can satisfy the consensus model. The decreasing of searching regions will greatly decrease the numbers of the candidates. In Asec-Prediction, the most time-consuming part is to judge whether the candidates satisfy the SECIS elements or not. Undoubtedly, the two improvements of decreasing searching regions will accelerate the prediction of Asec-Prediction. As one exemplification, for Methanococcus jannaschii with ~1.6 M nt, it only needs approximate 10 s to complete the prediction on a common personal computer. The integrated programs in Asec-Prediction can be freely replaced with other optimal programs that can enable it achieving a higher efficiency. Presently, Asec-Prediction integrates RNAfold 1.4 and Glimmer 2.13. Sometimes its prediction precision will be limited by the deficiencies of these integrated programs. For example, RNAfold 1.4 cannot give the correct stem-loop structure for the SECIS element “CGCCCGGGGGGAACCCCGCAAGGAGGGGACCCCCGGGTC”, while this element can be correctly predicted by RNAstructure 4.2 (. This may be the result of slight variants in the implementation of the energy models between RNAfold and RNAstructure. Similarly, we found that Glimmer 2.13 sometimes cannot correctly predict the existed genes, such as HesB-like genes in M. jannaschii (. In order to avoid these potential deficiencies, Asec-Prediction can flexibly replace RNAfold 1.4 and Glimmer 2.13 with higher versions like RNAfold 1.5 and Glimmer 3.01. Furthermore, other tools can also be freely integrated to Asec-Prediction if they enable it achieving better prediction. For example, Lambert et al. reported that ERPIN is effective to detect SECIS elements (. Thus, Asec-Prediction can be updated timely with much higher prediction accuracy.

Evaluation

Archaeal genome dataset

We downloaded 26 completely sequenced archaeal genomes from the NCBI database (updated on February 27, 2006) as our dataset (Table 1, Table 2). Among them, 14 genomes had been researched previously for selenoprotein genes by Kryukov and Gladyshev (, in which they reported 8 selenoprotein genes in M. jannaschii and 7 selenoprotein genes in Methanopyus kandleri, while no selenoprotein genes were found in the remaining archaeal genomes. The other 12 genomes have not been researched for selenoprotein genes according to our knowledge.

Table 1

Selenoprotein genes predicted by Asec-Prediction in fourteen researched archaeal genomes

Species	Accession No.	Predicted selenoprotein genes
		Previously reported	Asec-Prediction
Aeropyrum pernix	NC_000854	0	0
Archaeoglobus fulgidus	NC_000917	0	0
Halobacterium sp.	NC_002608	0	0
Methanobacterium therm.	NC_000916	0	0
Methanococcus jannaschii	NC_000909	8	8
Methanopyrus kandleri	NC_003551	7	7
Pyrobaculum aerophilum	NC_003364	0	0
Pyrococcus abyssi	NC_000868	0	0
Pyrococcus furiosus	NC_003413	0	0
Pyrococcus horikoshii	NC_000961	0	0
Sulfolobus solfataricus	NC_002754	0	0
Sulfolobus tokodaii	NC_003106	0	0
Thermoplasma acidophilum	NC_002578	0	0
Thermoplasma volcanium	NC_002689	0	0

Table 2

Selenoprotein genes predicted by Asec-Prediction in twelve un-researched archaeal genomes

Species	Accession No.	Selenoprotein genes by Asec-Prediction
Haloarcula marismortui	NC_006396	0
Methanococcus maripaludis	NC_005791	0
Methanosarcina acetivorans	NC_003552	1
Methanosarcina barkeri str.	NC_007355	0
Methanosarcina mazei	NC_003901	0
Methanosphaera stadtmanae	NC_007681	0
Methanospirillum hungatei	CP000254	0
Nanoarchaeum equitans	NC_005213	0
Natronomonas pharaonis	NC_007426	0
Picrophilus torridus	NC_005877	0
Sulfolobus acidocaldarius	NC_007181	0
Thermococcus kodakaraensis	NC_006624	0

Prediction of previously known archaeal selenoprotein genes

We first applied Asec-Prediction to the 14 archaeal genomes with already known selenoprotein genes (Table 1). Successfully, Asec-Prediction predicted the reported 15 selenoprotein genes in the two archaeal genomes (M. jannaschii and M. kandleri) without redundant results. As to the other genomes, Asec-Prediction did not find selenoprotein genes as well. Therefore, it shows that Asec-Prediction is effective for the prediction of archaeal selenoprotein genes.

Identification of a novel archaeal selenoprotein gene

Recognition of SECIS element

We then applied Asec-Prediction to the 12 archaeal genomes that have been completely sequenced but un-researched for selenoprotein genes (Table 2). Among them, Asec-Prediction predicted a novel archaeal selenoprotein gene in M. acetivorans. Its ORF starts from 4,550,549 nt and ends at 4,550,942 nt in the nucleotide sequence. There is one TGA codon (4,550,732–4,550,734 nt) in this ORF translating to Sec. The corresponding SECIS element starts from 4,550,481 nt and ends at 4,550,513 nt with 36 nt upstream of the coding region. It is different from the general archaeal selenoprotein genes where the SECIS elements are downstream of the coding region. This phenomenon is just similar to that of the fdhA gene and seems to be the reason of its missing 20., 23.. The secondary structure of this SECIS element displays a stem-loop structure (Figure 2). In detailed description, there are eight base pairs in the stem: six GC pairs and two AU pairs, which is different from ten base pairs in the consensus model (. In the bugle, besides of containing GAA_A pattern, it also contains two additional unpaired bases: AA pair and GA pair. The features of the up helix and the apical loop are unchangeable: three GC pairs locate at the up helix and three nucleotides in the apical loop. The minimum free energy of this SECIS element is —16.5 kcal/mol, lower than the cutoff. Taken together, despite of some differences from the consensus model, this SECIS element still indicates a stem-loop structure and is reasonable to be considered as a satisfied SECIS element.

Figure 2

The secondary structure of the SECIS element presented upstream of the novel selenoprotein gene in M. acetivorans. Numbers indicate the locations of some nucleotides in the SECIS element

In the procedure of Sec/Cys-containing homology search in the NR protein database, Asec-Prediction found eight Cys-containing homologous sequences of this predicted selenoprotein (Figure 3) (. Multiple sequence alignment of them showed that the Cys residues and the Sec residue are identical. On both sides of the identical Sec/Cys, the alignment showed symmetrical patterns of conservation, such as the anterior “PD” pattern and the followed “C” pattern (Figure 3). As popular knowledge, a real selenoprotein shows symmetrical patterns of conservation around the Sec residue, as a result of similar strength of purifying selection at both sides of the recoded codon, together with different Sec- and Cys-containing orthologs across the phylogeny. Thus, the symmetrical conserved patterns seem to validate the translation from UGA codon to Sec. Furthermore, around the Sec residue the predicted selenoprotein gene contains a UXXC motif (Sec residue separated from Cys residue by two other residues) and the homologous sequences contain CXXC motifs. As reported in the literature 32., 33., 34., the [U/C]XXC motif always corresponds to redox function and presents in a variety of thiol-dependent redox enzymes. Besides, we noticed that these eight homologous proteins belong to two kinds of organisms. Five proteins belong to Escherichia coli (GenBank Accession No. AAA58080, AAG58405, AAN82482, AAT48175, and ZP_00720452) and the other three belong to Shigella (GenBank Accession No. AAN44778, ABB63439, and ABB67765). Most of them have the same function annotations in the NCBI database, acting as topoisomerase. Therefore, it seems that the novel selenoprotein contains redox and enzymatic functions, similar with the real selenoprotein.

Figure 3

Multiple sequence alignment of the predicted novel selenoprotein. The novel gene starts from 4,550,549 nt and ends at 4,550,942 nt. The arrow for 4,550,732 nt indicates the locations of the predicted Sec (U) and the corresponding Cys in homologies. Rectangle boxes in the alignment show the identical residues. The alignment was generated using ClustalW and edited manually.

Before the SECIS element, there is another non-UGA-ending ORF, which starts from 4,549,509 nt and ends at 4,549,940 nt. Three TGA codons are included in the ORF. However, we could not find identical Sec/Cys in its homologous sequences since these homologous sequences all terminate at the Sec positions (data not shown). In the sequence alignment, the sharp decrease in sequence conservation most appears after the real stop codon, which is due to the reduced evolutionary constraint in UTR region. Therefore, these three TGA codons cannot recode as Sec residues. Although a similar exceptive example had been found, that is, glycine reductase selenoprotein A in bacteria has no Sec/Cys pair homologous sequences in the NCBI database 17., 24., in the present study we do not have other exact evidence to declare it as a selenoprotein gene.

Recognition of SelB

The decoding of UGA codon to Sec is always accompanied by the interaction with the translation elongation factor SelB, especially for E. coli 35., 36.. For archaea and animal, the mechanisms of Sec incorporation are more complex and unclear. Böck and colleagues previously identified ORF MJ0495 in M. jannaschii to share some functions with bacterial SelB and acts as the Sec-specific translation factor SelB 37., 38.. In M. acetivorans, we noticed that ORF MA4654 is annotated as a SelB gene in the NCBI database. However, some researchers doubt the correctness of this annotation (Methanosarcina Sequencing Project; http://www.broad.mit.edu/). We aligned MA4654 with MJ0495 as well as with the elongation factors SelB and EF-Tu in E. coli. The alignment shows that MA4654 shares significant homologies with them (Figure 4). Nevertheless, unlike MJ0495, MA4654 does not contain C-terminal extension. Böck and colleagues demonstrated that the C-terminal extension of MJ0495 did not contribute to its binding to the SECIS element as SelB does in E. coli. They speculated that another protein might exist and contributed to the SECIS element binding in archaea (. Unfortunately, they did not find this protein. Therefore, we think MA4654 may take the role like MJ0495 and acts as the Sec-specific translation factor SelB in M. acetivorans.

Figure 4

Multiple sequence alignment of ORF MA4654 in M. acetivorans with ORF MJ0495 in M. jannaschii, SelB sequence (SelB-E.c.) and EF-Tu sequence (EF-Tu-E.c.) in E. coli. Residues are shaded in black for more conservation and in grey for less conservation. The alignment was generated with ClustalW and edited with GeneDoc.

In eukaryotes, the SelB sequences do not contain the SECIS-binding function (. However, another protein, SECIS-binding protein 2 (SBP2), was found to contain RNA-binding motif 40., 41.. We tried to find the similar functional protein in M. acetivorans through searching its homologous sequences. We downloaded all currently published SBP2 and aligned them with the present annotated proteins in M. acetivorans. Unfortunately, no prominent homologous proteins were found. A possible reason for this deficiency may be that the SBP2 dataset is not large enough. Another reason may be that the SECIS-binding protein in archaea is different from that in eukaryotes.

Conclusion

Asec-Prediction takes proper prediction strategy that avoids some redundancies, hence achieves quick and correct prediction. Asec-Prediction correctly predicted 15 previously reported archaeal selenoprotein genes in 14 researched archaeal genomes without any redundant results. In 12 un-researched archaeal genomes, Asec-Prediction identified a novel selenoprotein gene in M. acetivorans with further supporting evidence: (1) the SECIS element shows a stem-loop structure and locates at 5′-UTR; (2) the Sec/Cys pair is identical in the alignment; (3) the [U/C]XXC motif suggests that it contains redox and enzymatic functions like the real selenoprotein; (4) the ORF MA4654 in M. acetivorans acts as Sec-specific translation factor SelB like the ORF MJ0495 in M. jannaschii. All of them suggest that the predicted gene should be a real selenoprotein gene. The result shows that Asec-Prediction is effective for the prediction of archaeal selenoprotein genes.

Authors’ contributions

ML collected the datasets, conducted data analyses, and prepared the manuscript. YH and YX supervised the project and co-wrote the manuscript. All authors read and approved the final manuscript.

Competing interests

The authors have declared that no competing interests exist.

39 in total

Review 1. Biosynthesis of selenoproteins--an overview.

Authors: A Böck
Journal: Biofactors Date: 2000 Impact factor: 6.113

2. A novel RNA binding protein, SBP2, is required for the translation of mammalian selenoprotein mRNAs.

Authors: P R Copeland; J E Fletcher; B A Carlson; D L Hatfield; D M Driscoll
Journal: EMBO J Date: 2000-01-17 Impact factor: 11.598

3. Structural analysis of new local features in SECIS RNA hairpins.

Authors: D Fagegaltier; A Lescure; R Walczak; P Carbon; A Krol
Journal: Nucleic Acids Res Date: 2000-07-15 Impact factor: 16.971

4. In silico identification of novel selenoproteins in the Drosophila melanogaster genome.

Authors: S Castellano; N Morozova; M Morey; M J Berry; F Serras; M Corominas; R Guigó
Journal: EMBO Rep Date: 2001-08 Impact factor: 8.807

5. Prediction of pKa and redox properties in the thioredoxin superfamily.

Authors: Efrosini Moutevelis; Jim Warwicker
Journal: Protein Sci Date: 2004-08-31 Impact factor: 6.725

6. The prokaryotic selenoproteome.

Authors: Gregory V Kryukov; Vadim N Gladyshev
Journal: EMBO Rep Date: 2004-04-23 Impact factor: 8.807

7. Selenoprotein synthesis in archaea: identification of an mRNA element of Methanococcus jannaschii probably directing selenocysteine insertion.

Authors: R Wilting; S Schorling; B C Persson; A Böck
Journal: J Mol Biol Date: 1997-03-07 Impact factor: 5.469

8. Domain structure of the prokaryotic selenocysteine-specific elongation factor SelB.

Authors: M Kromayer; R Wilting; P Tormay; A Böck
Journal: J Mol Biol Date: 1996-10-04 Impact factor: 5.469

9. Pyrrolysine encoded by UAG in Archaea: charging of a UAG-decoding specialized tRNA.

Authors: Gayathri Srinivasan; Carey M James; Joseph A Krzycki
Journal: Science Date: 2002-05-24 Impact factor: 47.728

10. Recognition of UGA as a selenocysteine codon in type I deiodinase requires sequences in the 3' untranslated region.

Authors: M J Berry; L Banu; Y Y Chen; S J Mandel; J D Kieffer; J W Harney; P R Larsen
Journal: Nature Date: 1991-09-19 Impact factor: 49.962

5 in total

1. Selenoprofiles: profile-based scanning of eukaryotic genome sequences for selenoprotein genes.

Authors: M Mariotti; R Guigó
Journal: Bioinformatics Date: 2010-09-21 Impact factor: 6.937

Review 2. Selenocysteine, pyrrolysine, and the unique energy metabolism of methanogenic archaea.

Authors: Michael Rother; Joseph A Krzycki
Journal: Archaea Date: 2010-08-17 Impact factor: 3.273

3. Multiple amino acid sequence alignment nitrogenase component 1: insights into phylogenetics and structure-function relationships.

Authors: James B Howard; Katerina J Kechris; Douglas C Rees; Alexander N Glazer
Journal: PLoS One Date: 2013-09-03 Impact factor: 3.240

Review 4. Bioinformatics of Metalloproteins and Metalloproteomes.

Authors: Yan Zhang; Junge Zheng
Journal: Molecules Date: 2020-07-24 Impact factor: 4.411

5. SECISearch3 and Seblastian: new tools for prediction of SECIS elements and selenoproteins.

Authors: Marco Mariotti; Alexei V Lobanov; Roderic Guigo; Vadim N Gladyshev
Journal: Nucleic Acids Res Date: 2013-06-19 Impact factor: 16.971

5 in total