Literature DB >> 16990247

snoSeeker: an advanced computational package for screening of guide and orphan snoRNA genes in the human genome.

Jian-Hua Yang1, Xiao-Chen Zhang, Zhan-Peng Huang, Hui Zhou, Mian-Bo Huang, Shu Zhang, Yue-Qin Chen, Liang-Hu Qu.   

Abstract

Small nucleolar RNAs (snoRNAs) represent an abundant group of non-coding RNAs in eukaryotes. They can be divided into guide and orphan snoRNAs according to the presence or absence of antisense sequence to rRNAs or snRNAs. Current snoRNA-searching programs, which are essentially based on sequence complementarity to rRNAs or snRNAs, exist only for the screening of guide snoRNAs. In this study, we have developed an advanced computational package, snoSeeker, which includes CDseeker and ACAseeker programs, for the highly efficient and specific screening of both guide and orphan snoRNA genes in mammalian genomes. By using these programs, we have systematically scanned four human-mammal whole-genome alignment (WGA) sequences and identified 54 novel candidates including 26 orphan candidates as well as 266 known snoRNA genes. Eighteen novel snoRNAs were further experimentally confirmed with four snoRNAs exhibiting a tissue-specific or restricted expression pattern. The results of this study provide the most comprehensive listing of two families of snoRNA genes in the human genome till date.

Entities:  

Mesh:

Substances:

Year:  2006        PMID: 16990247      PMCID: PMC1636440          DOI: 10.1093/nar/gkl672

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

The small nucleolar RNAs (snoRNAs) represent an abundant group of small non-coding RNAs (ncRNAs) in eukaryotes (1). With the exception of RNase MRP, all the snoRNAs fall into two major families, box C/D and box H/ACA snoRNAs, on the basis of common sequence motifs and structural features. A large number of snoRNAs characterized to date are box C/D snoRNAs that share two conserved motifs, the 5′ end box C and the 3′ end box D, and the box H/ACA snoRNAs that exhibit a common hairpin–hinge–hairpin–tail secondary structure with the box H and ACA (2). Although several snoRNAs, such as U3, snR30 and RNase MRP, are required for specific cleavage of pre-rRNAs, the majority of box C/D snoRNAs function as guides for site-specific 2′-O-ribose methylation, and most box H/ACA snoRNAs function as guides for pseudouridylation in the post-transcriptional processing of rRNAs (2). Studies have shown that some snoRNAs and Small Cajal body-specific RNAs (scaRNAs) participate in the modification of snRNAs (3,4). Some modifications in Archaea tRNAs are also introduced by box C/D small RNAs, which are the homologs of snoRNAs in eukaryotes (5). Notably, an increasing number of orphan snoRNAs, which lack antisense to rRNAs or snRNAs, has been experimentally identified along with the numerous guide snoRNAs from different eukaryotes (6,7). The finding of an orphan snoRNA, HBII-52, being associated with human disease, has triggered a great interest in orphan snoRNAs (8,9). Many studies have proven that computational analysis of genomic databases is a useful way to identify snoRNAs from eukaryotes on a large-scale (7,10,11). To date, several searching programs based on pattern recognition scan algorithms have been developed, such as Snoscan (10) for box C/D snoRNA, and SnoGPS (11) and MFE (12) methods for box H/ACA snoRNA. In comparison to experimental approaches that tend to favor the discovery of the most abundant RNAs (13,14), computational analysis provides an unbiased genome-wide search for snoRNA genes. However, the current snoRNA-searching programs are essentially based on sequence complementarity to rRNAs or snRNAs and are therefore limited to the detection of guide snoRNAs and not orphan snoRNAs. Another limitation with the existing programs is that it is difficult to systematically search the human genome for snoRNAs because of the vast data source. Although two recent studies have performed a computational detection of human snoRNAs, they only focused on particular segments of the human genome (15,16). It is therefore important to develop an advanced search program for genome-wide screening of all snoRNAs, including the orphan snoRNAs. In this study, we developed a computational package which includes two novel snoRNA-searching programs, CDseeker and ACAseeker, specific to the detection of C/D snoRNAs and H/ACA snoRNAs, respectively. Based on new algorithms, our programs detected both guide and orphan snoRNA genes in a genome-wide analysis. We also systematically searched the human genome for snoRNAs with four humanmammal (human/mouse, human/rat, human/dog and human/cow) whole-genome alignment (WGA) sequences using these programs. As a result, a majority of known C/D and H/ACA snoRNA genes were detected. In addition, 54 novel genes were identified with 18 being experimentally confirmed. This study presents the most complete list of two families of snoRNA genes in the human genome till date.

MATERIALS AND METHODS

Data source

WGAs for human (May 2004), mouse (March 2005), rat (June 2003), dog (July 2004), cow (September 2004), chimp (November 2003) and monkey (January 2005) were downloaded from the UCSC Genome Bioinformatics site (). The repeat families were removed by RepeatMasker (). Sequences and annotation data for known human snoRNA genes (which were used in program training) were downloaded from snoRNA-LBME-db (17) on August 2005. UCSC KnownGene, RefGene and AceView annotation data for human protein genes and transcript units were downloaded from the UCSC Genome Bioinformatics site (). rRNA and snRNA sequences were downloaded from snoRNA-LBME-db (17). 2′-O-methylation and pseudouridylation sites of human rRNA and snRNA were cited from snoRNA-LBME-db (17) and other reports (18,19). Sequences 4 nt upstream to 25 nt downstream of known methylation sites were extracted from the rRNA and snRNA sequences as target sequences for the CDseeker program. Sequences 15 nt upstream to 15 nt downstream of the known pseudouridylation sites were extracted from the rRNA and snRNA sequences as target sequences for the ACAseeker program.

Algorithm description

The CDseeker program combines probabilistic model (20), conserved primary and secondary structure motifs to search orphan and guide C/D snoRNAs in WGA sequences. Common algorithm components for guide and orphan snoRNA genes are box C, box D and the terminal stem base pairing. Two additional components are applicable to guide snoRNA genes, a region of sequence complementary to rRNA or snRNA and box D′ if the rRNA or snRNA complementary region is not directly adjacent to box D. For orphan snoRNA genes, two additional components are (i) predicted conserved functional region next to box D or box D′ and (ii) box D′ if the conserved functional sequence is not next to box D. The distance between components (e.g. the maximum distance is 100 nt between box C and box D) was also taken into account. The program searches box C, box D, terminal stem pairing and the antisense element step-by-step in WGA consensus sequences, and scores the corresponding elements with probabilistic models, then evaluates them based on the standard cutoff score of the training set. Candidates progress to the next evaluation only if the element score is higher than the cutoff. The examination of antisense is an optional criterion in the CDseeker program. The program assigns the candidates as guide snoRNAs or orphan snoRNAs using this evaluation. Finally, in order to rank the candidates, the program sums the scores of the motifs resulting in a final score. The standard cutoff score of a training set is also applied for selecting candidates. The structural model of C/D snoRNA genes for CDseeker is based on the canonical C/D structure shown in Figure 1A and the core algorithm workflow is shown using a schematic diagram in Figure 1B.
Figure 1

CDseeker and ACAseeker core algorithm workflow. (A) The C/D snoRNA model. The C/D box snoRNAs carry the conserved boxes C (RUGAUGA, R = purine) and D (CUGA) near their 5′ and 3′ ends, respectively. The two boxes are frequently folded together by a short (4–5 bp) terminal helix, to form a structure similar to a kink-turn. Often, imperfect copies of the C and D boxes, named C′ and D′, are located internally, in the order C/D′/C′/D. The 2′-O-ribose methylation of target RNAs is guided by one or two 10–21 antisense elements located upstream of the D and/or D′ boxes, so that the modified base is paired with the snoRNA nucleotide located precisely 5 nt upstream of the D or D′ box (17). (B) Schematic representation of the CDseeker algorithms. (C) The H/ACA snoRNA model. The H/ACA box snoRNAs consist of two hairpins and two short single-stranded regions, which contain the H box (ANANNA) and the ACA box. The latter is always located 3 nt upstream of the 3′ end of the snoRNA. The hairpins contain bulges, or recognition loops that form complex pseudoknots with the target RNA, where the target uridine is the first unpaired base. The position of the substrate uridine always resides 13–16 nt upstream of the H box (left recognition pocket) or of the ACA box (right recognition pocket) (17). (D) Schematic representation of the ACAseeker algorithms.

The ACAseeker program combines probabilistic model (20), conserved primary and secondary structure motifs to search orphan and guide H/ACA snoRNAs in WGA sequences. Common algorithm components for guide and orphan snoRNA genes are box H, box ACA, stem1 (hairpin I), stem2 (hairpin II) and hairpin–hinge–hairpin. For guide snoRNA genes, another component is taken into account which is the two regions of sequence complementary to rRNA or snRNA in a hairpin. The program searches box H and ACA in conserved sequences of WGA consensus sequences and scores them using probabilistic models. The candidates having a score higher than the standard cutoff progress to the next step which is an evaluation of secondary structure feature using a slightly arbitrary standard observed from known H/ACA snoRNAs. Similar to CDseeker, the final examination of antisense by ACAseeker program is an optional criterion. The program assigns the candidates as guide snoRNAs or orphan snoRNAs by this evaluation. The structural model of H/ACA snoRNA genes used by ACAseeker is based on the canonical H/ACA structure shown in Figure 1C and the core algorithm workflow is shown using a schematic diagram in Figure 1D.

Generating consensus sequences for searching

We generated consensus sequences from UCSC whole-genome pairwise alignments. The following annotations were assigned: (i) the same letter if the sequences were identical between two alignment sequences; (ii) a dot (.) if there was a point mutation between two alignment sequences; and (iii) a dash (-) if there was an insertion or deletion (indel) between alignment sequences. The two alignment sequences were transformed as a consensus sequence (Supplementary Figure S1). To efficiently scan the snoRNA genes, consensus motifs in consensus sequences were searched. From alignments of known humanmammalian alignment snoRNA sequences, we found that the regular expression for box C, box D, box H and box ACA are as follows: (i) Box C, [ACGT.][ATG]GA[TG]G[ATG.]; (ii) Box D, CTGA; (iii) Box H, A[ACGT.]A[ACGT.][ACGT.][G.]?A; and (iv) Box ACA, A[CT.]A. Dot (.) indicates a point mutation between human and other mammalians, letters are identical nucleotides between human and other mammalians, the symbol ‘?’ indicates that the former symbol appear 0 or 1 time, and ‘[]’ means that one of the letters located within the brackets appear only once in regular expression.

Indel and substitution models

As H/ACA snoRNAs are structural and functional RNAs, their sequences are constrained for maintaining the hairpin–helix–hairpin–tail secondary structure. We surveyed 100 known H/ACA snoRNAs and found that successive mutations in these RNAs were mostly <7 nt and successive indels were mostly <5 nt in the alignment sequences (Supplementary Figure S2). We, therefore, defined maximum successive mutations as 7 nt and maximum successive indels as 5 nt. We applied these models to extract conserved segments from whole-genome pairwise alignments and found all known box H/ACA snoRNAs located in 120–1000 nt conserved segments (Supplementary Figure S3). The ACAseeker program was then applied to search for H/ACA genes whose lengths varied between 120 and 1000 nt in the conserved sequences of WGA.

Training for scoring standard with probabilistic model

(i) Training for scoring standard of box element with hidden Markov model. The hidden Markov model (HMM) has been widely used for searching protein-coding genes and non-coding RNA genes (10,20–22). The fixed-length HMM (or zero-order HMM) was used in training the box elements (including box C, box D, box D′, box H and box ACA) of known snoRNAs. In detail, box elements are trained by calculating the probability of test sequences with the HMM. According to the alignment sequences, the box H being trained, is ANANNA and ANANNGA, so there are two transition probabilities from position 5 to 6 of 0.90 and 0.10, respectively (the logarithm of the probability is −0.152 and −3.322, respectively; Supplementary Figure S6). The other transition probability is 1 (the logarithm of the probability is 0; Supplementary Figure S6). The emission probability is constructed by the following equation: The E is an emission score for the j-th symbol (base) of the i-th position of sequence S. f(S) is a frequency for j-th symbol (base) of the i-th position of sequence S. p is a pseudo-count. f(S) is a background frequency of the j-th symbol. Consequently, the HMM consists of an n × m matrix when the length of sequence S is n and the number of symbols is m. (ii) Training for scoring standard of terminal stem pairing with stochastic context-free grammars (SCFGs). Secondary structures in RNA are not local, similar to proteins; thus, it is necessary to use a more complex model than an HMM for modeling terminal stem pairing. Stochastic context-free grammars (SCFGs) (10,20), which are used here, can describe some long-range interactions, including most of those in RNA secondary structure. The SCFG production probabilities were estimated from a training set of C/D snoRNAs with terminal stem pairing. After surveying the terminal stem pairings of all known C/D snoRNA, we found that the pairs reached a maximum of 15 bp in the terminal stem. Hence the 15 nt flanking sequences of known snoRNA were extracted for folding the terminal stem and training. The score for terminal stem pairing is from 1.149 bits to 21.252 bits. (iii) Training for scoring standard of complementary regions with HMM. The HMM model was also used in training the complementary regions between ribosomal RNA and snoRNAs (10,20). The probability of matches (Watson–Crick matches and GU matches) and mismatches in the complementarity is emission probability. After surveying the antisense–target–helix of all known snoRNA, we defined criteria for complementary region evaluation. For box C/D snoRNA genes, the criteria are as follows: (a) we selected the lowest score (13.0 bits) of the complementary region as a cutoff score; (b) the first mismatch next to box D or box D′ did not affect the stability of the complementary helix (the antisense and target; 0 bit was given for the first mismatch next to box D or D′); (c) the maximum mismatch was 1 nt except for the first mismatch; (d) the maximum GU pairs were 3 bp; and (e) the minimum complementary length was 10 nt. For box H/ACA snoRNA genes, the criteria are as follows: (a) we selected the lowest score (16.2 bits) of the complementary region as a cutoff score; (b) the maximum mismatch was 1 nt; (c) the minimum complementary length was 9 nt; (d) the minimum upper stem pairing was 4 bp and maximum bulge in upper stem pairing was 5 nt; and (e) the minimum lower stem pairing was 4 bp and maximum bulge in lower stem pairing was 2 nt.

Evaluating box H/ACA secondary structure

For evaluating the hairpin structure, we folded the hairpin II and hairpin I step-by-step using RNAfold (23) and then evaluated the structure with the following standards concluded from surveying all the known H/ACA snoRNAs: (i) maximum distance between H box and hairpin I was 5 nt; (ii) maximum distance between H box and hairpin II was 7 nt; (iii) maximum distance between ACA box and hairpin II was 5 nt; (iv) maximum pocket was 12 nt; (v) maximum bulge in lower and upper stems were 2 and 5 nt, respectively; (vi) minimum stem pairs were 14 bp; (vii) minimum and maximum loop sizes were 3 and 17 nt, respectively; (viii) maximum tail length was 4 nt; and (ix) hairpin MFE was less than −11.0 (kcal/mol) and larger than −43.0 (kcal/mol). For evaluating the hairpin–hinge–hairpin structure, we folded the whole candidate using MFOLD (24), and evaluated the candidate with the following standards concluded from surveying all the known H/ACA snoRNAs: (i) maximum hinge size was 15 nt and (ii) hairpin–hinge–hairpin MFE was less than −29.0 (kcal/mol).

Selecting candidates using locateGenome program

With the exception of U3, mgU2-25/61, mgU12-22/U4-8, mgU12 and U13, all snoRNAs are located in introns of spliced mRNAs. We investigated the size distribution of all known snoRNA-hosted introns and found their lengths mainly fell into the range of 100–600 nt (Supplementary Figure S4). The length of introns containing C/D RNAs varied from 157 to 44 874 nt (Supplementary Figure S4). The length of introns containing H/ACA RNAs varied from 213 to 103 658 nt (Supplementary Figure S4). The locateGenome program first locates the candidates according to the UCSC AceView gene data of human genomic coordinates and orientations which includes >250 000 alt-splicings. The program then selects C/D candidates which are located within introns <50 000 bp and H/ACA candidates which are located within introns <110 000 bp.

Selecting candidates with conservation filter

We surveyed all the training snoRNA genes and found snoRNA sequences were more conserved than their flanking sequences (Figure 2A and B). We then evaluated candidates through the conservation percentage of flanking upstream and downstream 50 bp sequences and the UCSC genome browser high conserved track (25). We evaluated candidates by the following standards concluded from surveying all known snoRNAs:
Figure 2

(A) SnoRNA conserved features on the human UCSC Genome Browser. C/D and H/ACA snoRNAs are colored blue and green, respectively. Conservation track reveals that sequences corresponding to snoRNAs are more highly conserved than those of flanking sequences. (B) A candidate box H/ACA RNA by computational method does not fit the conserved pattern. UCSC conservation track reveals that sequences corresponding to candidate box H/ACA RNA are less highly conserved than those of flanking sequences. Although this candidate can fold into a hairpin–hinge–hairpin–tail structure, its expression cannot be confirmed by northern blot and reverse transcription.

The IDhp1 and IDhp2 are the conservation percentages of hairpins 1 and 2, IDup and IDdown are the conservation percentages of flanking upstream and downstream 50 bp sequences.

RNA isolation and northern blot

Total cellular RNA was isolated from adult male rat tissues and purified according to the method of guanidine thiocyanate/phenolchloroform (26). Small RNA was purified from total RNA according to the PEG8000/NaCl protocol (27). Small RNA (25 μg/lane) from brain, thymus, heart, lung, liver, spleen, kidney and testis was analyzed by electrophoresis on 8% acrylamide/7 mol/l urea gels. Total RNAs were transferred on to nylon membranes (Hybond-N+; Amersham) followed by UV irradiation for 5 min. Hybridization with 5′-labeled probes was performed as described previously (28).

Oligodeoxynucleotides

Oligonucleotides were synthesized and purified by Sangon Co. (Shanghai, China). The sequences of oligonucleotide probes for northern blotting and oligonucleotide primers for cDNA libraries construction and screening are shown in Supplementary Table S1. The probes used in northern blotting were 5′-end-labeled with [γ-32P]ATP (Yahui Co.) and submitted to purification according to standard laboratory protocols as described previously (26).

RESULTS

The strategy and efficiency of the CDseeker program for screening of mammalian C/D snoRNA genes

In this study, we developed a computational program, CDseeker, specific to the screening of box C/D snoRNAs that are conserved in humanmammal genomes (human/mouse, human/rat, human/dog and human/cow). The CDseeker program searched and scored the conserved motifs, terminal stem pairing and antisense in a consensus alignment sequence. The output candidates were assigned as ‘guide’ and ‘orphan’ C/D RNAs according to the score assignment. The whole procedure is outlined in Figure 1B and described in detail in Materials and Methods. To date, 252 C/D snoRNAs, including 119 imprinted C/D snoRNAs and excluding U3, have been identified from the human genome through experimental and conservative (homologous to mouse snoRNAs) detections (17). To test the sensitivity of this computational program on C/D snoRNA genes in human, we chose 133 C/D snoRNA genes and 6 imprinted C/D snoRNA genes as a dataset for training the CDseeker program (if the imprinted snoRNA had multiple gene isoforms, only one isoform was selected to avoid the overfit of the program to these large-amount-repeated imprinted snoRNA genes). We mainly focused on the score assignment and feature evaluation of the known C/D snoRNAs with respect to the following elements: conserved box C, box D and box D′ motifs, terminal stem pairing, conserved antisense elements, indel and substitution pattern and conservation of flanking sequence (Figure 2A). We then applied CDseeker to the training set of snoRNA gene alignments, which were extracted from four humanmammalian WGAs. As a result, 124 out of 139 known human snoRNA genes (90%) including 6 imprinted snoRNA genes were detected by CDseeker. Fifteen box C/D snoRNA genes were missed due to various reasons, including lack of conservation among any humanmammal genome alignments, location within repeatmask regions, lack of conserved motifs or a large length (>120 nt) (Supplementary Table S2). The training result showed that most of the orphan C/D snoRNAs had scores >26.53 bits and the guide C/D snoRNAs had scores >39.0 bits (Figure 3A and B). We therefore established the cutoff scores of 26.53 bits for orphan C/D snoRNAs and 39.0 bits for guide C/D snoRNAs to identify novel candidates of human C/D snoRNA genes.
Figure 3

Computational identification of box C/D snoRNAs. (A) The distribution of CDseeker scores for known ‘orphan’ C/D snoRNA genes. (B) The distribution of CDseeker scores for known guide C/D snoRNA genes. (HS, MM, RN, CF and BT are abbreviations of human, mouse, rat, dog and cow, respectively. HS-MM represent for human–mouse WGA sequences.)

The strategy and efficiency of the ACAseeker program for screening of mammalian H/ACA snoRNA genes

We also developed another computational program, ACAseeker, specific to the screening of box H/ACA snoRNAs that are conserved in humanmammal genomes. Similar to CDseeker, ACAseeker first searched conserved motifs in a consensus alignment sequence before evaluating the secondary structure features of box H/ACA snoRNAs. The program then searched and scored the functional antisenses to define the candidates as ‘guide’ H/ACA RNAs or ‘orphan’ H/ACA RNAs. The whole procedure is outlined in Figure 1D and described in detail in Materials and Methods. The same strategy as in CDseeker was applied to test the sensitivity of this computational program on human H/ACA snoRNA genes. In brief, 100 H/ACA snoRNA genes, which were previously identified through experimental and conservative (homologous to mouse snoRNAs) detection (17), were used to serve as a dataset for training the ACAseeker program. The score assignment and feature evaluation of the known H/ACA snoRNAs were considered according to the following elements: box H, box ACA motifs, secondary structure, minimum free energy, conserved antisense elements, indel and substitution pattern and conservation of flanking sequence (Figure 2A). ACAseeker was then applied to the training set of snoRNA gene alignments which were extracted from four humanmammalian WGAs. The test results showed that 75 out of 100 known box H/ACA snoRNA genes (75%) were detected by ACAseeker. The remaining 25 box H/ACA RNAs were missed due to reasons similar to those box C/D genes undetected in the CDseeker program (Supplementary Table S3). We found that most of known orphan H/ACA snoRNAs and guide H/ACA snoRNAs had a standard hairpin–hinge–hairpin–tail structure, H box score >0.7 bits, ACA box score >1.63 bits and target score >16.2 bits for guide H/ACA snoRNA. These results provided the cutoff scores and standards for the searching of novel candidates of human H/ACA snoRNA genes.

A genome-wide search of mammalian snoRNA genes with snoSeeker identifies 37 novel human snoRNA genes and 17 novel isoforms of known snoRNA genes

After the training tests of the two programs on known snoRNA genes, we applied the programs to the human genome for an overall search for snoRNA genes of the two families. The whole procedure is outlined in Figure 4A and B.
Figure 4

Flowchart of the CDseeker and ACAseeker algorithms. (A) The flowchart of the CDseeker algorithm is divided into three main stages. The initial stage is a scan of the four WGA sequences by the CDseeker core program. The second stage is location of the genome using the locateGenome program. The final stage is to intersect the four results and filter the candidate sequence with an evolution conservation pattern. The number of known snoRNAs found at different stages of analysis is shown in parentheses. (B) The flowchart of the ACAseeker algorithm is divided into three main stages. The initial stage is a scan of the four high WGA sequences by the ACAseeker core program. The second stage is location of the genome using the locateGenome program. The final stage is to intersect the four results and filter the candidate sequence with an evolution conservation pattern. The number of known snoRNAs found at different stages of analysis is shown in parentheses.

To search for C/D snoRNA genes, we applied CDseeker to four humanmammal WGA sequences (Figure 4A). About 300 candidates were obtained from the analysis of each humanmammal WGA. The second step was to locate the candidates in the human genome with the locateGenome program as described in detail in Materials and Methods. Only candidates located within introns, whose lengths were <50 000 nt, were accepted and the number of candidates for each humanmammal WGA was reduced to <200 after this step. Finally, we intersected the results of the four humanmammal WGA analysis and applied another filter on the results to eliminate false positive candidates. The filter was focused on the conservation of the flanking sequences of the candidates. In total, 212 C/D candidates including 86 orphan candidates were computationally identified from the human genome (Supplementary Table S5; sequences and annotated alignments are available at ). Of these candidates, 191 were previously identified snoRNA genes and the remaining 21 candidates, including 5 novel orphan genes and 2 novel isoforms of known orphan snoRNA genes, were novel snoRNA candidates. We similarly applied ACAseeker for searching another family of snoRNA genes, H/ACA genes, from the human genome. As H/ACA snoRNAs are structural RNAs with less successive mutations and indels, we used the conserved alignment sequences as source data to reduce the running time. The H/ACA program was then applied to the conserved alignment sequences in four WGAs, respectively (Figure 4B). About 100 candidates were obtained from each dataset of humanmammal WGA. With the same strategy applied in CDseeker, we located the H/ACA candidates in the human genome. All the candidates within introns with lengths <11 000 nt were selected. We also intersected the four humanmammal WGA results and applied another filter in a manner similar to the search of C/D snoRNA genes with CDseeker. In total, 108 human H/ACA candidates including 11 novel orphan genes and 8 novel isoforms of known orphan snoRNA were obtained (Supplementary Table S5; sequences and annotated alignments are available at ). Among them, 75 candidates were previously identified snoRNA genes and the remaining 33 candidates, including 13 orphan ones, were novel candidates. In summary, a total of 54 novel candidates were reported in this study (Tables 1 and 2, and Supplementary Table S4). In addition, >75% of the known snoRNA genes of the two families were detected in our computational scans. More importantly, a large number of orphan snoRNAs, including 26 novel orphan candidates and 94 known orphan genes were detected with the programs developed in this study, showing the efficiency of the programs as snoRNA screening tools. In addition to the ability of snoRNA gene detection, these two programs can also be applied for guiding function prediction of snoRNAs. Twenty-four novel guiding functions were presented in this study with the sequence pairing of the guide RNAs and the target RNAs shown in Supplementary Figure S5. Along with the previous results (29–32), 107 out of 110 ribosomal RNA 2′-O-methylations and 17 out of 25 spliceosomal RNA 2′-O-methylations in human have been assigned to C/D guide RNAs; 86 out of 97 ribosomal RNA pseudouridines and 20 out of 32 spliceosomal RNA pseudouridines in human have been assigned to H/ACA guide RNAs. Our results provide a more comprehensive understanding of the post-transcriptional modification-guided RNA machinery in the human genome.
Table 1

Novel box C/D snoRNA genes

snoRNA nameLenLocationExp.HomologyModificationAntisense elementHost gene/comments
Novel guide
    SNORD11794chr3:52699794–52699886N.blotMM RN CF BT18S-Gm68315 nt (3′)GNL3
    SNORD118101chr14:44649828–44649928N.blotMM RN CF BT18S-Gm144711 nt (3′)PRPF39
    SNORD11996chr20:2391598–2391693N.blotMM RN CF BT28S-Am456016 nt (3′)SNRPB
    SNORD12084chrX:20064093–20064185N.blotMM RN BTU2-Am3013 nt (3′)EIF1AX
    SNORD121A91chr9:33942762–33942852N.blotMM RN CF28S-Gm460712 nt (5′)UBAP2
    SNORD121B93chr9:33924286–33924378N.blotMM RN CF28S-Gm460712 nt (5′)UBAP2
Novel isoform
    SNORD41B94chr19:12675401–12675494Iso(U41)MM RN CF BT28S-Um427614 nt (3′)TNPO2
    SNORD12B103chr20:47330257–47330359Iso(HBII-99)MM RN CF BT28S-Gm387813 nt (5′)HSUP1
    SNORD111B94chr16:69120906–69120999Iso(HBII-82)MM CF BT28S-Gm392316 nt (5′)SF3B3
    SNORD58C79chr18:45269603–45269692Iso(U58)RN BT28S-Um419715 nt (5′)RPL17
    SNORD11B112chr2:202864285–202864396Iso(HBII-95)MM RN CF BT18S-Gm50913 nt (3′)NOP5/NOP58
    SNORD105B92chr19:10081425–10081516Iso(U105)MM RN CF18S-Um79915 nt (3′)P2Y11
Novel orphan
    SNORD122100chr2:29004342–29004441N.blotMM RN CF BTUnknownUnknownWDR43
    SNORD12388chr5:9601939–9602026N.blotMM RN CF BTUnknownUnknownHs.34447
    SNORD124104chr17:35437321–35437424N.blotMM RN CF BTUnknownUnknownTHRAP4
    SNORD12596chr22:28059152–28059247N.blotCF BTUnknownUnknownAP1B1
    SNORD12699chr14:19864440–19864538N.blotRN CFUnknownUnknownCCNB1IP
Novel isoform
    SNORD116@106chr15:22881615–22881718Iso(HBII-85)MM RN CF BTUnknownUnknownSNURF-SNRNP-UBE3A
    SNORD114@93chr14:100534548–100534640Iso(14q(II))CF BTUnknownUnknownMEG8

‘Iso’: is isoforms; ‘Len’: length of the snoRNA gene (as the program extends 5′ and 3′ stems by 15 nt, the predicted snoRNAs may be 20 nt larger than corresponding snoRNAs confirmed by northern blot); ‘Exp’: expression situation. ‘N.blot’ indicate the snoRNA was identified by northern blotting analysis in our work. In the column ‘host gene’, the protein-coding host genes are denoted by their symbols. In column ‘location’, the genomic locations are shown. In the column ‘modification’, a nucleotide with ‘m’ represents the rRNA or snRNA methylation site that is conserved in mammals. HS, MM, RN, CF, and BT are abbreviations of human (hg18, March 2006), mouse, rat, dog and cow, respectively.

Table 2

Novel box H/ACA snoRNA genes

snoRNA nameLen. (nt)LocationExp.HomologyModificationAntisense elementHost gene/comments
Novel guide
    SCARNA26145chr1:153915523–153915667N.blotCFU4-Ψ786 + 7 nt (5′)YY1AP1
    SNORA82123chr3:187986808–187986930N.blotMM RN28S-Ψ44913 + 7 nt (5′)EIF4A2
    SCARNA27126chr6:8031640–8031765N.blotCF BTU4-Ψ47 + 3 nt (5′)EEF1E1
    SNORA83A135chr7:64168351–64168485N.blotRN18S-Ψ13675 + 7 nt (5′)LOC441242
    SNORA83B135chr7:64862474–64862608N.blotRN18S-Ψ13675 + 7 nt (5′)LOC441241
    SNORA77B122chr22:18493925–18494047N.blotMM RN CF18S-Ψ8146 + 5 nt (5′)RANBP1
    SNORA77A123chr1:201965332–201965454N.blot (Schattner, ACA63)MM RN CF BT18S-Ψ8146 + 5 nt (5′)ATP2B4
    SNORA80B135chr7:6023034–6023168Schattner.CF18S-Ψ109-Ψ5726 + 4nt (5′) 7+ 4nt(3′)JTV1
    SNORA80A136chr21:32671367–32671502Schattner. (ACA67)MM BT18S-Ψ109-Ψ5726 + 4nt (5′) 7+ 4nt(3′)C21orf108
    SNORA79140chr14:80738792–80738931Schattner. (ACA65)MM RNU6-Ψ31-Ψ865 + 6nt (5′) 5+ 7nt(3′)GTF2A1
    SCARNA21138chr17:7750166–7750303Schattner. (ACA68)CFU12-Ψ196 + 4 nt (5′)CHD3
    SNORA76132chr17:59577431–59577562Schattner. (ACA62)MM RN CF BT18S-Ψ34-Ψ1057 + 3nt(5′) 7 + 5nt(3′)EST cluster
    SCARNA22132chr5:82395779–82395910Gu.(U109)MM RN CFU1-Ψ64 + 5nt (3′)MGC23909
Novel isoform
    SNORA58B134chr1:152498829–152498962Iso(ACA58)RN BT28S-Ψ38235 + 9 nt (5′)UBAP2L
Novel orphan
    SNORA84133chr9:94094564–94094696N.blot(Washietl)MN RN CFUnknownUnknownIARS
    SNORA85130chr15:63364852–63364981N.blotMM RN CF BTUnknownUnknownPARP16
    SNORA86132chr7:64163814–64163945N.blotMM RN CF BTUnknownUnknownBX649060
    SNORA45127chr11:8663564–8663690Gu.(ACA3-2)MM CF BTUnknownUnknownRPL27A
    SNORA12144chr10:101986903–101987046Gu.(U108)CFUnknownUnknownCWF19L1
Novel Isoform
    SNORA36C129chr2:69600679–69600807Iso(ACA36)CFUnknownUnknownAAK1
    SNORA38B132chr17:63167248–63167379Iso(ACA38)BTUnknownUnknownNOL11
    SNORA70B134chr2:61497883–61498016Iso(U70)MN BTUnknownUnknownUSP34
    SNORA70C134chr9:118983166–118983299Iso(U70)BTUnknownUnknownASTN2
    SNORA11B128chr14:90662523–90662650Iso(U107)MM BTUnknownUnknownC14orf159
    SNORA11C128chrX:47132993–47133120Iso(U107)MM CF BTUnknownUnknownZNF157
    SNORA11D127chrX:51823183–51823309Iso(U107)BTUnknownUnknownMAGED4
    SNORA11E127chrX:51950458–51950584Iso(U107)BTUnknownUnknownMAGED4

‘Iso’: is isoforms; ‘Len’: length of the snoRNA gene, ‘Exp’: expression situation. ‘N.blot’ indicate the snoRNA was identified by northern blotting analysis in our work. ‘Schattner’, ‘Gu’ and ‘Washietl’ indicates the confirmed expression of snoRNAs in other works (ref. 13, 16, 21). In the column ‘host gene’, the protein-coding host genes are denoted by their symbols. In column ‘location’, the genomic locations are shown. In the column ‘modification’, a nucleotide with ‘Ψ’ represents the rRNA or snRNA pseudouridine site that is conserved in mammals. HS, MM, RN, CF, and BT are abbreviations of human (hg18, March 2006), mouse, rat, dog and cow, respectively.

Expression of the novel snoRNA genes in different tissues

To further confirm the candidates identified computationally, all novel snoRNA candidates, with the exception of eight candidate H/ACA snoRNAs confirmed by three recently published articles (14,16,33) and 20 isoforms of known snoRNAs, were selected for northern blotting analyses. These included 15 novel snoRNA candidates that were conserved between human and rat (rat was chosen for this analysis to facilitate examination of a diverse range of tissues) and 11 novel snoRNA candidates that were conserved between human and mouse or dog or cow (human cell lines: 293T, HeLa, Jurkat and K562 for confirming novel candidates not conserved between human and rat). Total RNAs of eight rat tissues and four human cell lines were used in the experiments. Positive northern blots were obtained for 18 novel snoRNAs (Figure 5A–C) and of these 18 novel snoRNAs, 8 were orphan snoRNAs and 2 (SNORD126 and SNORD123) exhibited tissue-specific or restricted expression. The SNORD126 snoRNA was only expressed in thymus and spleen, while the SNORD123 snoRNA was strongly expressed in lung and heart, with weak expression in spleen and testis (Figure 5A). The remaining six orphan snoRNAs were ubiquitously expressed even though the accumulation levels were not the same in different tissues. It has been generally thought that all guide snoRNAs are expressed ubiquitously as housekeeper RNA genes. Surprisingly, our study showed that one guide snoRNA, SNORD121, was specifically expressed in the thymus (Figure 5A). Another guide snoRNA, SNORD120, was strongly expressed in thymus with lower levels of expression in spleen, lung, kidney, and testis.
Figure 5

Northern blotting analysis of the expression patterns of novel snoRNAs. Lane M, molecular weight markers (pBR322 digested with HaeIII and 5′-end-labeled with [γ-32P]ATP). The samples of different rat tissues are indicated by the names of tissues. U6 snRNA were analyzed as a control. (A) Expression pattern of novel C/D snoRNAs. (B) Expression pattern of novel H/ACA snoRNAs. (C) Expression pattern of novel snoRNAs in human cell lines. The names of human cell lines are indicated.

DISCUSSION

Recently, numerous orphan snoRNAs have been experimentally identified from different eukaryotes, notably mammals, suggesting that a large group of snoRNAs with unknown function is still hidden in mammalian genomes. With the finding of a novel function for the orphan snoRNA, HBII-52, there has been an increasing trend in orphan snoRNA research in mammalian genomes. In this study, a computational package, snoSeeker, which includes two programs, CDseeker and ACAseeker, was developed for the screening of both guide and orphan snoRNAs. We successfully detected 120 orphan snoRNAs as well as 200 guide snoRNA genes from the human genome using new programs. The programs incorporate a new strategy by taking complementary antisense as an optional criterion for candidate detection. Candidates having a high score (higher than a cutoff score) and complementary antisenses are assigned as guide snoRNAs, while those having a low score (lower than a cutoff score) and lack of complementary antisenses by computational evaluation are assigned as orphan snoRNAs. In comparison to the existing programs for snoRNA screening which essentially focus on searching guide snoRNAs, our package provides an advanced search engine for an overall analysis of snoRNA genes in mammalian genomes. It has been well demonstrated that, under selection pressure, functional RNAs [both protein-coding RNAs (ORF) and non-coding RNAs], exhibit highly conserved sequences between phylogenetically related species. Searching functional RNAs within the conserved intronic and/or intergenic sequences has been a key strategy to systematically identify snoRNAs (16), miRNAs (34,35) and other structural non-coding RNAs (33,36). Generally, non-coding RNAs receive higher selection pressure than their flanking sequences. Consistent with this point we found that snoRNA coding regions were more highly conserved than flanking sequences in the first survey of training snoRNA set and resembled a ‘hill’ in the UCSC genome browser (Figure 2A). Our experimental analyses further supported this notion. For example, some false positive candidates, which had highly conserved coding regions and flanking regions, were unable to be detected in northern blot analysis (Figure 2B and other experimental data not shown). Therefore, we used a conservation filter in the programs to distinguish the false positive candidates from the authentic candidates based on the difference for the conservation between the coding region of snoRNAs and their flanking regions. By using this filter, the false positive candidates of snoRNA genes were efficiently eliminated in our analyses. The programs developed in this study were also time-efficient and did not require sophisticated computational equipment. The search duration was ∼96 h using a personal computer (CPU 3 GHz) to complete the whole process, including searching the four WGAs of human/mouse, human/rat, human/dog and human/cow. With the low requirement of computational equipment, these programs can therefore be easily popularized. Frequently, two or even more snoRNAs are encoded within different introns of the same host gene. Interestingly, some isoforms are located in the flanking introns in the same host gene (37,38). The new isoforms of HBII-95, U105, HBII-99, HBII-82, U58 and U41 predicted by the computational methods developed in this study and the two novel snoRNAs, SNORD121A and SNORD121B, are new examples of intragenic snoRNA duplication. Most, if not all, of the snoRNAs in mammals are intron-encoded. Their host genes, such as ribosomal protein genes and snoRNA binding protein genes, are ubiquitously expressed housekeeper genes and mainly involved in ribosomal biogenesis. Only a few examples of imprinted orphan snoRNAs are found to have a brain-specific expression pattern (39). Surprisingly, we found that two (SNORD121 and SNORD120) newly identified guide snoRNAs exhibited an obvious tissue-specific or restricted expression pattern in this study (Figure 5A). This observation was further supported by the expression patterns of the host genes, UBAP2 and EIF1AX, of these two snoRNAs (in UCSC gene sorter ). The increasing number of tissue-specific expressed snoRNAs implies a regulatory role of snoRNAs in gene expression as has been showed for HBII-52 (8,9).

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.
  36 in total

1.  RNomics: an experimental approach that identifies 201 candidates for novel, small, non-messenger RNAs in mouse.

Authors:  A Hüttenhofer; M Kiefmann; S Meier-Ewert; J O'Brien; H Lehrach; J P Bachellerie; J Brosius
Journal:  EMBO J       Date:  2001-06-01       Impact factor: 11.598

2.  Cajal body-specific small nuclear RNAs: a novel class of 2'-O-methylation and pseudouridylation guide RNAs.

Authors:  Xavier Darzacq; Beáta E Jády; Céline Verheggen; Arnold M Kiss; Edouard Bertrand; Tamás Kiss
Journal:  EMBO J       Date:  2002-06-03       Impact factor: 11.598

Review 3.  Small nucleolar RNAs: an abundant group of noncoding RNAs with diverse cellular functions.

Authors:  Tamás Kiss
Journal:  Cell       Date:  2002-04-19       Impact factor: 41.582

Review 4.  The expanding snoRNA world.

Authors:  Jean Pierre Bachellerie; Jérôme Cavaillé; Alexander Hüttenhofer
Journal:  Biochimie       Date:  2002-08       Impact factor: 4.079

5.  A search for H/ACA snoRNAs in yeast using MFE secondary structure prediction.

Authors:  Sverker Edvardsson; Paul P Gardner; Anthony M Poole; Michael D Hendy; David Penny; Vincent Moulton
Journal:  Bioinformatics       Date:  2003-05-01       Impact factor: 6.937

6.  Vienna RNA secondary structure server.

Authors:  Ivo L Hofacker
Journal:  Nucleic Acids Res       Date:  2003-07-01       Impact factor: 16.971

7.  The Schizosaccharomyces pombe mgU6-47 gene is required for 2'-O-methylation of U6 snRNA at A41.

Authors:  Hui Zhou; Yue-Qin Chen; Yan-Ping Du; Liang-Hu Qu
Journal:  Nucleic Acids Res       Date:  2002-02-15       Impact factor: 16.971

8.  Identification of brain-specific and imprinted small nucleolar RNA genes exhibiting an unusual genomic organization.

Authors:  J Cavaillé; K Buiting; M Kiefmann; M Lalande; C I Brannan; B Horsthemke; J P Bachellerie; J Brosius; A Hüttenhofer
Journal:  Proc Natl Acad Sci U S A       Date:  2000-12-19       Impact factor: 11.205

9.  CARPEL FACTORY, a Dicer homolog, and HEN1, a novel protein, act in microRNA metabolism in Arabidopsis thaliana.

Authors:  Wonkeun Park; Junjie Li; Rentao Song; Joachim Messing; Xuemei Chen
Journal:  Curr Biol       Date:  2002-09-03       Impact factor: 10.834

10.  Computational identification of Drosophila microRNA genes.

Authors:  Eric C Lai; Pavel Tomancak; Robert W Williams; Gerald M Rubin
Journal:  Genome Biol       Date:  2003-06-30       Impact factor: 13.583

View more
  61 in total

1.  A living fossil in the genome of a living fossil: Harbinger transposons in the coelacanth genome.

Authors:  Jeramiah J Smith; Kenta Sumiyama; Chris T Amemiya
Journal:  Mol Biol Evol       Date:  2011-10-31       Impact factor: 16.240

2.  Animal snoRNAs and scaRNAs with exceptional structures.

Authors:  Manja Marz; Andreas R Gruber; Christian Höner Zu Siederdissen; Fabian Amman; Stefan Badelt; Sebastian Bartschat; Stephan H Bernhart; Wolfgang Beyer; Stephanie Kehr; Ronny Lorenz; Andrea Tanzer; Dilmurat Yusuf; Hakim Tafer; Ivo L Hofacker; Peter F Stadler
Journal:  RNA Biol       Date:  2011-11-01       Impact factor: 4.652

3.  Computational prediction of Caenorhabditis box H/ACA snoRNAs using genomic properties of their host genes.

Authors:  Paul Po-Shen Wang; Ilya Ruvinsky
Journal:  RNA       Date:  2009-12-28       Impact factor: 4.942

Review 4.  Noncoding RNA in development.

Authors:  Paulo P Amaral; John S Mattick
Journal:  Mamm Genome       Date:  2008-10-07       Impact factor: 2.957

Review 5.  Computational methods in noncoding RNA research.

Authors:  Ariane Machado-Lima; Hernando A del Portillo; Alan Mitchell Durham
Journal:  J Math Biol       Date:  2007-09-04       Impact factor: 2.259

6.  The small nucleolar ribonucleoprotein (snoRNP) database.

Authors:  J Christopher Ellis; Daniel D Brown; James W Brown
Journal:  RNA       Date:  2010-03-02       Impact factor: 4.942

7.  Plant noncoding RNA gene discovery by "single-genome comparative genomics".

Authors:  Chong-Jian Chen; Hui Zhou; Yue-Qin Chen; Liang-Hu Qu; Daniel Gautheret
Journal:  RNA       Date:  2011-01-10       Impact factor: 4.942

Review 8.  Biology and applications of small nucleolar RNAs.

Authors:  Tomaž Bratkovič; Boris Rogelj
Journal:  Cell Mol Life Sci       Date:  2011-07-12       Impact factor: 9.261

9.  Genome-wide searching with base-pairing kernel functions for noncoding RNAs: computational and expression analysis of snoRNA families in Caenorhabditis elegans.

Authors:  Kensuke Morita; Yutaka Saito; Kengo Sato; Kotaro Oka; Kohji Hotta; Yasubumi Sakakibara
Journal:  Nucleic Acids Res       Date:  2009-01-07       Impact factor: 16.971

10.  Mining small RNA sequencing data: a new approach to identify small nucleolar RNAs in Arabidopsis.

Authors:  Ho-Ming Chen; Shu-Hsing Wu
Journal:  Nucleic Acids Res       Date:  2009-04-08       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.