Literature DB >> 16990247

snoSeeker: an advanced computational package for screening of guide and orphan snoRNA genes in the human genome.

Jian-Hua Yang¹, Xiao-Chen Zhang, Zhan-Peng Huang, Hui Zhou, Mian-Bo Huang, Shu Zhang, Yue-Qin Chen, Liang-Hu Qu.

Abstract

Small nucleolar RNAs (snoRNAs) represent an abundant group of non-coding RNAs in eukaryotes. They can be divided into guide and orphan snoRNAs according to the presence or absence of antisense sequence to rRNAs or snRNAs. Current snoRNA-searching programs, which are essentially based on sequence complementarity to rRNAs or snRNAs, exist only for the screening of guide snoRNAs. In this study, we have developed an advanced computational package, snoSeeker, which includes CDseeker and ACAseeker programs, for the highly efficient and specific screening of both guide and orphan snoRNA genes in mammalian genomes. By using these programs, we have systematically scanned four human-mammal whole-genome alignment (WGA) sequences and identified 54 novel candidates including 26 orphan candidates as well as 266 known snoRNA genes. Eighteen novel snoRNAs were further experimentally confirmed with four snoRNAs exhibiting a tissue-specific or restricted expression pattern. The results of this study provide the most comprehensive listing of two families of snoRNA genes in the human genome till date.

Entities: CellLine Chemical Gene Species

Mesh：

Substances：

Year: 2006 PMID： 16990247 PMCID： PMC1636440 DOI： 10.1093/nar/gkl672

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The small nucleolar RNAs (snoRNAs) represent an abundant group of small non-coding RNAs (ncRNAs) in eukaryotes (1). With the exception of RNase MRP, all the snoRNAs fall into two major families, box C/D and box H/ACA snoRNAs, on the basis of common sequence motifs and structural features. A large number of snoRNAs characterized to date are box C/D snoRNAs that share two conserved motifs, the 5′ end box C and the 3′ end box D, and the box H/ACA snoRNAs that exhibit a common hairpin–hinge–hairpin–tail secondary structure with the box H and ACA (2). Although several snoRNAs, such as U3, snR30 and RNase MRP, are required for specific cleavage of pre-rRNAs, the majority of box C/D snoRNAs function as guides for site-specific 2′-O-ribose methylation, and most box H/ACA snoRNAs function as guides for pseudouridylation in the post-transcriptional processing of rRNAs (2). Studies have shown that some snoRNAs and Small Cajal body-specific RNAs (scaRNAs) participate in the modification of snRNAs (3,4). Some modifications in Archaea tRNAs are also introduced by box C/D small RNAs, which are the homologs of snoRNAs in eukaryotes (5). Notably, an increasing number of orphan snoRNAs, which lack antisense to rRNAs or snRNAs, has been experimentally identified along with the numerous guide snoRNAs from different eukaryotes (6,7). The finding of an orphan snoRNA, HBII-52, being associated with human disease, has triggered a great interest in orphan snoRNAs (8,9). Many studies have proven that computational analysis of genomic databases is a useful way to identify snoRNAs from eukaryotes on a large-scale (7,10,11). To date, several searching programs based on pattern recognition scan algorithms have been developed, such as Snoscan (10) for box C/D snoRNA, and SnoGPS (11) and MFE (12) methods for box H/ACA snoRNA. In comparison to experimental approaches that tend to favor the discovery of the most abundant RNAs (13,14), computational analysis provides an unbiased genome-wide search for snoRNA genes. However, the current snoRNA-searching programs are essentially based on sequence complementarity to rRNAs or snRNAs and are therefore limited to the detection of guide snoRNAs and not orphan snoRNAs. Another limitation with the existing programs is that it is difficult to systematically search the human genome for snoRNAs because of the vast data source. Although two recent studies have performed a computational detection of human snoRNAs, they only focused on particular segments of the human genome (15,16). It is therefore important to develop an advanced search program for genome-wide screening of all snoRNAs, including the orphan snoRNAs. In this study, we developed a computational package which includes two novel snoRNA-searching programs, CDseeker and ACAseeker, specific to the detection of C/D snoRNAs and H/ACA snoRNAs, respectively. Based on new algorithms, our programs detected both guide and orphan snoRNA genes in a genome-wide analysis. We also systematically searched the human genome for snoRNAs with four human–mammal (human/mouse, human/rat, human/dog and human/cow) whole-genome alignment (WGA) sequences using these programs. As a result, a majority of known C/D and H/ACA snoRNA genes were detected. In addition, 54 novel genes were identified with 18 being experimentally confirmed. This study presents the most complete list of two families of snoRNA genes in the human genome till date.

MATERIALS AND METHODS

Data source

WGAs for human (May 2004), mouse (March 2005), rat (June 2003), dog (July 2004), cow (September 2004), chimp (November 2003) and monkey (January 2005) were downloaded from the UCSC Genome Bioinformatics site (). The repeat families were removed by RepeatMasker (). Sequences and annotation data for known human snoRNA genes (which were used in program training) were downloaded from snoRNA-LBME-db (17) on August 2005. UCSC KnownGene, RefGene and AceView annotation data for human protein genes and transcript units were downloaded from the UCSC Genome Bioinformatics site (). rRNA and snRNA sequences were downloaded from snoRNA-LBME-db (17). 2′-O-methylation and pseudouridylation sites of human rRNA and snRNA were cited from snoRNA-LBME-db (17) and other reports (18,19). Sequences 4 nt upstream to 25 nt downstream of known methylation sites were extracted from the rRNA and snRNA sequences as target sequences for the CDseeker program. Sequences 15 nt upstream to 15 nt downstream of the known pseudouridylation sites were extracted from the rRNA and snRNA sequences as target sequences for the ACAseeker program.

Algorithm description

The CDseeker program combines probabilistic model (20), conserved primary and secondary structure motifs to search orphan and guide C/D snoRNAs in WGA sequences. Common algorithm components for guide and orphan snoRNA genes are box C, box D and the terminal stem base pairing. Two additional components are applicable to guide snoRNA genes, a region of sequence complementary to rRNA or snRNA and box D′ if the rRNA or snRNA complementary region is not directly adjacent to box D. For orphan snoRNA genes, two additional components are (i) predicted conserved functional region next to box D or box D′ and (ii) box D′ if the conserved functional sequence is not next to box D. The distance between components (e.g. the maximum distance is 100 nt between box C and box D) was also taken into account. The program searches box C, box D, terminal stem pairing and the antisense element step-by-step in WGA consensus sequences, and scores the corresponding elements with probabilistic models, then evaluates them based on the standard cutoff score of the training set. Candidates progress to the next evaluation only if the element score is higher than the cutoff. The examination of antisense is an optional criterion in the CDseeker program. The program assigns the candidates as guide snoRNAs or orphan snoRNAs using this evaluation. Finally, in order to rank the candidates, the program sums the scores of the motifs resulting in a final score. The standard cutoff score of a training set is also applied for selecting candidates. The structural model of C/D snoRNA genes for CDseeker is based on the canonical C/D structure shown in Figure 1A and the core algorithm workflow is shown using a schematic diagram in Figure 1B.

Figure 1

CDseeker and ACAseeker core algorithm workflow. (A) The C/D snoRNA model. The C/D box snoRNAs carry the conserved boxes C (RUGAUGA, R = purine) and D (CUGA) near their 5′ and 3′ ends, respectively. The two boxes are frequently folded together by a short (4–5 bp) terminal helix, to form a structure similar to a kink-turn. Often, imperfect copies of the C and D boxes, named C′ and D′, are located internally, in the order C/D′/C′/D. The 2′-O-ribose methylation of target RNAs is guided by one or two 10–21 antisense elements located upstream of the D and/or D′ boxes, so that the modified base is paired with the snoRNA nucleotide located precisely 5 nt upstream of the D or D′ box (17). (B) Schematic representation of the CDseeker algorithms. (C) The H/ACA snoRNA model. The H/ACA box snoRNAs consist of two hairpins and two short single-stranded regions, which contain the H box (ANANNA) and the ACA box. The latter is always located 3 nt upstream of the 3′ end of the snoRNA. The hairpins contain bulges, or recognition loops that form complex pseudoknots with the target RNA, where the target uridine is the first unpaired base. The position of the substrate uridine always resides 13–16 nt upstream of the H box (left recognition pocket) or of the ACA box (right recognition pocket) (17). (D) Schematic representation of the ACAseeker algorithms.

The ACAseeker program combines probabilistic model (20), conserved primary and secondary structure motifs to search orphan and guide H/ACA snoRNAs in WGA sequences. Common algorithm components for guide and orphan snoRNA genes are box H, box ACA, stem1 (hairpin I), stem2 (hairpin II) and hairpin–hinge–hairpin. For guide snoRNA genes, another component is taken into account which is the two regions of sequence complementary to rRNA or snRNA in a hairpin. The program searches box H and ACA in conserved sequences of WGA consensus sequences and scores them using probabilistic models. The candidates having a score higher than the standard cutoff progress to the next step which is an evaluation of secondary structure feature using a slightly arbitrary standard observed from known H/ACA snoRNAs. Similar to CDseeker, the final examination of antisense by ACAseeker program is an optional criterion. The program assigns the candidates as guide snoRNAs or orphan snoRNAs by this evaluation. The structural model of H/ACA snoRNA genes used by ACAseeker is based on the canonical H/ACA structure shown in Figure 1C and the core algorithm workflow is shown using a schematic diagram in Figure 1D.

Generating consensus sequences for searching

We generated consensus sequences from UCSC whole-genome pairwise alignments. The following annotations were assigned: (i) the same letter if the sequences were identical between two alignment sequences; (ii) a dot (.) if there was a point mutation between two alignment sequences; and (iii) a dash (-) if there was an insertion or deletion (indel) between alignment sequences. The two alignment sequences were transformed as a consensus sequence (Supplementary Figure S1). To efficiently scan the snoRNA genes, consensus motifs in consensus sequences were searched. From alignments of known human–mammalian alignment snoRNA sequences, we found that the regular expression for box C, box D, box H and box ACA are as follows: (i) Box C, [ACGT.][ATG]GA[TG]G[ATG.]; (ii) Box D, CTGA; (iii) Box H, A[ACGT.]A[ACGT.][ACGT.][G.]?A; and (iv) Box ACA, A[CT.]A. Dot (.) indicates a point mutation between human and other mammalians, letters are identical nucleotides between human and other mammalians, the symbol ‘?’ indicates that the former symbol appear 0 or 1 time, and ‘[]’ means that one of the letters located within the brackets appear only once in regular expression.

Indel and substitution models

As H/ACA snoRNAs are structural and functional RNAs, their sequences are constrained for maintaining the hairpin–helix–hairpin–tail secondary structure. We surveyed 100 known H/ACA snoRNAs and found that successive mutations in these RNAs were mostly <7 nt and successive indels were mostly <5 nt in the alignment sequences (Supplementary Figure S2). We, therefore, defined maximum successive mutations as 7 nt and maximum successive indels as 5 nt. We applied these models to extract conserved segments from whole-genome pairwise alignments and found all known box H/ACA snoRNAs located in 120–1000 nt conserved segments (Supplementary Figure S3). The ACAseeker program was then applied to search for H/ACA genes whose lengths varied between 120 and 1000 nt in the conserved sequences of WGA.

Training for scoring standard with probabilistic model

(i) Training for scoring standard of box element with hidden Markov model. The hidden Markov model (HMM) has been widely used for searching protein-coding genes and non-coding RNA genes (10,20–22). The fixed-length HMM (or zero-order HMM) was used in training the box elements (including box C, box D, box D′, box H and box ACA) of known snoRNAs. In detail, box elements are trained by calculating the probability of test sequences with the HMM. According to the alignment sequences, the box H being trained, is ANANNA and ANANNGA, so there are two transition probabilities from position 5 to 6 of 0.90 and 0.10, respectively (the logarithm of the probability is −0.152 and −3.322, respectively; Supplementary Figure S6). The other transition probability is 1 (the logarithm of the probability is 0; Supplementary Figure S6). The emission probability is constructed by the following equation: The E is an emission score for the j-th symbol (base) of the i-th position of sequence S. f(S) is a frequency for j-th symbol (base) of the i-th position of sequence S. p is a pseudo-count. f(S) is a background frequency of the j-th symbol. Consequently, the HMM consists of an n × m matrix when the length of sequence S is n and the number of symbols is m. (ii) Training for scoring standard of terminal stem pairing with stochastic context-free grammars (SCFGs). Secondary structures in RNA are not local, similar to proteins; thus, it is necessary to use a more complex model than an HMM for modeling terminal stem pairing. Stochastic context-free grammars (SCFGs) (10,20), which are used here, can describe some long-range interactions, including most of those in RNA secondary structure. The SCFG production probabilities were estimated from a training set of C/D snoRNAs with terminal stem pairing. After surveying the terminal stem pairings of all known C/D snoRNA, we found that the pairs reached a maximum of 15 bp in the terminal stem. Hence the 15 nt flanking sequences of known snoRNA were extracted for folding the terminal stem and training. The score for terminal stem pairing is from 1.149 bits to 21.252 bits. (iii) Training for scoring standard of complementary regions with HMM. The HMM model was also used in training the complementary regions between ribosomal RNA and snoRNAs (10,20). The probability of matches (Watson–Crick matches and GU matches) and mismatches in the complementarity is emission probability. After surveying the antisense–target–helix of all known snoRNA, we defined criteria for complementary region evaluation. For box C/D snoRNA genes, the criteria are as follows: (a) we selected the lowest score (13.0 bits) of the complementary region as a cutoff score; (b) the first mismatch next to box D or box D′ did not affect the stability of the complementary helix (the antisense and target; 0 bit was given for the first mismatch next to box D or D′); (c) the maximum mismatch was 1 nt except for the first mismatch; (d) the maximum GU pairs were 3 bp; and (e) the minimum complementary length was 10 nt. For box H/ACA snoRNA genes, the criteria are as follows: (a) we selected the lowest score (16.2 bits) of the complementary region as a cutoff score; (b) the maximum mismatch was 1 nt; (c) the minimum complementary length was 9 nt; (d) the minimum upper stem pairing was 4 bp and maximum bulge in upper stem pairing was 5 nt; and (e) the minimum lower stem pairing was 4 bp and maximum bulge in lower stem pairing was 2 nt.

Evaluating box H/ACA secondary structure

For evaluating the hairpin structure, we folded the hairpin II and hairpin I step-by-step using RNAfold (23) and then evaluated the structure with the following standards concluded from surveying all the known H/ACA snoRNAs: (i) maximum distance between H box and hairpin I was 5 nt; (ii) maximum distance between H box and hairpin II was 7 nt; (iii) maximum distance between ACA box and hairpin II was 5 nt; (iv) maximum pocket was 12 nt; (v) maximum bulge in lower and upper stems were 2 and 5 nt, respectively; (vi) minimum stem pairs were 14 bp; (vii) minimum and maximum loop sizes were 3 and 17 nt, respectively; (viii) maximum tail length was 4 nt; and (ix) hairpin MFE was less than −11.0 (kcal/mol) and larger than −43.0 (kcal/mol). For evaluating the hairpin–hinge–hairpin structure, we folded the whole candidate using MFOLD (24), and evaluated the candidate with the following standards concluded from surveying all the known H/ACA snoRNAs: (i) maximum hinge size was 15 nt and (ii) hairpin–hinge–hairpin MFE was less than −29.0 (kcal/mol).

Selecting candidates using locateGenome program

With the exception of U3, mgU2-25/61, mgU12-22/U4-8, mgU12 and U13, all snoRNAs are located in introns of spliced mRNAs. We investigated the size distribution of all known snoRNA-hosted introns and found their lengths mainly fell into the range of 100–600 nt (Supplementary Figure S4). The length of introns containing C/D RNAs varied from 157 to 44 874 nt (Supplementary Figure S4). The length of introns containing H/ACA RNAs varied from 213 to 103 658 nt (Supplementary Figure S4). The locateGenome program first locates the candidates according to the UCSC AceView gene data of human genomic coordinates and orientations which includes >250 000 alt-splicings. The program then selects C/D candidates which are located within introns <50 000 bp and H/ACA candidates which are located within introns <110 000 bp.

Selecting candidates with conservation filter

We surveyed all the training snoRNA genes and found snoRNA sequences were more conserved than their flanking sequences (Figure 2A and B). We then evaluated candidates through the conservation percentage of flanking upstream and downstream 50 bp sequences and the UCSC genome browser high conserved track (25). We evaluated candidates by the following standards concluded from surveying all known snoRNAs:

Figure 2

(A) SnoRNA conserved features on the human UCSC Genome Browser. C/D and H/ACA snoRNAs are colored blue and green, respectively. Conservation track reveals that sequences corresponding to snoRNAs are more highly conserved than those of flanking sequences. (B) A candidate box H/ACA RNA by computational method does not fit the conserved pattern. UCSC conservation track reveals that sequences corresponding to candidate box H/ACA RNA are less highly conserved than those of flanking sequences. Although this candidate can fold into a hairpin–hinge–hairpin–tail structure, its expression cannot be confirmed by northern blot and reverse transcription.

The IDhp1 and IDhp2 are the conservation percentages of hairpins 1 and 2, IDup and IDdown are the conservation percentages of flanking upstream and downstream 50 bp sequences.

RNA isolation and northern blot

Total cellular RNA was isolated from adult male rat tissues and purified according to the method of guanidine thiocyanate/phenol–chloroform (26). Small RNA was purified from total RNA according to the PEG8000/NaCl protocol (27). Small RNA (25 μg/lane) from brain, thymus, heart, lung, liver, spleen, kidney and testis was analyzed by electrophoresis on 8% acrylamide/7 mol/l urea gels. Total RNAs were transferred on to nylon membranes (Hybond-N+; Amersham) followed by UV irradiation for 5 min. Hybridization with 5′-labeled probes was performed as described previously (28).

Oligodeoxynucleotides

Oligonucleotides were synthesized and purified by Sangon Co. (Shanghai, China). The sequences of oligonucleotide probes for northern blotting and oligonucleotide primers for cDNA libraries construction and screening are shown in Supplementary Table S1. The probes used in northern blotting were 5′-end-labeled with [γ-32P]ATP (Yahui Co.) and submitted to purification according to standard laboratory protocols as described previously (26).

RESULTS

The strategy and efficiency of the CDseeker program for screening of mammalian C/D snoRNA genes

In this study, we developed a computational program, CDseeker, specific to the screening of box C/D snoRNAs that are conserved in human–mammal genomes (human/mouse, human/rat, human/dog and human/cow). The CDseeker program searched and scored the conserved motifs, terminal stem pairing and antisense in a consensus alignment sequence. The output candidates were assigned as ‘guide’ and ‘orphan’ C/D RNAs according to the score assignment. The whole procedure is outlined in Figure 1B and described in detail in Materials and Methods. To date, 252 C/D snoRNAs, including 119 imprinted C/D snoRNAs and excluding U3, have been identified from the human genome through experimental and conservative (homologous to mouse snoRNAs) detections (17). To test the sensitivity of this computational program on C/D snoRNA genes in human, we chose 133 C/D snoRNA genes and 6 imprinted C/D snoRNA genes as a dataset for training the CDseeker program (if the imprinted snoRNA had multiple gene isoforms, only one isoform was selected to avoid the overfit of the program to these large-amount-repeated imprinted snoRNA genes). We mainly focused on the score assignment and feature evaluation of the known C/D snoRNAs with respect to the following elements: conserved box C, box D and box D′ motifs, terminal stem pairing, conserved antisense elements, indel and substitution pattern and conservation of flanking sequence (Figure 2A). We then applied CDseeker to the training set of snoRNA gene alignments, which were extracted from four human–mammalian WGAs. As a result, 124 out of 139 known human snoRNA genes (90%) including 6 imprinted snoRNA genes were detected by CDseeker. Fifteen box C/D snoRNA genes were missed due to various reasons, including lack of conservation among any human–mammal genome alignments, location within repeatmask regions, lack of conserved motifs or a large length (>120 nt) (Supplementary Table S2). The training result showed that most of the orphan C/D snoRNAs had scores >26.53 bits and the guide C/D snoRNAs had scores >39.0 bits (Figure 3A and B). We therefore established the cutoff scores of 26.53 bits for orphan C/D snoRNAs and 39.0 bits for guide C/D snoRNAs to identify novel candidates of human C/D snoRNA genes.

Figure 3

Computational identification of box C/D snoRNAs. (A) The distribution of CDseeker scores for known ‘orphan’ C/D snoRNA genes. (B) The distribution of CDseeker scores for known guide C/D snoRNA genes. (HS, MM, RN, CF and BT are abbreviations of human, mouse, rat, dog and cow, respectively. HS-MM represent for human–mouse WGA sequences.)

The strategy and efficiency of the ACAseeker program for screening of mammalian H/ACA snoRNA genes

We also developed another computational program, ACAseeker, specific to the screening of box H/ACA snoRNAs that are conserved in human–mammal genomes. Similar to CDseeker, ACAseeker first searched conserved motifs in a consensus alignment sequence before evaluating the secondary structure features of box H/ACA snoRNAs. The program then searched and scored the functional antisenses to define the candidates as ‘guide’ H/ACA RNAs or ‘orphan’ H/ACA RNAs. The whole procedure is outlined in Figure 1D and described in detail in Materials and Methods. The same strategy as in CDseeker was applied to test the sensitivity of this computational program on human H/ACA snoRNA genes. In brief, 100 H/ACA snoRNA genes, which were previously identified through experimental and conservative (homologous to mouse snoRNAs) detection (17), were used to serve as a dataset for training the ACAseeker program. The score assignment and feature evaluation of the known H/ACA snoRNAs were considered according to the following elements: box H, box ACA motifs, secondary structure, minimum free energy, conserved antisense elements, indel and substitution pattern and conservation of flanking sequence (Figure 2A). ACAseeker was then applied to the training set of snoRNA gene alignments which were extracted from four human–mammalian WGAs. The test results showed that 75 out of 100 known box H/ACA snoRNA genes (75%) were detected by ACAseeker. The remaining 25 box H/ACA RNAs were missed due to reasons similar to those box C/D genes undetected in the CDseeker program (Supplementary Table S3). We found that most of known orphan H/ACA snoRNAs and guide H/ACA snoRNAs had a standard hairpin–hinge–hairpin–tail structure, H box score >0.7 bits, ACA box score >1.63 bits and target score >16.2 bits for guide H/ACA snoRNA. These results provided the cutoff scores and standards for the searching of novel candidates of human H/ACA snoRNA genes.

A genome-wide search of mammalian snoRNA genes with snoSeeker identifies 37 novel human snoRNA genes and 17 novel isoforms of known snoRNA genes

After the training tests of the two programs on known snoRNA genes, we applied the programs to the human genome for an overall search for snoRNA genes of the two families. The whole procedure is outlined in Figure 4A and B.

Figure 4

Flowchart of the CDseeker and ACAseeker algorithms. (A) The flowchart of the CDseeker algorithm is divided into three main stages. The initial stage is a scan of the four WGA sequences by the CDseeker core program. The second stage is location of the genome using the locateGenome program. The final stage is to intersect the four results and filter the candidate sequence with an evolution conservation pattern. The number of known snoRNAs found at different stages of analysis is shown in parentheses. (B) The flowchart of the ACAseeker algorithm is divided into three main stages. The initial stage is a scan of the four high WGA sequences by the ACAseeker core program. The second stage is location of the genome using the locateGenome program. The final stage is to intersect the four results and filter the candidate sequence with an evolution conservation pattern. The number of known snoRNAs found at different stages of analysis is shown in parentheses.

To search for C/D snoRNA genes, we applied CDseeker to four human–mammal WGA sequences (Figure 4A). About 300 candidates were obtained from the analysis of each human–mammal WGA. The second step was to locate the candidates in the human genome with the locateGenome program as described in detail in Materials and Methods. Only candidates located within introns, whose lengths were <50 000 nt, were accepted and the number of candidates for each human–mammal WGA was reduced to <200 after this step. Finally, we intersected the results of the four human–mammal WGA analysis and applied another filter on the results to eliminate false positive candidates. The filter was focused on the conservation of the flanking sequences of the candidates. In total, 212 C/D candidates including 86 orphan candidates were computationally identified from the human genome (Supplementary Table S5; sequences and annotated alignments are available at ). Of these candidates, 191 were previously identified snoRNA genes and the remaining 21 candidates, including 5 novel orphan genes and 2 novel isoforms of known orphan snoRNA genes, were novel snoRNA candidates. We similarly applied ACAseeker for searching another family of snoRNA genes, H/ACA genes, from the human genome. As H/ACA snoRNAs are structural RNAs with less successive mutations and indels, we used the conserved alignment sequences as source data to reduce the running time. The H/ACA program was then applied to the conserved alignment sequences in four WGAs, respectively (Figure 4B). About 100 candidates were obtained from each dataset of human–mammal WGA. With the same strategy applied in CDseeker, we located the H/ACA candidates in the human genome. All the candidates within introns with lengths <11 000 nt were selected. We also intersected the four human–mammal WGA results and applied another filter in a manner similar to the search of C/D snoRNA genes with CDseeker. In total, 108 human H/ACA candidates including 11 novel orphan genes and 8 novel isoforms of known orphan snoRNA were obtained (Supplementary Table S5; sequences and annotated alignments are available at ). Among them, 75 candidates were previously identified snoRNA genes and the remaining 33 candidates, including 13 orphan ones, were novel candidates. In summary, a total of 54 novel candidates were reported in this study (Tables 1 and 2, and Supplementary Table S4). In addition, >75% of the known snoRNA genes of the two families were detected in our computational scans. More importantly, a large number of orphan snoRNAs, including 26 novel orphan candidates and 94 known orphan genes were detected with the programs developed in this study, showing the efficiency of the programs as snoRNA screening tools. In addition to the ability of snoRNA gene detection, these two programs can also be applied for guiding function prediction of snoRNAs. Twenty-four novel guiding functions were presented in this study with the sequence pairing of the guide RNAs and the target RNAs shown in Supplementary Figure S5. Along with the previous results (29–32), 107 out of 110 ribosomal RNA 2′-O-methylations and 17 out of 25 spliceosomal RNA 2′-O-methylations in human have been assigned to C/D guide RNAs; 86 out of 97 ribosomal RNA pseudouridines and 20 out of 32 spliceosomal RNA pseudouridines in human have been assigned to H/ACA guide RNAs. Our results provide a more comprehensive understanding of the post-transcriptional modification-guided RNA machinery in the human genome.

Table 1

Novel box C/D snoRNA genes

snoRNA name	Len	Location	Exp.	Homology	Modification	Antisense element	Host gene/comments
Novel guide
SNORD117	94	chr3:52699794–52699886	N.blot	MM RN CF BT	18S-Gm683	15 nt (3′)	GNL3
SNORD118	101	chr14:44649828–44649928	N.blot	MM RN CF BT	18S-Gm1447	11 nt (3′)	PRPF39
SNORD119	96	chr20:2391598–2391693	N.blot	MM RN CF BT	28S-Am4560	16 nt (3′)	SNRPB
SNORD120	84	chrX:20064093–20064185	N.blot	MM RN BT	U2-Am30	13 nt (3′)	EIF1AX
SNORD121A	91	chr9:33942762–33942852	N.blot	MM RN CF	28S-Gm4607	12 nt (5′)	UBAP2
SNORD121B	93	chr9:33924286–33924378	N.blot	MM RN CF	28S-Gm4607	12 nt (5′)	UBAP2
Novel isoform
SNORD41B	94	chr19:12675401–12675494	Iso(U41)	MM RN CF BT	28S-Um4276	14 nt (3′)	TNPO2
SNORD12B	103	chr20:47330257–47330359	Iso(HBII-99)	MM RN CF BT	28S-Gm3878	13 nt (5′)	HSUP1
SNORD111B	94	chr16:69120906–69120999	Iso(HBII-82)	MM CF BT	28S-Gm3923	16 nt (5′)	SF3B3
SNORD58C	79	chr18:45269603–45269692	Iso(U58)	RN BT	28S-Um4197	15 nt (5′)	RPL17
SNORD11B	112	chr2:202864285–202864396	Iso(HBII-95)	MM RN CF BT	18S-Gm509	13 nt (3′)	NOP5/NOP58
SNORD105B	92	chr19:10081425–10081516	Iso(U105)	MM RN CF	18S-Um799	15 nt (3′)	P2Y11
Novel orphan
SNORD122	100	chr2:29004342–29004441	N.blot	MM RN CF BT	Unknown	Unknown	WDR43
SNORD123	88	chr5:9601939–9602026	N.blot	MM RN CF BT	Unknown	Unknown	Hs.34447
SNORD124	104	chr17:35437321–35437424	N.blot	MM RN CF BT	Unknown	Unknown	THRAP4
SNORD125	96	chr22:28059152–28059247	N.blot	CF BT	Unknown	Unknown	AP1B1
SNORD126	99	chr14:19864440–19864538	N.blot	RN CF	Unknown	Unknown	CCNB1IP
Novel isoform
SNORD116@	106	chr15:22881615–22881718	Iso(HBII-85)	MM RN CF BT	Unknown	Unknown	SNURF-SNRNP-UBE3A
SNORD114@	93	chr14:100534548–100534640	Iso(14q(II))	CF BT	Unknown	Unknown	MEG8

‘Iso’: is isoforms; ‘Len’: length of the snoRNA gene (as the program extends 5′ and 3′ stems by 15 nt, the predicted snoRNAs may be 20 nt larger than corresponding snoRNAs confirmed by northern blot); ‘Exp’: expression situation. ‘N.blot’ indicate the snoRNA was identified by northern blotting analysis in our work. In the column ‘host gene’, the protein-coding host genes are denoted by their symbols. In column ‘location’, the genomic locations are shown. In the column ‘modification’, a nucleotide with ‘m’ represents the rRNA or snRNA methylation site that is conserved in mammals. HS, MM, RN, CF, and BT are abbreviations of human (hg18, March 2006), mouse, rat, dog and cow, respectively.

Table 2

Novel box H/ACA snoRNA genes

snoRNA name	Len. (nt)	Location	Exp.	Homology	Modification	Antisense element	Host gene/comments
Novel guide
SCARNA26	145	chr1:153915523–153915667	N.blot	CF	U4-Ψ78	6 + 7 nt (5′)	YY1AP1
SNORA82	123	chr3:187986808–187986930	N.blot	MM RN	28S-Ψ4491	3 + 7 nt (5′)	EIF4A2
SCARNA27	126	chr6:8031640–8031765	N.blot	CF BT	U4-Ψ4	7 + 3 nt (5′)	EEF1E1
SNORA83A	135	chr7:64168351–64168485	N.blot	RN	18S-Ψ1367	5 + 7 nt (5′)	LOC441242
SNORA83B	135	chr7:64862474–64862608	N.blot	RN	18S-Ψ1367	5 + 7 nt (5′)	LOC441241
SNORA77B	122	chr22:18493925–18494047	N.blot	MM RN CF	18S-Ψ814	6 + 5 nt (5′)	RANBP1
SNORA77A	123	chr1:201965332–201965454	N.blot (Schattner, ACA63)	MM RN CF BT	18S-Ψ814	6 + 5 nt (5′)	ATP2B4
SNORA80B	135	chr7:6023034–6023168	Schattner.	CF	18S-Ψ109-Ψ572	6 + 4nt (5′) 7+ 4nt(3′)	JTV1
SNORA80A	136	chr21:32671367–32671502	Schattner. (ACA67)	MM BT	18S-Ψ109-Ψ572	6 + 4nt (5′) 7+ 4nt(3′)	C21orf108
SNORA79	140	chr14:80738792–80738931	Schattner. (ACA65)	MM RN	U6-Ψ31-Ψ86	5 + 6nt (5′) 5+ 7nt(3′)	GTF2A1
SCARNA21	138	chr17:7750166–7750303	Schattner. (ACA68)	CF	U12-Ψ19	6 + 4 nt (5′)	CHD3
SNORA76	132	chr17:59577431–59577562	Schattner. (ACA62)	MM RN CF BT	18S-Ψ34-Ψ105	7 + 3nt(5′) 7 + 5nt(3′)	EST cluster
SCARNA22	132	chr5:82395779–82395910	Gu.(U109)	MM RN CF	U1-Ψ6	4 + 5nt (3′)	MGC23909
Novel isoform
SNORA58B	134	chr1:152498829–152498962	Iso(ACA58)	RN BT	28S-Ψ3823	5 + 9 nt (5′)	UBAP2L
Novel orphan
SNORA84	133	chr9:94094564–94094696	N.blot(Washietl)	MN RN CF	Unknown	Unknown	IARS
SNORA85	130	chr15:63364852–63364981	N.blot	MM RN CF BT	Unknown	Unknown	PARP16
SNORA86	132	chr7:64163814–64163945	N.blot	MM RN CF BT	Unknown	Unknown	BX649060
SNORA45	127	chr11:8663564–8663690	Gu.(ACA3-2)	MM CF BT	Unknown	Unknown	RPL27A
SNORA12	144	chr10:101986903–101987046	Gu.(U108)	CF	Unknown	Unknown	CWF19L1
Novel Isoform
SNORA36C	129	chr2:69600679–69600807	Iso(ACA36)	CF	Unknown	Unknown	AAK1
SNORA38B	132	chr17:63167248–63167379	Iso(ACA38)	BT	Unknown	Unknown	NOL11
SNORA70B	134	chr2:61497883–61498016	Iso(U70)	MN BT	Unknown	Unknown	USP34
SNORA70C	134	chr9:118983166–118983299	Iso(U70)	BT	Unknown	Unknown	ASTN2
SNORA11B	128	chr14:90662523–90662650	Iso(U107)	MM BT	Unknown	Unknown	C14orf159
SNORA11C	128	chrX:47132993–47133120	Iso(U107)	MM CF BT	Unknown	Unknown	ZNF157
SNORA11D	127	chrX:51823183–51823309	Iso(U107)	BT	Unknown	Unknown	MAGED4
SNORA11E	127	chrX:51950458–51950584	Iso(U107)	BT	Unknown	Unknown	MAGED4

‘Iso’: is isoforms; ‘Len’: length of the snoRNA gene, ‘Exp’: expression situation. ‘N.blot’ indicate the snoRNA was identified by northern blotting analysis in our work. ‘Schattner’, ‘Gu’ and ‘Washietl’ indicates the confirmed expression of snoRNAs in other works (ref. 13, 16, 21). In the column ‘host gene’, the protein-coding host genes are denoted by their symbols. In column ‘location’, the genomic locations are shown. In the column ‘modification’, a nucleotide with ‘Ψ’ represents the rRNA or snRNA pseudouridine site that is conserved in mammals. HS, MM, RN, CF, and BT are abbreviations of human (hg18, March 2006), mouse, rat, dog and cow, respectively.

Expression of the novel snoRNA genes in different tissues

To further confirm the candidates identified computationally, all novel snoRNA candidates, with the exception of eight candidate H/ACA snoRNAs confirmed by three recently published articles (14,16,33) and 20 isoforms of known snoRNAs, were selected for northern blotting analyses. These included 15 novel snoRNA candidates that were conserved between human and rat (rat was chosen for this analysis to facilitate examination of a diverse range of tissues) and 11 novel snoRNA candidates that were conserved between human and mouse or dog or cow (human cell lines: 293T, HeLa, Jurkat and K562 for confirming novel candidates not conserved between human and rat). Total RNAs of eight rat tissues and four human cell lines were used in the experiments. Positive northern blots were obtained for 18 novel snoRNAs (Figure 5A–C) and of these 18 novel snoRNAs, 8 were orphan snoRNAs and 2 (SNORD126 and SNORD123) exhibited tissue-specific or restricted expression. The SNORD126 snoRNA was only expressed in thymus and spleen, while the SNORD123 snoRNA was strongly expressed in lung and heart, with weak expression in spleen and testis (Figure 5A). The remaining six orphan snoRNAs were ubiquitously expressed even though the accumulation levels were not the same in different tissues. It has been generally thought that all guide snoRNAs are expressed ubiquitously as housekeeper RNA genes. Surprisingly, our study showed that one guide snoRNA, SNORD121, was specifically expressed in the thymus (Figure 5A). Another guide snoRNA, SNORD120, was strongly expressed in thymus with lower levels of expression in spleen, lung, kidney, and testis.

Figure 5

Northern blotting analysis of the expression patterns of novel snoRNAs. Lane M, molecular weight markers (pBR322 digested with HaeIII and 5′-end-labeled with [γ-32P]ATP). The samples of different rat tissues are indicated by the names of tissues. U6 snRNA were analyzed as a control. (A) Expression pattern of novel C/D snoRNAs. (B) Expression pattern of novel H/ACA snoRNAs. (C) Expression pattern of novel snoRNAs in human cell lines. The names of human cell lines are indicated.

DISCUSSION

Recently, numerous orphan snoRNAs have been experimentally identified from different eukaryotes, notably mammals, suggesting that a large group of snoRNAs with unknown function is still hidden in mammalian genomes. With the finding of a novel function for the orphan snoRNA, HBII-52, there has been an increasing trend in orphan snoRNA research in mammalian genomes. In this study, a computational package, snoSeeker, which includes two programs, CDseeker and ACAseeker, was developed for the screening of both guide and orphan snoRNAs. We successfully detected 120 orphan snoRNAs as well as 200 guide snoRNA genes from the human genome using new programs. The programs incorporate a new strategy by taking complementary antisense as an optional criterion for candidate detection. Candidates having a high score (higher than a cutoff score) and complementary antisenses are assigned as guide snoRNAs, while those having a low score (lower than a cutoff score) and lack of complementary antisenses by computational evaluation are assigned as orphan snoRNAs. In comparison to the existing programs for snoRNA screening which essentially focus on searching guide snoRNAs, our package provides an advanced search engine for an overall analysis of snoRNA genes in mammalian genomes. It has been well demonstrated that, under selection pressure, functional RNAs [both protein-coding RNAs (ORF) and non-coding RNAs], exhibit highly conserved sequences between phylogenetically related species. Searching functional RNAs within the conserved intronic and/or intergenic sequences has been a key strategy to systematically identify snoRNAs (16), miRNAs (34,35) and other structural non-coding RNAs (33,36). Generally, non-coding RNAs receive higher selection pressure than their flanking sequences. Consistent with this point we found that snoRNA coding regions were more highly conserved than flanking sequences in the first survey of training snoRNA set and resembled a ‘hill’ in the UCSC genome browser (Figure 2A). Our experimental analyses further supported this notion. For example, some false positive candidates, which had highly conserved coding regions and flanking regions, were unable to be detected in northern blot analysis (Figure 2B and other experimental data not shown). Therefore, we used a conservation filter in the programs to distinguish the false positive candidates from the authentic candidates based on the difference for the conservation between the coding region of snoRNAs and their flanking regions. By using this filter, the false positive candidates of snoRNA genes were efficiently eliminated in our analyses. The programs developed in this study were also time-efficient and did not require sophisticated computational equipment. The search duration was ∼96 h using a personal computer (CPU 3 GHz) to complete the whole process, including searching the four WGAs of human/mouse, human/rat, human/dog and human/cow. With the low requirement of computational equipment, these programs can therefore be easily popularized. Frequently, two or even more snoRNAs are encoded within different introns of the same host gene. Interestingly, some isoforms are located in the flanking introns in the same host gene (37,38). The new isoforms of HBII-95, U105, HBII-99, HBII-82, U58 and U41 predicted by the computational methods developed in this study and the two novel snoRNAs, SNORD121A and SNORD121B, are new examples of intragenic snoRNA duplication. Most, if not all, of the snoRNAs in mammals are intron-encoded. Their host genes, such as ribosomal protein genes and snoRNA binding protein genes, are ubiquitously expressed housekeeper genes and mainly involved in ribosomal biogenesis. Only a few examples of imprinted orphan snoRNAs are found to have a brain-specific expression pattern (39). Surprisingly, we found that two (SNORD121 and SNORD120) newly identified guide snoRNAs exhibited an obvious tissue-specific or restricted expression pattern in this study (Figure 5A). This observation was further supported by the expression patterns of the host genes, UBAP2 and EIF1AX, of these two snoRNAs (in UCSC gene sorter ). The increasing number of tissue-specific expressed snoRNAs implies a regulatory role of snoRNAs in gene expression as has been showed for HBII-52 (8,9).

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

36 in total

1. RNomics: an experimental approach that identifies 201 candidates for novel, small, non-messenger RNAs in mouse.

Authors: A Hüttenhofer; M Kiefmann; S Meier-Ewert; J O'Brien; H Lehrach; J P Bachellerie; J Brosius
Journal: EMBO J Date: 2001-06-01 Impact factor: 11.598

2. Cajal body-specific small nuclear RNAs: a novel class of 2'-O-methylation and pseudouridylation guide RNAs.

Authors: Xavier Darzacq; Beáta E Jády; Céline Verheggen; Arnold M Kiss; Edouard Bertrand; Tamás Kiss
Journal: EMBO J Date: 2002-06-03 Impact factor: 11.598

Review 3. Small nucleolar RNAs: an abundant group of noncoding RNAs with diverse cellular functions.

Authors: Tamás Kiss
Journal: Cell Date: 2002-04-19 Impact factor: 41.582

Review 4. The expanding snoRNA world.

Authors: Jean Pierre Bachellerie; Jérôme Cavaillé; Alexander Hüttenhofer
Journal: Biochimie Date: 2002-08 Impact factor: 4.079

5. A search for H/ACA snoRNAs in yeast using MFE secondary structure prediction.

Authors: Sverker Edvardsson; Paul P Gardner; Anthony M Poole; Michael D Hendy; David Penny; Vincent Moulton
Journal: Bioinformatics Date: 2003-05-01 Impact factor: 6.937

6. Vienna RNA secondary structure server.

Authors: Ivo L Hofacker
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

7. The Schizosaccharomyces pombe mgU6-47 gene is required for 2'-O-methylation of U6 snRNA at A41.

Authors: Hui Zhou; Yue-Qin Chen; Yan-Ping Du; Liang-Hu Qu
Journal: Nucleic Acids Res Date: 2002-02-15 Impact factor: 16.971

8. Identification of brain-specific and imprinted small nucleolar RNA genes exhibiting an unusual genomic organization.

Authors: J Cavaillé; K Buiting; M Kiefmann; M Lalande; C I Brannan; B Horsthemke; J P Bachellerie; J Brosius; A Hüttenhofer
Journal: Proc Natl Acad Sci U S A Date: 2000-12-19 Impact factor: 11.205

9. CARPEL FACTORY, a Dicer homolog, and HEN1, a novel protein, act in microRNA metabolism in Arabidopsis thaliana.

Authors: Wonkeun Park; Junjie Li; Rentao Song; Joachim Messing; Xuemei Chen
Journal: Curr Biol Date: 2002-09-03 Impact factor: 10.834

10. Computational identification of Drosophila microRNA genes.

Authors: Eric C Lai; Pavel Tomancak; Robert W Williams; Gerald M Rubin
Journal: Genome Biol Date: 2003-06-30 Impact factor: 13.583

61 in total

1. A living fossil in the genome of a living fossil: Harbinger transposons in the coelacanth genome.

Authors: Jeramiah J Smith; Kenta Sumiyama; Chris T Amemiya
Journal: Mol Biol Evol Date: 2011-10-31 Impact factor: 16.240

2. Animal snoRNAs and scaRNAs with exceptional structures.

Authors: Manja Marz; Andreas R Gruber; Christian Höner Zu Siederdissen; Fabian Amman; Stefan Badelt; Sebastian Bartschat; Stephan H Bernhart; Wolfgang Beyer; Stephanie Kehr; Ronny Lorenz; Andrea Tanzer; Dilmurat Yusuf; Hakim Tafer; Ivo L Hofacker; Peter F Stadler
Journal: RNA Biol Date: 2011-11-01 Impact factor: 4.652

3. Computational prediction of Caenorhabditis box H/ACA snoRNAs using genomic properties of their host genes.

Authors: Paul Po-Shen Wang; Ilya Ruvinsky
Journal: RNA Date: 2009-12-28 Impact factor: 4.942

Review 4. Noncoding RNA in development.

Authors: Paulo P Amaral; John S Mattick
Journal: Mamm Genome Date: 2008-10-07 Impact factor: 2.957

Review 5. Computational methods in noncoding RNA research.

Authors: Ariane Machado-Lima; Hernando A del Portillo; Alan Mitchell Durham
Journal: J Math Biol Date: 2007-09-04 Impact factor: 2.259

6. The small nucleolar ribonucleoprotein (snoRNP) database.

Authors: J Christopher Ellis; Daniel D Brown; James W Brown
Journal: RNA Date: 2010-03-02 Impact factor: 4.942

7. Plant noncoding RNA gene discovery by "single-genome comparative genomics".

Authors: Chong-Jian Chen; Hui Zhou; Yue-Qin Chen; Liang-Hu Qu; Daniel Gautheret
Journal: RNA Date: 2011-01-10 Impact factor: 4.942

Review 8. Biology and applications of small nucleolar RNAs.

Authors: Tomaž Bratkovič; Boris Rogelj
Journal: Cell Mol Life Sci Date: 2011-07-12 Impact factor: 9.261

9. Genome-wide searching with base-pairing kernel functions for noncoding RNAs: computational and expression analysis of snoRNA families in Caenorhabditis elegans.

Authors: Kensuke Morita; Yutaka Saito; Kengo Sato; Kotaro Oka; Kohji Hotta; Yasubumi Sakakibara
Journal: Nucleic Acids Res Date: 2009-01-07 Impact factor: 16.971

10. Mining small RNA sequencing data: a new approach to identify small nucleolar RNAs in Arabidopsis.

Authors: Ho-Ming Chen; Shu-Hsing Wu
Journal: Nucleic Acids Res Date: 2009-04-08 Impact factor: 16.971