| Literature DB >> 18201379 |
Luca Cozzuto1, Mauro Petrillo, Giustina Silvestro, Pier Paolo Di Nocera, Giovanni Paolella.
Abstract
BACKGROUND: Analysis of non-coding sequences in several bacterial genomes brought to the identification of families of repeated sequences, able to fold as secondary structures. These sequences have often been claimed to be transcribed and fulfill a functional role. A previous systematic analysis of a representative set of 40 bacterial genomes produced a large collection of sequences, potentially able to fold as stem-loop structures (SLS). Computational analysis of these sequences was carried out by searching for families of repetitive nucleic acid elements sharing a common secondary structure.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18201379 PMCID: PMC2267715 DOI: 10.1186/1471-2164-9-20
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Sequence-based clustering of SLSs.
| low-GC Firmicutes | 65,220 | 4 | 105 | 38 | |
| 55,624 | 6 | 182 | 93 | ||
| 56,622 | 2 | 32 | 16 | ||
| 35,027 | 6 | 149 | 81 | ||
| 29,883 | 14 | 178 | 123 | ||
| 40,991 | 7 | 317 | 142 | ||
| 25,668 | 3 | 173 | 26 | ||
| 32,372 | 11 | 275 | 144 | ||
| 25,095 | 28 | 825 | 386 | ||
| Mollicutes | 8,953 | 1 | 21 | 8 | |
| 13,926 | 20 | 372 | 165 | ||
| high-GC Firmicutes | 54,254 | 9 | 282 | 120 | |
| 83,094 | 29 | 1,721 | 537 | ||
| 170,502 | 59 | 2,182 | 636 | ||
| α-Proteobacteria | 69,899 | 11 | 399 | 219 | |
| 14,933 | 19 | 797 | 383 | ||
| β-Proteobacteria | 214,459 | 26 | 2,009 | 470 | |
| 188,237 | 30 | 1,513 | 518 | ||
| 158,592 | 52 | 7,212 | 4,602 | ||
| 56,605 | 44 | 3,595 | 991 | ||
| γ-Proteobacteria | 86,339 | 12 | 1,152 | 431 | |
| 25,055 | 3 | 39 | 25 | ||
| 31,209 | 1 | 24 | 8 | ||
| 206,492 | 9 | 526 | 129 | ||
| 175,088 | 75 | 3,640 | 1,352 | ||
| 90,027 | 8 | 177 | 116 | ||
| 91,844 | 7 | 157 | 94 | ||
| 45,824 | 7 | 250 | 122 | ||
| 78,372 | 20 | 600 | 279 | ||
BLAST-MCL clustering of SLSs identified, from a representative set of bacterial genomes, as described in Petrillo et al. [18]: only species with at least one cluster of a minimum of 7 elements are listed. For each species, the number of elements within the starting population, the number of clusters and the number of clustered SLSs are reported. The number of SLS containing regions (SCRs), obtained by fusing overlapping clustered SLSs, is also reported.
Figure 1Fraction of sequence elements positive to RANDFOLD test. RANDFOLD test was run onto groups of clustered SLSs (panel A), total SLSs (panel B) and random sequences (panel C) from the 29 genomes listed in Table 1. The fraction of elements scoring positive with the indicated probability is diagrammed. Standard deviation bars are shown in panels B and C.
Regrouping of SLS clusters.
| 4 | 3 | 2 | 2 | |||
| 6 | 6 | 4 | 3 | 1 | ||
| 2 | 2 | 1 | 1 | 1 | ||
| 6 | 2 | 1 | 1 | |||
| 14 | 13 | 10 | 6 | 3 | ||
| 7 | 5 | 3 | 3 | 1 | ||
| 3 | 3 | 2 | 2 | 1 | ||
| 11 | 7 | 5 | 4 | |||
| 28 | 22 | 13 | 9 | 6 | ||
| 1 | 1 | 1 | 1 | |||
| 20 | 20 | 18 | 12 | |||
| 9 | 7 | 5 | 4 | 1 | ||
| 29 | 18 | 11 | 5 | |||
| 59 | 36 | 21 | 15 | 3 | ||
| 11 | 7 | 5 | 4 | |||
| 19 | 6 | 4 | 4 | |||
| 26 | 8 | 5 | 4 | |||
| 30 | 16 | 10 | 5 | 4 | ||
| 52 | 28 | 16 | 4 | 3 | ||
| 44 | 9 | 7 | 6 | |||
| 12 | 8 | 6 | 6 | 2 | ||
| 3 | 1 | 1 | 1 | |||
| 1 | 1 | 1 | 1 | |||
| 9 | 5 | 4 | 4 | |||
| 75 | 35 | 26 | 14 | 4 | 2 | |
| 8 | 4 | 3 | 3 | 2 | ||
| 7 | 6 | 4 | 4 | 1 | ||
| 7 | 7 | 5 | 4 | 2 | ||
| 20 | 15 | 11 | 5 | 2 | ||
Clusters reported in Table 1 were regrouped, according to sequence similarity, strand reciprocity and relative genomic position of their elements, as described in Methods. The number of groups, obtained by each criterion, is reported in the three columns labelled "Grouped by". Several groups are composed of sequences, contained within ISs or rRNA genes; their number is shown in the last two columns, for each genome.
Families of SLS containing repeated sequences.
| Family | This work | Literature | Type | Notes | ||||
| size | copies | size | copies | ref. | ||||
| Bant-1 | 72 | 104(29) | I | |||||
| Bcr1 | 167 | 31(21) | 147 | 12 | [24] | I | ||
| Bhal-1 | 74 | 36(32) | I | |||||
| Bhal-2 | 76 | 50(41) | I | contains CRISPR repeats | ||||
| Clop-1 | 93 | 44(28) | I | |||||
| Clot-1 | 74 | 19(16) | I | |||||
| Clot-2 | 31 | 34(32) | contains CRISPR repeats | |||||
| Clot-3 | 90 | 24(17) | I | contains CRISPR repeats | ||||
| Efa-1 | 163 | 65 (18) | I | |||||
| Efa-2 | 292 | 11(9) | G | |||||
| Lac-1 | 231 | 34(6) | G | |||||
| Sta-1 | 105 | 25(25) | I | |||||
| Sta-2 | 460 | 9(8) | S | |||||
| Sta-3 | 136 | 24(15) | I | |||||
| Sta-4 | 99 | 46(27) | I | |||||
| BOX | 84 | 205(105) | 100–200 | 127 | [25] | I | ||
| RUP | 63 | 110(99) | 108 | 54 | [26] | I | ||
| Stre-1 | 45 | 241(225) | G | |||||
| Bru-RS | 118 | 222(69) | 103–105 | 35–40 | [27] | I | ||
| Rpe-4 | 100 | 97(74) | 95 | 94 | [28] | I | ||
| Rpe-5 | 115 | 45(35) | 115 | 55 | [28] | I | ||
| Rpe-6 | 108 | 123(74) | 136 | 168 | [28] | |||
| Rpe-7 | 123 | 186 144) | 99 | 223 | [28] | |||
| Myg-1 | 259 | 10(7) | I | |||||
| Myp-1 | 143 | 25(18) | G | part of REPMP1 repeat | ||||
| Myp-2 | 158 | 42(16) | G | part of REPMP4 repeat | ||||
| Myp-3 | 558 | 11(8) | G | part of REPMP5 repeat | ||||
| Myp-4 | 364 | 8(7) | G | part of REPMP5 repeat | ||||
| Myp-5 | 426 | 8 (8) | G | part of REPMP5 repeat | ||||
| Myp-6 | 468 | 11(11) | G | part of REPMP2/3 repeat | ||||
| Myp-8 | 674 | 9(9) | G | part of REPMP2/3 repeat | ||||
| Myp-9 | 226 | 9(9) | G | part of REPMP2/3 repeat | ||||
| Myp-10 | 330 | 12 (12) | G | part of REPMP2/3 repeat | ||||
| Myp-7 | 131 | 42(22) | G | |||||
| Cod-1 | 140 | 17(16) | I | |||||
| Cod-2 | 32 | 43(39) | G | |||||
| Cod-3 | 170 | 23(20) | ||||||
| Cod-5 | 74 | 35(29) | I | |||||
| Myt-1 | 72 | 75(70) | ||||||
| Myt-2 | 115 | 769(223) | G | located within PE genes | ||||
| Myt-3 | 81 | 81(77) | G | located within PE genes | ||||
| Myt-4 | 83 | 196(68) | G | located within PE genes | ||||
| Myt-5 | 71 | 41(2) | G | contains CRISPR repeats | ||||
| Myt-7 | 136 | 278(68) | G | located within PE genes | ||||
| Myt-8 | 92 | 33(25) | ||||||
| Myt-9 | 67 | 53(15) | ||||||
| Myt-10 | 154 | 62(59) | G | located within PE genes | ||||
| Myt-11 | 65 | 56(21) | contains MIRU repeats | |||||
| REPLEP | 740 | 29(9) | 400–880 | 15 | [29] | I | ||
| RLEP | 641 | 38(30) | 601–1075 | 37 | [29] | S | ||
| Myl-1 | 371 | 7(4) | S | part of LEPREP repeat | ||||
| Myl-2 | 1979 | 9(7) | S | part of LEPREP repeat | ||||
| Bor-1 | 117 | 196(92) | I | |||||
| Bor-2 | 167 | 17(6) | I | |||||
| Bor-3 | 134 | 34(32) | G | |||||
| Bor-4 | 81 | 164(114) | G | |||||
| Bor-5 | 112 | 135(101) | G | |||||
| Bor-6 | 147 | 37(31) | G | |||||
| Bor-1 | 93 | 128(78) | I | |||||
| ATR | 206 | 14(9) | 183 | 13 | [30] | I | ||
| Nem-2 | 341 | 11(7) | ||||||
| Nem-3 | 127 | 10(9) | G | |||||
| Nem-4 | 36 | 412(362) | I | contains DUS repeats | ||||
| dRS3 | 33 | 755(708) | 20 | 770 | [30] | I | ||
| NEMIS | 46 | 262(81) | 106–158 | 250 | [13] | I | ||
| Rep2 | 65 | 22(18) | 59–154 | 26 | [30] | I | ||
| Pam-1 | 155 | 12(12) | S | contains DUS repeats | ||||
| BoxC | 50 | 22(20) | 56 | 32 | [31] | |||
| Eco-1 | 734 | 9(7) | G | |||||
| ERIC | 140 | 19(19) | 127 | 21 | [32] | S | ||
| PU-BIME | 108 | 301(199) | 40 | 485 | [31] | |||
| Hin-1 | 31 | 53(51) | I | contains DUS repeats | ||||
| Pae-1 | 84 | 133(61) | I | |||||
| Pae-2 | 287 | 65(24) | G | |||||
| Pae-3 | 220 | 16(13) | G | |||||
| Pae-4 | 52 | 41(35) | ||||||
| Ppu-1 | 617 | 39(28) | I | |||||
| Ppu-2 | 2056 | 10(8) | S | |||||
| Ppu-3 | 251 | 27(23) | G | |||||
| Ppu-4 | 81 | 41(24) | I | |||||
| Ppu-9 | 124 | 57(31) | I | |||||
| REP | 39 | 588(496) | 30 | 804 | [33] | I | ||
| PU-BIME | 43 | 146(126) | 40 | 100 | [31] | I | ||
| PU-BIME* | 80 | 59(37) | 40 | >100 | [31] | |||
| PU-BIME | 78 | 142(94) | 40 | 82 | [31] | |||
| Sal-1 | 115 | 27(17) | I | |||||
| Sal-2 | 120 | 33(3) | G | contains CRISPR repeats | ||||
| ERIC | 103 | 97(66) | 127 | 80 | [31] | I | ||
| Vic-1 | 184 | 14(1) | I | |||||
| ERIC | 115 | 241(128) | 69–127 | 167 | [16] | I | ||
| YPAL | 168 | 101(68) | 169 | 30 | [17] | I | ||
| YPAL* | 136 | 26(13) | 130 | 10 | [17] | I | ||
The final set of 92 families of repeated sequences is reported, grouped by species. For each family, the length of the model and the number of sequences fitting the model are given. The number of complete sequences, i.e. covering the model from end to end, is reported in parenthesis. Previously described sequence families have been named in column "Family", according to the current literature; for each of them, the number and typical size of its members are also provided, together with references. For novel families, a systematic name was built by fusing a shortened species name to a progressive number. In the column "type", I, G and S indicate the prevalent genomic location of the members of each families within intergenic, genic or border-spanning sequences. For some families, small previously described sequence motifs contribute to the formation of a substantially larger model; for others, their members are frequently located within larger previously described sequences. In both cases, a note is reported in the rightmost column.
Secondary structure prediction analysis of families.
| Species | Family | P | Conserved structure | Conserved SLS position | SLS folding aptitude | Type |
| Bcr1 | 0.99 | s | + | + | I | |
| Bhal-1 | 0.98 | s | + | ++ | I | |
| Bhal-2 | 0.99 | c | - | I | ||
| Clop-1 | 0.96 | s | + | + | I | |
| Clot-1 | 0.95 | s | + | ++ | I | |
| Efa-1 | 0.85 | s | + | +++ | I | |
| Efa-2 | 1.00 | s | + | - | G | |
| Lac-1 | 0.97 | c | +° | - | G | |
| Sta-1 | 0.84 | s | + | +++ | I | |
| Sta-2 | 1.00 | s | + | ++ | S | |
| Sta-3 | 0.97 | s | + | + | I | |
| Bru-RS | 0.98 | s | + | + | I | |
| Rpe-4 | 0.73 | s | + | - | I | |
| Rpe-5 | 1.00 | s | + | + | I | |
| Rpe-6 | 0.45 | - | +° | + | ||
| Rpe-7 | 0.99 | s | + | ++ | ||
| Myg-1 | 0.06 | - | +° | - | I | |
| Myp-1 | 0.00 | - | +° | - | G | |
| Myp-2 | 0.95 | s | + | ++ | G | |
| Myp-3 | 0.89 | s | + | - | G | |
| Myp-4 | 0.09 | - | +° | - | G | |
| Myp-5 | 0.74 | s | + | - | G | |
| Myp-6 | 0.55 | c | - | G | ||
| Myp-7 | 0.67 | s | + | - | G | |
| Cod-1 | 0.97 | s | + | +++ | I | |
| Cod-2 | 0.98 | s | - | G | ||
| Cod-3 | 0.99 | s | + | +++ | ||
| Myt-1 | 0.74 | s | + | +++ | ||
| Myt-8 | 0.90 | s | + | ++ | ||
| REPLEP | 1.00 | c | +° | - | I | |
| RLEP | 1.00 | s | + | ++ | S | |
| Myl-1 | 0.61 | s | + | ++ | S | |
| Myl-2 | 0.97 | s | + | + | S | |
| Bor-1 | 0.86 | s | + | ++ | I | |
| Bor-2 | 1.00 | s | + | - | I | |
| Bor-1 | 0.93 | s | + | ++ | I | |
| ATR | 1.00 | s | + | - | I | |
| Nem-2 | 0.93 | s | + | + | ||
| Nem-4 | 0.93 | s | + | +++ | I | |
| dRS3 | 0.98 | c | - | I | ||
| NEMIS | 1.00 | s | + | + | I | |
| Rep2 | 0.98 | s | + | + | I | |
| Pam-1 | 0.96 | s | + | +++ | S | |
| BoxC | 0.99 | c | +° | - | ||
| Eco-1 | 0.18 | - | +° | - | G | |
| ERIC | 0.94 | s | + | ++ | S | |
| PU-BIME | 0.94 | s | + | + | ||
| Hin-1 | 0.96 | s | + | + | I | |
| Pae-1 | 0.97 | s | + | ++ | I | |
| Pae-3 | 0.26 | - | +° | - | G | |
| Pae-4 | 0.93 | s | + | ++ | ||
| Ppu-1 | 0.97 | s | + | + | I | |
| Ppu-2 | 1.00 | s | + | +++ | S | |
| Ppu-4 | 0.95 | s | + | - | I | |
| Ppu-9 | 0.54 | s | + | - | I | |
| PU-BIME | 0.97 | c | - | I | ||
| PU*-BIME | 0.98 | s | + | - | ||
| PU-BIME | 0.98 | s | + | - | ||
| Sal-1 | 0.94 | c | - | I | ||
| Sal-2 | 1.00 | c | - | G | ||
| ERIC | 0.90 | s | + | - | I | |
| YPAL | 1.00 | s | + | +++ | I | |
| YPAL* | 0.96 | c | - | I |
The ability to form a consensus secondary structure was evaluated by RNAz: the prediction scores are reported in column "P" for each family. The type of predicted structure is indicated in column "conserved structure", where "s" indicates a stem-loop based structure, while "c" indicates a more complex structure, where a stem-loop compatible with the original search is not present. For each family, the aligned localization of the original SLSs is indicated by '+' in column "conserved SLS position"; when SLS alignment is not in agreement with the RNAz prediction, a '°' is added to the '+' symbol. The column marked "SLS folding aptitude" reports the behavior of family elements in the RANDFOLD test: the number of '+' symbols describes the percent of positive elements ('+++' if 90% or above; '++' if 70–90%; '+' if 50–70%; '-' if less than 50%). The localization of family members, as already described in Table 3, is also reported in the last column.
Structural properties of the described SLS families in relation to genomic location.
| Sec. Struct. + | Sec. Struct. - | ||||
| Genomic location | SLS + | SLS - | SLS + | SLS - | |
| Genic | 5 | 4 | 4 | 17 | |
| Border spanning | 7 | 0 | 0 | 0 | |
| Intergenic | 25 | 6 | 1 | 9 | |
| Others | 9 | 1 | 1 | 3 | |
Columns under "Sec. Struct. +/-" report the number of families, characterized by the presence or absence of a conserved secondary structure predicted by RNAz; the labels "SLS +/-" indicate the presence or absence of aligned SLSs; "Total" means the sum of rows or columns.
Figure 2Alignment of ERIC, Pae-1, Sta-1 and Efa-1 family members. (A) A representative set elements from each family was aligned by using the HMM model as a guide. In each panel, one row corresponds to one family member (indicated on the right with its genomic position). Within each row, sequence conservation is indicated by increasing gray levels and gaps by dotted spaces; overlapping SLSs are reported as red and blue lines, the red ones indicating SLSs used to define the original HMM model for the family, the blue all the others. Darker colors indicate the SLS folding aptitude, i.e. positivity to RANDFOLD for P <= 0.005. Common secondary structures, predicted by RNAz, are reported at the bottom, just above the ruler in nucleotides: green triangles indicate stems produced by pairing complementary regions on the same strand as the identified SLSs, while brown triangles indicate the same from the opposite strand. The boxed regions highlight areas where aligned SLSs and predicted structures are in agreement. (B) Graphic representation of the RNAz predicted secondary structures.
Figure 3Alignment of PU-BIME, dRS3 and Myt-10 family members. Panels A and B legends are as in Figure 2.
Figure 4Schematic representation of the overall procedure.