| Literature DB >> 14499004 |
Abstract
BACKGROUND: For many RNA molecules, secondary structure rather than primary sequence is the evolutionarily conserved feature. No programs have yet been published that allow searching a sequence database for homologs of a single RNA molecule on the basis of secondary structure.Entities:
Mesh:
Substances:
Year: 2003 PMID: 14499004 PMCID: PMC239859 DOI: 10.1186/1471-2105-4-44
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1An example SCFG architecture. The sequence at the top folds into the specified secondary structure. At the bottom, the nodal architecture of the model that would produce this sequence is shown. Shaded triangles represent base pair emitting nodes, and point to the base pair they emit. Open triangles represent single nucleotide emitting nodes, and point to the nucleotide they emit.
All possible node-states and their emission scores.
| Node-state | Description | Profile emission score | Single-sequence emission score | Gap class |
| ROOT_S | Start of model | 0 | 0 | M_cl |
| ROOT_IL | Gap in query at left end | 0 | IL_cl | |
| ROOT_IR | Gap in query at right end | 0 | IR_cl | |
| BEGL_S | Start of left branch of bifurcation | 0 | 0 | M_cl |
| BEGR_S | Start of right branch of bifurcation | 0 | 0 | M_cl |
| BEGR_IL | Gap in query at bifurcation | 0 | IL_cl | |
| MATP_MP | Matched base pair | M_cl | ||
| MATP_ML | Match on left side of base pair; gap in target on right | DR_cl | ||
| MATP_MR | Match on right side of base pair; gap in target on left | DL_cl | ||
| MATP_D | Two gaps in target, for each side of base pair | 0 | 0 | DB_cl |
| MATP_IL | Gap in query just after left side of base pair | 0 | IL_cl | |
| MATP_IR | Gap in query just before right side of base pair | 0 | IR_cl | |
| MATL_ML | Match to single nucleotide on left | M_cl | ||
| MATL_D | Gap in target on left | 0 | 0 | DL_cl |
| MATL_IL | Gap in query on left | 0 | IL_cl | |
| MATR_MR | Match to single nucleotide on right | M_cl | ||
| MATR_D | Gap in target on right | 0 | 0 | DR_cl |
| MATR_IR | Gap in query on right | 0 | IR_cl | |
| END_E | End of stem-loop | 0 | 0 | M_cl |
| BIF_B | Bifurcation | 0 | 0 | M_cl |
v is the current state. a is the nucleotide present in the query on the left, b is the nucleotide present in the query on the right. j is any nucleotide in the target for a single nucleotide alignment, while k, l is a base pair in the target for a base pair alignment. g is the background frequency of a nucleotide and s and s' are the substitution matrices defined in the text. Node-states with an M gap class are in the "mainline" path through the model that the an exact match would follow. Node-states with an IL or IR gap class represent a gap in the query sequence, while node-states with a DL, DR, or DB gap class represent gaps in the target sequence.
Parameterization of negative transition scores from gap penalties.
| To class | ||||||
| From class | M_cl | IL_cl | DL_cl | IR_cl | DR_cl | DB_cl |
| M_cl | 0 | 1/2α | 1/2α | 1/2α | 1/2α | α |
| IL_cl | β + 1/2α | β | β + α | β + α | β + α | β + 3/2α |
| DL_cl | β + 1/2α | β + α | β | β + α | β + α | β + 1/2α |
| IR_cl | β + 1/2α | N.A. | β + α | β | β + α | β + 3/2α |
| DR_cl | β + 1/2α | β + α | β + α | β + α | β | β + 3/2α |
| DB_cl | 2β + α | 2β + 3/2α | 2β + 1/2α | 2β + 3/2α | 2β + 1/2α | 2β |
α and β are replaced with α' and β' for the specific cases described in the text. The IR_cl to IL_cl transition is never allowed in these models.
Figure 2The two classes of local alignment. Each example shows how the nodal guide tree best aligns to the target sequence. At the bottom is the RSEARCH output for the alignment. On the left is an example of begin locality, while on the right is an example of end locality. The numbers next to the query sequence represent positions relative to the entire query; the numbers next to the target sequence represent positions relative to the subsequence defined in the "Target =" line.
Figure 3The RIBOSUM85-60 matrix. The 16 × 16 matrix is used to get scores for aligning base pairs. The 4 × 4 matrix is used to get scores for aligning single-stranded regions. Positive scores are shaded.
Figure 4RSEARCH statistics, Distribution of scores for a search against random sequences. We searched a database of 10,000 random sequences of 10,000 nucleotides each with a GC composition of 50% using the precursor to the C. elegans miRNA mir-40 as the query [60]. We took the best score found for each of the 10,000 sequences in the database and plotted their distribution. We then calculated the mean and standard deviation and plotted the Gaussian distribution for those values. We also calculated K and λ for the Gumbel distribution and plotted that distribution. Average observed number of hits with E-value less than a cutoff versus reported E-value for searches of various RNase P queries against database of Archaeal genomes. E-values were computed using partition points of 40% and 60% G+C content.