| Literature DB >> 19812779 |
Amy M Hauth1, Gertraud Burger.
Abstract
MOTIVATION: A recurrent criticism is that certain bioinformatics tools do not account for crucial biology and therefore fail answering the targeted biological question. We posit that the single most important reason for such shortcomings is an inaccurate formulation of the computational problem.Entities:
Keywords: guidelines; problem formulation; tool development
Year: 2008 PMID: 19812779 PMCID: PMC2735962 DOI: 10.4137/bbi.s706
Source DB: PubMed Journal: Bioinform Biol Insights ISSN: 1177-9322
Methodology for formulating a bioinformatics model.
| Biological question |
| Biological knowledge |
| Describe biological phenomenon based on observed occurrences |
| |
| |
| Construct dataset of observed instances |
| |
| |
| |
| Unobserved yet conceivable occurrences |
| |
| |
| Biological criteria (BCs) |
| Convert description of biological knowledge |
| |
| |
| Remove ambiguity |
| |
| Computational rules (CRs) |
| Formulate rules based on BCs |
| |
| |
| Identify key rules |
| |
| Global computational problem restating the biological question |
| General computational approach |
| Detail approach through a set of smaller problem definitions |
Sample of biological model for tRNA gene identification model.
| Biological model |
|---|
| Question: Which tRNAs are encoded in a genome? |
| Relevant knowledge on tRNAs (description of gene, gene product and function)
1. Brief definition
“Transfer RNA (tRNA) … is a small RNA molecule (70–90 nucleotides). The tRNAs, by binding at one end to a specific codon in the mRNA and at their other end to the amino acid specified by that codon, enable amino acids to line up according to the sequence of nucleotides in the mRNA. Each tRNA is designed to carry only one of the 20 amino acids …. Each of the 20 amino acids has at least one type of tRNA assigned to it, and most have several tRNAs. Before an amino acid is incorporated into a protein chain, it is attached by its carboxyl end to the 3’end of … a tRNA containing the correct 2. Description 2.1 Observed tRNAs (gene product) and genes 2.1.1 tRNA structure
“tRNAs can form the loops and base-paired stems of a cloverleaf structure, and all are thought to fold further to adopt the L-shaped conformation” ( The cloverleaf structure is composed of three arms (D, anticodon (AC) and T), a highly variable (V) loop between the AC- and T-arms, enclosed by the aminoacyl (AA) stem (Fig. 1). Generally, the D-stem forms four base pairings but a stem of three is possible. Likewise, the D-loop is typically 8 nt long but may expand to 9 or 10 nt. Overlapping the D-arm is one of two promoters recognized by transcription factor TFIIIC and having a conserved sequence of 5′-GTGGCNNAGT-3′ ( A major alternative to the cloverleaf structure is composed of two instead of three arms, lacking either the D- or the T-arm (Fig. 2). Often, the V-loop expands and establishes extra stabilizing interactions. Such tRNAs lacking entire domains have been documented in certain animal lineages ( In Archea, the ‘strictly invariant’ nucleotide U at position 8 is replaced by a C ( Available general resource— 2.1.2 tRNA genes The sequence of the tRNA (gene product) differs from that of its gene. The transcribed sequence of the gene is subjected to various processes, which effectively changes the sequence of the gene. Most common are post-transcriptional nucleotide modifications. For example, the T-loop contains the modified base pseudouridine (phi), which is encoded in the gene as T. In addition, the CCA tail at the 3′ end of tRNAs is added post-transcriptionally. Less common are changes incurred by RNA editing by which nucleotides are replaced, inserted or deleted. For example, mis-pairings in the AA-arm portion of the gene between 1–72, 2–71, and 3–70 are corrected post-transcriptionally by RNA editing ( 2.2 Sample instances Below is a list of sequences representative of tRNA genes. This list is sufficiently broad to serve as a benchmark that measures the effectiveness of tRNA identification software. Journal references are provided where appropriate. GenBank Acc. No. DQ256197, positions 78–145 and references therein. A compilation of tRNA genes identified by tRNAscan-SE ( 2.3 Conceivable genes The unifying structure is the L-shaped tertiary structure required to perform its translational function (Fig. 3). Nucleotides are also important for processing amino-acylation, binding of initiation and elongation factors, etc. The constraints on the shape are …. It is conceivable that the gene is encoded by multiple gene pieces that are transcribed independently, similar to ribosomal RNA ( |
Figure S1Canonical schematic of the tRNA cloverleaf secondary structure (Marck and Grosjean, 2002).
Figure S2The two-arm, tRNA-Arg molecule in Caenorhabditis elegans (Okimoto and Wolstenholme, 1990).
Figure S3The L-shaped tRNA tertiary structure. AA-stem (orange), D-arm (green), AC-arm (blue), V-loop (purple), T-arm (red).
Sample of bioinformatics transformation for tRNA gene identification model.
| Bioinformatics transformation |
|---|
| Biological criteria (BC)
1. Cloverleaf variant
1.1. This variant is composed of three arms, a stem and a highly variable “bulge” loop: the D-arm(positions 10..25), the anticodon (AC) arm (pos. 27..43), the variable (V) loop (pos. 44..48) and the T-arm (pos. 49..65) enclosed by the acceptor (AA) stem (pos. 1..7 and 66..72). More details in Fig. 1. 1.2. Stems represent the major source of tertiary structure stabilization. Individual nucleotide interactions between loop regions provide some additional stabilization. 2. Two-arm variant
2.1. This variant is composed of two arms, a stem and a highly variable “bulge” loop. The order is similar to the cloverleaf variant, except either the D-arm or the T-arm is absent. See Fig. 2. 2.2. Both stems and individual nucleotide interactions between loops stabilize the tertiary structure. Compensation for the missing stem is provided by a larger V-loop and an increase in non-stem, nucleotide interactions. 3. Stems
3.1. Bulges of one or two nucleotides may occur in a stem 3.2. Allowable nucleotide pairs: A-U, C-G and G-U 4. D-arm
4.1. The D-arm forms a hairpin closed by a stem (pos. 10 to 25). 4.2. The D-stem length is 3 or 4 nt. 4.3. The D-stem pairing positions: 10–25, 11–24, 12–23 and 13–22. Note: if 13–22 do not pair, the numbering remains as though they are in the stem. 4.4. The D-loop length is 8 to 11 nt. If positions 13 and 22 do not pair, it increases to 10 to 13 nt. 4.5. The D-loop positions: 14,15,16,17,17a,18,19,20,20a,20b,21. Optional positions: 17a, 20a and 20b. 4.6. Detailed nucleotide and base-pairing distributions are available (see 5. Conserved sequences
5.1. In eukaryotes, the conserved sequence, 5′-GTGGCNNAGT-3′, is found at position 8 ( |
| Computational rules (CR)
1. tRNA genes: grammar (partial)
1.1. <tRNA gene>::= <AA-stem begin><ss-loop><three-arm> (BC 1.1, 2.1) <ss-loop><AA-stem end> | <AA-stem begin> <ss-loop><two-arm><ss-loop><AA-stem end> 1.2. <three-arm>::= <D-arm><ss-loop><AC-arm><V-loop>< T-arm> (BC 1.1) 1.3. <two-arm>::= <D-arm><ss-loop><AC-arm><V-loop>| (BC 2.1) <AC-arm><V-loop>< T-arm> 1.4. <D-arm>::= <stem-loop> (BC 4.2) 1.5. <stem-loop>::= <stem begin> <ss-loop> <stem end> | (BC 3.1) <stem begin> <stem-loop-with-bulge> <stem end> 1.6. <stem-loop-with-bulge>::= <ss-bulge> <stem-loop> <ss-bulge> | (BC 3.1) <ss-bulge> <stem-loop> | <stem-loop> <ss-bulge> 1.7. <ss-loop>::= <sequence> general 1.8. <sequence>::= <nucleotide>* general 1.9. <ss-bulge>::= <nucleotide> | <dinucleotide> (BC 3.1) 1.10. <dinucleotide>::= <nucleotide><nucleotide> general 1.11. <nucleotide>::= A | C | G | T general 2. Stems
2.1. 3. D-arm
3.1. 16 <= | 3.2. 3 <= | 3.3. 8 <= | 4. Conserved sequence near D-arm
4.1. D conserved sequence pattern: GTGGCNNAGT (BC 5.1) 4.2. 2 <= | First position of D-stem relative to first position of D-consensus pattern| <= 3 (BC 5.1, 4.2) |
Sample of computational model for tRNA gene identification model.
| Computational model |
|---|
| Global problem |
| |
| Partial problem-set |
| 1. Identify conserved sequences |
| |
| |
| |
| |
| |
| 2. Identify candidate arms (form hairpins for stems overlapping conserved patterns) |
| |
| |
| |
| |
| |
| |
| |
Introns in tRNA genes and the effect on the bioinformatics model.
| Bioinformatics model |
|---|
| Relevant Knowledge |
| Addition to tRNA gene description |
| Introns … |
| New biological criteria (NBC)
6. Introns
6.1. Infrequent occurrence in tRNA genes 6.2. Location of introns
6.2.1. Most frequent in loops
6.2.1.1. Most prevalent in V-loop between AC-arm and T-arm 6.2.2. Typically, infrequent in stems
6.2.2.1. In Archaea, introns are predominately present in stems 6.3. Group I introns
6.3.1. Observed occurrences vary from 140 to over 2,000 nt long 6.4. Group II introns ( 6.4.1. Intron structure contains six domains: I-IV 6.4.2. Typical length is 600 to 2,500 nt long. Smallest is 389 nt long. Largest is 3,400 nt long. 6.4.3. Large introns most often contain an ORF, typically in the loop of domain IV. 6.4.4. Domain V sequence: 5′-RAGCYNNRURMrNNrAAANNYKYayGYNNRGUUY-3′ |
| New computational rules (NCR)
5. tRNA genes: additional grammar for introns
5.1. <ss-loop>::= <sequence> | <sequence-or-empty> (NBC 6.2.1, Replaces CR 1.7) <sequence-with-intron> 5.2. <sequence-or-empty>::= <sequence> | ɛ (empty set) general 5.3. <sequence-with-intron>::= <intron> | <intron> <ss-loop> (NBC 6.2.1) 5.4. <ss-bulge>::= <nucleotide> | <dinucleotide>| (NBC 6.2.2, Replaces CR 1.9) <intron-in-bulge> 5.5. <intron-in-bulge>::= <intron> <nucleotide> | <intron> <dinucleotide> | (NBC 6.2.2) <nucleotide><intron> | <dinucleotide> <intron> | <nucleotide> <intron> <nucleotide> | <intron> 5.6. <intron>::= <sequence><domainV><sequence> (NBC 6.3.2) 5.7. <domainV>::= RAGCYNNRURMrNNrAAANNYKYayGYNNRGUUY (NBC 6.3.4) 6. Introns
6.1. 140 <= | 6.2. 389 <= | 7. D-arm (in addition to CompReq 3)
7.1. 16 <= | |
| Partial set of problems |
| Need to re-work completely the analytical approach as previous approach is no longer feasible. |