| Literature DB >> 24359548 |
Manuel A Candales1, Adrian Duong1, Keyar S Hood1, Tony Li1, Abat Shakenov1, Runda Sun1, Michael Abebe1, Ryan A E Neufeld1, Li Wu1, Ashley M Jarding1, Cameron Semper1, Steven Zimmerly1.
Abstract
BACKGROUND: Accurate and complete identification of mobile elements is a challenging task in the current era of sequencing, given their large numbers and frequent truncations. Group II intron retroelements, which consist of a ribozyme and an intron-encoded protein (IEP), are usually identified in bacterial genomes through their IEP; however, the RNA component that defines the intron boundaries is often difficult to identify because of a lack of strong sequence conservation corresponding to the RNA structure. Compounding the problem of boundary definition is the fact that a majority of group II intron copies in bacteria are truncated.Entities:
Year: 2013 PMID: 24359548 PMCID: PMC4028801 DOI: 10.1186/1759-8753-4-28
Source DB: PubMed Journal: Mob DNA
Figure 1Example group II intron structure. (A) DNA structure of a group II intron. The intron RNA portion is denoted by red boxes, while conserved ORF domains are in blue. The IEP contains a RT (reverse transcriptase) domain, including conserved sub-domains (0, 1, 2, 2a, 3, 4, 5, 6, 7), an X domain, a D (DNA-binding) domain and an optional En (endonuclease) domain. Intron RNA domains are shown underneath in Roman numerals, and exon 1 and 2 sequences are in black. (B) An example group II intron RNA secondary structure (IIC). The intron sequence is depicted in red lettering, with exon sequences in blue and black. The ORF sequence is represented by the dotted loop in domain IV. IBS1/EBS1 and IBS3/EBS3 (blue and orange shading) represent base pairings between the intron and exons that help to define the intron boundaries during splicing. The sequence shown is for B.h.I1 of Bacillus halodurans.
Summary of programs
| blast_and_parse | • A tblastn search is done against NCBI’s ‘nr’ database with a set of representative group II intron ORF sequences as queries |
| • A list of unique, non-overlapping candidate hits is assembled, along with accession number and coordinates | |
| DNA_sequence_download | • The GenBank entry for each candidate DNA sequence is downloaded |
| • Candidates are separated by taxonomic classification, with bacterial and archaeal candidates proceeding to the next step by default | |
| create_storage | • Creates a FASTA file for each candidate’s DNA sequence |
| • Creates storable files containing information about each candidate, to be used in later programs | |
| filter_out_ non_gpII_rts | • A blastx search of candidate sequences is done against a local database of known, categorized bacterial RT sequences; candidate RTs whose closest relatives are not group II introns are separated out |
| find_intron_class | • A blastx search of candidate sequences is done against a local database of known and classified group II intron ORF sequences; based on the top matches, the ORF class is assigned, and the closest relative in the curated set is identified |
| find_orf_domains | • A blastx alignment is done between a candidate sequence and a representative IEP of the same class, whose IEP is mapped for the domains characteristic of group II introns |
| • The domains present for each IEP are tabulated, and the candidate is categorized as having complete domains or missing domains; candidate sequences with complete IEP domains continue to be analyzed | |
| find_orf | • A blastx alignment is done between each candidate sequence and its closest relative among curated group II introns |
| • From the alignment, it is decided whether the candidate sequence contains frame shifts, premature stops or other problems within its IEP | |
| • If the ORF appears intact, then a predicted amino acid sequence is assigned | |
| find_intron_boundaries | • Information on possible boundary positions is acquired using class-specific HMM profiles of boundary sequences |
| generate_rna_sequence | • Boundary sequence data are evaluated and the most probable intron boundaries are predicted, along with the complete sequence of the intron |
| • Candidates with ambiguous boundaries are noted | |
| group_candidates | • All ORF sequences assigned to a given class are aligned using ClustalW, and pair-wise distances are calculated using PROTDIST of the Phylip package |
| • Sequences differing by less than 0.061 units are assigned to a group of 95% identity | |
| • For each group of 95% identity, the complete intron DNA sequence of each member is aligned using ClustalW | |
| select_prototypes | • For each group of 95% identity, one candidate sequence is selected as the prototype, or representative of the group |
Figure 2Pipeline flowchart. The pipeline proceeds through a series of steps in which data are collected and put into eight storage folders. Each storage folder feeds data into a subsequent program, which produces the next storage folder. The number of candidate introns decreases at each step, while more information accumulates for the smaller set of introns. To summarize the overall process briefly, a BLAST search identifies candidate IEPs in GenBank and DNA sequences are downloaded. RTs that are not IEPs are filtered out, and retained candidates are assigned to an intron class. ORF domains (0, 1, 2a, 2b, 3, 4, 5, 6, 7, X, En) are identified and ORF boundaries are annotated. The intron boundaries are then identified and an RNA structure is generated. Candidates with >95% similarity are grouped and a prototype from each group is identified.