| Literature DB >> 22693213 |
Pravech Ajawatanawong1, Gemma C Atkinson, Nathan S Watson-Haigh, Bryony Mackenzie, Sandra L Baldauf.
Abstract
Analyses of multiple sequence alignments generally focus on well-defined conserved sequence blocks, while the rest of the alignment is largely ignored or discarded. This is especially true in phylogenomics, where large multigene datasets are produced through automated pipelines. However, some of the most powerful phylogenetic markers have been found in the variable length regions of multiple alignments, particularly insertions/deletions (indels) in protein sequences. We have developed Sequence Feature and Indel Region Extractor (SeqFIRE) to enable the automated identification and extraction of indels from protein sequence alignments. The program can also extract conserved blocks and identify fast evolving sites using a combination of conservation and entropy. All major variables can be adjusted by the user, allowing them to identify the sets of variables most suited to a particular analysis or dataset. Thus, all major tasks in preparing an alignment for further analysis are combined in a single flexible and user-friendly program. The output includes a numbered list of indels, alignments in NEXUS format with indels annotated or removed and indel-only matrices. SeqFIRE is a user-friendly web application, freely available online at www.seqfire.org/.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22693213 PMCID: PMC3394284 DOI: 10.1093/nar/gks561
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Work flow for SeqFIRE, a user-friendly web application for automated identification and extraction of indels and conserved blocks from MSAs. The workflow for the (A) indel and (B) conserved block modules of SeqFIRE are shown on the left and right, respectively. Boxes indicate processes and diamonds indicate suggested parameters for specific steps. Numbers in the upper right-hand corners of boxes indicate different steps in the process as described in the text. For the conserved block module, MNS refers to minimum non-conserved site threshold and MCS to minimum conserved site threshold.
Figure 2.Example output from the SeqFIRE indel module. The module produces five different outputs (A–E). The alignment with indel annotation is visualized in Jalview (A) and text mode (B). The indels list is a numbered sequential list of all indels including the location of the indel in the alignment and the full sequence of the indel region for all taxa (C). The simple indel matrix is a NEXUS-formatted matrix with all simple indels scored as 0 or 1 (absence or presence) for all taxa (D). The indel module also outputs an alignment with all indel regions removed, also in NEXUS format (E). Outputs B–E can be downloaded as a single file or separately using links at the top of the output page.
Performance of SeqFIRE and GBlocks in detecting conserved blocks within BAliBASE (16) benchmark alignments
| Test alignment | Original alignment (sites) | GBlocks | SeqFIRE | |||
|---|---|---|---|---|---|---|
| Less stringency | More stringency | Low stringency | Medium stringency | High stringency | ||
| BB1103 | 582 | 162 (27.8%) | 42 (7.2%) | 218 (37.5%) | 215 (36.9%) | 49 (8.4%) |
| BB1105 | 609 | 14 (2.3%) | 0 (0.0%) | 336 (55.2%) | 314 (51.6%) | 0 (0.0%) |
| BB1106 | 385 | 29 (7.5%) | 0 (0.0%) | 205 (53.2%) | 193 (50.1%) | 0 (0.0%) |
| BB11031 | 882 | 26 (2.6%) | 0 (0.0%) | 278 (31.5%) | 253 (28.7%) | 6 (0.7%) |
| BB11036 | 525 | 69 (13.1%) | 0 (0.0%) | 322 (61.3%) | 304 (57.9%) | 39 (7.4%) |
| BB12001 | 623 | 193 (31.0%) | 83 (13.3%) | 372 (59.7%) | 361 (57.9%) | 107 (17.2%) |
| BB12004 | 312 | 152 (48.7%) | 40 (12.8%) | 226 (72.4%) | 226 (72.4%) | 92 (29.5%) |
| BB12017 | 586 | 318 (54.3%) | 229 (39.1%) | 425 (72.5%) | 414 (70.6%) | 233 (39.8%) |
| BB12030 | 1247 | 279 (22.4%) | 83 (6.7%) | 738 (59.2%) | 738 (59.2%) | 192 (15.4%) |
| BB12043 | 786 | 120 (15.3%) | 13 (1.7%) | 211 (26.8%) | 210 (26.7%) | 91 (11.6%) |
| BB30008 | 1413 | 158 (11.2%) | 28 (2.0%) | 333 (23.6%) | 323 (22.9%) | 93 (6.6%) |
| BB30009 | 278 | 48 (17.3%) | 0 (0.0%) | 184 (66.2%) | 155 (55.8%) | 5 (1.8%) |
| BB30021 | 631 | 25 (4.0%) | 0 (0.0%) | 151 (23.9%) | 131 (20.8%) | 12 (1.9%) |
| BB30027 | 239 | 49 (20.5%) | 0 (0.0%) | 67 (28.0%) | 60 (25.1%) | 5 (2.1%) |
| BB30030 | 2015 | 129 (6.4%) | 21 (1.0%) | 254 (12.6%) | 236 (11.7%) | 39 (1.9%) |
Test alignment numbers refer to BAliBASE accession numbers for three different levels of sequence conservation: Ref 1V1, Ref 1V2 and Ref 3. GBlocks was tested at the two stringency levels provided by the web server, while SeqFIRE was tested at three levels using a combination of user-defined options (for details see text). For each stringency level, the number of conserved positions is listed with the percentage of retained sites shown below in parentheses.
Figure 3.Example of SeqFIRE and GBlocks detection of conserved alignment regions under high stringency criteria. A fragment of BAliBASE reference 1 V2 alignment number BB12001 is shown between positions 129 and 187. The gray bars below the alignment indicate the conserved blocks detected by SeqFIRE and the black bars show the conserved blocks detected by GBlocks. The dark background within the alignment indicates conserved amino acids.