| Literature DB >> 11707150 |
Abstract
BACKGROUND: Ribosomal 16S DNA sequences are an essential tool for identifying and classifying microbes. High-throughput DNA sequencing now makes it economically possible to produce very large datasets of 16S rDNA sequences in short time periods, necessitating new computer tools for analyses. Here we describe FastGroup, a Java program designed to dereplicate libraries of 16S rDNA sequences. By dereplication we mean to: 1) compare all the sequences in a data set to each other, 2) group similar sequences together, and 3) output a representative sequence from each group. In this way, duplicate sequences are removed from a library.Entities:
Mesh:
Substances:
Year: 2001 PMID: 11707150 PMCID: PMC59723 DOI: 10.1186/1471-2105-2-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Graphical User Interface (GUI) for FastGroup.
Figure 2Schematic of bacterial 16S rDNA showing conserved and hypervariable regions. Detailed information about the primers and their superposition on the bacterial 16S rDNA can be found at . Bact27F (5' AGA GTT TGA TCM TGG CTC AG 3') corresponds to positions 9–27 of the E. coli 16S rDNA and is similar to BSF8/20. Bact517 (5' ATT ACC GCG GCT GCT GG 3') corresponds to positions 517–534 of the E. coli 16S rDNA and is similar to BSF517/17. Bact1492R (5' TAC GGY TAC CTT GTT ACG ACT T 3') corresponds to positions 1492–1514 of the E. coli 16S rDNA. The approximate sites for hypervariable regions (V1-V3) are shown as shaded boxes.
Effects of varying PSI on ability of FastGroup to correctly identify Bact517 site. To determine the number of times that the site was found by FastGroup, the sequences were displayed from the 3' direction and visually analyzed in the group_seqs.txt output file. The number of false positives were determined by looking for significantly truncated sequences (e.g., <400 bp) and then visually confirming that a false site was identified
| 100 | 75 | 0 |
| 90 | 79 | 0 |
| 80 | 81 | 0 |
| 70 | 83 | 0 |
| 60 | 92 | 9 |
Effects of matching direction and window size on grouping results and time to analyze data using the PSI algorithm.
| 5' | 1 N in 50 bp | Bact517 | 10 | 54 | 8 |
| 3' | 1 N in 50 bp | Bact517 | 10 | 48 | 4 |
| 5' | 1 N in 50 bp | 1 N in 50 bp | 10 | 92 | 12 |
| 3' | 1 N in 50 bp | 1 N in 50 bp | 10 | 94 | 30 |
| 5' | 500 bp* | 1 N in 50 bp | 10 | 64 | 5 |
| 3' | 500 bp* | 1 N in 50 bp | 10 | 55 | 3 |
| 3' | 1 N in 50 bp | Bact517 | 5 | 49 | <1 |
| 3' | 1 N in 50 bp | Bact517 | 10 | 48 | 4 |
| 3' | 1 N in 50 bp | Bact517 | 25 | 51 | 67 |
* FastGroup it is not capable of both using a specific number of bp from one end and trimming the other end using one of the other parameters. In these examples, this limitation was circumvented by first trimming the sequences using the 1 N in 50 bp criteria. The output fasta_groups.txt file was then used as the input file for a second FastGroup analysis where 500 bp from the 5' end were used for grouping.
Comparison of PSI and MM Algorithms.
| PSI | 100 | 85 | 7 |
| PSI | 97 | 48 | 4 |
| PSI | 95 | 45 | 4 |
| PSI | 93 | 41 | 3 |
| MM | 1 | 62 | <1 |
| MM | 2 | 42 | <1 |
| MM | 3 | 36 | <1 |
| MM | 4 | 30 | <1 |
Effects of using only partial sequences during the Grouping step.
| 5' | 100 | 54 | 8 |
| 5' | 90 | 45 | 1 |
| 5' | 80 | 44 | <1 |
| 3' | 100 | 48 | 4 |
| 3' | 90 | 37 | 1 |
| 3' | 80 | 28 | <1 |
Figure 3Comparison of ClustalX and FastGroup analyses. An alignment of the 16S rDNA library was performed using ClustalX and a NJ tree was constructed. The "ClustalX Clades" were made by grouping end nodes separated by approximately 3% divergence (i.e., the combined branch lengths). Sequences grouped together by FastGroup, using default trimming criteria and 97% PSI, were identified on this tree and color-coded.