| Literature DB >> 25887972 |
Julie M Allen1, Daisie I Huang2, Quentin C Cronk3, Kevin P Johnson4.
Abstract
BACKGROUND: Assembling genes from next-generation sequencing data is not only time consuming but computationally difficult, particularly for taxa without a closely related reference genome. Assembling even a draft genome using de novo approaches can take days, even on a powerful computer, and these assemblies typically require data from a variety of genomic libraries. Here we describe software that will alleviate these issues by rapidly assembling genes from distantly related taxa using a single library of paired-end reads: aTRAM, automated Target Restricted Assembly Method. The aTRAM pipeline uses a reference sequence, BLAST, and an iterative approach to target and locally assemble the genes of interest.Entities:
Mesh:
Year: 2015 PMID: 25887972 PMCID: PMC4380108 DOI: 10.1186/s12859-015-0515-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Graphic of the aTRAM method. A) Formation of the aTRAM database; DNA is sequenced into a paired-end short read dataset (SRD). aTRAM splits the SRD into shards, creates a BLAST formatted database of the first pair and indexes the paired-end for the sequences in each shard. B) In iteration 0 a query sequence in either amino acid or DNA format is queried against the aTRAM formatted database using BLAST. The top-hits and their paired-ends are selected and assembled de novo. In the following iterations the contigs from the previous iteration are queried against the same database using BLAST, the top-hits and paired-ends selected and assembled de novo until the full locus is assembled.
Results from assembling 1,534 protein coding genes using aTRAM, a reference-based and a de novo approach
|
|
|
| |
|---|---|---|---|
|
|
| ||
|
| 0.99 (1-0.20) | 0.093 (0.044) | 1,530 (99.7%) |
|
| 0.93 (1-0.19) | 0.077 (0.022) | N/A |
|
| 0.92 (1-0.16) | 0.095 (0.052) | 1,512 (98.92%) |
Figure 2Y axis is the ratio of the length of the contig assembled with aTRAM by the length of the contig assembled with the reference based approach. Points under the 1 line are longer with the reference based approach and those above the line are longer from aTRAM assemblies. The x-axis indicates the uncorrected p-distance comparing the aTRAM contigs to the reference DNA sequence. The graph illustrates that aTRAM assemblies tended to be longer and the longer genes tended to be the more divergent ones, suggesting that aTRAM can assemble more divergent sections than a reference based approach.
Results from assembling 1,107 1:1 orthologous genes using aTRAM across different species of lice
|
|
|
|
|
|---|---|---|---|
|
|
| ||
|
| 25 – 30a | 1091 (98.6%) | 1068 (96.5%) |
|
| 65 - 70a | 1089 (98.4%) | 1048 (94.7%) |
|
| 65 - 70a | 1082 (97.7%) | 1031 (93.1%) |
|
| 75 - 80a | 1090 (98.5%) | 1026 (92.7%) |
|
| ~110b | 1102 (99.5%) | 1060 (95.8%) |
|
| ~110b | 1074 (97.0%) | 1053 (95.1%) |
Years divergent from the reference taxon were estimated in millions of years from a). Light et al. [23] and b) Smith et al. [24]. Contigs are the number of the 1,107 queries that assembled contigs in aTRAM. The final column has the number of contigs that passed a Reciprocal best-BLAST test against the entire Pediculus humanus protein coding genome.