| Literature DB >> 25053142 |
David Minkley, Michael J Whitney, Song-Han Lin, Marina G Barsky, Chris Kelly, Chris Upton1.
Abstract
BACKGROUND: Large DNA sequence data sets require special bioinformatics tools to search and compare them. Such tools should be easy to use so that the data can be easily accessed by a wide array of researchers. In the past, the use of suffix trees for searching DNA sequences has been limited by a practical need to keep the trees in RAM. Newer algorithms solve this problem by using disk-based approaches. However, none of the fastest suffix tree algorithms have been implemented with a graphical user interface, preventing their incorporation into a feasible laboratory workflow.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25053142 PMCID: PMC4118789 DOI: 10.1186/1756-0500-7-466
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Figure 1STS graphical user interface. Panel A: Input window. Panel B: Output window.
Parameters for tree traversals and display of traversal results
| LCS occurrences | Gather LCSs occurring for | |
| | LCS inputs | Gather the single LCS in |
| | Number of occurrences | For each number |
| | Number of inputs | For each number |
| | Sets | Fetch the |
| | All singles | Gather a separate result set for each input sequence. |
| | All pairs | Gather a separate result set for each possible pair of input sequences |
| Threshold length | Common substrings must be of at least this length to show up in result sets | |
| | Display length | Display LCSs only up to and including this length in the result sets |
| | Max results | For the non-standard result sets, restrict table size to this many rows |
| Reference genome | Gather result sets only for queries involving this input sequence |
Example input formats for search sequences
| acgcgaatccgt | Search all input sequences for pattern | Multiple patterns are separated with a ";" |
| ac*cg*atccgt | Search all input sequences for pattern. "*" represents a wildcard, which matches all 4 nucleotides | Multiple patterns are separated with a ";" |
| "2, 12961, 12" | Search input 2 for a 12 nt pattern beginning at position 12961 | Multiple patterns are separated with a ";". Tandem repeats of the pattern may be matched if they exist. |
Sequence summary and resource usage for tree construction and traversal of test datasets
| | |||||||
|---|---|---|---|---|---|---|---|
| 10 random 10 kbp seqs | 0.1 | 154 ms | 649.4 | 1.72 | 4.0 Mb | 21 ms | 2.3 |
| 100 random 10 kbp seqs | 1.0 | 733 ms | 1,364.3 | 1.72 | 39.7 Mb | 150 ms | 22.9 |
| 1,000 random 10 kbp seqs | 10.0 | 9.2 s | 1,084.9 | 1.72 | 397.1 Mb | 3.8 s | 24.0 |
| 10,000 random 10 kbp seqs | 100.0 | 1 m 56 s | 861.3 | 1.72 | 3.9 Gb | 5 m 10s | 24.1 |
| 100,000 random 10 kbp seqs | 1,000.0 | 29 m 52 s | 558.1 | 1.72 | 38.8 Gb | 7 h 23 m | 25.2 |
| 10,000 random 100 bp seqs | 1.0 | 6.0 s | 166.4 | 1.72 | 111.6 Mb | 1.7 s | 22.0 |
| 10,000 random 1 kbp seqs | 10.0 | 13.7 s | 732.1 | 1.72 | 459.7 Mb | 25.5 s | 24.1 |
| 10,000 random 100 kbp seqs | 1,000.0 | 25 m 42 s | 648.6 | 1.72 | 38.5 Gb | 51 m 22 s | 24.3 |
| 10,000 random 1 Mbp seqs2 | 10,000.0 | 12 h 29 m | 222.5 | 1.72 | 384.7 Gb | 9 h 18 m | 24.0 |
| 62 | 310.5 | 21 m 55 s | 235.7 | 1.72 | 11.9 Gb | 2 m 33 s | 24.0 |
| 62 random seqs w/ | 310.5 | 7 m 3 s | 734.0 | 1.72 | 11.9 Gb | 2 m 56 s | 24.0 |
| 4 Chlorella virus genomes | 0.9 | 815 ms | 1,074.5 | 1.72 | 41.8 Mb | 203 ms | 24.0 |
| Human genome (hg38)2 | 3,209.3 | 3 h 58 m | 223.7 | 2.07 | 117.2 Gb | 45 m 33 s | 24.1 |
1Tree construction time includes time to both Import sequences and Build the suffix index.
2The "10,000 random 1 Mbp" and human genome datasets utilized an external USB2.0 hard drive.
Effect of mismatches and wild cards on search times
| 0 | 2 | 461 ms | |
| | 1 | 61 | 1.4 s |
| | 2 | 1075 | 1.8 s |
| | 3 | 12543 | 6.8 s |
| | 4 | 98684 | 18.9 s |
| 0 | 4 | 28 ms | |
| | 1 | 234 | 1.4 s |
| | 2 | 3757 | 2.5 s |
| | 3 | 39076 | 8.6 s |
| | 4 | 277422 | 33.0 s |
| 0 | 30 | 76 ms | |
| | 1 | 807 | 5.9 s |
| | 2 | 12777 | 6.4 s |
| | 3 | 118420 | 17.4 s |
| | 4 | 754861 | 84.3 s |
| 0 | 93 | 94 ms | |
| | 1 | 2923 | 32.7 s |
| | 2 | 41372 | 1 m 0 s |
| | 3 | 350292 | 1 m 27 s |
| 4 | 19708731 | 2 m 31 s |
Searches were performed on trees constructed from a dataset of 10,000 randomly-generated 10 kbp sequences.