| Literature DB >> 35689182 |
Matthis Ebel1,2, Giovanna Migliorelli1,2, Mario Stanke3,4.
Abstract
BACKGROUND: An important initial phase of arguably most homology search and alignment methods such as required for genome alignments is seed finding. The seed finding step is crucial to curb the runtime as potential alignments are restricted to and anchored at the sequence position pairs that constitute the seed. To identify seeds, it is good practice to use sets of spaced seed patterns, a method that locally compares two sequences and requires exact matches at certain positions only.Entities:
Keywords: Alignment; Alignment anchor; Genome alignment; Geometric hashing; Spaced seeds
Mesh:
Year: 2022 PMID: 35689182 PMCID: PMC9188137 DOI: 10.1186/s12859-022-04745-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1Geometric hashing. a Idea of geometric hashing. Seeds a, b and d map geometrically to the same tile and support each other even though they are distant and there are indels between them if they specify homologous site pairs. Seed candidate c maps geometrically to tile whose significance falls below the threshold and is not reported as no other seeds map geometrically to the same tile. b Seeds from human and mouse gene glutathione synthetase (GSS, Ensembl IDs ENSG00000100983 and ENSMUSG00000027610, respectively). Conserved exons (thick blue bars) are hit by many seed matches (orange lines). All seeds from the gene range were collected in a single tile, despite differing intron lengths between corresponding exons. Edited screenshots from UCSC Genome Browser [47, 48]
Fig. 2Distribution of maximal exon offsets in the test set of human-mouse orthologs. Exon offsets measure the cumulative effect that indels have on the relative positions of exons. For most genes, maximal exon offsets are in the range (blue). The offsets whose absolute value is beyond are marked in orange. The distribution leans to the right, apparently because there are more transposable elements inserted in intronic regions in the human lineage. Outliers are not shown. See main text for the definition of
Fig. 3Comparison of the different methods. The y-axis shows on a logarithmic scale the total number of false positive seeds that are scaled to be estimates of the total number of false positive seeds per base in the genome if two complete human-sized genomes were compared ( means ). The x-axis is the percentage of coding exons that are supported by seeds. Each data point represents a run of the respective method with a certain weight k (point labels), ranging from 12 (top right points) to 24 (bottom left). Note that data points at 0 are slightly shifted for better visibility. The filtering to consider at most 10 seed candidates per k-mer was relaxed to 100 for all weight 12 runs
Runtime and memory requirements of comparable runs of M3 (multiple spaced seed patterns), M4 (neighbouring matches) and M5 (geometric hashing)
| Method | Weight | Patterns | Sensitivity | Additional runtime (s) | |
|---|---|---|---|---|---|
| M3 | 15 | 4 | 0.954 | 12 | – |
| M4 | 15 | 4 | 0.843 | 86 | |
| M5 | 15 | 4 | 0.954 | 0 | 300 |
Fig. 4Comparison of minimal achievable FP count when sensitivity is required to be at least 0.9. Geometric hashing is the only method that does not report any false positives in our dataset at this sensitivity threshold. The numbers on top of the bars denote the weight of the seed or spaced seed pattern(s). On the y-axis k and M abbreviate and , respectively
Runtime and memory requirements of the methods when run with best weight as determined by Fig. 4
| Method | Weight | Patterns | Sensitivity |
| Runtime (min) | Memory (GB) |
|---|---|---|---|---|---|---|
| M1 | 13 | 1 | 0.909 | 34.3 | 32 | 38 |
| M2 | 15 | 1 | 0.914 | 2.99 | 17 | 26 |
| M3 | 17 | 2 | 0.906 | 0.375 | 38 | 64 |
| M3 | 18 | 4 | 0.909 | 0.0116 | 87 | 156 |
| M4 | 12 | 4 | 0.928 | 0.349 | 194 | 80 |
| M5 | 15 | 4 | 0.955 | 0 | 75 | 137 |
The weights were chosen such that each method had the lowest possible but a sensitivity of at least 0.9. Note that the runtime and memory for M4 cannot be directly compared to the other methods as we needed to split the data into batches and do multiple sequential runs because of memory limitations
Comparison of YASS [27] and geometric hashing using the two-step approach (“Two-step geometric hashing” section) on the same dataset as used in Fig. 3. Sensitivity is the fraction of human conding exons covered by an alignment for YASS or by a seed for geometric hashing
| Method | Sensitivity | FP | Runtime (min) | Memory |
|---|---|---|---|---|
| YASS | 0.9931 | 7 | 107 | 590 MB |
| Geometric hashing | 0.9979 | 0 | 49 | 28 GB |
FP denotes the number of alignments between randomly generated sequences in YASS and the number of such seeds for geometric hashing