Jingwen Ren1, Mark J P Chaisson1. 1. Department of Quantitative and Computational Biology (QCB), University of Southern California, Los Angeles, California, the United States of America.
Abstract
It is computationally challenging to detect variation by aligning single-molecule sequencing (SMS) reads, or contigs from SMS assemblies. One approach to efficiently align SMS reads is sparse dynamic programming (SDP), where optimal chains of exact matches are found between the sequence and the genome. While straightforward implementations of SDP penalize gaps with a cost that is a linear function of gap length, biological variation is more accurately represented when gap cost is a concave function of gap length. We have developed a method, lra, that uses SDP with a concave-cost gap penalty, and used lra to align long-read sequences from PacBio and Oxford Nanopore (ONT) instruments as well as de novo assembly contigs. This alignment approach increases sensitivity and specificity for SV discovery, particularly for variants above 1kb and when discovering variation from ONT reads, while having runtime that are comparable (1.05-3.76×) to current methods. When applied to calling variation from de novo assembly contigs, there is a 3.2% increase in Truvari F1 score compared to minimap2+htsbox. lra is available in bioconda (https://anaconda.org/bioconda/lra) and github (https://github.com/ChaissonLab/LRA).
It is computationally challenging to detect variation by aligning single-molecule sequencing (SMS) reads, or contigs from SMS assemblies. One approach to efficiently align SMS reads is sparse dynamic programming (SDP), where optimal chains of exact matches are found between the sequence and the genome. While straightforward implementations of SDP penalize gaps with a cost that is a linear function of gap length, biological variation is more accurately represented when gap cost is a concave function of gap length. We have developed a method, lra, that uses SDP with a concave-cost gap penalty, and used lra to align long-read sequences from PacBio and Oxford Nanopore (ONT) instruments as well as de novo assembly contigs. This alignment approach increases sensitivity and specificity for SV discovery, particularly for variants above 1kb and when discovering variation from ONT reads, while having runtime that are comparable (1.05-3.76×) to current methods. When applied to calling variation from de novo assembly contigs, there is a 3.2% increase in Truvari F1 score compared to minimap2+htsbox. lra is available in bioconda (https://anaconda.org/bioconda/lra) and github (https://github.com/ChaissonLab/LRA).
This is a PLOS Computational Biology Methods paper.
Introduction
Studies of genetic variation often begin by aligning sequences from a sample back to a reference genome, and inferring variation as differences in the alignment. Long read, single molecule sequencing (LRS) is becoming established as a routine approach for sequencing genomes. The two technologies that produce LRS technologies, Pacific Biosciences (PacBio) and Oxford Nanopore (ONT) generate reads over 50kb at error rate 15% or less. Aligning these sequences is a computationally challenging task for which several methods are available including minimap2, ngmlr, and BLASR [1-3]. They are demonstrated to be quite fast and accurate, but have limitations, particularly when there are large sequence differences between the read and the reference. This problem is amplified in complex, repetitive regions such as variable-number tandem repeats, that only make up 3% of the human genome, but account for nearly 70% of observed structural variation: insertions and deletions at least 50 bases (SV), and in larger SV [4].A common approach for mapping LRS reads is seeding and chaining, where an approximate alignment is formed based on a subset (chain) of exact matches between the sequence and the genome. The exact matches may be found using various data structures including variable-length matches using a BWT-FM index or suffix arrays [3], and minimizer based indexing [1]. The chaining algorithm used by BLASR uses a linear cost gap function for sparse-dynamic programming (SDP) [5]. While the chaining algorithm is efficient, it has long been known that linear-cost gap functions do not accurately reflect biological variation [6], and has been shown to decrease sensitivity for detecting SV from LRS alignments [2].Both the minimap2 and ngmlr aligners use gap penalties that are a concave function of gap length, and are demonstrated to be quite effective for mapping LRS reads across SV with alignment gaps that reflect biological variation. In minimap2, a heuristic algorithm is used for chaining, while ngmlr adopts a Smith-Waterman-like dynamic programming algorithm. An exact solution to sparse dynamic programming with a concave gap (CG-SDP) function exists [7], and is slightly less efficient than linear-cost SDP. However, as presented, the algorithm requires asynchronous processing and has never been implemented for sequence alignment in computational biology. Furthermore, the algorithm does not take into account genome arrangements such as inversions.An additional challenge for variant discovery is efficient alignment of contigs assembled from LRS that now have contiguity on par with the initial release of the human genome [8-10]. Existing methods exist to align whole genomes, but do not implement concave gap penalties [11], are low-resolution [12], or split alignments across large variants [1]. To explore the application of the exact solution of seed chaining sparse dynamic programming with a concave gap function to read and assembly alignment, we developed an alignment method—lra. We have simplified the implementation of the CG-SDP algorithm [7], and extended to allow for inversions and translocations. Finally, we demonstrate that this approach can improve SV discovery while having a runtime on par with state of the art methods.
Results
We compared alignment metrics and variant discovery on simulated data sampling from the human genome build GRCh38, and real sequencing data from HG002 including three sequencing datasets: PacBio consensus reads (HiFi), PacBio single-pass reads (CLR), and ONT reads, as well as on contigs from a haplotype-resolved de novo assembly [10]. The simulated data were mapped to GRCh38 to evaluate mapping accuracy with a genome that includes centromeric sequences. All real data were mapped to build GRCh37 to use curated variants for accuracy analysis [13]. The PacBio HiFi data are characterized by high-accuracy reads (>99%) with an average length of 19kb, compared to the CLR reads that have an accuracy around 80% and an average read length of 21kb. The ONT data: 88% accuracy and 1.2kb average read length. The distribution of the read lengths are shown in S1 and S2 Figs. All read datasets are above 40× coverage of a human genome.The alignment results and structural variant callsets were compared to those generated by minimap2 and ngmlr. When computing SAM formatted alignments with minimap2, the lra and minimap2 runtime are comparable. lra and minimap2 reach similar speed on HiFi and ONT data, although all are within a factor of 1.12. minimap2 is 3.7 times faster than lra on CLR dataset (Table 1). lra is 4–6 times faster than ngmlr. The difference of bases aligned by lra and minimap2 are within 0.06–1%. We compared the mapping accuracy of lra, minimap2 and ngmlr on simulated HiFi, CLR data for read lengths between 5–50kb and ONT data for read lengths between 1–50kb. HiFi and CLR reads were simulated by PBSIM (https://github.com/pfaucon/PBSIM-PacBio-Simulator) and ONT data were simulated by alchemy2, which is distributed with lra source. We used paftools.js mapeval to evaluate the mapping accuracy across three overlap percentages (10%, 40%, 70%) to measure the change of mapping accuracy with overlap percentage evaluation metric. The mapping accuracy of lra for reads with mapping quality at least 20 is 99.89%, 99.97% and 99.97% for HiFi, CLR and ONT respectively, when a 40% overlap with the simulated interval is required for a correct alignment (Fig 1). As the overlap percentage goes up from 10% to 70%, the mapping accuracy of lra alignment decreases 0.1%, 0.09%, 0.03% for HiFi, CLR and ONT at mapqv 20 due to differences in lengths of sequences aligned in repetitive regions. In lra, a second local minimizer index is used to refine chained anchors and improve alignments (Methods). We evaluated the mapping accuracy of lra without the step of refining by local minimizers (S1 Text). For simulated HiFi data, the mapping accuracy has almost no difference on all 10%, 40%, 70% overlap percentage metrics, while mapping accuracy at the 70% dropped below 0.90 and 0.97 for simulated CLR and ONT reads, respectively.
Table 1
Performance of alignment methods.
HG002 HiFi
HG002 CLR
HG002 ONT
HG002 HiFi hifiasm hap1
HG002 HiFi hifiasm hap2
lra
runtime
942m
7428m
1528m
105m
90m
memory
12.01G
16.46G
13.96G
31.10G
28.66G
# of mapped reads
1.84M
4.95M
1.37M
396
411
# of mapped bases
35.85Gb
86.82Gb
25.45Gb
2.87Gb
2.98Gb
minimap2
runtime
890m
1973m
1358m
100m
82m
memory
19.04G
18.49G
22.88G
20.95G
21.90G
# of mapped reads
1.85M
5.06M
1.41M
492
496
# of mapped bases
36.20Gb
87.36Gb
25.73Gb
2.92Gb
3.01Gb
ngmlr
runtime
5087m
33047m
8862m
-
-
memory
13.6G
14.67G
17.00G
-
-
# of mapped reads
1.78M
4.84M
2.07M
-
-
# of mapped bases
34.59Gb
83.75Gb
24.22Gb
-
-
Each dataset was aligned to GRCh37 by all methods with 16 threads, with—indicating a software crash. Total CPU time is reported. The optimal values in each class are given in bold. The HiFi hifiasm assembly is a haplotype-resolved de novo assembly of HG002 using HiFi reads, with haplotype N50 values of 50.55Mb and 42.92Mb.
Fig 1
Mapping accuracy.
Mapping accuracy of lra, minimap2 and ngmlr on simulated HiFi, CLR reads for lengths between 5–50kb and ONT reads for lengths between 1–50kb. Simulated reads were mapped to genome GRCh38. A read is considered as correctly mapping if the reported mapping interval has ≥ 10%, 40%, 70% overlap with the truth interval. paftools.js mapeval was used to evaluate the mapping accuracy.
Mapping accuracy.
Mapping accuracy of lra, minimap2 and ngmlr on simulated HiFi, CLR reads for lengths between 5–50kb and ONT reads for lengths between 1–50kb. Simulated reads were mapped to genome GRCh38. A read is considered as correctly mapping if the reported mapping interval has ≥ 10%, 40%, 70% overlap with the truth interval. paftools.js mapeval was used to evaluate the mapping accuracy.Each dataset was aligned to GRCh37 by all methods with 16 threads, with—indicating a software crash. Total CPU time is reported. The optimal values in each class are given in bold. The HiFi hifiasm assembly is a haplotype-resolved de novo assembly of HG002 using HiFi reads, with haplotype N50 values of 50.55Mb and 42.92Mb.A common application of LRS in human genetics is detecting structural variation [2, 14]. We evaluated SV detection from mapped reads across combinations of technologies, aligners, and SV calling algorithms on both simulated datasets and real dataset. Reads were aligned with lra, pbmm2/minimap2, and ngmlr, and variants were detected with pbsv, Sniffles and cuteSV [2, 15, 16]. The pbmm2 method encapsulates minimap2 with technology and SV discovery specific parameters. The pbsv method did not run using ONT reads, and so pbsv analysis of ONT data is omitted. In the simulation study, we used SUVIVOR (https://github.com/fritzsedlazeck/SURVIVOR) to simulate insertions, deletions, inversions and more complicated nested SVs, including deletion-inversion-deletions (INVDEL) and inverted duplications (INVDUP). We simulated 195 insertions and deletions of lengths between 50–10000 bases, 97 inversions of lengths between 600–2000 bases and 100 INVDEL and INVDUP of lengths between 600–1000 bases respectively using the human genome GRCh37 chromosomes 20–22 as a reference. HiFi and CLR reads were simulated by PBSIM and ONT reads were simulated by alchemy2. Truvari (https://github.com/spiralgenetics/truvari) was used to benchmark the insertions, deletions and inversions callset. Both pbsv and cuteSV calling results show similar F1 score (>= 0.974, 0.979, 0.974) for lra, minimap2 and ngmlr alignments on insertions and deletions over all data types. All three aligners achieved comparable F1 scores (>= 0.97) on inversions from cuteSV calling results over all data types (S3 Table). There was a lower F1 score for variants detected by Sniffles on lra alignment of the simulated CLR reads, which is inconsistent with other two callers. Complicated nested SVs like INVDEL presented a greater challenge, particularly for lower accuracy reads. Using simulated HiFi reads, 98 of the INVDEL and 94 of the INVDUP were correctly called by lra, with all alignment methods detecting at least 93 variants in each data set. The majority of the INVDEL variants detected by lra alignments using CLR (79) data missed one deletion, while variant calls from minimap2 and ngmlr correctly detect 94 and 74 INVDEL variants, respectively. A different pattern existed for ONT data, where 1–2 correct calls were detected from minimap2/ngmlr, and 37 by lra. The INVDUP calls were successfully detected from lra and ngmlr alignments, and represented as insertions by minimap2. The lra alignments resulted in 99 and 100 calls from the CLR and ONT simulated reads, and ngmlr 98 calls from both datasets (S4 Table).We used pbsv and Sniffles to detect variation on the HiFi, CLR, and ONT data; the cuteSV variant calls had lower accuracy on PacBio data across multiple alignment methods (S7 Table) and were excluded from analysis. The majority of SVs are under 500 bases [14] and are spanned by LRS alignments. Consequently, variant calls with similar representations of gaps should have converging callsets, which may be confirmed by comparing SV calls from different combinations of algorithms. The most consistent variant calling was found using pbsv on both types of PacBio using lra and pbmm2 alignments. These callsets ranged from 21,439–22,325 SVs, with a difference of 2.6–3.3% between alignment algorithms and 0.7–1.5% between data types (Table 2). Compared to the pbsv calls, there was greater variation between callsets generated using Sniffles across both data types and alignment algorithms, with a range of 16,689–27,001 SVs per callset. There was less variation between the size of callsets of different data types and the same alignment algorithm (average difference of 624 across pbsv callsets), compared to different alignment algorithms applied to the same data (an average difference of 2,290 across pbsv callsets). The differences between callsets not due to gap placement were broadly measured by comparing callsets with a low-stringency filter: within 50% of length and at most 1kb apart. Using this definition of overlap, the callsets from PacBio data using pbsv were most similar with no more than 20% of calls unique to an alignment method. When stratifying PacBio variants unique to a dataset overlap with genomic features, 29–46% of variants overlap tandem repeats, and 15–56% overlap segmental duplications (SD) (S2 Table). This indicates that calls in tandem repeats may have different breakpoints and are difficult to compare using standard overlap approaches [13].
Table 2
Statistics of structural variant call sets.
lra
pbmm2
ngmlr
Number of calls
Average call size
Total size
Number of calls
Average call size
Total size
Number of calls
Average call size
Total size
pbsv
CLR
INS
13177
560
7.38M
12828
523
6.72M
11416
537
6.14M
DEL
9148
688
6.30M
8926
603
5.39M
8569
594
5.10M
HiFi
INS
13133
577
7.58M
12655
488
6.19M
10342
506
5.24M
DEL
9037
619
5.60M
8784
583
5.12M
8445
575
4.86M
Sniffles
CLR
INS
16377
1074
17.59M
11592
785
9.10M
8318
600
5.00M
DEL
10624
820
8.71M
9195
578
5.32M
8371
739
6.19M
HiFi
INS
12004
805
9.67M
10902
459
5.01M
9716
436
4.24M
DEL
8883
1073
9.54M
7907
430
3.40M
7698
672
5.18M
ONT
INS
13052
1084
14.15M
12103
488
5.91M
10314
476
4.91M
DEL
10129
752
7.63M
9238
550
5.08M
8286
637
5.28M
Summary of SV callsets from different aligners (lra/pbmm2/ngmlr) + SV callers (pbsv/Sniffles).
Summary of SV callsets from different aligners (lra/pbmm2/ngmlr) + SV callers (pbsv/Sniffles).A greater call count may reflect more sensitive detection of SV, or simply fragmentation of variants. To assess the accuracy of variant callsets, we compared callsets against the GIAB Tier 1 calls [13] using Truvari. The lra and pbmm2 based callsets outperform ngmlr based calls in precision and recall on all data types (Table 3 and Fig 2). While the F1 scores are effectively equivalent between lra and pbmm2 based variant calls on HiFi (0.970 vs 0.968) and CLR (0.967 vs 0.967) using pbsv, there is a substantial improvement on calls on ONT data made by Sniffles (0.942 vs 0.910). This indicates that the greater call count on ONT data includes is at least partially attributed to increasing recall, particularly in larger (>500 base) insertions, without affecting precision on the high-quality regions that were ascertained.
Table 3
Truvari classification of variant calls.
pbsv
Sniffles
HiFi
CLR
HiFi
CLR
ONT
lra
pbmm2
ngmlr
lra
pbmm2
ngmlr
lra
minimap2
ngmlr
lra
minimap2
ngmlr
lra
minimap2
ngmlr
TP base
9466
9408
8433
9438
9414
9076
8924
8700
8015
8969
8082
6533
9085
8887
8471
TP call
9506
9449
8500
9516
9507
9208
8964
8713
8028
9101
8147
6567
9102
8902
8488
FP
404
390
396
448
418
428
392
357
800
2023
1492
1298
550
940
815
FN
175
233
1208
203
227
565
717
941
1626
672
1559
3108
556
754
1170
s/FN
108
93
176
169
151
268
573
598
573
629
1423
2387
477
464
410
precision
0.959
0.96
0.955
0.955
0.958
0.956
0.958
0.961
0.909
0.818
0.845
0.835
0.943
0.904
0.912
recall
0.982
0.976
0.875
0.979
0.976
0.941
0.926
0.902
0.831
0.930
0.838
0.678
0.942
0.922
0.879
F1 score
0.970
0.968
0.913
0.967
0.967
0.952
0.942
0.931
0.869
0.871
0.842
0.748
0.943
0.91
0.895
Truvari comparisons between lra, pbmm2/minimap2 and ngmlr using the Genome in a Bottle benchmark SV set. Optimal results in each category are shown in bold. TP-base means true positive calls in the benchmark SV curation set, while TP-call means true positive calls in the SV set from each aligner. False positive means the number of non-matching calls from the SV set from each aligner. False negative means the number of non-matching calls from the SV curation set.
Fig 2
The precision and recall of HG002 SV call sets.
The precision and recall of SV call sets from HG002 HiFi, CLR, ONT, hifiasm assembly, measured by Truvari using Genome in a Bottle benchmark SV callsets.
The precision and recall of HG002 SV call sets.
The precision and recall of SV call sets from HG002 HiFi, CLR, ONT, hifiasm assembly, measured by Truvari using Genome in a Bottle benchmark SV callsets.Truvari comparisons between lra, pbmm2/minimap2 and ngmlr using the Genome in a Bottle benchmark SV set. Optimal results in each category are shown in bold. TP-base means true positive calls in the benchmark SV curation set, while TP-call means true positive calls in the SV set from each aligner. False positive means the number of non-matching calls from the SV set from each aligner. False negative means the number of non-matching calls from the SV curation set.While the F1 scores are nearly equivalent on PacBio data, the combination of algorithms may contribute to a more complete evaluation of an LRS genome. Calls from the lra and pbmm2 alignments contribute 83 and 43 unique calls, respectively, that were annotated as true positives in the HiFi/pbsv call sets, and 134, 133 unique calls from the CLR/pbsv call sets. The average length of the uniquely called true positive SV among these callsets is 1986bp, highlighting the challenge of detecting longer SV from read based alignments. To further assess the potential for callsets to be more comprehensive, benchmark variants annotated as false-negatives were compared to variants discovered directly from read alignments. Each call where at least 20% of the reads overlapping the call had an SV of the same type with the ratio of lengths between 0.6–1.4 was annotated as a supported false-negative (s/FN). Between 14.5%-83.2% of annotated false negatives from pbsv variant callsets, and 35.0%-93.6% of false-negative calls in Sniffles callsets were considered s/FN. Furthermore, there were 13% more supported false-negative calls in the PacBio lra/pbsv callsets compared to the pbmm2/pbsv callsets, possibly the result of tuning the pbsv variant discovery algorithm to the input generated by a particular alignment algorithm.The accuracy of breakpoints was measured by comparing the boundaries of true positive SVs to the breakpoints of the GIAB truth data for each aligner/caller combination. We measured the percentage of SVs with perfect breakpoint boundaries and the average shifting distance between the left-most coordinate of SV boundaries. The average shifting distances of lra is 53bp across all data type and SV calling algorithms, compared to an average shifting distance of 46bp across all aligner/data/SV calling methods. The percentages of SVs with perfect breakpoints called from lra alignments ranged between 31–64% across all data and SV calling algorithms (S5 Table), which indicates in addition to alignment, the accuracy of breakpoints also depends on SV caller.A comparison of callsets highlights how detecting inversions and SV in segmental duplications remains challenging. Inversion calls were generated by Sniffles. We filtered calls that overlapped the centromere. Each resulting callset was manually curated using dot-plots of the genome assemblies against the reference. Calls were classified as true positives if at least one haplotype demonstrated the inversion, an inverted duplication if the dot-plot signature was a fixed inverted duplication or inverted transposition, false-positive if both haplotypes did not show an inversion or an inverted duplication, and NA if not possible to validate with the assembly, commonly in pericentric regions. Across all data types, ngmlr based alignments discovered the most inversions and inverted duplications annotated as true-positives, with calls ranging from 8077–44,538 bases (Table 4). On average, 75 inversions and inverted duplicatoins were detected using ngmlr alignments, versus 61 for both minimap2 and lra, indicating that additional development may be required to accurately detect rearranged DNA with minimal computational burden.
Table 4
Classification of inversions detected in HG002 using PacBio and ONT reads.
Dataset
Aligner
TP
Dup
NA
FP
HiFi
lra
30
21
8
26
minimap2
59
4
4
8
ngmlr
58
17
2
16
CLR
lra
65
9
9
36
minimap2
69
1
9
20
ngmlr
66
17
9
17
ONT
lra
53
7
7
18
minimap2
44
7
1
6
ngmlr
58
10
3
9
TP, true positive using manual curation. Dup, inverted duplication misclassified as an inversion on both haplotypes, NA not possible to manually curate, FP—no signature of inversion in read nor assembled haplotype dot-plots.
TP, true positive using manual curation. Dup, inverted duplication misclassified as an inversion on both haplotypes, NA not possible to manually curate, FP—no signature of inversion in read nor assembled haplotype dot-plots.When sequencing depth is sufficiently high coverage, comprehensive haplotype-resolved de novo assemblies can be used to detect variation instead of read alignments. Both lra and minimap2 were used to align contigs of a haplotype-resolved de novo assembly of HG002 constructed by the hifiasm assembler [10]. The haplotype assemblies resolve 2.97Gb and 3.07Gb, with N50 values 50.55 and 42.92 Mb. Both alignment methods have similar runtime to align de novo assembly contigs to GRCh37, although lra requires 1.5-fold the memory. Over 95% of each haplotype assembly was mapped by each method; 2.92/2.97 Gb were mapped by minimap2, and 2.87/2.97Gb mapped by lra. It was not possible to use ngmlr to align contigs. Variants were detected from the lra and minimap2 alignments using htsbox pileup -cuf. Calls from different haplotypes that overlap at least 30% were determined as homozygous calls and the rest were classified as heterozygous calls. Additionally, calls in centromere regions were removed for both minimap2 and lra callsets. This generated 25,982 calls by lra, and 26,341 by minimap2. Truvari analysis of the calls gives an F1 score of 0.955 by lra calls, and 0.933 for minimap2 calls (Table 5). There are 233 calls from the true-positive annotations of the lra HiFi/pbsv callset missing from the lra assembly callset. Upon inspection, the majority of these missed calls are due to a combination of the process used to merge haplotype calls into a diploid callset, shifted breakpoints relative to the GIAB annotation, and haplotype dropout/swapping of de novo assembly sequences.
Table 5
Assembly based calls comparison.
HG002 HiFi hifiasm assembly
lra
minimap2
Insertion, hom
6276
6494
Insertion, het
8852
8783
Deletion, hom
3518
3895
Deletion, het
7336
7169
TP base
9310
9035
TP call
9412
9062
FP
546
692
FN
331
606
precision
0.945
0.929
recall
0.966
0.937
F1 score
0.955
0.933
htsbox pileup -cuf is used to call SVs from contig lra and minimap2 alignment. Calls from lra and minimap2 assembly alignments are classified as homozygous and heterozygous by comparing calls from two haplotypes. Truvari comparison results between these two callsets are shown.
htsbox pileup -cuf is used to call SVs from contig lra and minimap2 alignment. Calls from lra and minimap2 assembly alignments are classified as homozygous and heterozygous by comparing calls from two haplotypes. Truvari comparison results between these two callsets are shown.Diversity studies are increasingly using SMS de novo assemblies to measure variation [17, 18] rather than from sequencing reads. An integrated analysis of the regions of the genome spanned by assembly alignments and the raw-read base calls from multiple alignment and SV discovery methods can help reveal the extent that variation is discovered and potentially missing from a high-quality haplotype-resolved de novo assembly. Across all read technologies, alignment algorithms, and variant discovery methods, 167 unique calls representing 0.6% of an average read-based SV callset were in regions not mapped by either haplotype of the assembly. Of these, 113 are in segmental duplications (SD). Because SDs make up roughly 5% of the human genome, the majority of SD regions that are mapped by SMS reads in this study are assembled in the hifiasm genome. The lra alignments detect 53 and 60 large SV (>20kb) totaling over 3Mb from the haplotype 1 and haplotype 2 assemblies that are not found in the read SV callsets (S2 Text). Because the length of SV negatively correlates with allele frequency, and is associated with enrichment in genome-wide association studies significant loci [19], these larger variants represent a class of variation that is biologically important for discovery.The high-confidence callset used in Truvari analysis contains fewer than 10,000 SVs, less than half what are expected to be found in a human genome. To gauge the specificity of assembly-based calls outside this set, SV contained in read alignments were compared to the SV discovered from assemblies. An SV detected from an individual read supported an assembly based SV if the call was the same type (e.g. insertion or deletion), was within 1kb, and the ratio of SV lengths was between 0.5–2. When using PacBio CLR aligned reads, nearly all (99.998–100%) of SV annotated by Truvari as true-positives had at least four supporting reads aligned by lra or minimap2. Using this approach, 98.3% of lra deletion SV, and 98.7% of minimap2 deletion SV are supported by reads, and 99.1%/99.7% of insertion SV from lra/minimap2, are supported by reads (Table 6). This is greater than the precision measured by Truvari on assembly-based calls: 0.955 for lra, and 0.933 for minimap2, indicating an underestimate of the callset precision. When inspecting the SV calls that are not supported by read alignments, many are due to differential placement of gaps causing disagreement of SV length between the read and assembly-based calls.
Table 6
Read based support of assembly calls.
Assembly
SV type
total
filtered
supported
fraction
Haplotype 1 lra
DEL
6101
5874
5767
0.982
Haplotype 1 minimap2
DEL
6078
5924
5841
0.986
Haplotype 2 lra
DEL
6287
6061
5963
0.984
Haplotype 2 minimap2
DEL
6267
6107
6033
0.989
Haplotype 1 lra
INS
9736
9354
9275
0.992
Haplotype 1 minimap2
INS
9882
9619
9587
0.997
Haplotype 2 lra
INS
10032
9651
9568
0.991
Haplotype 2 minimap2
INS
10183
9890
9859
0.997
The two haplotype assemblies are considered separately to avoid complications of merging into a diploid callset. Calls are produced by lra and minimap2, with only deletion and insertion SV classes considered. The total calls are the original calls produced by each method, and filtered are calls excluding centromeres and 50kb of flanking sequence. The supported SV have at least four reads supporting the call from either lra or minimap2 alignments.
The two haplotype assemblies are considered separately to avoid complications of merging into a diploid callset. Calls are produced by lra and minimap2, with only deletion and insertion SV classes considered. The total calls are the original calls produced by each method, and filtered are calls excluding centromeres and 50kb of flanking sequence. The supported SV have at least four reads supporting the call from either lra or minimap2 alignments.
Materials and methods
The alignment of reads and contigs to a reference are generally defined by the maximally scoring local alignment of a query q to a set of target sequences collectively referred to as a target t with a match bonus and penalties for mismatch, gaps, and inversions/translocations. lra uses a heuristic to find an approximate local alignment employing the commonly used seeding, chaining, and refinement approach that has been applied to all scales of alignment from short-read, long-read, and whole-genome alignment [3, 20, 21]. Each alignment proceeds in four broad steps: seed sequence matching, clustering, chaining, and finally alignment refinement.Many recent advances have been made in sequence mapping using a subsampled index on a reference using minimizers or locally sensitive hashing [22, 23]. A minimizer index is parameterized by a k-mer size k and window size w, and indexes the position of the lexicographically least canonical k-mer in every sequence of length w across the genome. We develop a variant on the approach of minimizers that uses adaptive thresholds to limit the total number of positions sampled in unique regions of a genome, and increase the sampling of positions near paralog-specific variants that distinguish repetitive regions.Optimal sets of matches between the query and reference are selected in two phases using CG-SDP. Given a set of fragments Φ = {α1, …, α}, each fragment is defined by a start point and an end point on a Cartesian plane and a weight: . The basic CG-SDP formulation is to define a chain of fragments C ⊂ [1, …, n] that maximizes
such that is above and to the right of , and gap is a concave function. An algorithm was presented for chaining using a concave gap cost model [7], however there are no alignment methods that implement this approach. This method has been implemented in lra and extended to allow for inversions in the optimal chain. Additionally, the original description of the algorithm requires asynchronous processing, which we have updated to use standard serial computation. The details of determining minimizers and the chaining algorithm are given below.
Building a minimizer index and sequence matching
We add three additional parameters to generate a minimzer index: F, N and W that limit the density of minimizers that are selected. An initial set of minimizers is determined in the standard approach, with minimizer k-mer parameter k and window w [22]. Next, minimizers of multiplicity larger than F are removed. Then the reference is partitioned into intervals of length W, and the remaining minimizers starting in each interval are selected in order of their multiplicity in the genome until the first N minimizers are obtained, in a more simplified version of weighted minimizer sampling [24]. Different parameter settings are used to index a genome for different sequencing technologies and contig alignments. When indexing the genome for aligning HiFi reads, 867M minimizers from total 1015M were left after filtering by frequency threshold F. In total, 117M minimizers were selected after the final filtering based on N and W. All minimizers from a query sequence are matched against the filtered set of minimizers from the reference. The result of the sequence matching is a set of anchors A = {β1, …, β}, where β is a tuple (x, y, k), where q[x, x, …, x] of the query matches t[y, y, …, y] of the target.The chains of anchors for low-accuracy single-pass SMS reads may be sparse or have spurious off-diagonal matches in repetitive sequences. The detailed alignment may be calculated using dynamic programming within the region that is chained, however to limit the computation required for refining the alignment we implement an additional index of local minimizers that are used to refine an alignment once a coarse-chaining has been done. The local index is parameterized by k < k,w < w, and a tiling length W, and is composed of a collection of separate minimizer indexes for sequence intervals of length W tiling across both the query and the target.
Clustering
Although CG-SDP can be applied to all anchors A, for efficiency a greedy approach is used to cluster anchors that would likely be together on an optimal chain. These clusters may be used to filter out spurious matches in low accuracy reads, or may be chained directly on high accuracy reads and contig mapping. When forming alignments from chained clusters, it is necessary to have a cluster refining step that divides rough clusters into non-overlapping fine clusters to avoid chaining that skips biological variation in repetitive sequences.
Rough clustering
Rough clustering partitions anchors into clusters representing approximate intervals on the query and target that are aligned (Fig 3a), and serves to exclude noisy anchors unlikely to be chained in an alignment by CG-SDP. Denoting the forward diagonal of each anchor β as f = y − x, and the reverse diagonal r = x + y, a sorted anchor order O = [o1, …, o] is defined by ordering anchors by forward diagonal and then x coordinate. A reverse sorted order is similarly defined sorting on reverse diagonal and x coordinate. This will be used to detect alignments on the reverse strand, but because the operations are the same as on the forward strand, only subsequent steps using the forward sorted order are given. The set of rough clusters is defined by partitioning O into non-overlapping intervals such that every anchor indexed in an interval has a diagonal within D of the preceding anchor in the interval. Intervals are greedily assigned with first interval starting at the first index in O, and subsequent intervals starting on after a gap of more than D between anchors. Intervals with few elements (defined by a minClusterSize parameter) are discarded, and the rough clusters R = {R1, …, R} are defined from the set of anchors included in each interval. The value of D is chosen so that rough clusters are likely to contain at least minClusterSize true anchors from a read (default to 3 for CLR and ONT, and 10 for HiFi/contigs). For CLR and ONT data, we empirically determined a sufficient D is 200, and 150 for HiFi and assembly contig alignment. For low accuracy reads, chains are formed by running CG-SDP on all matches retained in rough clustering. For high accuracy reads, the clusters must be post-processed with fine-clustering prior to being chained.
Fig 3
Example of clustering anchors prior to CG-SDP.
a, Two rough clusters (blue and orange), which are far from each other on the reference. b, The initial four fine clusters defined from the contiguous stretches of unique anchors. c, fine cluster-1 and fine cluster-2 are merged because their diagonal difference is smaller than D and projected distances between their endpoints is smaller than G. fine cluster-3 and fine cluster-4 are not merged due to the large diagonal difference. d, Non-unique anchors in the trapezoid between fine cluster-1 and fine cluster-2 are added to the merged fine cluster-1, along with non-unique anchors in the trapezoid defined by the start of rough cluster-2 and the start of fine cluster-3. e, Three fine clusters are obtained after rough clustering and fine clustering. f-h, Splitting of overlapping fine-clusters: f, overlap of clusters. g, Boundaries of split clusters defined by a start (red dot) and an end (blue dot). h, the optimal chain of split super-fragments.
Example of clustering anchors prior to CG-SDP.
a, Two rough clusters (blue and orange), which are far from each other on the reference. b, The initial four fine clusters defined from the contiguous stretches of unique anchors. c, fine cluster-1 and fine cluster-2 are merged because their diagonal difference is smaller than D and projected distances between their endpoints is smaller than G. fine cluster-3 and fine cluster-4 are not merged due to the large diagonal difference. d, Non-unique anchors in the trapezoid between fine cluster-1 and fine cluster-2 are added to the merged fine cluster-1, along with non-unique anchors in the trapezoid defined by the start of rough cluster-2 and the start of fine cluster-3. e, Three fine clusters are obtained after rough clustering and fine clustering. f-h, Splitting of overlapping fine-clusters: f, overlap of clusters. g, Boundaries of split clusters defined by a start (red dot) and an end (blue dot). h, the optimal chain of split super-fragments.
Fine clustering
For mapping hich accuracy reads, each rough cluster is processed independently by dividing into non overlapping fine clusters, where each fine cluster consists of anchors on a close diagonal D, with endpoints that do not overlap. The first step of CG-SDP will be applied to chain the fine clusters and find an approximate alignment between q and t. Each fine cluster C is defined by all of the anchors contained in the cluster, and endpoints and , where , are the minimum x, y coordinates of the starting points of all the anchors in C, and , are the maximum x, y coordinates of the ending points of all the anchors in C. To define the fine clusters, anchors in each rough cluster are first sorted by Cartesian coordinate. Within each rough cluster, an anchor is defined as unique when the k-mer of the match is not repeated in the cluster. Fine clusters are initialized as runs of unique anchors in the Cartesian ordering that are on a close diagonal, and the distance between the end of one anchor and the start of the next is small (Fig 3b). Every pair of fine clusters C and C are merged if their endpoints have diagonal differences smaller than D and are within Cartesian distance G (Fig 3c), and all non-unique anchors within the trapezoid defined by are included into the merged fine cluster (Fig 3d and 3e). The remaining non-unique anchors that are not added into the fine cluster are discarded. In the step of fine clustering, we empirically found D = 500 was able to distinguish clustering anchors in different tandem repeats and allowed a sufficient number of repetitive anchors to be included in the fine clusters.
Cluster splitting and chaining
Each fine cluster C ∈ C defines a super-fragment F that unlike the minimizer fragments which have starting and endpoints along the same diagonal, the endpoints of C and may not be along a diagonal. The set of all such fragments is F, and an optimal chain of fragments will be defined using CG-SDP. However, considering the rectangles defined by the boundaries of each fragment, the CG-SDP algorithm only selects fragments with non-overlapping rectangles in the optimal chain. Due to the repetitive nature of genomes, this may result in erroneously skipped fragments. To account for this, fine clusters are split at overlapping boundaries (Fig 3g). The start and end coordinates of all fine clusters are stored in a set that is queried to find boundary points appearing in the range of each fine cluster. The coordinates of each split cluster are set according to the first/last anchors appearing after/before the boundary point. An optimal chain of fragments F ∈ F with corresponding clusters C are then found using CG-SDP on F.
Cluster refinement
The optimal chain of super-fragments C contains the anchors from which the optimal alignment will be defined. A pairwise alignment may be created using dynamic programming on the substrings between the minimizer matches, however for high-error rate reads matches are sparse and the length of substrings may be too large to efficiently compute, or have too large of a diagonal gap to use banded alignment. A second anchoring step using the local minimizer index is used to detect shorter and more dense anchors. The local minimizer index contains and separate minimizer indexes for the target and query sequences that index each tiling substring of length W, accounting for the relative positions of the substrings in each sequence. Every pair of substrings from q and t that have an anchor in C are compared using their local minimizer indexes to generate a resulting set of anchors A. To reduce runtime of CG-SDP on A, anchors that are adjacent in Cartesian sorted order and on the same diagonal are merged. The length of any merged anchor is the difference from the last endpoint to the first starting point. The resulting merged fragments are chained using CG-SDP, as described in section Problem statement of chaining, and the anchors on this chain are denoted as A.
Alignment refinement
Banded alignment is used to create a pairwise alignment between anchors in A. We assume that large gaps between anchors may be modeled using an affine alignment that allows a single large penalty-free gap. Given two sequences q and t, a match matrix M, a gap penalty δ, and an alignment band B, assume that |q| < |t| − B. A lower-score matrix S is calculated for a typical banded alignment band B and gap penalty δ, e.g if |i − j| ≤ B, −∞ otherwise. Next, an upper-diagonal matrix is calculated that allows for a single transition from the lower matrix with banded alignment. Denoting the length of the q and t as l and l:
Problem statement of chaining
We define a set of fragments Φ = {α1, …, α}. Each fragment α is associated with a lower left start point and upper right end point , and a score l. For minimizer fragments, the upper right endpoints are a fixed distance from the lower points, e.g. . For super fragments defined by fine clusters, the endpoints and starting points may not be on the same diagonal. The starting point of a fragment α is denoted s(α), and the end e(α). A point (x, y) is above (x, y) if x > x, y > y (conversely (x, y) is below (x, y)). The chaining score is defined by Eq 1 where is a concave function of , the difference between the forward diagonals of the endpoint of and the starting point of α. The naive way to solve this problem takes O(n2) in time. By applying CG-SDP [7], the runtime is O(n log(n)2). Below, the solution provided by Eppstein, Galil, and Giancarlo is described both with increase clarity from the original description, with a modification that enables calculation with synchronous computation. Finally, the method is extended to allow for inversions, similar to [25].
SDP algorithm with concave gap cost function—Defining subproblems
For simplicity, assume both sequences are the same length l and that all points are in [0, l) (e.g. shifted by the minimum x and y coordinate), and are on a l × l grid. To speed the chaining algorithm, the search for the fragment that precedes another on an optimal chain is divided into multiple overlapping subproblems that may be solved independently and more efficiently than the naive scan, and the globally optimal score for each point is selected from each of the subproblems that overlap it. Each subproblem divides a block of t columns or rows of the search space into an A-part and B-part covering and columns/rows respectively, where A-part contains all the endpoints and B-part contains all the startpoints in the corresponding columns/rows. When the size of a subproblem is only one column/row, the A-part of the corresponding subproblem is set to be empty. Each subproblem is described by the label d ∈ {column, row}, the starting and ending rows and columns of the subproblem A.s, A.e, B.s, B.e, and a set of lists collectively referenced as DATA that are used in the calculation of optimal chains. The full set of subproblems Sub(d, s, e) are generated recursively as:The more detailed pseudocode for this partitioning is given in S1 Algorithm. Fig 4 visualizes the column and row subproblem division of a simple case with six fragments.
Fig 4
Visualization of subproblems divisions.
The data structures needed for each subproblem: D, D, D, E, E, E, E and E after the processing of all the 12 points. Points are numbered in the order of processing (Cartesian sorted order). 12 points are assigned into three column subproblems , , and one row subproblem , where starting points are assigned to A-part and endpoints are assigned to B-part. Leaf subproblems are not shown for simplicity. Start and End are used for the traceback of the optimal chain. Start stores sub—the index of the subproblem which yields the optimal chaining score up to a starting point and ind—φ(E, f), where f is the diagonal of the starting point. End stores the optimal value for each endpoint. For this toy example, the match bonus of every fragment is 2 and the gap cost function of appending fragment α to fragment α is , where is the endpoint of α and is the starting point of α. End indicates that there are three optimal chains achieving the optimal value: 2.8. By tracing back, chain-1: point-1, point-2, point-3, point-6, chain-2: point-1, point-2, point-5, point-9 and chain-3: point-7, point-10, point-11, point-12.
Visualization of subproblems divisions.
The data structures needed for each subproblem: D, D, D, E, E, E, E and E after the processing of all the 12 points. Points are numbered in the order of processing (Cartesian sorted order). 12 points are assigned into three column subproblems , , and one row subproblem , where starting points are assigned to A-part and endpoints are assigned to B-part. Leaf subproblems are not shown for simplicity. Start and End are used for the traceback of the optimal chain. Start stores sub—the index of the subproblem which yields the optimal chaining score up to a starting point and ind—φ(E, f), where f is the diagonal of the starting point. End stores the optimal value for each endpoint. For this toy example, the match bonus of every fragment is 2 and the gap cost function of appending fragment α to fragment α is , where is the endpoint of α and is the starting point of α. End indicates that there are three optimal chains achieving the optimal value: 2.8. By tracing back, chain-1: point-1, point-2, point-3, point-6, chain-2: point-1, point-2, point-5, point-9 and chain-3: point-7, point-10, point-11, point-12.Each subproblem is processed by finding the optimal endpoint from the A-part that precedes each starting point in corresponding B-part. For a starting point p that is assigned to the B-parts of a set of column-based subproblems, the union of the A-parts of those subproblems form the entire plane to the left of the point. Similarly for a starting point assigned to the B-part of a set of row subproblems, the union of the A-parts of those subproblems form the plane below the point. After solving for the optimal preceding endpoint in all of the subproblems in which p is contained, the one with maximum score among these endpoints is the global optimal, which represents the optimal chain from all endpoints below and to the left of p.Once the subproblems are defined, the list elements of DATA are allocated and initialized. The following elements are associated with the A-part of a column/row subproblem:D: The forward diagonals in increasing/decreasing order overlapped by endpoints in the A-part.D: The optimal chaining score for diagonals overlapped by endpoints in A-part. D[s] holds the optimal value of forward diagonal D[s].D: Index of point with optimal score along diagonal. D[s] references the point with the best score of D[s].The following elements are associated with the B-part of a column/row subproblem:E: The forward diagonals in increasing/decreasing order overlapped by starting points in the B-part.E: Similar to D, the locally best chaining value.E: Index of the forward diagonal in D that leads to the best chaining value E[s] for the forward diagonal E[s].E: A block structure used in the calculation of E.E: Index of the most recent value in D used to compute values in E.For each point, we have two lists that reference the subproblems a point is contained in:SA, A list of subproblems that contain a point that is in an A-part of a subproblem.SB, A list of subproblems that contain a point that is in a B-part of a subproblemTo make the description of the method more concise, the operation φ(D, f) and φ(E, f) are defined to give the index of diagonal f in D and E, respectively.
SDP algorithm with concave gap cost function—Conquering subproblems
The original description of CG-SDP, uses asynchronous processing during solving subproblems [7]. Here we give an alternative description of the algorithm, and provide an approach to solve subproblems in a way that allows synchronous processing of points in Cartesian order for a more simple implementation. The framework for solving a subproblem is first described, and then an order of processing points is given to solve for optimal fragment chaining.For any subproblem, the optimal chains between endpoints in the A-part to starting points in the B-part are found using the D, D, D, E, E, E arrays, variable E and block list E. Consider two points p in A-part, and p in B-part that are on diagonals f and f, respectively. An invariant for any subproblem is that if the optimal chain ending with the fragment that has endpoint p has been solved and is referenced as Score(p), the score of chaining p with p is Score(p) + gap(|f − f|). Because the gap cost only depends on diagonal and not the coordinates of a point, all points in A-part on the same diagonal f will have the same cost to chain with p. More generally, the score to chain any starting point in B-part to an ending point in A-part is equal to the score of chaining a diagonal in B to a diagonal in A. The score of an optimal chain up to point p in B-part, E[j], where j = φ(E, f), is found for column-based problems as:For row-based subproblems:The naive approach to solve for all values of E scales quadratically as O(|E||D|). A problem of this form:
where w is a concave function and D[k] is a linear transformation of E[k], was solved in O(|D| log|E|) time using the auxiliary block data structure E [26, 27]. In this solution, iteration is performed with one pass over the D array. After each iteration i, the invariant holds: each element in E has been set to reference which element of the prefix D[1, …, n] gives the minimum value for Eq (6). Efficiency is gained by not updating E explicitly but by storing indexes in the data structure E that may be updated in log(|E|) time by applying a function Update(D[i], E) on each iteration. The data structure E has an operation that can reconstruct E, E[k] = Ω(E, k). The details of E and Update may be derived from [27]. In the application to CG-SDP, D is replaced by D and E by E, and Eqs (4) and (5) solve for the optimal chains between diagonals from an A-part of a subproblem to diagonals in the B-part of the column and row subproblem respectively. An important detail in our implementation is that in column-based problems an Update operation only affects references in E that are on a greater or equal diagonal than the current element in D, and for row-based problems elements in E are only affected for diagonals that are less than the current element in D. However, it is not possible to apply Eqs (4) and (5) directly. Because points are processed in Cartesian order, but D in forward diagonal order, values of D are not solved in increasing order and not all values of D are guaranteed to be solved when values of E are computed. In [7] this is accounted for by asynchronous computation. Below, an approach is described to solve with a standard model of computation by calling Update(D[i], E) using subsets of D as they become known. Contrary to the customary model of divide and conquer, where subproblems are completely solved before combining into a global solution, this solves portions of subproblems on each iteration.Points are processed in order of x and then y coordinate, determining the value of the optimal chain up to the current processed point. The value of a starting point is the value of the optimal chain that chains the starting point to an endpoint below it. When no endpoints are below a starting point, the value of this starting point is trivially set to 0. The value of an endpoint is simply the value of the optimal chain preceding the corresponding starting point plus the value of the fragment.When processing an endpoint p, the starting point p of the corresponding fragment α has been solved because points are processed in Cartesian order. The value of the chain at the endpoint is Score(p) + l, where l is the match bonus of fragment α. In order for p to be used when solving for starting points that are above it, the value of D must be updated in subproblems for which p is a point in the A-part. The SA list is used to determine which subproblems contain p in a A-part. Suppose the forward diagonal of p is f. In each subproblem in SA, Score(p) is passed to D[j] and D[j] is set to max{Score(p), D[j]}, where j = φ(D, f), and D[j] is set to i if D[j] gets updated.When processing a starting point p, the optimal value must be calculated in each of the B-parts of subproblems that include p, and the global optimal value will be selected among them. For any subproblem, this can be achieved by solving for E[j] using Eq (4) for column-based or Eq (5) for row-based subproblems, where j = φ(E, f) and f is the diagonal of p. For a column-based subproblem, this requires that the values of D[k] have been solved where D[k] ≤ f, and for row-based subproblems, the values of D[k] must have been solved where D[k] > f. These correspond diagonals overlapping points that fall in the region below and to the left of the p and are exactly what have been processed when solving for points in Cartesian sorted order. Thus all required values in D arrays (though not the entire D array) are available when solving for p.The value of E[j] is optimal once Update(D[k], E) has been called on all diagonals that contain potential predecessors to points on diagonal E[j]. The function Update(D[k], E) must be called only once per element in D and in increasing order. However because points are processed in Cartesian sorted order, values of E are solved in arbitrary order, and calling Update(D[k], E) can reference elements in D multiple times. To account for this, we maintain an index E for each subproblem that keeps track of the last element of D which has been processed by Update(D[k], E). Before solving any points, E is initialized to -1 in every subproblem, and when processing points will be assigned to reference the most recently updated diagonal from D. When processing starting point p in a column-based subproblem, E[j] must be solved, where j = φ(E, f). If D[E] > f, Update has been called on all values of D that E[j] relies on, and the state of E contains the optimal value for E[j] such that it may be calculated immediately from Ω(E, j). Otherwise, Update(D[k], E) is called for E < k < C, where D[C] is the first diagonal from the left that is larger than f in array D, and E[j] may be calculated from Ω(E, j). Similarly, for row-based subproblems if f ≥ D[E], the value of E[j] may be assigned immediately from Ω(E, j) where j = φ(E, f). Otherwise, Update(D[k], E) is called for E < k < C, where D[C] is the first diagonal from the left that is smaller than or equal to f in array D, and E[j] can be retrieved from Ω(E, j). By comparing values E[j] from all the subproblems in the SB list of startpoint p, the maximum value for p will be obtained and stored. The pseudocode and detailed example of solving points and conquering subproblems are given in the S2 Algorithm and S3 Fig.
Extension to inversion cases
This extension is inspired by the work [25]. As explained in the previous section, when chaining fragments in forward direction, two points—a lower left start point and a upper right endpoint would be associated to each fragment α. can be chained to an endpoint below and to the left of it and can be the predecessor of a starting point above and to the right of it. To allow inversions to happen, fragments must be allowed to be chained in the reverse direction. To account for this, we associate two more points to each fragment α, that are a upper left start point and a lower right endpoint . The start point can only be chained to endpoint of some other fragment α that satisfies . The endpoint can precede a starting point of some other fragment α that satisfies . , , and can be represented as s1(i), e1(i), s2(i) and e2(i) respectively. Chaining fragment α to fragment α in the reverse direction equals to chaining the start point s2(α) to the endpoint e2(α). The cost of appending s2(α) to s2(α) in the reverse direction is gap(α, α), which is a concave function of , the difference between the reverse diagonals of fragments α and α. In a short word, s2(α) and e2(α) of each fragment α would be responsible for the possible chaining of α in the reverse direction, while s1(α) and e1(α) would be responsible for the forward direction.We used the same column and row subproblem dividing scheme to sort all s1 and e1 points and assign them into column-1 and row-1 subproblems. Then all s2 and e2 points are sorted and assigned to column-2 and row-2 subproblems in the same way. Arrays D, D, D, E, E, E, the block structure E and variable E are allocated and initialized for each subproblem. Lists SA1 and SA2 reference column-1/row-1 and column-2/row-2 subproblems that each endpoint e1 and e2 is in respectively. Similarly, lists SB1 and SB2 references column-1/row-1 and column-2/row-2 subproblems that starting points s1 and s2 in respectively. The only difference between column-2/row-2 and column-1/row-1 subproblems is that column-2/row-2 stores reverse diagonal instead of forward diagonal in D and E arrays.The steps to solve subproblems to allow chaining in both forward and reverse directions are highly similar to section SDP algorithm with concave gap cost function—conquering subproblems. Points are processed in order of x and then y coordinate. When a starting point s1(α) is being processed, E[j], where j = φ(E, f) and f is the forward diagonal, will be computed from column-1/row-1 subproblems in SB1. The maximum E[j] of those subproblems is the value of optimal chain up to s1(α), Score(s1(α)), where s1(α) is forwardly chained to an endpoint e1(α) below and to the left of it. Similarly, when a starting point s2(α) is being processed, E[j], where j = φ(E, r) and r is the reverse diagonal, will be computed from column-2/row-2 subproblems in SB2. The maximum E[j] of those subproblems is the value of optimal chain up to s2(α), Score(s2(α)), where s2(α) is reversely chained to an endpoint e2(α) above and to the left of it.After solving s1(α) and s2(α) for fragment α, the value of the optimal chain up to fragment α can be calculated as Score(α) = max{Score(s1(α)), Score(s2(α))} + l(α), where l(α) is the match bonus of fragment α. This optimal chain is chosen from all the possible chains that fragment α is forwardly or reversely chained to the predecessor in the left. When solving e1(α), Score(α), will be passed to array D[j], where j = φ(E, f) and f is the forward diagonal, to column-1/row-1 subproblems in SA1. And D[j] will be updated to the index of point e1(α), if Score(α) > D[j]. Similarly, when e2(α) is being processed, Score(α) will be passed to array D of column-2/row-2 subproblems in SA2 and D will be updated.Therefore, the addition of two points s2(α) and e2(α) for each fragment α make it possible to allow α to be chained in the reverse direction. Meanwhile, the overall time complexity and storage remain bounded by O(n(log(n)2)), where n is the total number of fragments.
Time complexity
Assume there are n fragments in total, list SB/SA contains O(log(n)) subproblems that a point is in. In section SDP algorithm with concave gap cost function—conquering subproblems, we mention that when a starting point p with forward diagonal f is being processed, Update(D[k], E) is called for E < k < C, where D[C] is the first diagonal from the left that is larger than/smaller than or equal to f in column/row subproblems. And E[j] can be retrieved in O(log(n)) time from the block structure E by calling Ω(E, j), where j = φ(E, f). Procedure Update may be called several times for a starting point in each subproblem. In order to make it easy to quantify the total time complexity of Update procedures, we consider that each Update procedure is called right after D[j] is updated by some endpoint p with diagonal f, where j = φ(E, f). When an endpoint is being solved, there are O(log(n)) subproblems associated with it and in each subproblem Update takes O(log(n)) to conduct. Therefore, it takes O((log(n))2) time to solve the subproblems that are associated with an endpoint. When a starting point is being processed, there are O(log(n)) subproblems it is in and in each subproblem E[j] can be computed from the block structure E in O(log(n)) time. Therefore, it takes O((log(n))2) time to tackle subproblems that are associated with a starting point. Since there are n fragments in total, the time complexity of processing all the points and subproblems is bounded by O(n log(n)2).
Discussion
The initial description of CG-SDP was given in 1992 in two publications, one covering an affine-cost gap function, and another with a concave-cost gap function [7, 28]. While there have been many implementations of affine cost SDP, no sequence alignment methods have been implemented using SDP with a concave-cost gap function. In the original description of the algorithm, the processing of a starting point is blocked until all subproblems the point relies on are solved, and then the process is unblocked and processing resumes. We find that this blocking and unblocking strategy is not necessary, and the addition of data structures to keep track of the state of computation of subproblems enables solving the problem with a standard model of computation. We used two additional strategies to effectively employ SDP in lra: an iterative refinement process where a large number of anchors from the initial minimizer search are grouped into a small super-fragments that are chained using SDP, and once a rough alignment has been found a new set of matches with smaller anchors is calculated using the local miminizer indexes. As a result, alignment is of similar speed to state of the art algorithms, without the need for single-instruction multiple-data (SIMD) processing; runtime was slower when using an SIMD alignment library [29] possibly due to the overhead of invoking the library functions.The lra alignments have competitive runtime and memory usage compared to minimap2. Using two different SV discovery algorithms, pbsv and Sniffles, we show it is possible to use lra alignments to discover SV using PacBio HiFi, CLR, and Oxford Nanopore reads, as well as directly from aligned de novo assembly contigs. The performance for SV detection using PacBio reads and the pbsv algorithm is similar between lra and minimap2, with lra demonstrating a small gain in recall over larger SV events. Importantly, the availability of multiple alignment algorithms can help improve the results of studies that require high sensitivity and specificity, such as Mendelian analysis. The greater improvement in SV discovery metrics on ONT data aligned by lra additionally highlights the utility of using multiple algorithms to analyze sequencing data.Finally, as de novo assembly of genomes becomes more routine, it is important to have accurate methods of SV detection from contig alignments. We used lra to align a haplotype-resolved genome assembled from PacBio HiFi reads and detect SV. When compared to the GIAB SV benchmark data, the lra alignments show slightly lower precision and recall than read-based SV detection. However, when the assembly-based SV are compared to SV detected in an orthogonal read dataset, nearly all (> 98.8%) variants discovered from assemblies are supported by read alignments. This indicates few differences between the HiFi based assembly and the reference are due to assembly error, and the annotated precision of the callset is likely an underestimate. A variant callset is effectively a list of operations that may be applied to the reference genome to reconstruct a sample genome. Because the HiFi assembly has few assembly errors on the same size scale as an SV, this supports the development of an alternative model for validating SV callsets in which the reconstructed genome is compared to the haplotype-resolved assembly, rather than by comparing callsets. This may be used to validate calls inside the high-confidence regions defined where a haplotype-resolved assembly has been confidently generated.
The distribution of read lengths from the HG002 HiFi, CLR, and ONT data.
(PDF)Click here for additional data file.
The cumulative number of bases from the HG002 HiFi, CLR, and ONT data.
(PDF)Click here for additional data file.
A detailed example of visualizing subproblems division.
The data structures for each subproblem: D, D, D, E, E, E, E, E and the process of subproblems solving. The horizontal axis represents the query, while the vertical axis represents the target. Points are numbered in Cartesian sorted order, which is the processing order. 12 points are assigned into three column subproblems , , and one row subproblem , where starting points are assigned to A-part and endpoints are assigned to B-part. Leaf subproblems are not shown for simplicity. Start and End are used for the trace-back of the optimal chain. Start stores sub—the index of the subproblem which yields the optimal chaining score up to a starting point and ind—the index of f in array E, that is φ(E, f), where f is the diagonal of the starting point. End stores the optimal value for each endpoint. For this toy example, gap cost of appending fragment α to fragment α is , where is the endpoint of α and is the startpoint of α. m, shows the regions that each subproblem covers and the initialized data structures for each subproblem. There are of three column subproblems and one row subproblems (leaf subproblems are not shown for simplicity): , , and . a-l shows how the data structures of subproblems that are associated with the point being processed in each step are updated. Note that entries that are updated are highlighted by orange. a, shows for startpoint − 1, it is a leaf subproblem that yields the value of the optimal chain up to startpoint − 1. b, shows when processing endpoint − 2, the optimal value up to it is Score(startpoint − 1) + 2, where 2 is the match bonus of the fragment. Array End is then updated. Since endpoint − 2 is in , , , the corresponding D entries are updated to 2 and corresponding D entries update to the index of endpoint − 2. c, shows when processing startpoint − 3, it is in the B-parts of subproblems and . startpoint − 3 is located in E[2] of , so Update(D[0], E) would be called to get the value of E[2]. The purple color highlighting shows what forward diagonals in E would be updated by D[0]. E[2] would be updated to point to D[0]. startpoint − 3 is located in E[2] of and no forward diagonals in D used to update E[2]. Therefore, in Start, sub and ind for startpoint − 3 are updated to , 2. d, shows when processing startpoint − 4, it is in and locates in E[0] of . Since there is no forward diagonal in D can be used to update E[0], it is a leaf subproblem that yields the optimal chaining value up to startpoint − 4 in Start. e, shows when processing startpoint − 5, it is in and . In , there is no forward diagonal can be used to update E[1]. In , Update(D[0], E) is called to update the block structure E, so E[1] and E[2] would be computed from E. In Start, sub and ind for startpoint − 5 are updated to , 1. f, g, h, i, j, k, l: show the subproblems solving and data structures updating for the rest of points. After processing all 12 points, three optimal chains can be obtained by tracing back, which are chain − 1 = [startpoint − 1, endpoint − 2, startpoint − 3, endpoint − 6], chain − 2 = [startpoint − 1, endpoint − 2, startpoint − 5, endpoint − 9] and chain − 3 = [startpoint − 7, endpoint − 10, startpoint − 11, endpoint − 12].(PDF)Click here for additional data file.
Impact of the local minimizer index.
(PDF)Click here for additional data file.
Assembly exclusive calls.
(PDF)Click here for additional data file.
Defining subproblems.
(PDF)Click here for additional data file.
Sparse dynamic programming with convex gap cost.
(PDF)Click here for additional data file.
Shared variant calls.
Considering HiFi/CLR + pbsv callsets and ONT + Sniffles callsets, the size of the intersection of every two callsets are shown. The intersection is calculated using Truvari bench on two called vcfs.(PDF)Click here for additional data file.
Calls unique to a callset in tandem repeats and segmental duplications. Entry with pbmm2 as row and lra as column means the number of unique calls in pbmm2 callset when comparing pbmm2 callset and lra callset.
(PDF)Click here for additional data file.
Comparison of the Truvari result between all combinations of aligners and SV callers on simulated HiFi, CLR and ONT dataset with simulated SVs.
SVs: Indels (insertions and deletions), inversions were simulated by SUVIVOR. All SVs were simulated by SUVIVOR. HiFi and CLR reads were simulated by PBSIM and ONT reads were simulated by alchemy2. We simulated 195 Indels (insertions and deletions) of lengths between 50-10000 bases, 97 inversions of lengths between 600-2000 bases.(PDF)Click here for additional data file.
Comparison of the Truvari result between all combinations of aligners and SV callers on simulated HiFi, CLR and ONT dataset with complicated nested SVs.
Complicated nested SVs: deletion-inversion-deletions (INVDEL), inverted-duplications (INVDUP), which were simulated by SUVIVOR. HiFi and CLR reads were simulated by PBSIM and ONT reads were simulated by alchemy2, which is distributed with lra source. We simulated 100 deletion-inversion-deletions and inverted-duplications of lengths between 600-1000 bases respectively. Inversion-deletion is a type of nested SV where an inversion is flanked by 2 deletions and inversion-duplication means the duplicated sequence is inverted. For deletion-inversion-deletions, the inversion and two deletions have all been found in order to be counted as a TP. For cases where the inversion and only one flanked deletion are found were counted as Partial. For inverted-duplications, both the inversion and duplication need to be found in order to be counted as a TP. We found that minimap2 alignment find inverted-duplications as insertions, therefore, we didn’t count that as TP.(PDF)Click here for additional data file.
Comparison of the breakpoints on real datasets.
The breakpoints accuracy analysis was conducted by comparing the boundaries of true positive SVs from Truvari result to the curated SVs’ breakpoints for each aligner/caller combination. Breakpoints accuracy is measured by the percentage of SVs with perfect breakpoint boundaries and the average shifting distance between the left-most coordinate of SV boundaries.(PDF)Click here for additional data file.
Comparison of the breakpoints on simulated SVs.
The breakpoints accuracy analysis was conducted by comparing the boundaries of true positive SVs to the bondaries of ground truth SV for each aligner/caller combination. Breakpoints accuracy is measured by the percentage of SVs with perfect breakpoint boundaries and the average shifting distance between the left-most coordinate of SV boundaries.(PDF)Click here for additional data file.
Truvari classification of cuteSV variant calls.
Truvari comparisons between lra, minimap2 and ngmlr using the Genome in a Bottle benchmark SV set. Optimal results in each category are shown in bold. TP-base means true positive calls in the benchmark SV curation set, while TP-call means true positive calls in the SV set from each aligner. False positive means the number of non-matching calls from the SV set from each aligner. False negative means the number of non-matching calls from the SV curation set.(PDF)Click here for additional data file.25 Jan 2021Dear Dr. Chaisson,Thank you very much for submitting your manuscript "lra: A long read aligner for sequences and contigs" for consideration at PLOS Computational Biology.As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. The most critical points that need to be fully addressed are further evaluation of mapping accuracy, concerns about high false positive rates and improving the documentation of the tool to highlight its benefits.We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.When you are ready to resubmit, please upload the following:[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).Important additional instructions are given below your reviewer comments.Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.Sincerely,Ferhat Ay, Ph.DAssociate EditorPLOS Computational BiologyJian MaDeputy EditorPLOS Computational Biology***********************Reviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #1: Ren & Chaisson described lra, a new long-read aligner specialized for calling structural variations (SVs). Lra implements a chaining algorithm with concave gap cost functions. Although the algorithm is not new, as the authors said, this is probably the first time the algorithm gets implemented. This manuscript represents an important addition in this field. Critical observations are as follows:1) I tried lra v1.1.0 on reads simulated by PBSIM. At mapping quality 60, lra wrongly mapped 1.3% of reads. This is not as good as NGMLR. While simulation based evaluation may not reflect the accuracy on real data, it is a good sanity check and indicative of low-level algorithmic issues. I wonder if the authors can improve the mapping accuracy further. The file is available at https://github.com/lh3/minimap2/releases/download/v2.0/pb-1.fa.gz and the mapping result can be evaluated by "paftools.js mapeval -r 1e-6".2) For a second test, I mapped the HG002 maternal hifiasm assembly to GRCh38 and call variants with "htsbox -vcf ref.fa -q5 -S10000 -T20 srt.bam | awk '$6==1'". Minimap2 -xasm5 calls 3.75M variants, which is about right. Lra calls 9.53M. The number is too high, suggesting the majority are false positives. I guess this is caused by inaccurate base alignment. I understand the authors are focusing on SVs, but the high false positive rate of small variants is concerning. It would be good if the authors can try to improve that.3) For each query sequence, minimap2 chains all anchors in one round and produces multiple chains. In comparison, lra clusters anchors first and then chains in each cluster. While lra chaining may be theoretically better, clustering is heuristic and may not be as good. Combining clustering and chaining might hurt overall mapping accuracy. The authors said they use clustering "for efficiency". How much efficiency is affected if we use lra chaining for all anchors?Minor comments4) Minimap2+paftools is primarily developed for calling small variants, not for calling SVs. We can increase the recall of SV calling by adding options "-r2k -z2000,200" or even "-r7k -z2000,200". This trades performance and small variant accuracy for SV accuracy.5) The manuscript is using concave gap cost functions, not convex.6) Apparently, lra requires C++14 or C++17 to compile. However, Makefile is not using an option like -std=c++14. I got a compilation error.Reviewer: Heng LiReviewer #2: Long read sequencing is a technology that becomes more important in recent days. Alignment of long reads is the fundamental step for many downstream analyses. In this manuscript, the authors described the design of lra, and compared the output to other programs. The tool is extreme valuable for long reads data analyses. It is challenging to understand all details described in the manuscript, but the paper is well-written, solid and with enough details. I also tested the program. It is easy to install and use. For an actively emerging area, any new tools are welcomed, and lra is one of the excellent ones from my point of view. Thus, I suggest accepting the manuscript with some minor revisions as shown below. I suggest the authors to improve the GitHub page to make it easier for users to understand the benefit of using lra.Minor points:Page 2 line 47, change “translocation” to “translocations”.Page 5 line 150, change “section 3.3” to something else, as currently there is no title like “3.3”. The same for other similar cases.There are a lot of abbreviations in this paper, such as “DATA” (line 206), “SA”, “SB” (line 237). It is better to add some explanations to make it easier for readers to follow. SA means subproblem A part?Line 343, change “can be only be” to “can only be”.Line 413, not a problem but I am surprised that GRCh37 instead of GRCh38 is used. Maybe because the curated variant file is based on GRCh37? Why It is “mapped to GRCh38” in line 427?Line 423, “the number of bases aligned by lra and minimap2 are within 0.03-7%”. I do not understand. Do you mean the difference between the alignments by the two programs?**********Have all data underlying the figures and results presented in the manuscript been provided?Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology
data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.Reviewer #1: YesReviewer #2: Yes**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: Yes: Heng LiReviewer #2: NoFigure Files:While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at .Data Requirements:Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.Reproducibility:To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see22 Apr 2021Submitted filename: LRA Review comments.pdfClick here for additional data file.13 May 2021Dear Chaisson,We are pleased to inform you that your manuscript 'lra: A long read aligner for sequences and contigs' has been provisionally accepted for publication in PLOS Computational Biology.Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.Also, we strongly encourage you to address the few remaining points raised by two reviewers. Since these are very minor, we did not want to make you go through another round of revisions but rather address these together with other editorial edits needed in one final submission.Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.Best regards,Ferhat Ay, Ph.DAssociate EditorPLOS Computational BiologyJian MaDeputy EditorPLOS Computational Biology***********************************************************Reviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #1: I have tried the latest LRA from github on the T2T CHM13 assembly and a HG00733 assembly. LRA crashed on both inputs. Nonetheless, based on the alignments it has outputted, I can see that LRA has greatly improved the alignment quality, which has addressed most of my major concerns in the previous round. Please fix the bug and make sure LRA runs under addressSanitizer. I also have a couple of minor comments. These are just suggestions that the authors may ignore:1) It would be good to use a command-line option parser that is compatible with POSIX getopt. This will be more convenient to users and is less error prone.2) In the SAM output, we use "=" for sequence matches and "X" for mismatches. We don't mix "M" and "X" together.Reviewer #2: The authors have answered all my questions and included a great number of improvements compared to last submission. One tiny flaw may be “when a 40% overlap with the simulated interval is required for a correct alignment (Figure ??)”, where the meaning of “??” is unknown. The authors may consider adding some subtitles for “3 Results”. Otherwise, I think the manuscript is in great shape.Reviewer #3: I have gone through the revision, and the authors have adequatelyaddressed all my comments. I have no further comments.**********Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #1: YesReviewer #2: YesReviewer #3: Yes**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: Yes: Heng LiReviewer #2: NoReviewer #3: No16 Jun 2021PCOMPBIOL-D-20-02184R1lra: A long read aligner for sequences and contigsDear Dr Chaisson,I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!With kind regards,Zsofi ZomborPLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol
Authors: Michael Brudno; Chuong B Do; Gregory M Cooper; Michael F Kim; Eugene Davydov; Eric D Green; Arend Sidow; Serafim Batzoglou Journal: Genome Res Date: 2003-03-12 Impact factor: 9.043
Authors: W James Kent; Robert Baertsch; Angie Hinrichs; Webb Miller; David Haussler Journal: Proc Natl Acad Sci U S A Date: 2003-09-19 Impact factor: 11.205
Authors: Kishwar Shafin; Trevor Pesout; Ryan Lorig-Roach; Marina Haukness; Hugh E Olsen; Colleen Bosworth; Joel Armstrong; Kristof Tigyi; Nicholas Maurer; Sergey Koren; Fritz J Sedlazeck; Tobias Marschall; Simon Mayes; Vania Costa; Justin M Zook; Kelvin J Liu; Duncan Kilburn; Melanie Sorensen; Katy M Munson; Mitchell R Vollger; Jean Monlong; Erik Garrison; Evan E Eichler; Sofie Salama; David Haussler; Richard E Green; Mark Akeson; Adam Phillippy; Karen H Miga; Paolo Carnevali; Miten Jain; Benedict Paten Journal: Nat Biotechnol Date: 2020-05-04 Impact factor: 54.908
Authors: Aaron M Wenger; Paul Peluso; William J Rowell; Pi-Chuan Chang; Richard J Hall; Gregory T Concepcion; Jana Ebler; Arkarachai Fungtammasan; Alexey Kolesnikov; Nathan D Olson; Armin Töpfer; Michael Alonge; Medhat Mahmoud; Yufeng Qian; Chen-Shan Chin; Adam M Phillippy; Michael C Schatz; Gene Myers; Mark A DePristo; Jue Ruan; Tobias Marschall; Fritz J Sedlazeck; Justin M Zook; Heng Li; Sergey Koren; Andrew Carroll; David R Rank; Michael W Hunkapiller Journal: Nat Biotechnol Date: 2019-08-12 Impact factor: 54.908
Authors: Guillaume Marçais; Arthur L Delcher; Adam M Phillippy; Rachel Coston; Steven L Salzberg; Aleksey Zimin Journal: PLoS Comput Biol Date: 2018-01-26 Impact factor: 4.475
Authors: Michelle D Noyes; William T Harvey; David Porubsky; Arvis Sulovari; Ruiyang Li; Nicholas R Rose; Peter A Audano; Katherine M Munson; Alexandra P Lewis; Kendra Hoekzema; Tuomo Mantere; Tina A Graves-Lindsay; Ashley D Sanders; Sara Goodwin; Melissa Kramer; Younes Mokrab; Michael C Zody; Alexander Hoischen; Jan O Korbel; W Richard McCombie; Evan E Eichler Journal: Am J Hum Genet Date: 2022-03-14 Impact factor: 11.043