Literature DB >> 32777815

LongAGE: defining breakpoints of genomic structural variants through optimal and memory efficient alignments of long reads.

Abstract

SUMMARY: Defining the precise location of structural variations (SVs) at single-nucleotide breakpoint resolution is a challenging problem due to large gaps in alignment. Previously, Alignment with Gap Excision (AGE) enabled us to define breakpoints of SVs at single-nucleotide resolution; however, AGE requires a vast amount of memory when aligning a pair of long sequences. To address this, we developed a memory-efficient implementation-LongAGE-based on the classical Hirschberg algorithm. We demonstrate an application of LongAGE for resolving breakpoints of SVs embedded into segmental duplications on Pacific Biosciences (PacBio) reads that can be longer than 10 kb. Furthermore, we observed different breakpoints for a deletion and a duplication in the same locus, providing direct evidence that such multi-allelic copy number variants (mCNVs) arise from two or more independent ancestral mutations.
AVAILABILITY AND IMPLEMENTATION: LongAGE is implemented in C++ and available on Github at https://github.com/Coaxecva/LongAGE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Disease Gene Species

Year: 2021 PMID： 32777815 PMCID： PMC8128450 DOI： 10.1093/bioinformatics/btaa703

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Recent single-molecule sequencing technologies generate very long reads, enabling the capture of multiple variant types including structural and copy number variations (SVs/CNVs). However, precise alignment around SVs is a challenge, because of large gaps in alignment (Abyzov ; Lam ; Sedlazeck ). Previously, Alignment with Gap Excision (AGE) was described as a precise method that uses dynamic programming to solve the problem (Abyzov and Gerstein, 2011). While not designed to be used for aligning against a reference genome, its primary purpose is realigning reads around the sites of suspected SVs/CNVs. Thus, its application is limited to the alignment of short reads/contigs to relatively small genomic regions. However, it requires vast memory usage, because its implementation uses matrices. Here, we introduce LongAGE, a memory-efficient implementation of AGE. LongAGE leverages linear space alignment algorithms based on the idea first presented to solve the longest common subsequence problem (Hirschberg, 1975) and several other such algorithms for sequence alignments (Chao ). LongAGE vastly improves memory usage compared to AGE; that allows users to realign long reads (PacBio/Oxford Nanopore) or contigs on a regular compute node, desktop or laptop.

2 Materials and methods

2.1 Memory-efficient implementation

Given two sequences to be aligned of length N, M: and , with . Let denote the optimal score for the left flank () and denote the optimal score for the right flank (), where is defined to be the maximum sum of values (aligning x to y, or either x or y to a gap ‘-’) of up-to the aligned pairs. The AGE algorithm is summarized as follows: Recall that the AGE algorithm uses matrices to compute the best score (BS) of aligning n and m nucleotides at the -ends and N − n and M − m nucleotides at the -ends is , where M is the maximum in the leading submatrix and M is the maximum in the trailing submatrix : We reckon that M and M are values of P and Q, respectively. To reduce memory usage, we can use a single array (α, β) for each matrix: Our main implementation is summarized in two steps: Compute the maxima scores using the linear-space algorithms using the detail implementation outlined by Chao . Reconstruct pairwise alignments based on the maxima scores (the second round of the same procedure of finding the maxima scores). It is well known that CNVs and SVs can have homologous and identical sequences around their breakpoints (Kidd ). Several optimal alignments exist with the same maxima scores because of identical sequences at SV breakpoints (Tran ), differences in alignments result from shifting along the identical sequences. By common convention LongAGE returns the left-shifted solution. LongAGE reduces the space usage from to , while increasing computation time by at most four times.

2.2 Resolving breakpoints of mCNVs using long-reads

The steps were as follows: Identify SVs of interest (Fig. 1A): Aligned Illumina HiSeq short-reads (Zook ) in BAM format are available for three trios from the Genome in a Bottle (GIAB) Consortium. The coverage was for the parents and for the child. CNVs were discovered in children using CNVnator (Abyzov ) with default options and 1 kb bins. We then genotyped CNVs in corresponding parents using the same bin size. CNVnator returned estimated copy number (CN) for each member of the trio. Applying the condition: [ CN (in one parent) ] and [ CN (in the other parent) ] and [ CN (in child) ] for each GIAB trio, we obtained two candidate mCNVs. The candidate mCNV in the Ashkenazim trio was likely a false positive as no PacBio reads supported deletion and duplication in that region. The other mCNV in the Chinese trio was around 20 kb in length and contained a deletion in the father (HG006) and a duplication in the mother (HG007) (Fig. 1B).

Fig. 1.

Defining breakpoints of mCNV on chromosome 19 in Chinese Trio from GIAB. (A) Read depth signals from top to bottom corresponding to father (HG006), mother (HG007) and son (HG005). (B) Haplotypes with deletion and duplication are passed down from both parents to son. (C) Haplotypes with tandem duplication and deletion were assembled by haplotype-assigned PacBio reads. Breakpoints of the deletion and duplications are different. Analyze long-reads containing SVs (Fig. 1C): NGMLR (Sedlazeck ) was used to map the GIAB Mt Sinai PacBio reads of the Chinese son (HG005) (Zook ) to the Human Reference GRCh38, where the option was “−x pacbio”. Using SAMtools (Li ), we extracted reads from regions of interest, which are chromosomal coordinates where coordinate intervals [L−40 kb, R+40 kb], where L and R refer to the left and right breakpoint coordinates from read depth analysis. Extracted reads were realigned to the reference genome around the breakpoints using LongAGE with either “−indel” or “−tdup” which specify alignment that is expected to have indels or duplications in the read sequence, respectively. However, it should be noted that until recently long-reads have had high error rates (Lau ), hence our use of a lower gap opening penalty “−go=−1”. Rectify SV breakpoints (Fig. 1C): Realigned reads were grouped based on which haplotype (deletion or duplication) had better support. For the best alignment, we required that: (i) the breakpoints from LongAGE’s alignment are within 1 kb of the estimated breakpoints of mCNV; (ii) every flank of an aligned read should have a minimum length of 1.5 kb or at least a fifth of the read length; (iii) its score is at least 500 more than for the alignment in the alternative mode (“−indel” for “−tdup” and vice versa). We assembled the above-selected reads into two contigs using a long-read assembler wtdbg2 (Ruan and Li, 2020) and then aligned those contigs with the same parameters to precisely resolve the breakpoints. More descriptions of best practice of using the tools can be found in the Supplementary Material.

3 Results

To study the trade-off between memory usage and running time, we created a synthetic dataset of SVs with lengths varying from 1 to 32 kb, and one of 1 Mbp length. Inspired by (Abyzov ; Lam ), we randomly generated coordinates of a synthetic deletion of a certain length, then created the pseudocontig of each deletion allele by joining left and right flanks of 10 kb in length total. We then aligned the created pseudocontig against the regions in the reference from the -end of the left flank to -end of the right flank. We perform alignment with AGE and LongAGE on each pair of such pseudocontigs for all lengths of synthetic SVs. Table 1 summarizes run time and memory usage of AGE and LongAGE by Valgrind (Seward ) on all pairs of synthesized sequences. In LongAGE, memory usage grows linearly, while computation time is 2.6 to longer than AGE, which is expected under Hirschberg’s method. Given 192 GB of memory on a Gold 6148 Processor workstation, AGE failed to align sequences of 1 Mbp due to the lack of memory allocation. LongAGE completed in less than 20 min and only needed a maximum of 114 megabytes for the task.

Table 1.

Memory usage in megabytes and run time in seconds of AGE and LongAGE in controlled experiments on aligning two sequences with various variant lengths

Tools	1 kb	2 kb	4 kb	8 kb	16 kb	32 kb	1 Mbp
Memory usage (megabytes)
AGE	550.83	600.85	700.90	901.04	1301.21	2101.68	⊘
LongAGE	2.71	2.92	3.13	3.55	3.62	5.55	113.29
Running time (s)
AGE	5.05	5.55	6.57	8.37	12.03	19.27	⊘
LongAGE	18.92	20.72	22.80	23.77	32.06	50.63	1159.61

Note: Benchmarks were made on an Intel Xeon(R) Gold 6148 Processor (27.5M Cache, 2.40 GHz) with 192 GB of memory.

Memory usage in megabytes and run time in seconds of AGE and LongAGE in controlled experiments on aligning two sequences with various variant lengths Note: Benchmarks were made on an Intel Xeon(R) Gold 6148 Processor (27.5M Cache, 2.40 GHz) with 192 GB of memory. Thousands of deletion and duplication polymorphisms larger than 1 kb in human genomes, called copy number variations (CNVs), can impact phenotypes by causing gene dosage and structure to vary among individuals (Usher and McCarroll, 2015). Many CNVs are multiallelic (mCNVs) where their structural alleles have been rearranged multiple times in their ancestors. The origin of such events is not fully understood due to difficulties in resolving their breakpoints with short reads, as the breakpoints are often embedded in segmental duplications. To demonstrate the applicability of LongAGE, we resolved breakpoints of reciprocal deletion and duplication with long homologies around breakpoints in the Chinese Trio sequenced by the GIAB Consortium. Such events have been previously described by (Abyzov ) and were hypothesized to occur from a single non-allelic homologous recombination (NAHR) mentioned by (Abyzov ; Lam ). First, we identified a copy number neutral region on the Human Genome GRCh38 of mCNV (:–) with possible deletion and duplication haplotypes in a child using Illumina HiSeq short-read data (Zook ) (Fig. 1A). Then, assuming the two (deletion and duplication) haplotypes are present in the child (Fig. 1B), we locally realigned PacBio long-reads with LongAGE using both INDEL (for alignment with deletion) and TDUP (for alignment with tandem duplication) modes. Next, by comparing alignments in each mode, we selected reads likely to be supported by deletion and tandem duplication. Breakpoints can be imprecise due to sequencing errors/homologies, yet roughly match those identified from read depth analysis. We obtained 26 deletion-supporting reads, and 20 duplication-supporting reads (Supplementary Table S1). We then assembled these reads into two contigs, and we aligned them to the reference (by LongAGE in appropriate mode) with a high percent identity of over 98%. We observed that deletion breakpoints are left-shifted compared to duplication breakpoints for 1538 and 1541 bp for the left breakpoint and the right breakpoint, respectively (Fig. 1C). Such a shift suggests that the deletion and duplication occurred ancestrally from two different events.

4 Conclusion

We have presented LongAGE, a memory-efficient implementation of AGE. Even when aligning megabase-long sequences, LongAGE’s memory footprint is less than hundreds of megabytes, while it is at most four times slower than AGE in terms of running time. The tool facilitates the resolution and standardization of SV breakpoints in highly repetitive regions at a single base pair. It is capable of refining read alignment once a read has been heuristically mapped to a particular genomic location that is expected to contain an SV. Click here for additional data file.

13 in total

1. Recent developments in linear-space alignment methods: a survey.

Authors: K M Chao; R C Hardison; W Miller
Journal: J Comput Biol Date: 1994 Impact factor: 1.479

2. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

3. Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library.

Authors: Hugo Y K Lam; Xinmeng Jasmine Mu; Adrian M Stütz; Andrea Tanzer; Philip D Cayting; Michael Snyder; Philip M Kim; Jan O Korbel; Mark B Gerstein
Journal: Nat Biotechnol Date: 2009-12-27 Impact factor: 54.908

4. A human genome structural variation sequencing resource reveals insights into mutational mechanisms.

Authors: Jeffrey M Kidd; Tina Graves; Tera L Newman; Robert Fulton; Hillary S Hayden; Maika Malig; Joelle Kallicki; Rajinder Kaul; Richard K Wilson; Evan E Eichler
Journal: Cell Date: 2010-11-24 Impact factor: 41.582

5. AGE: defining breakpoints of genomic structural variants at single-nucleotide resolution, through optimal alignments with gap excision.

Authors: Alexej Abyzov; Mark Gerstein
Journal: Bioinformatics Date: 2011-01-13 Impact factor: 6.937

6. Analysis of deletion breakpoints from 1,092 humans reveals details of mutation mechanisms.

Authors: Alexej Abyzov; Shantao Li; Daniel Rhee Kim; Marghoob Mohiyuddin; Adrian M Stütz; Nicholas F Parrish; Xinmeng Jasmine Mu; Wyatt Clark; Ken Chen; Matthew Hurles; Jan O Korbel; Hugo Y K Lam; Charles Lee; Mark B Gerstein
Journal: Nat Commun Date: 2015-06-01 Impact factor: 14.919