Literature DB >> 30423092

Fast characterization of segmental duplications in genome assemblies.

Ibrahim Numanagic^1,2, Alim S Gökkaya³, Lillian Zhang¹, Bonnie Berger^1,2, Can Alkan³, Faraz Hach^4,5.

Abstract

Motivation: Segmental duplications (SDs) or low-copy repeats, are segments of DNA > 1 Kbp with high sequence identity that are copied to other regions of the genome. SDs are among the most important sources of evolution, a common cause of genomic structural variation and several are associated with diseases of genomic origin including schizophrenia and autism. Despite their functional importance, SDs present one of the major hurdles for de novo genome assembly due to the ambiguity they cause in building and traversing both state-of-the-art overlap-layout-consensus and de Bruijn graphs. This causes SD regions to be misassembled, collapsed into a unique representation, or completely missing from assembled reference genomes for various organisms. In turn, this missing or incorrect information limits our ability to fully understand the evolution and the architecture of the genomes. Despite the essential need to accurately characterize SDs in assemblies, there has been only one tool that was developed for this purpose, called Whole-Genome Assembly Comparison (WGAC); its primary goal is SD detection. WGAC is comprised of several steps that employ different tools and custom scripts, which makes this strategy difficult and time consuming to use. Thus there is still a need for algorithms to characterize within-assembly SDs quickly, accurately, and in a user friendly manner.
Results: Here we introduce SEgmental Duplication Evaluation Framework (SEDEF) to rapidly detect SDs through sophisticated filtering strategies based on Jaccard similarity and local chaining. We show that SEDEF accurately detects SDs while maintaining substantial speed up over WGAC that translates into practical run times of minutes instead of weeks. Notably, our algorithm captures up to 25% 'pairwise error' between segments, whereas previous studies focused on only 10%, allowing us to more deeply track the evolutionary history of the genome. Availability and implementation: SEDEF is available at https://github.com/vpc-ccg/sedef.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2018 PMID： 30423092 PMCID： PMC6129265 DOI： 10.1093/bioinformatics/bty586

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Segmental duplications (SDs) are defined as genomic segments of size >1 Kbp that are repeated within the genome with at least 90% sequence identity (Bailey ) in either tandem or interspersed organization. Almost all genomes harbor large SDs; e.g. the build 37 release of the human reference genome (GRCh37) contains a total of 159 Mbp gene-rich duplicated sequence, which corresponds to ∼5.5% of its entire length (http://humanparalogy.gs.washington.edu/build37/build37.htm). It is known that SDs played a major role in evolution (Marques-Bonet ; Prado-Martinez ; Sudmant ), and are one of the most important factors that contribute to human disease either directly (Gonzalez ; Hollox ; Yang ), or through leading to other forms of structural variation (SV) (Alkan ; Mills ). Furthermore, human populations show SD diversity that may be used as markers for population genetics studies (Alkan ; Sudmant ). Despite their functional importance, SDs are poorly characterized due to the difficulties they impose on constructing accurate genome assemblies (Alkan ; Chaisson ; Steinberg ), as well as the ambiguities in read mapping (Treangen and Salzberg, 2011; Firtina and Alkan, 2016). Building inaccurate assemblies due to SDs is an important problem, since it may lead to potentially incorrect conclusions about the evolution of the species of interest as in the case of the giant panda assembly (Li ). In the giant panda genome analysis, the authors concluded that the panda genome included substantially less repeats and duplications compared with other mammalian genomes; however, it is shown that this result is likely incorrect due to mis-assembly of these regions. Additionally, duplications give rise to gaps in the assembly, which likely contain genes and other functionally active regions (Alkan ; Bailey , 2002). Accurate assembly of duplicated regions remains a difficult and unsolved problem, which may be ameliorated through the use of ultra-long reads generated by the Oxford Nanopore platform (Jain ) or Linked-Read sequencing (e.g. 10x Genomics; Mostovoy ; Yeo ) if the duplicated segment is shorter than the read or linked-read length, respectively. However, characterization of the SD content in existing assemblies is still important for two reasons: (i) to evaluate the ‘completeness’ of these genome assemblies and (ii) understand genome evolution for comparative genomics studies. SD content in assemblies can be assessed using two strategies, and the overlap between the results of the two methods determines the completeness of the assembly in terms of duplications. The first method, called Whole-Genome Assembly Comparison (WGAC), relies on the alignment of the entire genome to itself to identify repeating segments (Bailey ) within the assembly, except for common repeats which are filtered out. The second strategy is called Whole-genome Shotgun Sequence Detection (WSSD) which relies on the read depth sequence signature (Bailey ). Briefly, the WSSD method aligns the original reads back to the assembled genome, and looks for regions of read depth significantly higher than the average, which signals a putative duplication (Bailey ). Regions that are marked as SDs by WSSD, but not by WGAC, are then classified as likely collapsed duplications (Alkan ). Although the ‘optimal’ alignment of the entire genome to itself can be theoretically computed via standard dynamic programing (e.g. Smith-Waterman algorithm), such an approach remains impractical due to quadratic time and memory complexity, and is likely to remain so (Backurs and Indyk, 2015). Furthermore, high edit distance between the SD paralogs disqualifies the use of most of the available edit distance approximations with theoretical guarantees (Andoni ; Hanada ), as well as the majority of the sequence search tools that operate under the assumption that the similarity between regions is high. Standard whole-genome or long-read aligners [e.g. MUMmer (Marçais ) or Minimap2 (Li, 2018)] are not able to efficiently capture SDs with low similarity rates (i.e. <80%), and also confuse SDs with other repeated elements in the genome (e.g. stretches of short tandem repeats). For these reasons, the WGAC method is composed of a number of heuristics that include several tools and scripts (Bailey ). First, the common repeats are removed from the assembly in a step called fuguization. Remaining regions are partitioned into 400 Kb chunks (due to memory limitations when WGAC was developed), all pairwise alignments are computed using BLAST (Altschul ), and significant alignments are kept as putative duplications. A modified version of BLASTZ (Schwartz ) is also used to find within-chunk (i.e. 400 Kb segments) duplications. Next, common repeats are inserted back, spurious alignments at the end of the sequences are trimmed, and final global alignments are calculated. Note that another method, SDquest (Pu ), was published while this article was under review. The original WGAC implementation as outlined above is difficult to run, and it is time consuming as it relies on several general purpose tools (such as BLAST) and custom Perl scripts. In its current form, the only way to accelerate the WGAC analysis is using a compute cluster to parallelize BLAST alignments. Interestingly, another problem is the modified version of BLASTZ for self alignments: the source code is not available, and only a binary compiled for the Sun Solaris operating system has been released (http://humanparalogy.gs.washington.edu/code/WGAC_HOWTO.pdf), rendering the tool unusable for most other researchers. We note that this self-alignment step might be replaced with another tool such as LASTZ (Harris, 2007); however, the parameter settings for the current release is not yet optimized for alternative aligners. Here we introduce a new algorithm to characterize SDs in genome assemblies. Although we follow a strategy akin to WGAC in aligning a whole genome to itself; we do so in a more efficient way by introducing sophisticated optimizations to both the putative SD detection and global alignment steps. We leverage our knowledge from the biology of human and other genomes that different mutation events contribute unequally to the total value of the error rate (quantification of differences) between segments, in order to better optimize our SD detection algorithms. A key conceptual advance of our work that helps model such events in the genome is to separately consider germline mutation rates (denoted as small mutations), and larger-scale de novo SV rates. This formulation enables us to speed up SD detection and better capture evolutionary events. We implement our algorithms in C++ and provide a single package called SEDEF (SEgmental Duplication Evaluation Framework). In contrast to WGAC, which requires several weeks to complete even on a compute cluster, SEDEF can characterize SDs in the human genome in 10 CPU h. We believe SEDEF will be a powerful tool to characterize SDs for both genome assembly evaluation and comparative genomics studies.

2 Preliminaries

SDs are generated by large-scale copy events that have occurred during the evolution of the genome. After such a copy event, both sites involved in SD may have undergone a number of changes during the evolutionary history of the genome. Formally, consider a genomic sequence of length , where for any i. Let be a substring in G of length n that starts at position i. Furthermore, let X be the set of all k-mers in the substring . We assume that k is predefined and fixed. The Levenshtein (Levenshtein, 1966) edit distance is defined as: of two substrings and (further simplified as G and G) to be a minimal number of edit operations (i.e. single nucleotide substitutions, insertions and deletions) that are needed to convert the string G into G The length of the alignment between G and G is denoted as l, and clearly . Let us define the notion of edit error (further referred to as just error) between two strings G and G as —the edit distance normalized over alignment length. Intuitively, this is the average number of the edits needed to turn G into G Clearly, two strings are identical if . We consider G and G as a SD of length l with error δ if the following SD conditions are met: where l is the length of alignment between G and G, . We also assume that the overlap between the G and G in the genome is at most .

2.1 Edit distance model

Each SD is generated by a past SV event that copied a substring of length n in G from locus i to locus j. This copy was initially perfect, meaning that corresponding strings G and G (initially both of length n) were identical. However, various changes during the evolutionary history of the genome—such as point mutations, small indels, and other structural variants—have altered both the original and duplicate strings independently. Thus it is necessary to take such changes into consideration when identifying the potential SDs. Although previous SD studies focused only on SDs with pairwise error at most 10%, here we aim to focus on SDs whose error rate can go up to 25% (in another words, ). Higher δ allows us to track the evolutionary history of the human genome to earlier periods. However, it also significantly renders the SD detection problem more difficult since the majority of the known filtering techniques that operate on the edit distance metric space assume much lower values of δ (Andoni ). We address this challenge by leveraging our knowledge from the biology of human and other genomes that different mutation events contribute unequally to the total value of δ in order to better optimize our SD detection algorithms. A key conceptual advance of our work that helps model such events in the human genome is to separately consider germline mutation rates (denoted as small mutations), and de novo SV rates. It is estimated that the substitution rate in the human genome is roughly 0.5 × 10–9 per basepair per year (Scally, 2016), and we may assume a similar rate for other mammalian genomes. The evolutionary split of the human and chimpanzee species is estimated to have occurred ∼7 million years ago (Hedges and Kumar, 2009). Thus, we expect that the probability of a basepair being mutated since the split is roughly 3.5 × 10–3. If we also account for small indels [with an even smaller mutation rate than those of substitutions (Montgomery )], the edit error between any two paralogs of an SD that have occurred after the evolutionary split is not >0.1%. Even if we consider SDs that occurred much earlier in history (e.g. after the lowest common ancestor of human and mouse roughly 90 million years ago), the total edit error will not be >10%. However, the edit error between two paralogs of an SD can be much larger due to large SVs. One such example is insertion of transposons within an SD. These events can be visualized as large gaps within an edit distance string of two SDs which contribute a large share towards the total edit error (Fig. 1) (We note that inversion and translocation events contribute more to sequence divergence due to incorrect alignments, but such events are rare). Thus we assume that small mutations contribute at most towards the total edit error δ (both paralogs can be mutated up to 7.5%); this default setting is higher than the estimated rates (above) for human and mouse genomes in order to be able to handle older species as well. Analogously, large-scale events (subsequently referred to as gaps) contribute the remaining of the edit error. We assume that the probability of a large gap occurring at any basepair in the genome is not >0.005 (as estimated by analysis of existing human SDs; similar value can be derived for other species). Note that the gap penalty is typically calculated via affine gap model, where gap openings are heavily penalized while the gap extensions are either ignored or assigned a very low penalty. Many human SDs have if calculated by standard Levenshtein distance metric. SEDEF uses standard (Levenshtein) gap distance metric while locating seed SDs (meaning all seed SDs have ); however, this restriction is lifted in a later step where we switch to the affine gap penalty.

Fig. 1.

(Left) Simplified representation of a SD lifetime. Initially, a large-scale duplication forms an SD, at which point both the original region and the copy are identical. Then, both the original region and copy undergo various independent changes, such as large-scale deletions (in red), insertions (in blue), and small repeat insertions (in fuchsia). Finally, various germline mutations (in yellow) affect both regions. The resulting SD as seen today, defined as the pair , is shown in the third row. (Right) Shows the idealized Jaccard similarity between the k-mer sets X and X corresponding to the G and G, respectively. Note that some repeats also increase the proportion of shared k-mers. Colors denote same as on Left Furthermore, we assume that the mutations within SD paralogs follow a Poisson error model (Jain ; Fan ), and that those mutations occur independently of each other. It follows that any k-mer in X (the set of k-mers of G) has accumulated on average mutations compared with the originating k-mer in X provided that such a k-mer was part of the original copy. By setting a Poisson parameter , we obtain the probability of an event in which a k-mer is preserved in both paralogs of an SD (i.e. that it is error-free): . We call this error model SD error model, and assume that any SD of interest satisfies the error constraints mentioned above.

2.2 Jaccard similarity

Suppose that we ask whether two substrings G and G are similar to each other, where the length of both strings is n. One way to measure the similarity of those substrings is to analyze their respective k-mer sets X and X and to count the number of shared k-mers between them. This metric is known as Jaccard similarity of sets X and X, and is formally defined as Clearly, higher similarity of strings G and G implies a larger value of . The calculation of Jaccard similarity between two sets can be approximated via the MinHash technique developed by Broder (1997), who proved that given a universe of all k-mers U and a random permutation h on U (typically a hash function with no collisions), it follows that where is the minimal member of T with respect to h. Furthermore, can be estimated and efficiently computed by calculating where is the sketch of X and stands for a subset of s elements from X whose hash values are minimal with respect to the hash function h (such elements are called minimizers of set X). This estimate is unbiased as long as h is random, and its accuracy depends on the sketch size s. Since in practice s is much smaller than , calculating a MinHash estimate is substantially faster compared with the calculation of . The performance of the MinHash technique can be further improved in the context of large strings, as shown by Jain . Instead of computing , it is possible to compute , where is a winnowing fingerprint of the corresponding string G is calculated by sliding a window of size w through G and by taking in each window a k-mer of minimal hash value (in case of a tie, the rightmost k-mer is selected). The expected size of W(A) for a random sequence A is (Schleimer ). The main benefit of winnowing, aside from speeding up the construction of sketch S(A), is the fact that winnow W(A) can be computed efficiently in linear time and O(w) space in a streaming fashion with appropriate data structures (Carruthers-Smith, 2013). Moreover, it has been empirically shown that can be efficiently estimated by calculating a winnowed MinHash score of sets X and X (Jain ): Furthermore, given the minimal desired value τ of Jaccard similarity between the two sets, it follows that (Jain ). This estimate can be used to efficiently filter out any sets X and X whose Jaccard similarity is below a given threshold with high confidence.

3 Materials and methods

The SD detection problem can be formulated as follows: find all pairs of loci (i, j) inside a genome G with l ≥1000 such that (i) the edit error (i.e. divergence) between G and G is at most δ; and (ii) the corresponding alignment between G and G is of size at least l and is not contained within a larger alignment satisfying the SD criteria. In other words, for any pair (i, j) we aim to find a maximal valid region alignment of size l between G and G that satisfies the criteria and error model of SDs. A naïve method for locating SDs within any genomic sequence G consists of locally aligning G onto itself, followed by the analysis of all acceptable paths within a local alignment matrix. However, this strategy is impractical for large since the best known algorithms for optimal local alignment require time and space. Another possible approach, which we take, is to solve this problem by iterating through each pair of indices (i, j) within G and testing whether the matching G and G satisfy the SD criteria through global alignment (given a fixed size n of G and G). Although this method, if implemented naïvely, is still too slow for larger genomes and requires quadratic space, it can be significantly accelerated by filtering out any pair (i, j) that is unlikely to form an SD. This iterative approach is the cornerstone of our SD detection framework, SEDEF, which consists of a novel seed and extend algorithm (Fig. 2):

Fig. 2.

Step-by-step depiction of the SEDEF framework. Our contribution is highlighted above the steps in the dark gray boxes

SD seeding: Initially, we aim to find all pairs of strings (G, G)—called seed SD—such that the length of both strings is , and such that G and G are believed to satisfy the SD criteria. We achieve this by iterating through the genome, and for each locus i in the genome rapidly enumerating all feasible pairs j for which winnowed MinHash Jaccard similarity between G and G goes over a pre-defined threshold τ. SD extension: Here we relax the condition that both G and G have the same size n, and keep expanding both seed regions G and G until the winnowed MinHash estimate drops below τ. These enlarged seed SDs are called potential SD regions. We terminate this extension when we either reach the maximal allowed value of SD, or if the extension causes G and G to significantly overlap. SD chaining: Finally, we locate all ‘true’ SDs within any potential SD region and calculate their alignments by locally aligning potential SD regions via local chaining and sparse dynamic programing. Afterwards, we filter out any spurious hits and report the remaining SDs. Step-by-step depiction of the SEDEF framework. Our contribution is highlighted above the steps in the dark gray boxes

3.1 Identifying seed SDs

In order to verify if the strings G and G have edit error under the SD error model, we will calculate the Jaccard similarity of their corresponding k-mer sets X and X and check if it is . For the sake of explanation, we will assume that (analogous reasoning holds if this is not the case). If c is (number of shared k-mers), and (the size of sets X and X), we can express the Jaccard similarity of those sets as . We will also assume that no k-mer occurs twice in these sets; this assumption is sufficient for the calculation of lower bound below. Simply using error δ to calculate the expected lower bound of Jaccard similarity τ is infeasible in practice due to the large value of δ in our setting. However, as noted earlier, differences between duplicated regions are not chosen uniformly at random, and thus we can separate the error δ into the , where δ is the error rate of the small mutations and δ is the error rate of large indels, as defined in the preliminaries. This separation of the two error rates is one of the novel contributions of our work. If there exists a valid SD spanning G and G, set X can be considered as a union of two disjoint sets and , where represents the k-mers initially copied by the SD event and might have undergone small mutations, while contains all ‘new’ k-mers introduced by subsequent large events such as SVs and large indels (analogous separation applies to X as well). In this way we can separate the effects of the small mutations and large-scale events. Ideally shares no k-mers with X, while since we expect some shared k-mers to remain error-free after small mutations (see Fig. 1 for visualization). Now let us also express t as , where and (note that implies because the small mutations keep strings that generate and similar in size; thus we assume w.l.o.g that ). Let be the ratio of k-mers that are not mutated in both and (we assume that and share no common k-mers, which is a valid assumption for a lower bound calculation). Its expected value, provided a Poisson error model introduced above, is (Jain ). Now we proceed to estimate the minimal required Jaccard similarity of X and X Note that so far: ; ; and because (equality holds for the ideal condition where ). It follows that: Since and the expected value of is , it clearly follows that the minimum required expectation of Jaccard similarity τ is at least: To find the seed SDs, we follow a similar strategy as described in (Jain ), where our G and G correspond to the long reads and the genomic hits. We start by indexing a genome G and constructing an index I of genome G that is a sorted list of unique pairs (i, x) where x is a k-mer in the winnow and i is a starting position of x in G. We also construct a reverse index : it provides for any input k-mer x a list of all positions i in G such that . These two tables are computationally inexpensive to calculate and allow us to quickly calculate winnow of any substring G in G. For any locus i within G, we enumerate a list of all pairs . By using the winnowed MinHash lemma, we know that substring G starting at some locus j is a potential SD match for G if and share at least k-mers, where the sketch size s is set to . Since C is sorted by index, we can use this lemma to efficiently select all candidate locations for which by ‘rolling’ a MinHash calculation as follows (Jain ). We start by setting j = j, and then construct an ordered set where each element y is assigned 1 if it belongs to the intersection of and zero otherwise (Such a set can be efficiently implemented with a balanced binary tree where any update operation costs only .). Then we keep ‘rolling’ G by increasing j: this corresponds to checking the similarity between and , wherein we remove any minimizer from L which occurred at position j and add any minimizer that occurs at the position . Note that any such step costs at most operations (where s is the sketch size). With the appropriate auxiliary structures, we can calculate the winnowed MinHash estimate of and in O(1) time. Once we find a j for which the corresponding MinHash estimate is maximal and above τ, we add the pair (i, j) to the list of found SD seeds.

3.2 Finding potential SD regions

So far, we have assumed that the value of n is fixed and that . Now we lift this restriction and attempt to extend any seed SD as much as possible in both directions in order to ensure that we can find the boundaries of ‘true’ SDs. This can be done by iteratively increasing the values of n and m by one [each step takes time], which essentially keeps expanding the sets and : any minimizer which occurs at loci and within G is added to the ordered set L. Here we utilize the same structures as in the previous step (see Section 3.4), and keep extending SD region until the value of the winnowed MinHash estimate goes below τ. We also terminate extension if both n and m become too large (we limit SEDEF to find potential SDs of at most 1 Mbp in length, as per WGAC). Note that the term keeps growing while stays the same if two regions stop being similar after some time, which iteratively lowers the Jaccard estimate. We also interrupt the extension if the strings G and G begin to overlap. Note that we can perform this extension in the reverse fashion, by slowly decreasing the values i and j and applying the same techniques as described above. Finally, we report the largest G and G whose corresponding MinHash estimates are above τ. For each potential SD, we also apply a q-gram filter (Jokinen and Ukkonen, 1991) in order to further reduce the rate of false positives as follows. Define the q-gram similarity of strings G and G to be the total number of q-mers shared by both G and G We adapt the well-known q-gram lemma for our problem as follows: any G and G whose edit error is below δ and satisfies the SD error model will share at least q-grams, where we assume that and where p is the expected number of gaps per basepair in the genome. This modification allows us to losslessly reject any pair of substrings G and G that do not satisfy the SD error model for the given value of p The aforementioned algorithm performed on the whole human genome produces >500 million potential SD regions due to the presence of various small repeats in the genome. In order to alleviate this problem, we only use k-mers that contain at least one non-repeat-masked nucleotide during the detection of seed SDs. In order to allow the case of repeats being inserted in the SD during the evolutionary process, the SD extension step uses any available k-mer to extend seed SDs. Finally, we pad each potential SD region with a pre-defined number of bases (which is a function of the size of the potential SD region) in order to further increase the probability of locating large SDs within the potential regions.

3.3 Detecting final SDs

After finding the potential SD regions, we enumerate all local alignments of size 1000 within those regions that satisfy the SD criteria. In order to do this efficiently, SEDEF employs a two-tiered local chaining algorithm similar to those in (Abouelhoda and Ohlebusch, 2003; Myers and Miller, 1995). In the first part, we use a seed-and-extend method to construct the list of matching seed locations (of size 11 and higher), and proceed by finding the longest chains formed by those seeds via an sparse dynamic programing algorithm as described in Abouelhoda and Ohlebusch (2003) and Myers and Miller (1995). In this step, we restrict the maximum gap size between the seeds to in order to cluster the seeds within a chain as ‘close’ as possible. After finding these initial chains (which might span <1000 bp), we refine them by further chaining them into the large final chains by allowing larger gaps. In order to retain compatibility with WGAC, which allows arbitrary large gaps within the SD (since it does not penalize the gap extension), we use the affine gap penalty during the construction of SD chains; however, we limit gaps to no longer than 10 000 bp in order to avoid low-quality alignments. Chaining is accompanied by the global alignments which are done with the KSW2 library, which utilizes ‘single instruction, multiple data’ (SIMD) parallelization through Streaming SIMD Extensions (SSE) instructions to speed up the global sequence alignment (Li, 2017). Importantly, we report all our alignments in standard BEDPE format, together with corresponding edit strings in CIGAR format (Li ) and various other useful metrics similar to WGAC such as Kimura two parameter genetic distance (Kimura and Ohta, 1972) and Jukes-Cantor distance (Jukes and Cantor, 1969). In our experiments, we used k = 12 for the seed SD stage and k = 11 for chaining step (note that this parameter is configurable by user). While lower values of k may improve the sensitivity, we found that any such improvement is rather negligible and not worth the increase in the running time. On the other hand, higher values of k improve the running time while lowering the sensitivity.

4 Results

We evaluated SEDEF using the human reference genome (UCSC hg19) and mouse reference genome (UCSC mm8), and compared its calls to WGAC calls. Note that as mentioned in the Introduction it is not possible to run WGAC without Sun Solaris operating system; therefore, we were not able to benchmark it ourselves. WGAC calls were obtained from http://humanparalogy.gs.washington.edu and http://mouseparalogy.gs.washington.edu. WGAC calls are the current gold (and only) standard of SDs in both human and mouse genomes, and are used as SD annotations by UCSC Genome Browser. In case of human genome, the entire process took around 10 CPU h with the peak RAM usage of 7 GB in single-CPU mode. SEDEF is also highly parallelizable, and it took only 14 min for the whole process to finalize on 80 CPU cores. This is a significant improvement over WGAC, which takes several weeks to complete (private communication). Similar running times were observed in mouse genome, despite the fact that mouse genome contains significantly more repeats than human genome and thus necessitates longer running times (She . Run times on a single CPU and 80 CPU cores when ran in parallel via GNU Parallel (Tange, 2011) are given in Table 1.

Table 1.

Running time performance of SEDEF in single-core mode and multi-core mode on 80 Intel Xeon E7-4860 v2 cores at 2.60 GHz

	Human (hg19)			Mouse (mm8)
	Total	Seeding and extending	Chaining and aligning	Total	Seeding and extending	Chaining and aligning
1 core	10 h 30 min	7 h 33 min	2 h 57 m	13 h 7 min	7 h 53 min	5 h 14 min
80 cores	0 h 14 min	0 h 10 min	0 h 04 m	0 h 30 min	0 h 10 min	0 h 20 min

Running time performance of SEDEF in single-core mode and multi-core mode on 80 Intel Xeon E7-4860 v2 cores at 2.60 GHz SEDEF initially detected around 2 250 000 seed SD regions in human genome. After the chaining process, the final number of SDs was reduced to ≈186 400. Finally, after filtering out the common repeats and other spurious hits, we report 67 882 final SD pairs that cover 219 Mbp of the human genome. This is a significant increase over WGAC data, which reports 24 477 SD pairs that cover 159 Mbp of the genome. Of this 60 Mbp increase in the duplication content, 30 Mbp belongs to regions in the genome without common repeats. Figure 3 shows the genome coverage, together with size and error distribution of SDs found by SEDEF and WGAC. The majority of SEDEF SDs have cumulative error δ (with affine gap penalty) around 15%. As for the mouse genome, SEDEF found 352 991 final SDs which cover 259 Mbp of the genome, as compared with 140 Mbp covered by 117 213 WGAC SDs. Of the additional 120 Mbp found by SEDEF, 45 Mbp belongs to non common repeat regions.

Fig. 3.

(Left) Performance of SEDEF’s algorithm on simulated SDs. x-axis is the total simulated SD error rate δ, whereas y axis is the number of correctly detected SDs (total 1000 for each δ). Since SEDEF successfully detects more than 995 simulated SDs for any δ, the plot area is cropped. (Right) Venn diagram depicts the SD coverage of the human and mouse genome (in Mbp) as calculated by SEDEF and WGAC. Intersected region stands for the bases covered by both SEDEF and WGAC

4.1 Filter and alignment accuracy

4.1.1 Simulations

We also evaluated the accuracy of the seeding and chaining process based on total error rate δ. For this purpose, we generated 1000 random sequences of sizes 1–100 Kbp for each (i.e. up to 30%), and for each such sequence generated a random SD according to the SD criteria defined above (where δ and δ are randomly chosen such that they are both less than ). All sequences and mutations were randomly generated with uniform distribution. These two sequences (original one and the randomly mutated one) were fed to SEDEF, and then we checked whether SEDEF finds a match between these two sequences, and whether this match covers the original SDs (a match covers SD if >95% of the SD bases are included in the match). As shown in Figure 3, SEDEF’s overall sensitivity is 99.94%, and the sensitivity drops slowly as δ increases. However, even for , sensitivity remains above 99%. We performed a similar experiment on chromosome 1, where we randomly fetched 10 000 sequences (uniform distribution) of various lengths and introduced random mutations to simulate a SD event. In this experiment, SEDEF had only a 0.15% false negative rate (i.e. undetected SDs), where all missed duplications were very small SDs of lengths ≈1000.

4.1.2 WGAC coverage

It is worth mentioning that SEDEF-detected human SDs completely cover ≈98% of the previously reported SD intervals (≈99.6% in basepairs) by WGAC. SEDEF entirely misses <0.3% (70) of SD intervals reported in WGAC results, and for the 1.4% of WGAC SDs, SEDEF reports partial overlap (i.e. <80% reciprocal overlap). All together, SEDEF misses about 0.6 Mbp out of 159 Mbp as reported by WGAC (≈0.4%), where 0.5 Mbp contained short common repeats. We note that several WGAC SDs are in fact common repeats, and that several WGAC alignments contain long gaps. This is likely due to the dependency of WGAC on common repeat annotations, which may not be comprehensive. Additionally, WGAC employs several heuristics to reinsert common repeats to fuguized putative duplications that might ‘glue’ very short non common repeat segments into larger segments with high repeat content that show similar alignment properties to a SD. This effect is much more present in mouse genome, where SEDEF misses 16 471 WGAC SDs (14.1%), and partially covers 2.2% of such SDs. However, in terms of basepairs SEDEF only misses 1.5 Mbp (0.1 Mbp non common repeat elements). After extra validation, we found that most of missed WGAC SDs (≈14 470) are in fact common repeats incorrectly reported as SDs; thus SEDEF misses only 1.7% of the correct WGAC SD calls.

4.2 Comparison to other methods

We also evaluated the SD discovery accuracy of whole-genome aligners Minimap2 (Li, 2018) and MUMmer/nucmer (Marçais ) on the human genome assembly (UCSC hg19). These tools do not support SD detection out of the box; however, a self assembly-to-assembly comparison can be performed in order to identify the repetitive regions in the genome. These regions can be refined into SDs after applying further processing with SDDetector (Dallery ) and filtering out candidate SDs which consist solely of common short repeats. When compared with these tools, SEDEF is an integrated pipeline for identifying SDs from scratch on a given assembly. Note that other similar tools, such as DupMasker (Jiang ), are developed to annotate SDs and require already existing SD database from similar genomes to be able to mark SDs in a given genome (Table 2).

Table 2.

SD coverage of the human genome (hg19) as reported by different tools

Tool	Covers	Misses	Extra	Time (h: m)
WGAC (gold standard)	159.5	0.0	0.0	weeks
SEDEF	218.8	0.6	60.0	0:36
Minimap2	53.3	107.3	1.1	1:30
MUMmer/nucmer	142.6	30.8	13.9	≥20:00
SDDetector	30.1	130.8	1.5	≥1:00^a

Note: Misses and Extra are calculated with respect to the WGAC SD calls, which are currently the gold standard of SD calls. Note that we have filtered out all calls where at least one mate is composed solely of common short repeats (Minimap2, MUMmer/nucmer and SDDetector) as we did on SEDEF. All running times were adjusted for 20 CPU cores (all tools which support parallelization were run on 20 cores).

aAdjusted running time for 20 cores; in reality, SDDetector spends hours in the single threaded pre-processing stage. Furthermore, the reported running time only includes post-processing and does not include initial BLAST alignment calculations.

SD coverage of the human genome (hg19) as reported by different tools Note: Misses and Extra are calculated with respect to the WGAC SD calls, which are currently the gold standard of SD calls. Note that we have filtered out all calls where at least one mate is composed solely of common short repeats (Minimap2, MUMmer/nucmer and SDDetector) as we did on SEDEF. All running times were adjusted for 20 CPU cores (all tools which support parallelization were run on 20 cores). aAdjusted running time for 20 cores; in reality, SDDetector spends hours in the single threaded pre-processing stage. Furthermore, the reported running time only includes post-processing and does not include initial BLAST alignment calculations. We ran these tools on 20 CPU cores using the GNU Parallel (Tange, 2011) by aligning all pairs of chromosomes in hg19. Minimap2-based approach identified only 29% of the SD intervals reported by WGAC, which spanned 33% of the duplicated basepairs (53 Mbp out of 159 Mbp). MUMmer/nucmer approached better SD coverage performance, which identified 98.8% of WGAC regions that spanned 89% of duplicated basepairs (143 out of 159 Mbp), but the analysis was much slower and completed in 20 h in the same compute setting. Minimap2 required 1.5 h of run times using 20 CPU cores (in comparison, SEDEF takes only 36 min on 20 cores). Overall, analysis misses a significant amount of duplications in a self-comparison task when using the recommended parameters of intra-species assembly-to-assembly comparison. Meanwhile MUMmer/nucmer-based approach covers the SD regions more consistently with those reported by WGAC; however it still misses many WGAC calls which are found by SEDEF. Finally, SEDEF is able to find more calls compared with the other tools in much shorter amount of time, as shown in Table 2.

5 Conclusion

SDs are among the most important forms of genomic rearrangements that drive genome evolution. However, their accurate identification is lacking due to the unavailability of necessary computational tools. In this article we presented SEDEF to help fill this gap in methodology. In future work, we aim to characterize the effect of various edit distance embeddings and techniques such as gapped q-grams (Bar-Yossef ; Burkhardt and Kärkkäinen, 2002). Although many of these techniques have been previously implemented (Hanada ), our initial experiments did not show that any such embeddings or techniques are beneficial for strings with large edit distances. SEDEF is designed as a fast, accurate, and user friendly tool to discover duplicated segments in genome assemblies. Therefore it aims to help researchers easily identify duplicated segments in genomes from several organisms, enabling them to extend their ability to perform comparative genomic studies in complex regions of the genome. We aim to extend it with an A-Bruijn graph based analysis (Jiang ) to provide a full view of the evolution of SDs. Armed with the extensions as we mention earlier, we will then use SEDEF to fully analyze reference genome assemblies from various genomes to both evaluate the assembly accuracy, and to better understand the role of SDs in organism evolution.

33 in total

1. Recent segmental duplications in the human genome.

Authors: Jeffrey A Bailey; Zhiping Gu; Royden A Clark; Knut Reinert; Rhea V Samonte; Stuart Schwartz; Mark D Adams; Eugene W Myers; Peter W Li; Evan E Eichler
Journal: Science Date: 2002-08-09 Impact factor: 47.728

Review 2. The mutation rate in human evolution and demographic inference.

Authors: Aylwyn Scally
Journal: Curr Opin Genet Dev Date: 2016-08-31 Impact factor: 5.578

Review 3. Repetitive DNA and next-generation sequencing: computational challenges and solutions.

Authors: Todd J Treangen; Steven L Salzberg
Journal: Nat Rev Genet Date: 2011-11-29 Impact factor: 53.242

4. The sequence and de novo assembly of the giant panda genome.

Authors: Ruiqiang Li; Wei Fan; Geng Tian; Hongmei Zhu; Lin He; Jing Cai; Quanfei Huang; Qingle Cai; Bo Li; Yinqi Bai; Zhihe Zhang; Yaping Zhang; Wen Wang; Jun Li; Fuwen Wei; Heng Li; Min Jian; Jianwen Li; Zhaolei Zhang; Rasmus Nielsen; Dawei Li; Wanjun Gu; Zhentao Yang; Zhaoling Xuan; Oliver A Ryder; Frederick Chi-Ching Leung; Yan Zhou; Jianjun Cao; Xiao Sun; Yonggui Fu; Xiaodong Fang; Xiaosen Guo; Bo Wang; Rong Hou; Fujun Shen; Bo Mu; Peixiang Ni; Runmao Lin; Wubin Qian; Guodong Wang; Chang Yu; Wenhui Nie; Jinhuan Wang; Zhigang Wu; Huiqing Liang; Jiumeng Min; Qi Wu; Shifeng Cheng; Jue Ruan; Mingwei Wang; Zhongbin Shi; Ming Wen; Binghang Liu; Xiaoli Ren; Huisong Zheng; Dong Dong; Kathleen Cook; Gao Shan; Hao Zhang; Carolin Kosiol; Xueying Xie; Zuhong Lu; Hancheng Zheng; Yingrui Li; Cynthia C Steiner; Tommy Tsan-Yuk Lam; Siyuan Lin; Qinghui Zhang; Guoqing Li; Jing Tian; Timing Gong; Hongde Liu; Dejin Zhang; Lin Fang; Chen Ye; Juanbin Zhang; Wenbo Hu; Anlong Xu; Yuanyuan Ren; Guojie Zhang; Michael W Bruford; Qibin Li; Lijia Ma; Yiran Guo; Na An; Yujie Hu; Yang Zheng; Yongyong Shi; Zhiqiang Li; Qing Liu; Yanling Chen; Jing Zhao; Ning Qu; Shancen Zhao; Feng Tian; Xiaoling Wang; Haiyin Wang; Lizhi Xu; Xiao Liu; Tomas Vinar; Yajun Wang; Tak-Wah Lam; Siu-Ming Yiu; Shiping Liu; Hemin Zhang; Desheng Li; Yan Huang; Xia Wang; Guohua Yang; Zhi Jiang; Junyi Wang; Nan Qin; Li Li; Jingxiang Li; Lars Bolund; Karsten Kristiansen; Gane Ka-Shu Wong; Maynard Olson; Xiuqing Zhang; Songgang Li; Huanming Yang; Jian Wang; Jun Wang
Journal: Nature Date: 2009-12-13 Impact factor: 49.962

5. Human-mouse alignments with BLASTZ.

Authors: Scott Schwartz; W James Kent; Arian Smit; Zheng Zhang; Robert Baertsch; Ross C Hardison; David Haussler; Webb Miller
Journal: Genome Res Date: 2003-01 Impact factor: 9.043

6. Psoriasis is associated with increased beta-defensin genomic copy number.

Authors: Edward J Hollox; Ulrike Huffmeier; Patrick L J M Zeeuwen; Raquel Palla; Jesús Lascorz; Diana Rodijk-Olthuis; Peter C M van de Kerkhof; Heiko Traupe; Gys de Jongh; Martin den Heijer; André Reis; John A L Armour; Joost Schalkwijk
Journal: Nat Genet Date: 2007-12-02 Impact factor: 38.330

7. Limitations of next-generation genome sequence assembly.

Authors: Can Alkan; Saba Sajjadian; Evan E Eichler
Journal: Nat Methods Date: 2010-11-21 Impact factor: 28.547

8. Great ape genetic diversity and population history.

Authors: Javier Prado-Martinez; Peter H Sudmant; Jeffrey M Kidd; Heng Li; Joanna L Kelley; Belen Lorente-Galdos; Krishna R Veeramah; August E Woerner; Timothy D O'Connor; Gabriel Santpere; Alexander Cagan; Christoph Theunert; Ferran Casals; Hafid Laayouni; Kasper Munch; Asger Hobolth; Anders E Halager; Maika Malig; Jessica Hernandez-Rodriguez; Irene Hernando-Herraez; Kay Prüfer; Marc Pybus; Laurel Johnstone; Michael Lachmann; Can Alkan; Dorina Twigg; Natalia Petit; Carl Baker; Fereydoun Hormozdiari; Marcos Fernandez-Callejo; Marc Dabad; Michael L Wilson; Laurie Stevison; Cristina Camprubí; Tiago Carvalho; Aurora Ruiz-Herrera; Laura Vives; Marta Mele; Teresa Abello; Ivanela Kondova; Ronald E Bontrop; Anne Pusey; Felix Lankester; John A Kiyang; Richard A Bergl; Elizabeth Lonsdorf; Simon Myers; Mario Ventura; Pascal Gagneux; David Comas; Hans Siegismund; Julie Blanc; Lidia Agueda-Calpena; Marta Gut; Lucinda Fulton; Sarah A Tishkoff; James C Mullikin; Richard K Wilson; Ivo G Gut; Mary Katherine Gonder; Oliver A Ryder; Beatrice H Hahn; Arcadi Navarro; Joshua M Akey; Jaume Bertranpetit; David Reich; Thomas Mailund; Mikkel H Schierup; Christina Hvilsom; Aida M Andrés; Jeffrey D Wall; Carlos D Bustamante; Michael F Hammer; Evan E Eichler; Tomas Marques-Bonet
Journal: Nature Date: 2013-07-03 Impact factor: 49.962

9. Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite gene clusters.

Authors: Jean-Félix Dallery; Nicolas Lapalu; Antonios Zampounis; Sandrine Pigné; Isabelle Luyten; Joëlle Amselem; Alexander H J Wittenberg; Shiguo Zhou; Marisa V de Queiroz; Guillaume P Robin; Annie Auger; Matthieu Hainaut; Bernard Henrissat; Ki-Tae Kim; Yong-Hwan Lee; Olivier Lespinet; David C Schwartz; Michael R Thon; Richard J O'Connell
Journal: BMC Genomics Date: 2017-08-29 Impact factor: 3.969

10. ARCS: scaffolding genome drafts with linked reads.

Authors: Sarah Yeo; Lauren Coombe; René L Warren; Justin Chu; Inanç Birol
Journal: Bioinformatics Date: 2018-03-01 Impact factor: 6.937

14 in total

1. On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference.

Authors: Alexis Criscuolo
Journal: F1000Res Date: 2020-11-10

2. Segmental duplications and their variation in a complete human genome.

Authors: Mitchell R Vollger; Xavi Guitart; Philip C Dishuck; Ludovica Mercuri; William T Harvey; Ariel Gershman; Mark Diekhans; Arvis Sulovari; Katherine M Munson; Alexandra P Lewis; Kendra Hoekzema; David Porubsky; Ruiyang Li; Sergey Nurk; Sergey Koren; Karen H Miga; Adam M Phillippy; Winston Timp; Mario Ventura; Evan E Eichler
Journal: Science Date: 2022-04-01 Impact factor: 63.714

3. Origins and evolution of extreme life span in Pacific Ocean rockfishes.

Authors: Sree Rohit Raj Kolora; Gregory L Owens; Juan Manuel Vazquez; Alexander Stubbs; Kamalakar Chatla; Conner Jainese; Katelin Seeto; Merit McCrea; Michael W Sandel; Juliana A Vianna; Katherine Maslenikov; Doris Bachtrog; James W Orr; Milton Love; Peter H Sudmant
Journal: Science Date: 2021-11-11 Impact factor: 63.714

4. Chromosomal inversion polymorphisms shape the genomic landscape of deer mice.

Authors: Olivia S Harringmeyer; Hopi E Hoekstra
Journal: Nat Ecol Evol Date: 2022-10-17 Impact factor: 19.100

5. Gene clustering and copy number variation in alkaloid metabolic pathways of opium poppy.

Authors: Qiushi Li; Sukanya Ramasamy; Pooja Singh; Jillian M Hagel; Sonja M Dunemann; Xue Chen; Rongji Chen; Lisa Yu; Joseph E Tucker; Peter J Facchini; Sam Yeaman
Journal: Nat Commun Date: 2020-03-04 Impact factor: 14.919

6. Genome structure variation analyses of peach reveal population dynamics and a 1.67 Mb causal inversion for fruit shape.

Authors: Jiantao Guan; Yaoguang Xu; Yang Yu; Jun Fu; Fei Ren; Jiying Guo; Jianbo Zhao; Quan Jiang; Jianhua Wei; Hua Xie
Journal: Genome Biol Date: 2021-01-05 Impact factor: 13.583

7. The structure, function and evolution of a complete human chromosome 8.

Authors: Glennis A Logsdon; Mitchell R Vollger; PingHsun Hsieh; Yafei Mao; Mikhail A Liskovykh; Sergey Koren; Sergey Nurk; Ludovica Mercuri; Philip C Dishuck; Arang Rhie; Leonardo G de Lima; Tatiana Dvorkina; David Porubsky; William T Harvey; Alla Mikheenko; Andrey V Bzikadze; Milinn Kremitzki; Tina A Graves-Lindsay; Chirag Jain; Kendra Hoekzema; Shwetha C Murali; Katherine M Munson; Carl Baker; Melanie Sorensen; Alexandra M Lewis; Urvashi Surti; Jennifer L Gerton; Vladimir Larionov; Mario Ventura; Karen H Miga; Adam M Phillippy; Evan E Eichler
Journal: Nature Date: 2021-04-07 Impact factor: 69.504

8. Long-read assembly of a Great Dane genome highlights the contribution of GC-rich sequence and mobile elements to canine genomes.

Authors: Julia V Halo; Amanda L Pendleton; Feichen Shen; Aurélien J Doucet; Thomas Derrien; Christophe Hitte; Laura E Kirby; Bridget Myers; Elzbieta Sliwerska; Sarah Emery; John V Moran; Adam R Boyko; Jeffrey M Kidd
Journal: Proc Natl Acad Sci U S A Date: 2021-03-16 Impact factor: 11.205

9. Modelling segmental duplications in the human genome.

Authors: Eldar T Abdullaev; Iren R Umarova; Peter F Arndt
Journal: BMC Genomics Date: 2021-07-02 Impact factor: 3.969

10. Dog10K_Boxer_Tasha_1.0: A Long-Read Assembly of the Dog Reference Genome.

Authors: Vidhya Jagannathan; Christophe Hitte; Jeffrey M Kidd; Patrick Masterson; Terence D Murphy; Sarah Emery; Brian Davis; Reuben M Buckley; Yan-Hu Liu; Xiang-Quan Zhang; Tosso Leeb; Ya-Ping Zhang; Elaine A Ostrander; Guo-Dong Wang
Journal: Genes (Basel) Date: 2021-05-30 Impact factor: 4.096