| Literature DB >> 23658221 |
Sangwoo Kim1, Paul Medvedev, Tara A Paton, Vineet Bafna.
Abstract
Genomic sequence duplication is an important mechanism for genome evolution, often resulting in large sequence variations with implications for disease progression. Although paired-end sequencing technologies are commonly used for structural variation discovery, the discovery of novel duplicated sequences remains an unmet challenge. We analyze duplicons starting from identified high-copy number variants. Given paired-end mapped reads, and a candidate high-copy region, our tool, Reprever, identifies (a) the insertion breakpoints where the extra duplicons inserted into the donor genome and (b) the actual sequence of the duplicon. Reprever resolves ambiguous mapping signatures from existing homologs, repetitive elements and sequencing errors to identify breakpoint. At each breakpoint, Reprever reconstructs the inserted sequence using profile hidden Markov model (PHMM)-based guided assembly. In a test on 1000 artificial genomes with simulated duplication, Reprever could identify novel duplicates up to 97% of genomes within 3 bp positional and 1% sequence errors. Validation on 680 fosmid sequences identified and reconstructed eight duplicated sequences with high accuracy. We applied Reprever to reanalyzing a re-sequenced data set from the African individual NA18507 to identify >800 novel duplicates, including insertions in genes and insertions with additional variation. polymerase chain reaction followed by capillary sequencing validated both the insertion locations of the strongest predictions and their predicted sequence.Entities:
Mesh:
Year: 2013 PMID: 23658221 PMCID: PMC3695505 DOI: 10.1093/nar/gkt339
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.The overall pipeline of Reprever. Insertion breakpoints and consensus sequences of copy count increased regions are assessed through eight steps from donor sample preparation. (1) Donor genome is paired-end sequenced to generate mate pairs. (2) Copy count increased region H is predicted from conventional CNV prediction methods, such as read-depth analysis or comparative hybridization. (3 and 4) Search homologs of each copy count increased region H in the reference. As read count and mapping is confounded because of the non-unique sequences, we merge and repartition the mate pairs mapped to the homolog set H = . (5) Each mate pair is further classified into five classes by its mapping information (Supplementary Figure S5). (6) Reads with ambiguous or orphan mapping are realigned to be rescued. (7) Insertion locations of extra copies are analyzed by Reprever tools. Based on discordant read signatures (red mate pairs), putative breakpoints (yellow ovals) are inferred and interrogated by three independent tests: breakpoint shape (Box 1), breakpoint specificity and breakpoint coverage (data not shown. See ‘Materials and Methods’ section). (8) Once insertion locations are finalized, Reprever tools reconstruct the sequence at the breakpoint, as well as all the existing homologs using profile hidden Markov model (PHMM) (Box 2). The PHMMs are trained from boundary to center, as mate-pairs are iteratively assigned to train model parameters.
Figure 2.Performance test of Reprever in breakpoint and sequence inference. Totally 1000 simulated genomes are constructed and underwent random duplication up to four copies. (A) Reprever identified up to 97% of the breakpoints. In >90% of four copy insertion cases, Reprever exactly found insertion sites. (B) Distribution of error between true/inferred breakpoints. Errors <0 denotes that the inferred breakpoints are located upstream of true breakpoints, errors >0 denotes downstream. The average error size is 3.15 bp. (C) Accuracy of reconstructed duplicate sequences in 969 test genomes. Comparisons between the true and inferred sequences () are measured in sequence dissimilarity using BLAST. Compared with the reference sequence A, which is the best estimate sequence for extra copies without reconstruction, Reprever-inferred sequences have much less mismatches to the true answers (green bars). The performance is consistent regardless of initial diversity among duplicons (red bars). (D) Reprever can reconstruct multiple duplicates simultaneously without an accuracy loss.
Figure 3.Reconstruction of fosmid-validated duplications. Totally 1000 simulated genomes are constructed and underwent random (A) (discordant) paired-end reads mapped around a validated insertion site chr1:16935160 (from chr1:17069565–17076928 and chr1:16729161–16736531). There are only a few discordant reads that bridge the insertion site to its template (red arrows). Reprever automatically recruits neighboring orphan reads (blue arrows) to increase read coverage for reconstruction. (B) Comparison of reconstructed and true (fosmid) insertion sequences. The reconstructed sequences (Ins.Reprever) are much closer to the true insertions (Ins.fosmid) compared with the template sequences (Ref1 and Ref2). Reconstructed donor sequences at the template sites (Don1, Don2) are almost identical to the templates. (C) Multiple sequence alignment visualization of sequence variation between templates and reconstructed sequences. Highly varied regions including deletions are reconstructed perfectly.
Reprever analysis on NA18507 data given three independent gain calls and one negative call
| Call set (size) | n.accepted (rate) | n.breakpoint |
|---|---|---|
| Each call set | ||
| CNVer (1876) | 1103 (58.8%) | 3993 |
| Yoon | 54 (63.5%) | 168 |
| Conrad (206) | 67 (32.5%) | 173 |
| Random (1000) | 39 (3.9%) | 77 |
| Overlapping call set | ||
| CNVer + Yoon (27) | 22 (81.5%) | 63 |
| CNVer + Conrad (96) | 55 (57.3%) | 194 |
| Yoon + CNVer (16) | 9 (56.3%) | 30 |
| Yoon + Conrad (3) | 3 (100.0%) | 11 |
| Conrad + CNVer (9) | 8 (88.9%) | 9 |
| Conrad + Yoon (2) | 2 (100.0%) | 2 |
| CNVer + Yoon + Conrad (2) | 2 (100.0%) | 4 |
| CNVer*Yoon (12) | 10 (83.3%) | 13 |
| CNVer*Conrad (9) | 8 (88.9%) | 24 |
| Yoon*Conrad (2) | 2 (100.0%) | 3 |
For each call set, Reprever scans potential insertion breakpoints of the given regions.
n.accepted, number of regions that have at least one matching breakpoint; n.breakpoint, number of total breakpoints.
aThis call set was targeted only to Chr 1.
bWe define an overlapping call set as the subset of call set A, which takes every region in the condition of existence of a region , which overlaps at least 50% of r. Overlapping call set is similarly defined but contains regions where the overlap is satisfied reciprocally. In other words, at least 50% of r is also in r, and vice-versa.
Predicted breakpoints in gene region
| Breakpoint | Region | Gene | Score | CNV source |
|---|---|---|---|---|
| chr1:16766464 | Intron | 62.07 | CNVer | |
| chr1:16775357 | CDS | 33.02 | CNVer | |
| chr1:16778388 | CDS | 100.07 | CNVer | |
| chr1:16784684 | CDS | 37.8 | CNVer | |
| chr1:16792907 | Intron | 105.0 | CNVer | |
| chr1:21665540 | Intron | 38.91 | CNVer | |
| chr1:143665870 | Intron | 39.2 | CNVer | |
| chr1:143989906 | Intron | 25.3 | CNVer | |
| chr1:144008986 | CDS | 43.25 | CNVer | |
| chr1:144011688 | Intron | 24.3 | CNVer | |
| chr1:144013697 | Intron | 27.01 | CNVer | |
| chr1:147009280 | CDS | 22.42 | CNVer | |
| chr1:147965460 | Intron | 30.57 | CNVer | |
| chr2:91467172 | Intron | 21.46 | CNVer | |
| chr2:95963046 | Intron | 37.5 | CNVer | |
| chr2:95965655 | Intron | 95.25 | CNVer | |
| chr2:95967599 | Intron | 117.56 | CNVer | |
| chr3:197148075 | CDS | 26.93 | CNVer | |
| chr3:197149586 | Intron | 37.05 | CNVer | |
| chr3:197150946 | Intron | 24.0 | CNVer | |
| chr4:191100462 | Intron | 21.48 | CNVer | |
| chr4:191105726 | Intron | 33.2 | CNVer | |
| chr5:98898838 | Intron | 29.67 | CNVer | |
| chr6:24791875 | Intron | 31.0 | Conrad | |
| chr9:66729560 | Intron | 38.62 | CNVer | |
| chr11:89298719 | Intron | 28.56 | Conrad | |
| chr12:69819920 | Intron | 89.0 | Conrad | |
| chr15:26137336 | CDS | 25.98 | CNVer | |
| chr16:33492794 | Intron | 31.81 | CNVer | |
| chr22:18976441 | Intron | 31.7 | CNVer |
Top 30 scored breakpoints are listed here. The entire list including intergenic regions is available in Supplementary Table S3.
CDS, coding DNA sequence.
Figure 4.(A) Four cases are illustrated as examples. (1) A gain call is made at chr1:553585-560412 by CNVer (and chr1:554301–560300 by Yoon et al.). Reprever found that a ∼1.3-kb sub-region is duplicated into chr8:63177985. (2) Similarly, a gain call at chr1:67224800–67226000 by Yoon is explained by Reprever showing a duplication at chr16:76346444. Note that the inserted sequence has a partial inversion of 400-bp 5′ segment (orange block). (3 and 4) Two gain calls were made CNVer (chr16:16633477–16638823 and chr5:20868352–20870585). Insertion locations found by Reprever are located in bigger segmental duplication regions (dotted blue/red blocks). The arrangement of repeat elements (colored diamonds) show that the copy number increase in the donor genome actually resulted from deletions in the reference genome. (B) Concordant read counts around the putative breakpoints (red arrows). Coverage is calculated using whole-clone (red), end reads (green) and the gap between end reads (blue) of mate pairs. The relative read count reduction predicts the allele zygosity (1/3 heterozygous and 2/4 homozygous). (C) PCR amplification of duplicated regions. Primers are designed to capture duplication boundaries (encircled red numbers in A). (D) (upper) The inserted sequence of the first case (at chr8:63177988) is identified by Sanger sequencing (annotated with PCR 1/2 and PCR 3/4). Totally 12 SNVs are found. Three consensus sequences reconstructed by Reprever (green) recovered 11 of 12 variations (92%) for inserted sequence and found four additional variations in the homologs (highlighted in yellow). (lower) The partial inversion of the inserted segment in the second case (at chr16:76346444) is confirmed by Sanger sequencing with two SNVs. Reprever recovered one of them.