| Literature DB >> 23967014 |
Yinghua Wu1, Lifeng Tian, Mario Pirastu, Dwight Stambolian, Hongzhe Li.
Abstract
Copy number variations (CNVs) are associated with many complex diseases. Next generation sequencing data enable one to identify precise CNV breakpoints to better under the underlying molecular mechanisms and to design more efficient assays. Using the CIGAR strings of the reads, we develop a method that can identify the exact CNV breakpoints, and in cases when the breakpoints are in a repeated region, the method reports a range where the breakpoints can slide. Our method identifies the breakpoints of a CNV using both the positions and CIGAR strings of the reads that cover breakpoints of a CNV. A read with a long soft clipped part (denoted as S in CIGAR) at its 3'(right) end can be used to identify the 5'(left)-side of the breakpoints, and a read with a long S part at the 5' end can be used to identify the breakpoint at the 3'-side. To ensure both types of reads cover the same CNV, we require the overlapped common string to include both of the soft clipped parts. When a CNV starts and ends in the same repeated regions, its breakpoints are not unique, in which case our method reports the left most positions for the breakpoints and a range within which the breakpoints can be incremented without changing the variant sequence. We have implemented the methods in a C++ package intended for the current Illumina Miseq and Hiseq platforms for both whole genome and exon-sequencing. Our simulation studies have shown that our method compares favorably with other similar methods in terms of true discovery rate, false positive rate and breakpoint accuracy. Our results from a real application have shown that the detected CNVs are consistent with zygosity and read depth information. The software package is available at http://statgene.med.upenn.edu/softprog.html.Entities:
Keywords: breakpoint; deletion; duplication; exon sequencing; structural variation
Year: 2013 PMID: 23967014 PMCID: PMC3744852 DOI: 10.3389/fgene.2013.00157
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Types of CNVs and their breakpoints.
| Deletion | RNAME[ | |
| .RNAME[ | ( | |
| Tandem | RNAME[ | |
| duplication | .RNAME[ | ( |
| Insertion | RNAME[ | |
| .RNAME[ | ( | |
| .RNAME[ | ( |
Figure 1MATCHCLIP algorithm.
Comparison of CNVs detected from simulated sequence reads with known 885 CNVs of NA12878 by five different methods with different methods of alignments.
| bwa PE | 758:17 | 632:2 | 594:80 | 696:158 | 798:291 |
| bwasw | 705:26 | 652:9 | |||
| bowtie2 PE | 781:18 | 642:6 | 580:76 | 719:165 | 496:146 |
| bowtie2 SE | 728:2 | 635:1 | |||
| novo PE | 758:8 | 414:2 | 577:26 | 681:123 | 769:223 |
| novo SE | 691:3 | 124:2 | |||
| bwa PE | 738:12 | 631:32 | 586:42 | 644:71 | 781:301 |
| bwasw | 653:55 | 643:12 | |||
| bowtie2 PE | 770:26 | 645:21 | 559:59 | 666:85 | 509:154 |
| bowtie2 SE | 723:1 | 633:3 | |||
| novo PE | 708:4 | 312:2 | 576:21 | 657:60 | 762:226 |
| novo SE | 669:3 | 118:0 | |||
The numbers in each cell are given in the format “concordant CNVs:false positives”.
CNVs detected by MATCHCLIP in 20 exome sequenced samples, including 10 samples with long axial length (Long AL) and 10 samples with short axial length (Short AL).
Total, number of CNVs longer than 500 basepairs; New, number of CNVs that do not overlap with any in the estd59 database (1000 Genomes Project Consortium, 2010); D_HET, number of deletion CNVs that has heterozygous sites in deleted region, where ygosity was called using samtools' mpileup function and bcftools; RDR(DEL/DUP), averaged read depth ratios (RDRs) of the read depth inside a CNV region to the read depth outside a CNV region. The outer regions include 3000 bases before and 3000 after the CNV region. NA represents no duplications were detected.