| Literature DB >> 30850377 |
Guofeng Meng1, Ying Tan2, Yue Fan2, Yan Wang2, Guang Yang2, Gregory Fanning2, Yang Qiu2.
Abstract
PacBio sequencing is a powerful approach to study DNA or RNA sequences in a longer scope. It is especially useful in exploring the complex structural variants generated by random integration or multiple rearrangement of endogenous or exogenous sequences. Here, we present a tool, TSD, for complex structural variant discovery using PacBio targeted sequencing data. It allows researchers to identify and visualize the genomic structures of targeted sequences by unlimited splitting, alignment and assembly of long PacBio reads. Application to the sequencing data derived from an HBV integrated human cell line(PLC/PRF/5) indicated that TSD could recover the full profile of HBV integration events, especially for the regions with the complex human-HBV genome integrations and multiple HBV rearrangements. Compared to other long read analysis tools, TSD showed a better performance for detecting complex genomic structural variants. TSD is publicly available at: https://github.com/menggf/tsd.Entities:
Keywords: structural variants long reads genomic structure PacBio
Mesh:
Year: 2019 PMID: 30850377 PMCID: PMC6505135 DOI: 10.1534/g3.118.200900
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Figure 1The flowchart of TSD and its evaluation. (a) TSD is designed to identify the structural organization of complex SVs. In this exemplary demonstration, four pieces of exogenous DNA sequences (DNA fragments from the targeted sequences) rearrange and integrate into the host genome. TSD is used to identify their origins, rearrangement and integration location in the host genome. (b) The flowchart of TSD in dealing the PacBio reads from complex SVs. The long reads are aligned to both host genome and targeted sequence using BWA-MEM tool. If the reads are partially mapped, the unmapped fragments are cut for a new round of alignment. This can be repeated for multiple times until no unmapped fragment is longer than 200 bp. The final SV structure is determined by assembling the mapped fragments. (c) Build consensus fragments. The error-tolerating setting of BWA leads to frequent deviation from the true start and end locations. To overcome this problem, we infer the consensus start and end locations from redundant targeted sequencing reads utilizing a voting strategy. The read dash lines indicate acceptable ranges to select the reads with the same start or end location. (d) Evaluation using simulated PacBio reads. 99.4% of reads are correctly mapped to human genome; using the correctly mapped reads, 100% of simulated SVs are recovered accurately for both break point location and direction.
Figure 2TSD discovers the HBV integration events in the PLC/PRF/5 cells. (a) A genomic region identified by TSD with HBV integration event, which consists of 6 DNA fragments. The first line indicates the assembled HBV integrated region and the other lines indicate PacBio reads derived from this region. Each row can represent multiple PacBio reads if they have the same fragments composition; the number (right side) counts PacBio reads. The line colors and arrows indicate the mapped strands: darkgreen: forward; brown: reverse stand. (b) The HBV integration events discovered by TSD using PacBio sequencing data. TSD identified 9 HBV integrated regions, including multiple regions with complex HBV rearrangements. Most of the integration sites are validated by NGS study.
Evaluation with HGAP and Sniffle
| HGAP | Sniffles | TSD | NGS | |||||
|---|---|---|---|---|---|---|---|---|
| Int. | Rea. | Int. | Rea. | Int. | Rea. | Int. | Rea. | |
| Region1 | left | no | right | no | both | yes | right | — |
| Region2 | no | no | both | no | both | yes | both | — |
| Region3 | no | no | right | no | both | yes | both | — |
| Region4 | right | no | both | yes | both | yes | both | — |
| Region5 | no | no | both | yes | both | yes | right | — |
| Region6 | no | no | both | yes | both | yes | both | — |
| Region7 | both | yes | both | yes | both | yes | both | — |
| Region8 | left | yes | both | yes | both | yes | both | — |
| Region9 | left | yes | both | yes | both | yes | both | — |
Int.: Integration sites; Rea: Rearrangement sites; left/right: only left/right site of an integration event is identified.