Literature DB >> 29092025

Simulating the dynamics of targeted capture sequencing with CapSim.

Minh Duc Cao1, Devika Ganesamoorthy1, Chenxi Zhou1, Lachlan J M Coin1,2.   

Abstract

Motivation: Targeted sequencing using capture probes has become increasingly popular in clinical applications due to its scalability and cost-effectiveness. The approach also allows for higher sequencing coverage of the targeted regions resulting in better analysis statistical power. However, because of the dynamics of the hybridization process, it is difficult to evaluate the efficiency of the probe design prior to the experiments which are time consuming and costly.
Results: We developed CapSim, a software package for simulation of targeted sequencing. Given a genome sequence and a set of probes, CapSim simulates the fragmentation, the dynamics of probe hybridization and the sequencing of the captured fragments on Illumina and PacBio sequencing platforms. The simulated data can be used for evaluating the performance of the analysis pipeline, as well as the efficiency of the probe design. Parameters of the various stages in the sequencing process can also be evaluated in order to optimize the experiments. Availability and implementation: CapSim is publicly available under BSD license at https://github.com/Devika1/capsim. Contact: l.coin@imb.uq.edu.au. Supplementary information: Supplementary data are available at Bioinformatics online.
© The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

Entities:  

Mesh:

Year:  2018        PMID: 29092025      PMCID: PMC6192212          DOI: 10.1093/bioinformatics/btx691

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

High-throughput sequencing (HTS) has tremendously revolutionized genomic studies for the ability for cost- and time-effective characterization of the complete genetic information of a sample. In many clinical applications, only a panel of actionable regions are the subject for investigation (Bellos ; Samorodnitsky ). In these analyses, investigators often use a targeted capture sequencing protocol where a pool of synthesized oligonucleotides (probes) are used to selectively capture genomic fragments of interests using hybridization (Gnirke ) In an efficient design, only DNA fragments from the targeted loci are sequenced. This allows for deeper sequence coverage compared to whole genome sequencing at a much lower cost and faster time to results, resulting in a scalable approach for use in clinical laboratories. Computational simulation has been indispensable in developing and benchmarking HTS data analysis tools (Escalona ). Simulation data in silico are cheaper and faster to produce than real data; they are generated under controlled conditions and can be perfectly characterized. Furthermore, simulation also helps investigators to assess the performance of sequencing protocols, and to optimize the design prior to performing experiments. While numerous simulators are available for whole genome sequencing (Escalona ) and targeted exome sequencing (Kim ), to the best of our knowledge, there is currently no existing tool, which can simulate the dynamics of the captured process in silico. We believe such a tool would be useful for assessing not only the computational analysis pipeline, but also the efficiency of a capture design. Here, we present CapSim, a software package to meet the need for simulating targeted capture sequencing data. Given a set of probes, CapSim simulates the dynamics of probe hybridization in silico to generate a set of fragments to be sequenced. Unlike most existing HTS simulation tools, CapSim emulates all various stages of the sequencing process, including fragmentation, fragment capture and sequencing. CapSim is written in Java and is able to run natively on any computing platform, making it easily accessible to bioinformatics community.

2 Materials and methods

The sequencing process starts with the fragmentation of the DNA, CapSim simulates this process by iteratively sampling fragments from the genome sequence with a given length distribution. We model the fragment length using a log-logistic distribution, which was found to provide the best fit to the fragment size distributions from several datasets (See Supplementary Material). In the next step in targeted capture sequencing, the target probes bind to the fragments by hybridization and the bound fragments are then pulled down by beads, which specifically binds to the DNA fragments hybridized to the capture probes. CapSim simulates this process by mapping the probes to the fragments. To be computationally efficient, CapSim first maps the probes to the genome sequence and once a fragment is sampled from the genome, it uses a greedy algorithm to determine the maximum number of probes that can bind to the fragment. We model the stochastic nature of the capture process in which the probability of a fragment being captured is proportional to the number of probes bound, and is inversely proportional to the length of the fragment. This simulation of the dynamics of the hybridization is shown to generate more realistic captured sequencing data than Wessim Kim , the only existing tool for simulating captured sequencing (see Results). The captured fragments are then subjected to in silico sequencing. For Illumina sequencing, CapSim introduces a size distribution of fragments that form clusters for sequencing. It uses a Log-Logistic distribution to sample from fragments simulated from the capture step (see Supplementary Material). These selected fragments are then used for sequencing simulation in which reads are copied from the two ends of the fragments with errors introduced. For PacBio sequencing, CapSim simulates polymerase read length from a given distribution (see Supplementary Material). In simulating the sequencing of a fragment, CapSim alternates copying the two strands with a PacBio error profile, until reaching the polymerase read length.

3 Results

We evaluated CapSim by comparing its simulation and real data from capture sequencing on the same set of probes. We used Agilent SureSelect Target Enrichment kit to enrich the targeted regions and performed capture sequencing on NA12878 sample on both Illumina and PacBio sequencing platforms (See Supplementary Material). Sequence coverage distribution between simulation data and real capture sequencing data from Illumina sequencing are shown in Figure 1 for various regions. Coverage distribution from CapSim simulation data closely resembles the real sequencing data. Especially, Figure 1b shows an off-target capture (1 kb upstream of the target region) in test sample which was detected by CapSim simulation data, however Wessim (Kim ) did not replicate the sequence coverage distribution of real capture sequencing data and the off-target capture was not detected. CapSim simulation detected 67% of the off-target regions from real capture data, whereas Wessim only detected 8% of the off-target regions (see Supplementary Material). CapSim simulation for PacBio also closely resembles the real sequencing data (see Supplementary Material). Wessim does not support simulation of long read capture sequencing data.
Fig. 1

Sequence coverage distribution of Illumina sequencing data from real capture sequencing data on test sample (first panel), capture simulation data from CapSim (second panel) and WesSim (third panel). Position of the capture probes are shown in blue in the bottom capture track panels. (a) and (b) are sequence coverage in targeted regions and (c) is sequence coverage in an off-target region

Sequence coverage distribution of Illumina sequencing data from real capture sequencing data on test sample (first panel), capture simulation data from CapSim (second panel) and WesSim (third panel). Position of the capture probes are shown in blue in the bottom capture track panels. (a) and (b) are sequence coverage in targeted regions and (c) is sequence coverage in an off-target region

4 Conclusion

In this manuscript, we have introduced CapSim a tool for simulation of targeted capture sequencing on both Illumina and PacBio platforms. Unlike most existing HTS data simulators, CapSim provides parameters for controlling each intermediate stages, allowing users to evaluate the effects of the experiment design before running the experiment (details in GitHub). Once sequencing reads are simulated using CapSim, users can align the reads to the reference genome and assess the performance of the capture design. A customized script to identify probes leading to off-target capture is also provided with CapSim (details in GitHub). This will allow users to identify probes which needs to be removed from the design to reduce the off-target capture to improve their design. We believe CapSim is applicable for all hybridization based capture methods, however the data presented in here was simulated with Agilent XT and more assessment needs to be performed to determine if the simulation data could be comparable for other capture methods.

Funding

The research is supported by funding from the National Health and Medical Research Council (APP1052303). Conflict of Interest: none declared. Click here for additional data file.
  5 in total

1.  Comparison of custom capture for targeted next-generation DNA sequencing.

Authors:  Eric Samorodnitsky; Jharna Datta; Benjamin M Jewell; Raffi Hagopian; Jharna Miya; Michele R Wing; Senthilkumar Damodaran; Juliana M Lippus; Julie W Reeser; Darshna Bhatt; Cynthia D Timmers; Sameek Roychowdhury
Journal:  J Mol Diagn       Date:  2015-01       Impact factor: 5.568

2.  Wessim: a whole-exome sequencing simulator based on in silico exome capture.

Authors:  Sangwoo Kim; Kyowon Jeong; Vineet Bafna
Journal:  Bioinformatics       Date:  2013-02-14       Impact factor: 6.937

3.  cnvCapSeq: detecting copy number variation in long-range targeted resequencing data.

Authors:  Evangelos Bellos; Vikrant Kumar; Clarabelle Lin; Jordi Maggi; Zai Yang Phua; Ching-Yu Cheng; Chui Ming Gemmy Cheung; Martin L Hibberd; Tien Yin Wong; Lachlan J M Coin; Sonia Davila
Journal:  Nucleic Acids Res       Date:  2014-09-16       Impact factor: 16.971

Review 4.  A comparison of tools for the simulation of genomic next-generation sequencing data.

Authors:  Merly Escalona; Sara Rocha; David Posada
Journal:  Nat Rev Genet       Date:  2016-06-20       Impact factor: 53.242

5.  Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing.

Authors:  Andreas Gnirke; Alexandre Melnikov; Jared Maguire; Peter Rogov; Emily M LeProust; William Brockman; Timothy Fennell; Georgia Giannoukos; Sheila Fisher; Carsten Russ; Stacey Gabriel; David B Jaffe; Eric S Lander; Chad Nusbaum
Journal:  Nat Biotechnol       Date:  2009-02-01       Impact factor: 54.908

  5 in total
  4 in total

Review 1.  A broad survey of DNA sequence data simulation tools.

Authors:  Shatha Alosaimi; Armand Bandiang; Noelle van Biljon; Denis Awany; Prisca K Thami; Milaine S S Tchamga; Anmol Kiran; Olfa Messaoud; Radia Ismaeel Mohammed Hassan; Jacquiline Mugo; Azza Ahmed; Christian D Bope; Imane Allali; Gaston K Mazandu; Nicola J Mulder; Emile R Chimusa
Journal:  Brief Funct Genomics       Date:  2020-01-22       Impact factor: 4.241

2.  Fishing for DNA? Designing baits for population genetics in target enrichment experiments: Guidelines, considerations and the new tool supeRbaits.

Authors:  Belén Jiménez-Mena; Hugo Flávio; Romina Henriques; Alice Manuzzi; Miguel Ramos; Dorte Meldrup; Janette Edson; Snaebjörn Pálsson; Guðbjörg Ásta Ólafsdóttir; Jennifer R Ovenden; Einar Eg Nielsen
Journal:  Mol Ecol Resour       Date:  2022-03-03       Impact factor: 8.678

3.  GtTR: Bayesian estimation of absolute tandem repeat copy number using sequence capture and high throughput sequencing.

Authors:  Devika Ganesamoorthy; Minh Duc Cao; Tania Duarte; Wenhan Chen; Lachlan Coin
Journal:  BMC Bioinformatics       Date:  2018-07-16       Impact factor: 3.169

Review 4.  A Guide to Carrying Out a Phylogenomic Target Sequence Capture Project.

Authors:  Tobias Andermann; Maria Fernanda Torres Jiménez; Pável Matos-Maraví; Romina Batista; José L Blanco-Pastor; A Lovisa S Gustafsson; Logan Kistler; Isabel M Liberal; Bengt Oxelman; Christine D Bacon; Alexandre Antonelli
Journal:  Front Genet       Date:  2020-02-21       Impact factor: 4.599

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.