| Literature DB >> 27667791 |
Bayo Lau1, Marghoob Mohiyuddin1, John C Mu1, Li Tai Fang1, Narges Bani Asadi1, Carolina Dallett2, Hugo Y K Lam1.
Abstract
LongISLND is a software package designed to simulate sequencing data according to the characteristics of third generation, single-molecule sequencing technologies. The general software architecture is easily extendable, as demonstrated by the emulation of Pacific Biosciences (PacBio) multi-pass sequencing with P5 and P6 chemistries, producing data in FASTQ, H5, and the latest PacBio BAM format. We demonstrate its utility by downstream processing with consensus building and variant calling.Entities:
Mesh:
Year: 2016 PMID: 27667791 PMCID: PMC5167071 DOI: 10.1093/bioinformatics/btw602
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.(a) Number of 7-mers binned with respect to accuracy, determined within 1% as discussed in the Supplementary material. A context-independent error profile would yield a delta peaking function centered at the global accuracy. (b) Fraction of samples of a certain sequence length aligned to a homopolymer of true length 6. Compared to the analytical expression derived in the Supplementary Material, G/C deletion bias is observed in both P5 and P6 chemistries
Step N to N + 5 of a series of extended-k-mer operations across a hypothetical truth sequence
Number of flanking bases is taken to be a small value of 2 for illustration only. At step N, the truth has an A flanked by GT and CG, this is characterized by GTACG. The read has a matching A, so a Match event is recorded/simulated. At step N + 1, the truth has a C flanked by TA and GT, this is characterized by TACGT. The read has a deleted C, so a Deletion event is recorded/simulated. At step N + 3, the truth is a length-5 T homopolymer flanked by CG and AC, this is characterized by CGT5AC. Over the stretch of homopolymer, one A is inserted in the read, and a insertion event is recorded/simulated.