| Literature DB >> 29213995 |
Ruibang Luo1,2, Fritz J Sedlazeck1, Charlotte A Darby1, Stephen M Kelly3, Michael C Schatz1,2,4.
Abstract
Linked-read sequencing, using highly-multiplexed genome partitioning and barcoding, can span hundreds of kilobases to improve de novo assembly, haplotype phasing, and other applications. Based on our analysis of 14 datasets, we introduce LRSim that simulates linked-reads by emulating the library preparation and sequencing process with fine control over variants, linked-read characteristics, and the short-read profile. We conclude from the phasing and assembly of multiple datasets, recommendations on coverage, fragment length, and partitioning when sequencing genomes of different sizes and complexities. These optimizations improve results by orders of magnitude, and enable the development of novel methods. LRSim is available at https://github.com/aquaskyline/LRSIM.Entities:
Keywords: 10X Genomics; Genome assembly; Linked-read; Molecular barcoding; Phasing; Reads partitioning; Reads simulation
Year: 2017 PMID: 29213995 PMCID: PMC5711661 DOI: 10.1016/j.csbj.2017.10.002
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1The distribution of number of supporting reads per partition. About 1.5 million partitions are supported by more than 100 reads. The rightmost column shows the number of partitions with the number of supporting reads per partition > 100.
Fig. 2Distribution of the number of molecules per partition for NA12878.
Fig. 3Weighted molecule length distribution for NA12878. Physical Coverage equals , where l is the length of a molecule and n is the number of molecules in that size range.
Fig. 4Distribution of molecule coverage for NA12878.
Fig. 5Average sequencing coverage of 13 samples per chromosome. The coverages were normalized to the sample with the lowest average coverage (NA24149, 30.36x).
Fig. 6NG graph showing an overview of phased block sizes of 7 datasets. NG(X) is defined as X% of the genome is in phased blocks equals to or larger than the NG(X) length.
Assembly statistics of different number of partitions.
| No. of partitions (× 1,000) | Contig N50 | Phase Block N50 | Scaffold N50 |
|---|---|---|---|
| 15 | 198,485 | 1,146,590 | 1,016,017 |
| 20 | 265,543 | 2,881,040 | 2,796,090 |
| 30 | 230,711 | 1,945,051 | 1,880,870 |
| 50 | 215,743 | 1,472,710 | 1,459,816 |
| 100 | 177,635 | 1,471,806 | 1,271,685 |
| 1500 | 14,597 | 1588 | 14,685 |
The Contig N50, Phase Block N50 and Scaffold N50 of the A. thaliana genome with 6 different partition numbers.
Assembly statistics of different sequencing coverage.
| No. of read pairs (M) | No. of partitions (× 1,000) | Contig N50 | Phase Block N50 | Scaffold N50 |
|---|---|---|---|---|
| 9 (17-fold) | 20 | 233,233 | 1,027,768 | 899,826 |
| 18 (34-fold) | 20 | 265,543 | 2,881,040 | 2,796,090 |
| 27 (51-fold) | 20 | 221,680 | 1,971,701 | 1,896,517 |
| 27 (51-fold) | 30 | 241,319 | 1,979,723 | 1,688,453 |
Contig N50, Phase Block N50 and Scaffold N50 of the A. thaliana genome with 4 different combinations of number of read pairs and number of partitions.
Assembly statistics of different number of molecules per partition.
| No. of molecules per partition | Contig N50 | Phase Block N50 | Scaffold N50 |
|---|---|---|---|
| 1 | 46,993 | 72,400 | 54,769 |
| 4 | 249,371 | 2,105,517 | 2,020,687 |
| 7 | 274,133 | 1,869,856 | 2,074,127 |
| 10 | 265,543 | 2,881,040 | 2,796,090 |
| 15 | 232,708 | 2,860,223 | 2,416,675 |
| 20 | 245,894 | 1,920,175 | 1,878,113 |
Contig N50, Phase Block N50 and Scaffold N50 of the A. thaliana genome with 6 different molecule numbers per partition.
Fig. 7NG graph showing an overview of scaffold sizes of three datasets, including a simulated linked-reads for B73, real linked-reads for NC350 and a B73 assembly using PacBio long reads by Jiao et al. NG(X) is defined as X% of the genome is in phased blocks equals to or larger than the NG(X) length.
Fig. 8LRSim workflow. Lariat, SuperNova and HapCUT2 are three tools downstream to LRSim. Lariat is an aligner module of LongRanger specified for linked-read alignment. SuperNova is a genome assembler specified for lined-read. HapCUT2 is a phasing algorithm that works with linked-read. LRSim provides an option to skip variant simulation with SURVIVOR and take a user-provided variant file.