| Literature DB >> 28341701 |
Marc W Crepeau1, Charles H Langley1, Kristian A Stevens2.
Abstract
We investigate the utility and scalability of new read cloud technologies to improve the draft genome assemblies of the colossal, and largely repetitive, genomes of conifers. Synthetic long read technologies have existed in various forms as a means of reducing complexity and resolving repeats since the outset of genome assembly. Recently, technologies that combine subhaploid pools of high molecular weight DNA with barcoding on a massive scale have brought new efficiencies to sample preparation and data generation. When combined with inexpensive light shotgun sequencing, the resulting data can be used to scaffold large genomes. The protocol is efficient enough to consider routinely for even the largest genomes. Conifers represent the largest reference genome projects executed to date. The largest of these is that of the conifer Pinus lambertiana (sugar pine), with a genome size of 31 billion bp. In this paper, we report on the molecular and computational protocols for scaffolding the P. lambertiana genome using the library technology from 10× Genomics. At 247,000 bp, the NG50 of the existing reference sequence is the highest scaffold contiguity among the currently published conifer assemblies; this new assembly's NG50 is 1.94 million bp, an eightfold increase.Entities:
Keywords: 10× genomics; conifer genomes; genome assembly; sugar pine
Mesh:
Substances:
Year: 2017 PMID: 28341701 PMCID: PMC5427496 DOI: 10.1534/g3.117.040055
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Figure 1(Top) Construction of synthetic long read clouds with 10× Genomics technology. (A) HMW DNA is prepared from a single haploid sugar pine MGP. (B) Within the instrument emulsion droplets are used to pool HMW DNA. A barcode containing bead in each droplet is used for indexing the pools. During subsequent thermal cycling the bead dissolves and the input DNA fragments act as templates for primer extension by the barcoded random hexamers (blue). All extension products within a droplet contain the same 14 bp barcode (magenta). After completion of primer extension cycles the emulsion is “broken,” pooled extension products are physically sheared to subkilobase fragments, and p7 and p5 adapters are added to the barcoded terminal fragments by ligation and enrichment PCR. Final molecules have the standard configuration and adapter sequences of Illumina dual-index, paired-end libraries, except that i5 is 14 bp instead of 8 bp. The random hexamer sequence (Nmer) is trimmed from the start of read one during data processing. (C) The oligo-containing gel bead (colored dots) within each droplet contains a single 14 bp barcode and multiple long DNA fragments (wavy lines) that serve as templates for generation of library fragments and then reads (short bicolored bars). Different droplets may contain identical gel beads increasing the effective size of the pool. All reads with the same barcode form an effective pool. (D) Alignments to a contig will assign the contig to one or more pools. Overlap graph nodes are defined in windows at the contig ends. Because there are a large number of distinct pools (barcodes), any two nodes are unlikely to belong to the same two (or more) pools by chance. However, one shared barcode is common. (Bottom) Scaffolding with synthetic long read clouds and fragScaff. (E) For each node, the read group assignments for that node are compared with the read group assignments for every other node and the fraction of read groups shared between them (their “shared fraction”) is calculated. (F) A typical distribution of shared fraction for a node in our data. For each node, a normal distribution is fitted to the observed shared fraction data. Link scores for each ordered pair of nodes are computed by taking the negative log10 probability of the observed shared fraction under the fitted normal distribution. After all pairwise link scores are calculated, a global link score threshold is determined. In our example, the pair (n, n′) shared one barcode and are below the threshold, while the pair (n, n′′) shared >1 and are above. (F) Pairs of nodes with a score exceeding the global link score threshold become edges in an overlap graph with a weight corresponding to the score. (G) Layout linearizes the weighted graph and is accomplished by greedy algorithm described in (Adey ). From each linearized subgraph the consensus scaffold is determined by concatenating component scaffolds.
Library and sequencing results
| (A) Sequence Statistics by Library | |||||
|---|---|---|---|---|---|
| Library | Paired Reads | Read Length (bp) | Raw Sequence Coverage ( | Aligned Reads | Filtered Aligned Reads |
| 1 | 232,879,210 | 88 + 98 | 0.7 | 181,462,035 | 111,854,014 |
| 2 | 232,329,746 | 88 + 98 | 0.7 | 195,340,154 | 125,001,615 |
| 3 | 311,708,094 | 91 + 101 | 0.97 | 288,643,022 | 177,242,573 |
| 4 | 297,865,798 | 91 + 101 | 0.92 | 274,488,516 | 168,110,243 |
| 5 | 310,580,224 | 91 + 101 | 0.96 | 286,623,246 | 177,826,361 |
(A) Sequencing results are presented for each of the five libraries. A total of 4.25× sequence coverage of the 31 Gb genome was obtained. After barcode demultiplexing reads were aligned to P. lambertiana v1.0 with BWA and subsequently filtered by fragScaff (see Materials and Methods). (B) Pool statistics are presented for each library; >3.18 million barcoded pools were sequenced across all libraries. The mean read cloud length for HMW DNA covered ∼60 kb for all size-selected libraries, nearly twice as long as without size selection. We estimated the total physical coverage in read clouds to be 23.8×.
Assembly statistics before and after rescaffolding with long read clouds using fragScaff
| Original Assembly (with | Original Assembly (No | Rescaffolded Assembly (with | Rescaffolded Assembly (no | |
|---|---|---|---|---|
| Maximum scaffold length (bp) | 4,064,336 | 3,809,096 | 23,976,851 | 22,367,058 |
| 324,201 | 306,897 | 2,668,366 | 2,509,905 | |
| 959,930 | 904,501 | 8,710,993 | 8,182,563 | |
| 72,460 | 69,664 | 406,554 | 399,889 |
The weighted average assembly length (N50) increased by ∼ eightfold. This was regardless of whether or not padding (N’s) was included in the calculation.