| Literature DB >> 27176120 |
Christopher Huptas1, Siegfried Scherer1, Mareike Wenning2.
Abstract
BACKGROUND: Next-generation sequencing (NGS) technology has paved the way for rapid and cost-efficient de novo sequencing of bacterial genomes. In particular, the introduction of PCR-free library preparation procedures (LPPs) lead to major improvements as PCR bias is largely reduced. However, in order to facilitate the assembly of Illumina paired-end sequence data and to enhance assembly performance, an increase of insert sizes to facilitate the repeat bridging and resolution capabilities of current state of the art assembly tools is needed. In addition, information concerning the relationships between genomic GC content, library insert size and sequencing quality as well as the influence of library insert size, read length and sequencing depth on assembly performance would be helpful to specifically target sequencing projects.Entities:
Keywords: Assembler; Assembly performance; Average library insert size; Genomic GC content; Illumina; Next generation sequencing; PCR-free sample preparation; Read depth; Read size; Sequencing quality
Mesh:
Substances:
Year: 2016 PMID: 27176120 PMCID: PMC4864918 DOI: 10.1186/s13104-016-2072-9
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Bacterial strains
| Bacterial strain | Abbreviation | Reference genome sequences | % GC |
|---|---|---|---|
|
| Bce | NC_016779.1, NC_016794.1, NC_016780.1 | 35.4 |
|
| Efa | NC_017316.1 | 37.8 |
|
| Sen | NC_016856.1, NC_016855.1 | 52.2 |
|
| Pst | NC_015740.1 | 63.9 |
|
| Mlu | NC_012803.1 | 73.0 |
Sequenced bacterial strains, strain abbreviations, genomic GC-content and corresponding NCBI reference genome sequences
Sequencing library categories
| Library category | Fragmentation parameters | RB:BB ratios | Avg. insert size (raw data) (bps) | Sequenced genomes |
|---|---|---|---|---|
| TS | DF 5 %, PIP 175 W, C/B 200, Du 25 s | TruSeq® DNA PCR-free LPP (LS protocol, 550 bps) | 641 ± 28 | Bce, Efa, Pst, Mlu |
| IS1 | DF 10 %, PIP 175 W, C/B 200, Du 25 s | 2.0:1 + 3.8:2 | 686 ± 33 | Bce, Efa, Sen, Pst, Mlu |
| IS2 | DF 2 %, PIP 175 W, C/B 200, Du 30 s | 2.2:1 + 4.2:2 | 990 ± 79 | Efa, Sen, Mlu50 |
| IS3 | DF 2 %, PIP 175 W, C/B 200, Du 20 s | 2.3:1 + 4.4:2 | 1211 ± 78 | Bce, Pst, Mlu, Mlu50 |
| IS4 | DF 2 %, PIP 175 W, C/B 200, Du 10 s | 2.4:1 + 4.6:2 | 1297 | Mlu50 |
Settings for genomic DNA fragmentation and fragment size selection during library preparation. TS, original Illumina TruSeq® DNA PCR-free LLP (no modification)
IS1-IS4, categories for library average insert size, where IS1 < IS2 < IS3 < IS4
The term Mlu50 refers to the corresponding genome sequenced at 2 × 25 bps. Genomes listed without index were sequenced at 2 × 200 bps throughout
Bce B. cereus, Efa E. faecalis, Pst P. stutzeri, Mlu M. luteu, Sen S. enterica, RB Resuspension Buffer, BB Bead Buffer, DF Duty Factor, PIP Peak Incident Power, C/B Cycles per Burst, Du Duration
Fig. 1Insert size distributions after second size selection. Data originate from analysis with Bioanalyzer instruments. a Insert size distributions of sequencing libraries obtained with the standard Illumina TruSeq® DNA PCR-free LPP have an overall good reproducibility. All sequenced TS libraries are shown. b Modifications during DNA fragmentation and insert size selection enabled the creation of sequencing libraries with sharper and more symmetric insert size distributions. Sequencing libraries Efa_TS and Efa_IS1 are illustrated in red and blue, respectively. c In addition, different RB:BB ratios led to sequencing libraries varying in average insert size for the same genome. Blue: Pst_IS1, red: Pst_IS3. d Insert size distribution reproducibility is maintained when using modified LPPs
Fig. 2Linear regression of Bioanalyzer deduced and actual average library insert sizes. Calculation of actual library insert sizes was done after remapping of raw read data to respective reference genome sequences. Linear regression analysis revealed a very strong correlation, but the Bioanalyzer system turned out to overestimate library insert sizes
Fig. 3Impact of average library insert size and genomic GC content on sequencing quality. Genomic GC content was plotted against the percentage of raw read pairs passing quality filtering (80;20). Then, libraries were grouped according to their category. Group 1 (red) comprises all standard libraries (TS). Group 2 (green) covers all libraries of category IS1. Group 3 (blue) represents the combined set of IS2 and IS3 libraries
Fig. 4Interplay between insert size, GC content and sequencing quality of a library. Insert size distributions of IS1-3 libraries were plotted prior to and after read quality filtering (80;20). Distributions marked green belong to IS1 libraries, whereas distributions marked blue are from higher category libraries (IS2 or IS3). Dark-coloured distributions are derived from unfiltered libraries. Light-coloured distributions are obtained after quality filtering. To make insert size distributions directly comparable to each other, read counts were normalized by the maximal read count (per insert size) of the unfiltered library (of same category and genome). a Bce_IS1 and Bce_IS3. b Efa_IS1 and Efa_IS2. c Sen_IS1 and Sen_IS2. d Mlu_IS1 and Mlu_IS3
Fig. 5Effect of quality filtering on average library insert size. Average insert sizes for libraries Mlu50_IS2, Mlu50_IS3 and Mlu50_IS4 were plotted as a function of progressive stringency filtering. Most relaxed stringency corresponds to unfiltered libraries (raw). During most discriminative filtering reads are only passing quality control, if at least 90 % of their nucleotides had a Phred quality score ≥20 (90;20)
Relative assembly scores using different assemblers
| Quality metric | Assembler | Genome | |||
|---|---|---|---|---|---|
| Bce | Efa | Sen | Pst | ||
| Corrected NG50 | SPAdes | 1 | 1 | 1 | 1 |
| ABySS | 0.69 | 0.87 | 0.82 | 0.70 | |
| Velvet | 0.71 | 0.71 | 0.77 | 0.94 | |
| Edena | 0.63 | 0.72 | 0.63 | 0.73 | |
| max** | 448,776 | 381,370 | 292,477 | 212,702 | |
| NGA50 | SPAdes | 1 | 1 | 0.98 | 1 |
| ABySS | 0.93 | 0.97 | 1 | 0.92 | |
| Velvet | 0.58 | 0.82 | 0.92 | 0.96 | |
| Edena | 0.40 | 0.66 | 0.50 | 0.66 | |
| max** | 733,293 | 416,896 | 405,154 | 238,037 | |
A maximal relative assembly score of 1 refers to the assembler with the best assembly performance. All other assembly scores are expressed as relative fractions of their corresponding maximum. For genomes Bce, Efa and Pst each relative assembly score was calculated on the basis of 30 assembly sets. Assembly sets originated from same-genome sequencing libraries differing in category (average insert size), read length and sequencing depth. For Sen the number of assembly sets per relative assembly score was 10
max** Absolute value in nucleotides of the maximal relative assembly score of the column
Relative assembly scores for different insert sizes
| Quality metric | Insert size | Genome | |||
|---|---|---|---|---|---|
| Bce | Efa | Sen | Pst | ||
| Corrected NG50 | TS | 0.54 | 0.84 | n.d. | 0.96 |
| IS1 | 1 | 0.95 | 0.95 | 0.97 | |
| IS2 or IS3 | 0.74 | 1 | 1 | 1 | |
| max** | 590,346 | 409,939 | 300,036 | 217,864 | |
| NGA50 | TS | 0.53 | 0.85 | n.d. | 0.91 |
| IS1 | 0.85 | 0.97 | 0.96 | 0.93 | |
| IS2 or IS3 | 1 | 1 | 1 | 1 | |
| max** | 922,992 | 442,857 | 405,417 | 250,853 | |
Relative assembly scores rely on SPAdes assembly validations and are summarized per genome and library category. For Sen each relative assembly score refers to 5 assembly sets. Relative assembly scores of all other genomes comprise 10 assembly sets each
max** Absolute value in nucleotides of the maximal relative assembly score of the column
Relative assembly scores for different read lengths
| Quality metric | Read length | Genome | |||
|---|---|---|---|---|---|
| Bce | Efa | Sen | Pst | ||
| Corrected NG50 | 100 | 0.52 | 0.68 | 0.54 | 0.996 |
| 125 | 0.82 | 0.90 | 0.79 | 0.995 | |
| 150 | 0.90 | 0.96 | 0.81 | 1 | |
| 175 | 1 | 1 | 0.95 | 0.995 | |
| 189 | 0.91 | 0.996 | 1 | 0.99 | |
| max** | 541,348 | 419,532 | 358,280 | 213,754 | |
| NGA50 | 100 | 0.55 | 0.95 | 0.71 | 1 |
| 125 | 0.75 | 1 | 0.86 | 0.86 | |
| 150 | 0.78 | 0.97 | 0.85 | 0.96 | |
| 175 | 0.996 | 0.99 | 0.94 | 0.98 | |
| 189 | 1 | 0.999 | 1 | 0.97 | |
| max** | 898,406 | 424,840 | 455,293 | 249,244 | |
Read length associated relative assembly scores are listed per genome for SPAdes assemblies. Relative assembly scores for Sen were determined in dependence on 3 assembly sets each. Each other genome comprised 6 assembly sets per relative assembly score
max** Absolute value in nucleotides of the maximal relative assembly score of the column
Fig. 6Influence of read length on assembly performance. Shown is the fraction of top-performing assemblies as a function of library read length. Assemblies are defined to be top-performing, if their relative assembly scores are greater or equal to 0.95. Fractions were calculated comprising all investigated genomes and assemblers (Table 5; Additional file 1: Tables S7, S10 and S13)