| Literature DB >> 31364694 |
Mitsuhiko P Sato1, Yoshitoshi Ogura1, Keiji Nakamura1, Ruriko Nishida1,2, Yasuhiro Gotoh1, Masahiro Hayashi3,4, Junzo Hisatsune5,6,7, Motoyuki Sugai5,6,7, Itoh Takehiko8, Tetsuya Hayashi1.
Abstract
In bacterial genome and metagenome sequencing, Illumina sequencers are most frequently used due to their high throughput capacity, and multiple library preparation kits have been developed for Illumina platforms. Here, we systematically analysed and compared the sequencing bias generated by currently available library preparation kits for Illumina sequencing. Our analyses revealed that a strong sequencing bias is introduced in low-GC regions by the Nextera XT kit. The level of bias introduced is dependent on the level of GC content; stronger bias is generated as the GC content decreases. Other analysed kits did not introduce this strong sequencing bias. The GC content-associated sequencing bias introduced by Nextera XT was more remarkable in metagenome sequencing of a mock bacterial community and seriously affected estimation of the relative abundance of low-GC species. The results of our analyses highlight the importance of selecting proper library preparation kits according to the purposes and targets of sequencing, particularly in metagenome sequencing, where a wide range of microbial species with various degrees of GC content is present. Our data also indicate that special attention should be paid to which library preparation kit was used when analysing and interpreting publicly available metagenomic data.Entities:
Keywords: Illumina sequencing; bacterial genome sequencing; library preparation kits; metagenome sequencing; sequencing bias
Mesh:
Year: 2019 PMID: 31364694 PMCID: PMC6796507 DOI: 10.1093/dnares/dsz017
Source DB: PubMed Journal: DNA Res ISSN: 1340-2838 Impact factor: 4.458
Library preparation kits analysed in this study
| Kits | Abbreviation | Fragmentation methods | PCR cycles | Input DNA (ng) |
|---|---|---|---|---|
| Nextera XT | XT | Tagmentation by transposome | 12 | 1 |
| Nextera DNA Flex | FL | Tagmentation by transposome | 12 | 1 |
| KAPA HyperPlus | KP | Enzymatic | 12 | 1 |
| NEBNext Ultra II | NN | Enzymatic | 12 | 1 |
| QIAseq FX | QS | Enzymatic | 12 | 1 |
| TruSeq nano | TS | Sonication | 8 | 200 |
| KAPA HyperPlus PCR-free workflow | KPF | Enzymatic | 0 | 1,000 |
| TruSeq DNA PCR-free | TSF | Sonication | 0 | 1,000 |
Figure 1Quality comparison of E. coli and S. aureus genome assemblies obtained by library preparation kits. (A) Assembly statistics obtained by six library preparation kits were compared in E. coli and S. aureus. Two E. coli and two S. aureus genomes were analysed as model bacterial genomes to compare six library preparation kits. Illumina read sequences obtained from each library were assembled using Velvet and SPAdes, and the numbers of contigs and L50 values of each assembly are shown. In each sequence data set, assembly was repeated 10 times using Illumina reads randomly selected at 30× coverage. Error bars indicate standard deviations. The six kits used cover three fragmentation strategies (see the main text). XT, Nextera XT; FL, Nextera DNA Flex; KP, KAPA HyperPlus; NN, NEBNext Ultra II; QS, QIAseq FX; and TS, TruSeq nano. (B) Relative sequence coverage in relation to GC content was calculated in E. coli and S. aureus genomes obtained by three library preparation kits. Relative sequence coverage in the genome assemblies obtained by the XT, FL, and KP kits and GC content were calculated for every 200-bp window with no overlap. Only the first 120,000 bp regions of each genome are shown. (C) Relationships between GC content and sequence coverage in the E. coli and S. aureus genome assemblies obtained by six library preparation kits are shown. The relative abundance of 200 bp bins with a given GC content (defined by 0.5% interval) and the mean relative coverage of bins with a given GC content () were calculated and are shown along with GC content by black lines or lines coloured according to the library preparation kits, respectively. Black horizontal lines (=1) represent unbiased coverage. The data for bins with extreme GC content (those representing <0.5% of all 200 bp bins) are not shown. Color figures are available at DNARES online.
Figure 2Overall GC content-associated sequencing bias observed in 22 strains of non-S. aureus species in the genus Staphylococcus. Sequence reads were obtained from 22 strains of non-S. aureus species in the genus Staphylococcus using the XT and KP kits. The overall sequencing bias associated with GC content observed in the genome assemblies was quantified (see Materials and methods in the main text), and the relationships between the quantified overall sequencing bias and the mean GC content of each genome are shown. Solid lines indicate regression lines, and the 95% confidence intervals are indicated in grey.
Figure 3Overall GC content-associated sequencing bias in the sequence data of 191 species obtained by the XT library preparation kit. Illumina sequencing data for 191 species (one strain from each species) produced using the XT kit from a project of NBRP of Japan were downloaded from the public database (DDBJ). The overall GC content-associated sequencing bias in each data set was quantified, and relationships between the quantified overall sequencing bias and the mean GC content of each genome are shown.
Figure 4Metagenome sequencing of a mock bacterial community using six library preparation kits and the sequencing bias introduced by each kit. (A) Libraries of a mock bacterial community prepared by six library preparation kits were sequenced, and the relative genome abundance estimated in each data set obtained by six library preparation kits is shown. The mock community was composed of nine species with various levels of GC content. The relative abundances of each species were normalized by their genome sizes and the copy numbers of each species in the sample, which were determined by ddPCR. (B) Relationships between the GC content and sequence coverage in each genome in the mock community are shown. The mean relative coverage of each 200-bp bin with a given GC content () in each genome was calculated in each data set and is shown according to GC content by coloured lines. The colours of the lines correspond to the species shown in panel (A). Black horizontal lines in each plot (=1) represent unbiased coverage. The relative coverage was normalized by the copy numbers in the sample determined by ddPCR. Data for bins with extreme GC content (those representing <0.5% of all 200 bp bins) are not shown. Color figures are available at DNARES online.