Literature DB >> 33176883

A comprehensive investigation of metagenome assembly by linked-read sequencing.

Lu Zhang¹, Xiaodong Fang², Herui Liao^3,4, Zhenmiao Zhang⁵, Xin Zhou⁶, Lijuan Han³, Yang Chen⁷, Qinwei Qiu⁷, Shuai Cheng Li⁸.

Abstract

BACKGROUND: The human microbiota are complex systems with important roles in our physiological activities and diseases. Sequencing the microbial genomes in the microbiota can help in our interpretation of their activities. The vast majority of the microbes in the microbiota cannot be isolated for individual sequencing. Current metagenomics practices use short-read sequencing to simultaneously sequence a mixture of microbial genomes. However, these results are in ambiguity during genome assembly, leading to unsatisfactory microbial genome completeness and contig continuity. Linked-read sequencing is able to remove some of these ambiguities by attaching the same barcode to the reads from a long DNA fragment (10-100 kb), thus improving metagenome assembly. However, it is not clear how the choices for several parameters in the use of linked-read sequencing affect the assembly quality.
RESULTS: We first examined the effects of read depth (C) on metagenome assembly from linked-reads in simulated data and a mock community. The results showed that C positively correlated with the length of assembled sequences but had little effect on their qualities. The latter observation was corroborated by tests using real data from the human gut microbiome, where C demonstrated minor impact on the sequence quality as well as on the proportion of bins annotated as draft genomes. On the other hand, metagenome assembly quality was susceptible to read depth per fragment (CR) and DNA fragment physical depth (CF). For the same C, deeper CR resulted in more draft genomes while deeper CF improved the quality of the draft genomes. We also found that average fragment length (μFL) had marginal effect on assemblies, while fragments per partition (NF/P) impacted the off-target reads involved in local assembly, namely, lower NF/P values would lead to better assemblies by reducing the ambiguities of the off-target reads. In general, the use of linked-reads improved the assembly for contig N50 when compared to Illumina short-reads, but not when compared to PacBio CCS (circular consensus sequencing) long-reads.
CONCLUSIONS: We investigated the influence of linked-read sequencing parameters on metagenome assembly comprehensively. While the quality of genome assembly from linked-reads cannot rival that from PacBio CCS long-reads, the case for using linked-read sequencing remains persuasive due to its low cost and high base-quality. Our study revealed that the probable best practice in using linked-reads for metagenome assembly was to merge the linked-reads from multiple libraries, where each had sufficient CR but a smaller amount of input DNA. Video Abstract.

Entities: CellLine Chemical Disease Gene Species

Keywords: Linked-reads; Metagenome assembly; PacBio CCS long-reads; Parameter space; Short-reads

Mesh：

Year: 2020 PMID： 33176883 PMCID： PMC7659138 DOI： 10.1186/s40168-020-00929-3

Source DB: PubMed Journal: Microbiome ISSN： 2049-2618 Impact factor: 14.650

Background

The human microbiota are complex systems that contribute to a large part of human physiological activities and diseases. Knowing the genomic sequences of the microbiota content allows us to study its functions. However, microbial genome sequences are difficult to obtain. While a few microbes can survive isolation and be cultured in vitro for sequencing, the remaining microbial content remains as “microbial dark matter”. Alternatively, there have been attempts to use computational means to reconstruct the microbial genomes from a mixture of short-reads sequenced from them. However, such metagenome assembly faces the difficulties of having repetitive sequences of both intra- and inter-species origin, horizontal gene transfers, and mobilization events [1], complicated by uneven abundance of microbes in the sample. Current algorithms such as IDBA-UD [2], MEGAHIT [3], and MetaSPAdes [4] make use of read depth and fragment insert size constraint to unravel the repetitive sequences and estimate microbial abundance. However, their reliability is affected by the low continuity of short-read assembly. Long-read sequencing has been used to attempt to mitigate these problems, e.g., Nicholls et al. [5] and Sevim et al. [6]. In particular, Moss et al. [7] optimized the long-read library preparation protocol of nanopore sequencing and produced more complete bacterial genomes. However, the application of long-read sequencing in practical application remains costly (the “Discussion” section). Alternative sequencing platforms that provide long-range sequence information for metagenomics are available in the form of Illumina Truseq Synthetic Long Reads (SLR) and linked-reads. SLR arranges long DNA fragments into 384 well plates, which are further amplified and pooled sequenced with sufficiently deep sequencing depth (~ 50X per fragment), thus allowing long fragments to be assembled individually [8, 9]. Linked-reads are short-reads where reads from the same fragment are marked with the same barcode. The 10x linked-read microfluidic system assigns long DNA fragments into around 1 million partitions, where each fragment is sequenced with a shallow depth (0.1X–0.4X). A method for linked-read metagenome assembly, Athena-meta [10], bridges the gaps between contigs by local assembly on co-barcoded reads and outperformed the methods for short-reads and SLR in assembling human gut and environmental microbiome. There are four key parameters in linked-read sequencing which may impact metagenome assembly [11] (Fig. 1): (i) C, average depth of short-reads per fragment; (ii) C: average physical depth of the genome by long DNA fragments; (iii) N, number of fragments per partition; (iv) Fragment length distribution, which is specified using two parameters, namely, μ—average unweighted DNA fragment length and Wμ—length-weighted average of DNA fragment length. Several of these parameters are interdependent. For example, a greater amount of input DNA increases both C and N and decrease C; and the absolute values of C and C are set by how much total read coverage (C) is generated because C × C = C. In a previous study, we investigated the effects of these parameters on human diploid assembly [11].

Fig. 1

Parameters of linked-read sequencing to be investigated

Parameters of linked-read sequencing to be investigated The present study evaluates these parameters with respect to their impact on metagenome assembly. We used three sets of linked-reads, one from simulation, one from a mock community, and another from a real human gut microbiome sample. The simulated data consists of twenty datasets (Table S1) generated by an improved LRTK-SIM [11] that enables to deal with microbial samples with uneven abundance for this study (the “Methods” section). The mock community (ATCC MSA-1003) is a pool of 20 strains with staggered abundance, while the human gut microbiome is from a healthy Chinese stool sample. Because of an absence of ground truth to evaluate human gut microbiome, we annotated contig bins as draft genomes and assigned them to the corresponding taxonomic classification (the “Methods” section). Our results show that deeper C resulted in more assembled sequences and enabled better genomic coverage, but it was irrelevant to the assembly quality. C was not a dominating factor for contig continuity, which could be influenced more by genome characteristics. We further found C to affect the number of draft genomes and that C was associated with assembly quality. The μ had marginal effect on assemblies, and lower N values would lead to better assemblies by reducing the ambiguities of off-target reads. Compared to Illumina short-reads, 10x linked-reads significantly improved the metagenome assembly in both contig continuity and genome completeness.

Results

Three sets of linked-reads are used. The first is simulated from the MBARC-26 [6] community (Table S1 and S2), and the twenty simulated datasets are annotated as , ,, and (where superscript “-” represents the actual values of corresponding parameters, Table S1). The second and third sets are sequenced from a mock community of 20 strains (one lane reads from Illumina XTen, 108.7 GB, Table S3) and a human gut microbiome (two lane reads from Illumina XTen, 208.97 GB; the “Methods” section and Supplementary Note) followed by reads subsampling to match the expected parameter values. The microbial complexity in the human gut microbiome was evaluated by aligning linked-reads to the reference sequences from human microbiome project [12] (Supplementary Note). To obtain the datasets of different C and C, we subsampled short-reads (MSC) and long DNA fragments (MSC) of the mock community (the “Methods” section), where value of subscript “-” represents the reciprocal of sequenced lanes—for example, MSC/MSC means quarter lane reads were subsampled. Since the composition of the human gut microbiome is unknown, SC and SC (where subscript “-” represents the reciprocal of sequenced lanes) were generated by subsampling short-reads and barcodes instead. To avoid confusion, we used MSC and SC to denote total one lane and two lanes linked-reads from the mock community and human gut microbiome, respectively. According to microbial relative abundance, the microbes were classified into low- (L), medium- (M), and high-abundance (H) in the simulated data (Table S2); and classified into low- (L), medium- (M), high- (H), and ultrahigh-abundance (UH) in the mock community (Table S3). The contigs from the simulation and mock community were evaluated using two reference-based metrics (total aligned length and genomic coverage) and two measures for contig continuity (contig NG50 and NGA50). For human gut microbiome data, we annotated the contig bins as draft genomes and classified them into high-, medium-, and low-quality [13] (the “Methods” section). The number and quality of annotated draft genomes and contig N50 were used to evaluate the assemblies.

The influence of total read depth C

C has little effect on both total aligned length and genomic coverage for L and H microbes in the simulated data. For M microbes, their abundance correlates positively with total aligned length and genomic coverage, indicating that a low abundance could reduce assembly completeness even when C is high (Fig. 2a, b, e, and f).

Fig. 2

Trends of total assembly length, genomic coverage, contig NG50, and NGA50 by subsampling C (a–d), C (e–h), μ (i–l), and N (m–p) in simulated data. Microbes in cyan, blue, and red represent L, M, and H species, respectively Similarly, we fail to observe any clear trend between NG50 (or NGA50) and C. Two microbes with the deepest C, NC_014212 and NC_017095 (with the highest abundance), were assembled into fragmented contigs (Fig. 2c and g), suggesting that C was not a dominating factor for contig continuity; which is also seen in and , which have deeper C (C = 120X) than the other configurations. They achieved the largest total aligned length and genomic coverage, but their contig NG50 and NGA50 fluctuated and were not always the best. For example, NC_019904 and NC_002737, which have the lowest abundance among M microbes, yielded the largest total aligned length in (NC_019904, 1.29 Mb; NC_002737, 1.42 Mb; Fig. 2a) and (NC_019904, 1.77 Mb, NC_002737, 1.49 Mb; Fig. 2e). assembled fragmented contigs for both of the microbes on NG50 (NC_019904 < 500 bp, NC_002737 = 64.14 kb; Fig. 2g) and NGA50 (NC_019904 < 500 bp, NC_002737 = 62.47 kb; Fig. 2h). Although produced better NG50 on NC_002737 (NG50, 2.04 Mb, Fig. 2c), misassemblies were dispersed in its contigs (NGA50, 109.43 kb; Fig. 2d). The results for the mock community are consistent with those from the simulated data. The total aligned length and genomic coverage were fairly stable for L, H, and UH microbes regardless of the value of C (Fig. 3a, b, e, and f). For M microbes, the contigs from MSC/MSC covered the reference genomes poorly due to the insufficient read depth. A quarter lane reads (MSC/MSC) appeared to suffice for the read depth, achieving around full genomic coverage for all M microbes, except for ATCC_33323, which required a quarter lane reads more. No consistent trend could be observed for NG50 and NGA50; a quarter lane reads was necessary to generate contigs with non-zero NGA50 for M microbes.

Fig. 3

Trends of total assembly length, genomic coverage, contig NG50, and NGA50 by subsampling C (a–d) and C (e–h) for the mock community. Microbes in cyan, blue, red, and black represent L, M and H, and UH species, respectively In the results with human gut microbiome, deeper C extends the assembly length but has no impact on the assembly quality. After binning contigs and classifying the bins into draft genomes (the “Methods” section), SC produced the largest number of bins (148) and the longest assembly length (399.41 Mb). These statistics were reduced along subsampling reads progressively (Table 1). The proportions of bins annotated as draft genomes were reduced by increasing C (SC 77.78%, SC 69.39%, SC 63.75%, SC 65.38%, SC 54.05%; SC 68.75%, SC 60.94%, SC 59.55%, SC 58.16%, SC 54.05%). C negatively correlates with bin average contamination (SC 14.40%, SC 10.46%, SC 9.08%, SC 8.94%; Table S4 and Figure S1). We annotated the draft genomes as genus or species (> 60% confidence) based on their k-mer similarities with known microbial genomes (the “Methods” section). Most of the taxonomical classifications were observed by at least two parameter configurations, although some were unique to only one (Figure S2). Considering the qualities of annotated draft genomes, C demonstrated a positive correlation with the number of medium- and low-quality bins (Table 1); SC has the most high-quality bins and the largest average bin completeness (73.3%) compared to the other configurations (Table 1 and Table S4). The N50s of high-quality bins are significantly greater than medium- (p value = 0.01) and low-quality (p value = 5.3E−9) bins, suggesting that bin quality (determined by completeness and contamination) is highly correlated with contig continuities (Fig. 4a–c). Interestingly, high-quality bins required read coverage of at least 50X (SC = 85.81X; SC = 132.71X; SC = 111.11X; SC = 96.43X; SC = 64.67X; SC = 75.66X; SC = 151.55X; SC = 63.42X; SC = 53.08X), suggesting that the low abundance microbes were not assembled into high-quality genomes. Nevertheless, the contigs with extremely high depth may come from repetitive sequences and reduce the qualities of bins they belong to (Fig. 4d–f; C (high) = 81.4X; C (medium) = 140.1X; C (low) = 1636.5X).

Table 1

Summary of the assemblies for subsampled linked-reads from human gut microbiome and Illumina short-reads

Configurations	No. of bins	Total length (Mb)	High (%)	Medium (%)	Low (%)	Others (%)
SC_all	148	399.41	9 (6.08%)	23 (15.54)	48 (32.43)	68 (45.95)
SC_R1	104	290.49	10 (9.62)	30 (28.85)	28 (26.92)	36 (34.62)
SC_R2	80	225.73	11 (13.75)	15 (18.75)	25 (31.25)	29 (36.25)
SC_R4	49	159.40	15 (30.61)	9 (18.37)	10 (20.41)	15 (30.61)
SC_R8	36	115.96	6 (16.67)	16 (44.44)	6 (16.67)	8 (22.22)
SC_F1	98	305.24	14 (14.29)	20 (20.41)	23 (23.47)	41(41.84)
SC_F2	89	244.55	7 (7.87)	16 (17.98)	30 (33.71)	36 (40.45)
SC_F4	64	188.90	7 (10.94)	13 (20.31)	19 (29.69)	25 (39.06)
SC_F8	48	152.65	9 (18.75)	10 (20.83)	14 (29.17)	15 (31.25)
ILLU	53	145.50	0 (0)	16 (30.19)	16 (30.19)	21 (39.62)

ILLU assembly from Illumina short-reads

Fig. 4

Contig N50 and read depth for high-, medium-, and low-quality bins

Summary of the assemblies for subsampled linked-reads from human gut microbiome and Illumina short-reads ILLU assembly from Illumina short-reads Contig N50 and read depth for high-, medium-, and low-quality bins

The tradeoffs between C and C

There are tradeoffs between C and C in maintaining the same C. Because the product of PCR amplification per partition can generate around 500 Mb short-reads, loading DNA with greater density (deeper C) results in more fragments per partition and fewer reads sequenced for each fragment (shallower C). For M and H species in the simulated data, we found that increasing C is more effective than increasing C when C is around 10X, and they are comparably effective when C is beyond 30X (Fig. 2a, b, e, and f). As a rule, deep C is more pressing to reconstruct DNA fragment if C is low. For the examples of and (C = 10x), (C = 0.36X) was significantly better than (C = 0.064X) in total aligned length ( = 2.17 Mb: < 500 bp) and genomic coverage ( = 62.93%: < 1%) for NC_012982. generated more continuous contigs than for the five H species, NC_018068, NC_014364, NC_017033, NC021184, and NC_019792, (Fig. 2 c, d, g and h). In the mock community, MSC and MSC produced comparable assemblies when C was kept constant. In human gut microbiome, SC generated more assembled sequences than SC(SC:SC = 305.24 Mb:290.49 MB; SC:SC = 244.55 Mb:225.73 Mb; SC:SC = 188.90 Mb:159.40 Mb; SC:SC = 152.65 Mb:115.96 Mb, Table 1), but had higher average bin contamination (SC:SC = 14.04%:12.10%; SC:SC = 14.80%:10.46%; SC:SC = 12.66%:9.08%; SC:SC = 12.17%:8.95%) and worse contig N50 (SC:SC = 137.66 kb:168.67 kb; SC:SC = 127.58 kb:151.49 kb; SC:SC = 136.40 kb:181.46 kb; SC:SC = 115.29 kb:118.0 kb). These observations suggest that deeper C would result in more assembled sequences, while deeper C would help in improving assembly quality.

DNA fragment length and metagenome assembly

DNA long fragment information is critical for linked-read assembly, as it can help in spanning the gaps between contig breaks that are due to genome variations and repetitive sequences. On the other hand, it may lead to the loss of barcode specificity in disentangling short tandem repeats if the fragments are exceedingly long. In practice, it is difficult to extract very long DNA fragments from metagenomic sample; even on the gentlest DNA extractions, the mean fragment length (μ) is usually at most 10 to 20 kb. Our simulated data of μ from 5 to 100 kb showed that the assembly was not sensitive to μ. In some special cases, extremely long DNA fragments could improve the assemblies of M microbes with high repeat rates. For example, (μ = 100 kb) improved the contigs NG50 (3.17 Mb, Fig. 2k) and NGA50 (2.71 Mb, Fig. 2l) of NC_018014, which was the one with the highest repeat rate (18.3%).

Barcode specificity is important in microbial deconvolution

For human genome sequencing, each partition contains ten fragments (N = 10) on average [14]. N is supposed to be larger (N = 40) for metagenomic sequencing due to the limited fragment size (Wμ = 11.15 kb) and relatively small microbial genome size (Table S5). Large N also increases the difficulties in recognizing the fragments that short-reads belong to. The assembly on N, the smallest N (N = 10) in simulation, had much better NG50 and NGA50 for most of the H and M microbes (14 out of 18, the remaining 4 microbes are comparable, Fig. 2 o and p). Small N also failed to assemble L microbes (Fig. 2 m and n).

Assembly on Illumina short-reads and PacBio CCS long-reads

Illumina short-read sequencing is a mainstream technology for metagenomic sequencing, but its quality for metagenome assembly is unsatisfactory due to the lack of long-range connectivity. We downloaded the short-read data of the mock community from the Sequence Read Archive [15] (the “Methods” section) and performed an assembly. The assembly on linked-reads (total aligned length 52.04 Mb; genomic coverage 77.20%) is much better than that on short-reads (total aligned length 38.13 Mb; genomic coverage 56.69%, NG50 and NGA50, see Figure S3). For human gut microbiome, the assembly from 8.8 Gb short-reads showed a comparable number of bins (53) and total assembly length (145.50 Mb vs. 159.40 Mb for SC, Table 1). However, the short-read assembly generated no bins with high-quality because it had known issue to detect rRNAs and tRNAs [16, 17] (Table 1). SC, with the worst N50 in linked-read assembly, was also 4.49 times (115.29 kb vs. 25.69 kb) greater than Illumina short-reads. The average bin contamination rate of 17.39% for the assembly from short-reads was also much worse than linked-reads (Table S6). In mock community, we further compared linked-reads to PacBio CCS, which have both extreme long (N50 = 9.08Kb) and highly accurate (> 99% base accuracy) reads. The total aligned length and genomic coverage were comparable between CCS reads (54.04 Mb and 78.68%) and MSC (52.02 Mb and 77.18%), but CCS reads improved the contig continuity substantially (Figure S4).

Comparison to human genome parameter statistics

10x linked-read sequencing was originally developed for human genome assembly, so we compared the parameter distributions between human genome and human gut microbiome. Because no reference genome was available for human gut microbiome, we collected the sequences of all the non-redundant high-quality bins from SC and SC datasets as “pseudo” reference genomes and reconstructed 15,994,284 long fragments (> 2 kb). For human gut microbiome, C was comparable (C 0.30X vs. 0.32X, Table S5), and C was 6.26 times larger than human genome (NA24385, C 595.85X vs. 95.20X, Table S5); also, the DNA fragments were obviously much shorter (μ 7.91 kb vs. 28.06 kb; Wμ 11.15 kb vs. 44.53 kb, Table S5, Figure S5 and S6).

Discussion

Human microbiota provide rich information to understand microbial activities impacting human health and disease. Projects such as HMP (Human Microbiome Project) [12] and MetaHIT (Metagenomics of the Human Intestinal Tract) [18] have been proposed to collect microbiomes from diverse places of human body and aimed to understand their compositions and functions. De novo metagenome assembly on short-reads is commonly used to assemble microbial genomes from a mixture of culture-free microbes. Although it has been widely applied to assemble thousands of bacterial genomes [19, 20], there are four difficulties that remain: (1) assembly for low-abundance microbes; (2) repetitive sequences assembly such as 16S, 23S rRNA; (3) assembly of regions with genetic variation; (4) strain level assembly based on haplotype phasing. Besides metaSPAdes used in the current study, IDBA-UD [2] and MEGAHIT [3] were also tested and achieved comparable results with metaSPAdes. They all showed much worse assembly than linked-reads (Table S6). Long-read sequencing has the potential to assemble more complete genomes and is believed to dominate the field in the future. However, linked-reads are still worth to be considered as a transitional technology. First, both PacBio and Oxford Nanopore are several times more costly than 10x linked-reads (especially for library preparation). Second, high base error rate of long-reads lacks strength for haplotype phasing and strain level assembly. Third, clinical samples benefit from the small amount of input DNA required by linked-read sequencing. A previous study also observed some high-quality bins generated by linked-reads missed in long-reads assembly [7]. In this study, we comprehensively investigated the four parameters of linked-read sequencing on metagenome assembly, which could be fine-tuned in either library preparation or short-read sequencing. Read depth C and microbial abundance are the two most important parameters to determine genome coverage and the number of bins annotated as draft genomes. Low-abundance microbes were almost impossible to be assembled by any of the technologies; the assemblies of medium-abundance microbes were substantially improved by deep C, and they were fairly stable for high-abundance ones. According to our observation, C should be chosen from 120X to 400X to optimize the assembly quality. There is a tradeoff between C and C, where deep C can generate more high-quality bins and C controls total assembly length. Large μ enables DNA fragments spanning distant contigs, but it is unnecessary to produce extremely long fragments for microbial genomes. The repetitive sequences spread in microbial genomes are usually short (e.g.,16S: ~ 1.5 kb, 23S: ~ 2.9 kb), which could be resolved by assembling the co-barcoded reads with small N. Athena-meta includes four steps: (1) generate “seed” contigs using short-reads without barcodes; (2) link contigs into scaffold graph using aligned paired-end reads; (3) local assembly by recruiting co-barcoded reads that spanning both “seed” contigs; (4) pool and assemble the locally assembled sequences and “seed” contigs. We can link and interpret our observations with the corresponding strategies in Athena-meta. C is critical to construct “seed” contigs, as high-quality seed contigs are the prerequisite for local assembly using co-barcoded reads. C and C impact reconstruction of long DNA fragments, and the probability of two distant contigs spanned by the same fragment, respectively. Small N can reduce off-target reads and make local-assembly more efficiently. Our study revealed that the probable best practice in using linked-reads for metagenome assembly is to merge the linked-reads from multiple libraries, where each has sufficient C but a smaller amount of input DNA.

Methods

Simulate linked-reads for microbes with uneven abundance

LRTK-SIM [11] was initially built for human diploid assembly by simulating 10x linked-reads. In this study, we extended it to allow genomes with uneven depth to reflect different microbial abundance (Figure S7). We downloaded the reference genomes (denoted as M) of 23 bacterial and 3 archaeal strains from MBARC-26 [21] and categorized them into L (Molarity < 10−15), M (10−15 < Molarity < 10−14) and H (Molarity > 10−14) (Table S1). The molarity was normalized to sum to 1 as microbial relative abundance (), and C for microbe i (C) was calculated as C = C × A × 26 (C was predefined). The total fragment length for microbe i (M) was C× L, where L was genome size of M. The estimated input nucleotides were calculated as . We simulated a wide range of C (from 28X to 333X), C (from 0.064X to 0.77X), μ (from 5 to 100 kb), and N (from 10 to 160) to investigate their impact on metagenome assembly (Table S1).

DNA extraction, library preparation, and sequencing

For mock community, DNA from ATCC 20 strain staggered mix genomic material (ATCC-MSA1003) was extracted without size-selection. For human gut microbiome from stool sample, we extracted the DNA using Qiagen QiAaMP Stool Mini Kit and removed the DNA fragments below 5 kb. After that, the molecular weight of isolated DNA was assayed by pulsed-field electrophoresis. For 10x Chromium library preparation, 1 ng of isolated high molecular weight DNA was denatured according to the manufacturer recommendations, added to the reaction master mix and mixed with gel bead and emulsification oil to generate droplets within a Chromium Genome chip. The rest part of library preparation was done following the manufacturer protocol (Chromium Genome v1, PN-120229). The two libraries were sequenced by Illumina XTen with 2 × 150 bps paired-end reads, respectively. The DNA of human gut microbiome was also prepared for standard Illumina XTen short-read sequencing.

DNA long fragment reconstruction and linked-read subsampling

Long Ranger v2.2.1 [22] was used to correct barcode base errors, calculate PCR duplication rate, and perform barcode-aware linked-read alignment. BWA-MEM v0.7.17 [23] was adopted to align short-reads and linked-reads without barcodes. Long DNA fragments were reconstructed according to the mapping coordinates of co-barcoded short-reads. The linked-reads were sorted by barcode first and then by their mapping coordinates. Long DNA fragments were reconstructed by greedy extension and terminated if the nearest co-barcoded read was > 50 kb away. Each fragment must include at least two co-barcoded read pairs and have a minimum length of 2 kb.

Metagenome assembly

For linked-read assembly, the linked-reads without barcodes were first assembled into seed contigs by metaSPAdes v3.11.1 [4] with default parameters and aligned to contigs by BWA-MEM v0.7.17. Athena-meta v1.3 was applied for local assembly by collecting co-barcoded reads shared by two “seed” contigs in scaffold graph (Figure S8). For mock community, the Illumina short-reads (SRR8359173) and PacBio CCS reads (SRR9202034 and SRR9328980) were assembled by metaSPAdes v3.12.0 and Canu v2.0 [24], respectively. The command lines were included in the Supplementary Note.

Assembly evaluation

We implemented a pipeline (Figure S9) to compare different metagenome assemblies by integrating off-the-shelf software and in-house scripts. First, MaxBin v2.2.4 [25] grouped contigs (longer than 1 kb) into bins, and their completeness and contaminations were assessed by CheckM v1.0.12 [26]. Quast v5.0.0 [27] calculated basic statistics such as contig N50, NG50, NGA50, total aligned length, and genomic coverage; Aragorn v1.2.38 [28] and Barrnap (https://github.com/tseemann/barrnap) were used to infer tRNA and rRNA (5S, 16S, and 23S), respectively; Kraken v0.10.6 [29] annotated taxonomic classification of bins based on its built-in database MiniKrakenDB. The bin abundance was calculated by . For each bin, size(bin) is its total nucleotides, dp(bin) denotes its read depth, len(read) is short-read length, and sum(read) is total number of aligned short-reads. Bins were recognized as draft genomes if they were classified as high-quality (completeness > 90%, contamination < 5%, presence of the 5S, 16S, 23S rRNAs, and at least 18 tRNA), medium-quality (completeness ≥ 50% and contamination < 10%), and low-quality (completeness < 50% and contamination < 10%). The command lines were included in the Supplementary Note.

Conclusion

In this study, we comprehensively investigated four parameters of linked-read sequencing on metagenome assembly and compared with Illumina short-reads and PacBio CCS reads. Our study revealed that the probable best practice in using linked-reads for metagenome assembly is to merge the linked-reads from multiple libraries, where each has sufficient C but a smaller amount of input DNA. Additional file 1: Table S1. Parameter configurations of the simulated data sets. Table S2. Summary of the microbes in MBARC-26. Microbes were classified as High- (H, Molarity > 10−14), Medium- (M,10−15 < Molarity < 10−14) and Low- (Lsim, Molarity < 10−15) abundance based on their molarities. Table S3. Summary of 20 microbes in ATCC MSA-1003. Microbes were classified as UltraHigh- (UH, percentage = 18%), High- (H, percentage = 1.8%), Medium- (M, percentage = 0.18%) and Low- (L, percentage = 0.02%) abundance according to their mixture amount. Table S5. The key parameters of 10x linked-read sequencing for human gut metagenome and human genome. Table S6. The performance of metaSPAdes, MEGAHIT and IDBA-UD on short-read sequencing from human gut microbiome. Additional file 2: Table S4. Annotations of assemblies for the subsampled linked-reads from human gut microbiome. Additional file 3: Table S7 A summary of 65,535 microbes in human microbiome project covered by 10x linked-reads from human gut microbiome. Table S8. A summary of 1,285 microbes in human microbiome project covered by 10x linked-reads of human gut microbiome with genomic coverage>90% and sequencing depth > 20X. Additional file 4: Figure S1. Distributions of bin completeness and contamination of SCF- and SCR- of human gut microbiome data. Figure S2. Upset plots for the shared genus (A: SC, C: SC) and species (B: SC, D: SC) of different subsampling datasets. Figure S3. Comparison of the contig NG50 and NGA50 between Illumina short-reads (Illumina) and 10x linked-reads (MSC) from the mock community. Figure S4. Comparison of the contig NG50 and NGA50 between PacBio CCS reads (CCS) and 10x linked-reads (MSC) from the mock community. Figure S5. Parameter distributions of linked-read sequencing from human gut microbiome. PDF: probability density function; CDF: cumulative density function. Figure S6. Parameter distributions of linked-read sequencing from human genome (NA24385). PDF: probability density function; CDF: cumulative density function. Figure S7. Workflow of LRTK-SIM to simulate linked-reads for microbial genomes with uneven depth. Figure S8. Workflow of linked-reads metagenome assembly on simulated 10x linked-reads. Figure S9. Workflow for evaluating and comparing different metagenome assemblies. Figure S10. The distributions of genomic coverage and read depth for the microbes in human microbiome project according to the alignment of the linked-reads from human gut microbiome. CDF: cumulative density function. Supplementary Note: 1. Complexity and statistics for linked-reads from human gut microbiome. 2. Command lines adopted for the analysis.

28 in total

1. A human gut microbial gene catalogue established by metagenomic sequencing.

Authors: Junjie Qin; Ruiqiang Li; Jeroen Raes; Manimozhiyan Arumugam; Kristoffer Solvsten Burgdorf; Chaysavanh Manichanh; Trine Nielsen; Nicolas Pons; Florence Levenez; Takuji Yamada; Daniel R Mende; Junhua Li; Junming Xu; Shaochuan Li; Dongfang Li; Jianjun Cao; Bo Wang; Huiqing Liang; Huisong Zheng; Yinlong Xie; Julien Tap; Patricia Lepage; Marcelo Bertalan; Jean-Michel Batto; Torben Hansen; Denis Le Paslier; Allan Linneberg; H Bjørn Nielsen; Eric Pelletier; Pierre Renault; Thomas Sicheritz-Ponten; Keith Turner; Hongmei Zhu; Chang Yu; Shengting Li; Min Jian; Yan Zhou; Yingrui Li; Xiuqing Zhang; Songgang Li; Nan Qin; Huanming Yang; Jian Wang; Søren Brunak; Joel Doré; Francisco Guarner; Karsten Kristiansen; Oluf Pedersen; Julian Parkhill; Jean Weissenbach; Peer Bork; S Dusko Ehrlich; Jun Wang
Journal: Nature Date: 2010-03-04 Impact factor: 49.962

2. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph.

Authors: Dinghua Li; Chi-Man Liu; Ruibang Luo; Kunihiko Sadakane; Tak-Wah Lam
Journal: Bioinformatics Date: 2015-01-20 Impact factor: 6.937

3. Structure, function and diversity of the healthy human microbiome.

Authors:
Journal: Nature Date: 2012-06-13 Impact factor: 49.962

4. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements.

Authors: Rajiv C McCoy; Ryan W Taylor; Timothy A Blauwkamp; Joanna L Kelley; Michael Kertesz; Dmitry Pushkarev; Dmitri A Petrov; Anna-Sophie Fiston-Lavier
Journal: PLoS One Date: 2014-09-04 Impact factor: 3.240

5. Direct determination of diploid genome sequences.

Authors: Neil I Weisenfeld; Vijay Kumar; Preyas Shah; Deanna M Church; David B Jaffe
Journal: Genome Res Date: 2017-04-05 Impact factor: 9.043

6. Versatile genome assembly evaluation with QUAST-LG.

Authors: Alla Mikheenko; Andrey Prjibelski; Vladislav Saveliev; Dmitry Antipov; Alexey Gurevich
Journal: Bioinformatics Date: 2018-07-01 Impact factor: 6.937

7. High-quality genome sequences of uncultured microbes by assembly of read clouds.

Authors: Alex Bishara; Eli L Moss; Mikhail Kolmogorov; Alma E Parada; Ziming Weng; Arend Sidow; Anne E Dekas; Serafim Batzoglou; Ami S Bhatt
Journal: Nat Biotechnol Date: 2018-10-15 Impact factor: 54.908

8. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing.

Authors: Grace X Y Zheng; Billy T Lau; Michael Schnall-Levin; Mirna Jarosz; John M Bell; Christopher M Hindson; Sofia Kyriazopoulou-Panagiotopoulou; Donald A Masquelier; Landon Merrill; Jessica M Terry; Patrice A Mudivarti; Paul W Wyatt; Rajiv Bharadwaj; Anthony J Makarewicz; Yuan Li; Phillip Belgrader; Andrew D Price; Adam J Lowe; Patrick Marks; Gerard M Vurens; Paul Hardenbol; Luz Montesclaros; Melissa Luo; Lawrence Greenfield; Alexander Wong; David E Birch; Steven W Short; Keith P Bjornson; Pranav Patel; Erik S Hopmans; Christina Wood; Sukhvinder Kaur; Glenn K Lockwood; David Stafford; Joshua P Delaney; Indira Wu; Heather S Ordonez; Susan M Grimes; Stephanie Greer; Josephine Y Lee; Kamila Belhocine; Kristina M Giorda; William H Heaton; Geoffrey P McDermott; Zachary W Bent; Francesca Meschi; Nikola O Kondov; Ryan Wilson; Jorge A Bernate; Shawn Gauby; Alex Kindwall; Clara Bermejo; Adrian N Fehr; Adrian Chan; Serge Saxonov; Kevin D Ness; Benjamin J Hindson; Hanlee P Ji
Journal: Nat Biotechnol Date: 2016-02-01 Impact factor: 54.908

9. Next generation sequencing data of a defined microbial mock community.

Authors: Esther Singer; Bill Andreopoulos; Robert M Bowers; Janey Lee; Shweta Deshpande; Jennifer Chiniquy; Doina Ciobanu; Hans-Peter Klenk; Matthew Zane; Christopher Daum; Alicia Clum; Jan-Fang Cheng; Alex Copeland; Tanja Woyke
Journal: Sci Data Date: 2016-09-27 Impact factor: 6.444

10. Complete, closed bacterial genomes from microbiomes using nanopore sequencing.

Authors: Eli L Moss; Dylan G Maghini; Ami S Bhatt
Journal: Nat Biotechnol Date: 2020-02-10 Impact factor: 54.908

3 in total

Review 1. Revolutionized virome research using systems microbiology approaches.

Authors: Suwalak Chitcharoen; Pavaret Sivapornnukul; Sunchai Payungporn
Journal: Exp Biol Med (Maywood) Date: 2022-06-20

2. Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities.

Authors: Derek M Bickhart; Mikhail Kolmogorov; Elizabeth Tseng; Daniel M Portik; Anton Korobeynikov; Ivan Tolstoganov; Gherman Uritskiy; Ivan Liachko; Shawn T Sullivan; Sung Bong Shin; Alvah Zorea; Victòria Pascal Andreu; Kevin Panke-Buisse; Marnix H Medema; Itzhak Mizrahi; Pavel A Pevzner; Timothy P L Smith
Journal: Nat Biotechnol Date: 2022-01-03 Impact factor: 68.164

3. A metagenomic analysis of the bacterial microbiome of limestone, and the role of associated biofilms in the biodeterioration of heritage stone surfaces.

Authors: Philip J A Skipper; Lynda K Skipper; Ronald A Dixon
Journal: Sci Rep Date: 2022-03-22 Impact factor: 4.379

3 in total