| Literature DB >> 35695508 |
Asier Zaragoza-Solas1, Jose M Haro-Moreno1, Francisco Rodriguez-Valera1, Mario López-Pérez1.
Abstract
The recovery of DNA from viromes is a major obstacle in the use of long-read sequencing to study their genomes. For this reason, the use of cellular metagenomes (>0.2-μm size range) emerges as an interesting complementary tool, since they contain large amounts of naturally amplified viral genomes from prelytic replication. We have applied second-generation (Illumina NextSeq; short reads) and third-generation (PacBio Sequel II; long reads) sequencing to compare the diversity and features of the viral community in a marine sample obtained from offshore waters of the western Mediterranean. We found that a major wedge of the expected marine viral diversity was directly recovered by the raw PacBio circular consensus sequencing (CCS) reads. More than 30,000 sequences were detected only in this data set, with no homologues in the long- and short-read assembly, and ca. 26,000 had no homologues in the large data set of the Global Ocean Virome 2 (GOV2), highlighting the information gap created by the assembly bias. At the level of complete viral genomes, the performance was similar in both approaches. However, the hybrid long- and short-read assembly provided the longest average length of the sequences and improved the host assignment. Although no novel major clades of viruses were found, there was an increase in the intraclade genomic diversity recovered by long reads that produced an enriched assessment of the real diversity and allowed the discovery of novel genes with biotechnological potential (e.g., endolysin genes). IMPORTANCE We explored the vast genetic diversity of environmental viruses by using a combination of cellular metagenome (as opposed to virome) sequencing using high-fidelity long-read sequences (in this case, PacBio CCS). This approach resulted in the recovery of a representative sample of the viral population, and it performed better (more phage contigs, larger average contig size) than Illumina sequencing applied to the same sample. By this approach, the many biases of assembly are avoided, as the CCS reads recovers (typically around 5 kb) complete genes and even operons, resulting in a better discovery of the viral gene diversity based on viral marker proteins. Thus, biotechnologically promising genes, such as endolysin genes, can be very efficiently searched with this approach. In addition, hybrid assembly produces more complete and longer contigs, which is particularly important for studying little-known viral groups such as the nucleocytoplasmic large DNA viruses (NCLDV).Entities:
Keywords: PacBio CCS long reads; bacteriophage; long-read sequencing; metagenome; viral diversity; virome
Year: 2022 PMID: 35695508 PMCID: PMC9238414 DOI: 10.1128/msystems.00192-22
Source DB: PubMed Journal: mSystems ISSN: 2379-5077 Impact factor: 7.324
Summary statistics of viral sequence recovery for the short-read assembly (SRa), long-read assembly (LRa), and raw-read (LR) data sets
| Statistic | Illumina assembly (SRa) | PacBio assembly (LRa) | PacBio CCS15 reads (LR) |
|---|---|---|---|
| Starting sequences | 149,018 | 19,982 | 1,535,891 |
| Putative phages (VIBRANT) | 10,979 | 947 | 50,296 |
| 95% identity clustering | 10,979 | 947 | 42,156 |
| Unique sequences | 5,886 | 36 | 30,203 |
| Nucleotides sequenced (Gb) | 23.4 | 31.0 | 7.6 |
| Unique sequences/Gbp sequenced | 251.53 | 1.16 | 3,974 |
| Unique sequences (versus GOV2) | 4,196 | 35 | 26,766 |
| No. complete (high quality) | 9 (53) | 15 (114) | 0 (27) |
| Min–max sequence length (bp) | 1,000–188,349 | 1,353–428,169 | 1,011–17,836 |
| Avg sequence length (bp) | 4,906 | 32,260 | 5,261 |
| Min–max GC content (%) | 19.40–65.25 | 19.56–69.93 | 14.25–86.03 |
| Avg GC content (%) | 35.45 | 36.9 | 38.13 |
| Total proteins | 80,487 | 41,599 | 330,157 |
| Unique terminase ( | 30 | 2 | 393 |
| Avg proteins/sequence | 7.33 | 43.92 | 7.83 |
| Avg protein length (aa) | 190.29 | 223.42 | 177.9 |
Sequences not present in the other data sets (BLASTN, 95%; coverage of at least 70% of the smallest sequence).
Sequences not present in the other data sets or the Global Ocean Virome 2.0 (BLASTN, 95%; coverage of at least 70% of the smallest sequence).
VIBRANT defines a high-quality sequence as one that likely contains the majority of a virus’s complete genome (~70% completeness).
Values shown here represent protein numbers after dereplication (CD-HIT, 95% identity).
FIG 1(A) Taxonomic affiliations of viral contigs expressed in percentages, separated into those found in that data set (non-unique) and those unique to that data set (unique). The number in parentheses below each bar is the number of contigs in that category. (B) Distribution of assembled viral contigs that infect Alphaproteobacteria by contig length and GC content. Circles represent short-read assemblies (Illumina), while diamonds represent hybrid assemblies (PacBio + Illumina). Shapes are colored according to their host. (C) Distribution of assembled viral contigs that infect Cyanobacteria by contig length and GC content. Orange circles represent short-read assemblies (Illumina), while green diamonds represent hybrid assemblies (PacBio + Illumina). (D) Distribution of viral contigs by contig length and GC content. Circles represent short-read assemblies (Illumina), while green diamonds represent hybrid assemblies (PacBio + Illumina). Shapes are colored according to their host.
FIG 2(A) Relative abundance of viral sequences measured by their recruitment values in metagenomes from Tara Oceans expeditions for cyanophages, pelagiphages, and other phages. The x axis shows the number of Tara stations where the contig accumulated over the coverage thresholds, while the y axis shows the combined recruitment value (in RPKG). Circles represent contigs derived from assembly (green for hybrid assembly, orange for Illumina assembly), while blue diamonds represent raw PacBio reads. (B) Relative abundance of viral sequences in viromes (x axis) and metagenomes (y axis) obtained from the same sample at 15, 45, and 60 m, measured in ln RPKG. Circles represent contigs derived from assembly (green for hybrid assembly, orange for Illumina assembly), while blue diamonds represent raw PacBio CCS15 reads. SRF, surface; DCM, deep chlorophyll maximum.
FIG 3(A) Phylogenetic trees based on the terminase large subunit (TerL) and thymidylate synthase (PhyX). Branches are colored according to the assigned host, while the color of the outer circle indicates the data set the contig was obtained from (orange for Illumina assembly, green for PacBio assembly, blue for PacBio CCS15 reads). (B) Venn diagrams showing shared and unique sequences among the three data sets and GOV2 for the terminase large subunit (TerL) and thymidylate synthase (PhyX). The number inside each intersection leaf indicates the number of proteins shared by those data sets. In the unique section for each data set, the number in parentheses is the percentage of unique proteins in that data set compared to the total.