Literature DB >> 23468974

The Pacific Ocean virome (POV): a marine viral metagenomic dataset and associated protein clusters for quantitative viral ecology.

Bonnie L Hurwitz1, Matthew B Sullivan.   

Abstract

Bacteria and their viruses (phage) are fundamental drivers of many ecosystem processes including global biogeochemistry and horizontal gene transfer. While databases and resources for studying function in uncultured bacterial communities are relatively advanced, many fewer exist for their viral counterparts. The issue is largely technical in that the majority (often 90%) of viral sequences are functionally 'unknown' making viruses a virtually untapped resource of functional and physiological information. Here, we provide a community resource that organizes this unknown sequence space into 27 K high confidence protein clusters using 32 viral metagenomes from four biogeographic regions in the Pacific Ocean that vary by season, depth, and proximity to land, and include some of the first deep pelagic ocean viral metagenomes. These protein clusters more than double currently available viral protein clusters, including those from environmental datasets. Further, a protein cluster guided analysis of functional diversity revealed that richness decreased (i) from deep to surface waters, (ii) from winter to summer, (iii) and with distance from shore in surface waters only. These data provide a framework from which to draw on for future metadata-enabled functional inquiries of the vast viral unknown.

Entities:  

Mesh:

Year:  2013        PMID: 23468974      PMCID: PMC3585363          DOI: 10.1371/journal.pone.0057355

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Bacteria are fundamental to life on Earth and drive energy and nutrient cycles in natural systems [1]. In the surface oceans, viruses, thought to be predominantly phages, and are present at approximately 107 particles per milliliter of seawater often outnumbering their bacterial hosts fifteen to one [2]–[4]. These abundant phages play an integral role in the lifecycle, development and evolution of their diverse hosts (reviewed in [5]). This is particularly well documented in ocean cyanophages where natural abundances are high [6]–[9], and the eco-evolutionary virus-host dynamics involve manipulation of critical aspects of ocean life (e.g., photosynthesis, phosphate, nitrogen; reviewed in [10]). Despite this importance, our understanding of indigenous viral and microbial communities is compromised by both conceptual and technological challenges. Perhaps chief among them is that <1% of microbes in an environment can be routinely grown in the laboratory [11], [12]. To study this ‘unseen majority’ [13], microbial ecologists developed culture-independent techniques, first in the form of gene markers (e.g., small subunit ribosomal DNA, [14]) to quantify biodiversity and subsequently through metagenomics to study metabolic function and gene ecology [12], [15]–[17]. To this end, resources such as the Global Ocean Sampling (GOS) dataset [18] have been developed, and subsequently organized using sequence-based protein clusters (PCs) [19] to (i) group related proteins based on sequence similarity, (ii) map metadata associated with the samples to the PCs, and (iii) aid future inquires in genetic novelty by mapping to pre-defined PCs. The GOS dataset represents 7.7 million reads from 41 surface ocean microbial (0.1–0.8 µm size fraction) samples ranging 8,000 km along coastal Atlantic Ocean eastern United States waters to the Sargasso Sea, Gulf of Mexico, Caribbean Sea and a few sites in the South Pacific Ocean. This broad sampling has enabled GOS to greatly advance microbial ecology (849 citations as of 6 Sept 2012; Google Scholar) both through data mining studies and use of GOS to contextualize environmental research findings. While metagenomics was first applied to uncultured viral communities [20], two years in advance of microbes [15]–[17], there remains no such equivalent GOS-scale datasets (Table 1). Such a resource is missing because of limiting DNA yields from environmental viruses and the fact that viral metagenomic analytics are challenging. On the latter, viral metagenomics is more difficult than microbial metagenomics as (i) viral genome representation in public databases is more sparse (as of November 2012 there are 115 marine phage genomes in Genbank, Table S1), (ii) analytical tools paralleling those for microbes (e.g., MG-RAST [21] and IMG/M [22]) have not been available (but see VIROME [23] and MetaVir [24]) and (iii) the ‘unknown’ problem is even greater in viral metagenomes as often 90% of sequences lack similarity to anything in pre-existing protein databases (see Table 1 versus 65% for microbes [17]). Worse, the viral metagenomic datasets themselves often suffer either from being too small (e.g., LASL-based Sanger-sequenced metagenomes) or contain technical artifacts that are now known to render the metagenomes non-quantitative. To emphasize the latter, we highlight a commonly used marine viral dataset (363 citations as of 6 Sept 2012; Google Scholar) that includes 1.77 million metagenome sequences from 4 major ocean regions [25]. While this dataset was generated using the best available technologies and is the gold-standard for contextualizing viral ecological findings, it is now known to suffer from two major technical issues: (i) pooled samples (4 metagenomes from 184 viral assemblages collected over a decade from 68 sites) inhibit tracking of variability over space and time, and (ii) MDA-amplified material is now known to lead to non-quantitative metagenomes [26]. This latter point is particularly problematic for viruses as MDA biases are both stochastic [26] and systematic [27], [28] with the latter related to nucleic acid type and structure (i.e., a feature which varies across viral types). For these reasons, there are currently no marine viral metagenomic datasets appropriate for quantitative viral ecology.
Table 1

Summary of published marine, non-coral viral metagenomic datasets.

siteScripps Pier and Mission Bay,San Diego CAMission Bay, San Diego, CAJericho Pier andStrait of Georgia,British ColumbiaBBC, GOM, ARC, SARChesapeake BayLine Islands, Skan BayTampa Bay, induced viromeEutrophic MontereyBay, CA 200 m depthEast Pacific Risediffuse-flow viralcommunityPOV
# metagenomes 112411311132
Total # reads 1 9341 1562771 768 2975 6412 258 386294 0688811766 020 088
avg. len. (bp) >600>600>600102695<105102>600488310
sequencing platform SangerSangerSanger454 GS20Sanger454 GS20454 GS20SangerSanger454 Titanium
viral fraction (µm) <0.160.20.220.20.20.20.20.2N/A0.2
concentration TFFTFFTFF+ultracent.TFFTFFTFFTFFTFFTFF+ultracent.FeCl3-ppt
purification CsClCsClRNAseCsClN/ACsClCsClCsClN/ACsCl+DNase
DNA amplification LASLsLASLscDNA-based LASLsMDALASLsMDAMDAN/ALASLsLA
% unknown 657563–8191709093747589
reference [20] [73] [74] [25] [41] [42] [75] [30] [43] This publication

Abbreviations are as follows: TFF – Tangential Flow Filtration, ultracent. – ultracentrifugation, FeCl3-ppt – iron-chloride precipitation, CsCl – cesium chloride density gradient, LASLs – linker amplified shotgun libraries, MDA – multiple displacement amplification, LA – linker amplification.

Abbreviations are as follows: TFF – Tangential Flow Filtration, ultracent. – ultracentrifugation, FeCl3-pptiron-chloride precipitation, CsClcesium chloride density gradient, LASLs – linker amplified shotgun libraries, MDA – multiple displacement amplification, LA – linker amplification. Here we introduce a large-scale, quantitative Pacific Ocean Virome (POV) dataset and ∼456 K associated PCs that organize the ‘known’ and ‘unknown’ sequence space for future comparative viral metagenomic study. The 6 million read dataset is derived from 32 temporally- and spatially-resolved viral assemblages, and represents the largest viral metagenomic sampling of the Pacific Ocean to date, including the first large-scale viral metagenomes from the deep pelagic ocean (but see [29], [30] and Table 1). The POV dataset represents a systematically collected, processed, and documented quantitative marine viral metagenomic resource [31] as follows. All thirty-two POV communities were concentrated using a new method that captures nearly all particles [32], purified using DNase digestion and CsCl buoyant density gradients to minimize contamination by non-viral DNA [33], and DNA extracted and linker-amplified to minimize quantitative and cloning biases in the resulting metagenomes [34]. DNA was then sequenced by Roche 454 Titanium technology. The metagenomes and the associated PCs provide a much-needed community resource to test hypotheses about environmental viruses, as GOS has done for microbial ecology. For these reasons, POV will likely become a foundational dataset for future comparative studies of virus genes and communities at the global ocean scale such as those derived from the recent Tara Oceans [35] and Malaspina [36] expeditions.

Results and Discussion

The Pacific Ocean Samples

The 32 POV source waters varied by depth, proximity to land, and season and were derived from four regions in the Pacific Ocean (Figure 1): Scripps Pier in San Diego, California (SIO), Monterey Bay, California (MBARI), near Vancouver Island in British Columbia (LineP), and the Great Barrier Reef in Australia (GBR). Samples, metadata, and metagenomic descriptive statistics are summarized in Table 2.
Figure 1

Sampling site map for the POV dataset.

Thirty-two viral metagenomes represent discretely sampled and processed datasets that vary over time and space in the pelagic Pacific Ocean. (A) Overview of sampling sites, (B) GBR – Great Barrier Reef, Australia, near Dunk and Fitzroy Islands. (C) LineP- oceanographic transect off Vancouver Island, British Columbia (D) MBARI- Line67 oceanographic transect off of Monterey Bay, California (E) SIO- Scripps Pier, San Diego, CA. Images were created using Ocean Data View.

Table 2

Description of POV samples and associated metadata.

sample# readsMbpsAvg. read lengthStdev read lengthphotic zoneregiondepth (m)site locationmonthseasonOceanic features
L.Win.O.1000 m147 53740.527480aphoticLineP1000open oceanFeb.winterOMZ
L.Win.O.2000 m125 89633.426586aphoticLineP2000open oceanFeb.winter
L.Spr.C.500 m136 87644.532583aphoticLineP500coastalJun.summer
L.Spr.C.1000 m97 12631.5324124aphoticLineP1000coastalJun.summerOMZ
L.Spr.C.1300 m98 47824.224595aphoticLineP1300coastalJun.summer
L.Spr.I.500 m58 10820.5353101aphoticLineP500intermediateJun.summer
L.Spr.I.1000 m122 56545.937488aphoticLineP1000intermediateJun.summerOMZ
L.Spr.I.2000 m49 91413.226497aphoticLineP2000intermediateJun.summer
L.Sum.O.500 m42 11813.6322122aphoticLineP500open oceanAug.fall
L.Sum.O.1000 m70 59619.9282131aphoticLineP1000open oceanAug.fallOMZ
L.Sum.O.2000 m68 51618.6271122aphoticLineP2000open oceanAug.fall
L.Spr.O.1000 m101 17928.4280115aphoticLineP1000open oceanJun.summerOMZ
L.Spr.O.2000 m55 33214.8267110aphoticLineP2000open oceanJun.summer
L.Win.O.500 m167 61648.028684aphoticLineP500open oceanFeb.winter
M.Fall.O.1000 m225 83366.3293105aphoticMBARI1000open oceanOct.fall
M.Fall.O.4300 m144 58840.527995aphoticMBARI4300open oceanOct.fall
GD.Spr.C.8 m116 85529.024772photicGBR8coastalOct.spring
GF.Spr.C.9 m82 73920.925273photicGBR9coastalOct.spring
L.Sum.O.10 m165 25648.4292102photicLineP10open oceanAug.fall
L.Spr.C.10 m107 24425.724086photicLineP10coastalJun.summer
L.Spr.I.10 m92 41522.624492photicLineP10intermediateJun.summer
L.Spr.O.10 m75 03619.5259106photicLineP10open oceanJun.summer
L.Win.O.10 m192 68559.731074photicLineP10open oceanFeb.winter
M.Fall.C.10 m303 519105.2346118photicMBARI10coastalOct.fallUpwelling
M.Fall.I.10 m321 75492.9288109photicMBARI10intermediateOct.fallUpwelling
M.Fall.I.42 m31 52810.9346114photicMBARI42intermediateOct.fallUpwelling; DCM
M.Fall.O.10 m203 23852.425795photicMBARI10open oceanOct.fall
M.Fall.O.105 m156 50944281101photicMBARI105open oceanOct.fallDCM
SFC.Spr.C.5 m487 339191.2392107photicSIO5coastalApr.spring
SFD.Spr.C.5 m645 463218.7338119photicSIO5coastalApr.spring
SFS.Spr.C.5 m504 826173.1342130photicSIO5coastalApr.spring
STC.Spr.C.5 m821 404246.3299103photicSIO5coastalApr.spring

Oceanic features include: OMZ = oxygen minimum zone, DCM = deep chlorophyll maximum, Upwelling = within a current system with upwelling.

Sampling site map for the POV dataset.

Thirty-two viral metagenomes represent discretely sampled and processed datasets that vary over time and space in the pelagic Pacific Ocean. (A) Overview of sampling sites, (B) GBR – Great Barrier Reef, Australia, near Dunk and Fitzroy Islands. (C) LineP- oceanographic transect off Vancouver Island, British Columbia (D) MBARI- Line67 oceanographic transect off of Monterey Bay, California (E) SIO- Scripps Pier, San Diego, CA. Images were created using Ocean Data View. Oceanic features include: OMZ = oxygen minimum zone, DCM = deep chlorophyll maximum, Upwelling = within a current system with upwelling. The defining ecological features of each dataset are as follows. Four SIO metagenomes were derived from a single coastal, surface water sample from Scripps Pier (San Diego, CA, April 2009, spring), the site of the first viral metagenome [20], that were differentially concentrated and purified (4 treatments, >2.5 M sequences; [33]). Seven MBARI metagenomes (∼1.4 M sequences) represent viruses concentrated from various depths along a long-standing oceanographic transect, Line67 [33], [37], [38], which spans coastal, upwelling and open ocean waters off Monterey Bay, California collected in fall, October 2009. Nineteen LineP metagenomes (∼2.0 M sequences) represent viruses concentrated from various depths along another long-standing oceanographic transect, LineP [39], which spans coastal-to-open-ocean waters, including the second largest ocean oxygen minimum zone [40], off British Columbia collected in February (winter), June (spring) and August (summer) 2009. Finally, two GBR metagenomes (∼0.2 M sequences) represent viral concentrates from the dry season near Dunk (Tully River impacted) and Fitzroy (less impacted) Islands at the Great Barrier Reef in Australia collected in October (spring) 2009. Together, these metagenomes represent a diversity of pelagic ocean features including oceanic region (SIO, GBR, MBARI, LineP), proximity to land (coastal to open ocean), season (spring, summer, fall, and winter), depth (10 m to 4300 m), primary productivity and oxygen concentration (variability in other physiochemical characteristics are not considered here).

Taxonomic Composition of POV Metagenomes

Long-standing questions in marine viral ecology are centered on understanding the extent to which viral assemblages change spatially, temporally and under different environmental conditions in the ocean [2]. Yet, given the paucity of known viruses in biological databases comparatively examining viral assemblages from diverse environments in the sea is stymied. As commonly observed in marine viral metagenomic studies [20], [25], [41]–[43], the majority (87% photic zone, 91% aphotic zone) of the reads could not be classified based on sequence similarity to known taxa (see Materials and Methods, Figure 2A). Moreover, we found a smaller fraction of reads that matched known viruses in the aphotic zone (3.3%) than the photic zone (8.3%) likely due to more sampling in the surface oceans (Figure 2A and Table 1).
Figure 2

The POV dataset and its place in the viral protein universe.

(A) Summary superkingdom taxonomy statistics for quantitative Pacific Ocean viral metagenomes from 16 photic and 16 aphotic zone samples. Reads were taxonomically assigned based on matches to proteins in SIMAP and curated as described in the methods. (B) Venn diagram representing medium- to large membership PCs documents the relative contributions of the POV, GOS microbial, and SIMAP datasets to the ‘viral protein universe’.

The POV dataset and its place in the viral protein universe.

(A) Summary superkingdom taxonomy statistics for quantitative Pacific Ocean viral metagenomes from 16 photic and 16 aphotic zone samples. Reads were taxonomically assigned based on matches to proteins in SIMAP and curated as described in the methods. (B) Venn diagram representing medium- to large membership PCs documents the relative contributions of the POV, GOS microbial, and SIMAP datasets to the ‘viral protein universe’. To examine the taxonomic composition of the POV metagenomes in greater detail, we classified the reads from each sample based on their top match at the family level (Figure S1). Overall, we found that metagenomes from samples in the photic zone had a larger proportion of reads that matched Myoviridae (an average of 4.2%±1.9%) than in the aphotic zone (an average of 1.6%±0.8%). Several samples in the deep ocean however, were enriched for Myoviridae including L.Spr.I.2000 m (3.3%) and L.Spr.O.2000 m (3.0%) that closely matched their photic counterparts L.Spr.I.10 m (3.1%) and L.Spr.O.10 m (4.1%). Also notable were the large fraction of reads matching Myoviridae at the deep chlorophyll maximum (DCM) in the open ocean in Monterey Bay, (9.6% for M.Fall.O.105 m) which is more than four times the fraction seen in the surface ocean from the same time point and station (1.9% for M.Fall.O.10 m). We also found a large fraction of sequences that matched Podoviridae in the DCM sample in the open ocean in Monterey Bay (4.2% for M.Fall.O.105 m) and in the surface samples from the Great Barrier Reef (3.8% for GF.Spr.C.9 m and 5.0% for GD.Spr.C.8 m) as compared to the 0.8%±0.5% on average in other samples. Thus, Podoviridae may play an important role in reef ecosystems and the DCM not presently unknown. Finally, we compared and contrasted known viruses at the genus and species level in the combined photic and aphotic samples (Figure S2A and B respectively). At the genus level, we found a higher fraction of T4- and T7-like viruses in the photic zone (6.9% total) than the aphotic zone (2.6% total). At the species level, we found a higher faction of synechococcus and prochlorococcus phages in the photic zone (4.6% total) than the aphotic zone (1.1% total).

The Protein Cluster as a Means to Organize Unknown Sequence Space

While this great ‘unknown’ problem is exacerbated in viral metagenomes, it has also plagued microbial metagenomic studies to the extent that previous analyses of the GOS dataset organized this sequence space, including unknowns, using protein clustering (sensu Yooseph et al., 2007 and 2008 [19], [44]; details in Materials and Methods). Here, as per Yooseph’s approach, we individually assembled each POV metagenome and identified open reading frames (ORFs) on both the contigs and individual reads, yielding ∼4.1 M non-redundant ORFs. These POV ORFs were clustered with ORFs from GOS core clusters (3,625,128 ORFs, [19] of both microbial and viral origin, as well as genes from SIMAP phage genomes (33,857 ORFs, [45] – in total ∼7.8 M ORFs. Given that database representation of viral sequences is sparse at best (e.g., GOS represents mostly microbial-fraction not viral core clusters) and the POV samples represent predominantly unexplored ocean regions, it is not surprising that most (78%) POV ORFs fail to cluster with known PCs (Table 3). Self-clustering the unmapped POV ORFs further organized this unknown sequence space (i.e., another 55% of POV ORFs were clustered), such that only 23% of POV ORFs remained as singletons. These singletons could either represent artifact or more likely are members of the “rare biosphere” [46] under-sampled in this data set due to their rarity.
Table 3

POV ORF recruitment.

photic zone# ORFsclustered with GOS/SIMAP phageself-clusteredsingletonstotal clustered
photic2 783 78431%±4%47%±9%22%±7%78%±7%
aphotic1 323 81113%±6%62%±12%25%±10%75%±10%
all4 107 59522%±10%55%±13%23%±9%77%±9%

The fraction of POV ORFs that non-redundantly recruited to existing (GOS/known phages) PCs versus those that recruited to PCs derived from the POV dataset (self clustered). POV ORFs are non-redundant by metagenome. The standard deviation values refer to differences in the fraction of ORFS clustered in the POV samples.

The fraction of POV ORFs that non-redundantly recruited to existing (GOS/known phages) PCs versus those that recruited to PCs derived from the POV dataset (self clustered). POV ORFs are non-redundant by metagenome. The standard deviation values refer to differences in the fraction of ORFS clustered in the POV samples. In total, we identified 456,420 PCs that contained two or more non-redundant members (12,226+1,557+442,637 PCs derived from GOS+POV, Phage+POV and POV only, respectively). Of these, 27,646 PCs contained 20 or more members (counts with varying levels of cluster membership are summarized in Table 4). For comparison, GOS, the first large-scale marine microbial-fraction metagenomic sequencing effort predicted 6.1 M proteins from assemblies derived from 7.7 M Sanger sequences, and identified ∼39 k of these ‘20+ member’ PCs from surface ocean waters [19]. These POV data represent the first marine viral-fraction metagenomes analyzed using PC techniques (but see also [33], [47]), and they more than double the known viral PCs (Figure 2B). Specifically, the POV dataset ‘identified’ 12,302 GOS 20+ member PCs (these existed in GOS and a subset have been previously identified as viral [43], [48]), while adding 15,344 20+ member PCs represented only by POV sequences. These ∼28 k viral clusters are likely also abundant in nature as they represent only ∼6% of the total number of PCs (the ecological ‘binning’ unit), but recruit ∼68% of the POV ORFs (the ecological ‘count’ equivalent).
Table 4

Distribution and count by PC size.

PC size# POV PCs# POV reads# POV ORFs# GOS ORFs# SIMAPphage ORFs
2–4314 144798 190797 110N/A1 080
5–980 302514 717514 453N/A264
10–1934 330452 112452 012N/A100
20–4918 332542 090390 671151 282137
50–994 522305 752148 579157 10667
100–1991 956270 53190 302180 16069
200–4991 360430 580147 385283 10194
500–999715514 752155 714358 94296
1000–1999524738 986187 580551 246160
2000+234881 930396 720485 096114
total456 4205 449 6403 280 5262 166 9332 181

Distribution of PC sizes (based upon the number of ORFs a PC contains) and the number of PCs, POV reads mapping to clusters and ORFs that belonged to PCs of that size. All POV data are new to this study.

Distribution of PC sizes (based upon the number of ORFs a PC contains) and the number of PCs, POV reads mapping to clusters and ORFs that belonged to PCs of that size. All POV data are new to this study.

Protein Clusters as a Viral Community Functional Richness Metric

Because viruses lack gene markers (e.g., small subunit ribosomal DNA) and most of the POV reads cannot be identified in reference databases measuring viral community diversity is problematic. To measure functional richness in POV samples irrespective of annotation, we detected genetic links between viral communities using protein clustering and illustrated patterns in richness between the samples using a rarefaction analysis as previously defined [33], [47].

Seasonal viral functional richness measurements at LineP

To examine ocean viral functional richness across the depth continuum (10 m to 2000 m) and season (spring, summer, and winter), 11 metagenomes from a single LineP open ocean site (station P26) were analyzed by protein cluster/rarefaction analysis [33], [47]. Rarefaction analysis showed that photic samples were less functionally rich than aphotic samples from the same season (Figure 3A). When comparing samples from different seasons in the same photic zone rarefaction analysis showed that winter was the most functionally rich, followed by spring, and summer (Figure 3A). All aphotic samples clearly separated by season, whereas photic samples showed similar levels of functional richness in spring and summer but increased richness in winter. Overall, rarefaction patterns indicate that season and photic zone are important drivers of viral community functional richness at LineP.
Figure 3

Viral community functional richness based on season and proximity to shore.

Rarefaction analysis of hits to protein clusters from: (A) 11 POV metagenomes from a single LineP open ocean site (station P26) (B) 11 POV metagenomes from LineP stations P4, P12, and P26 from a single research cruise (June 2009). To be conservative, only protein clusters with >20 members were used in these analyses.

Viral community functional richness based on season and proximity to shore.

Rarefaction analysis of hits to protein clusters from: (A) 11 POV metagenomes from a single LineP open ocean site (station P26) (B) 11 POV metagenomes from LineP stations P4, P12, and P26 from a single research cruise (June 2009). To be conservative, only protein clusters with >20 members were used in these analyses.

Viral functional richness measurements in June at LineP from coastal to open ocean

To examine viral functional richness across the depth continuum (10 m to 2000 m) and with proximity to shore (coastal vs open ocean), 11 metagenomes from the LineP transect stations P4, P12, and P26 from a single research cruise (June 2009) were analyzed by protein cluster/rarefaction analysis. Broadly, we found that aphotic samples from coastal to open ocean showed the same overall functional richness. Photic samples, however, were more functionally rich at the coast (station P4) as compared to oligotrophic intermediate and open ocean water samples (stations P12 and P26) that are similar in terms of their overall environmental chemistry. These data suggest that coastal photic waters are more functionally rich than photic open ocean waters, but the same pattern is not evident in the aphotic ocean. Given that functional richness differed by photic zone in the smaller subset of LineP samples examined above, we analyzed the complete POV dataset by photic zone using a protein cluster/rarefaction analysis (Figure 4A and B). In the photic rarefaction analysis (Figure 4A), we found that deeply sequenced SIO samples were the most functionally rich. Though the rarefaction analysis should normalize the samples in terms of sequencing effort, the limitation we placed on our analysis to include only 20+ member clusters may have included rare SIO clusters that are highly represented due to the exceptional sequencing effort for this single sample. The samples with the next highest functional richness came from MBARI samples in the same current system as a local upwelling, followed by samples from fall/winter and spring/summer. We noted several exceptions to these general trends. First, the LineP spring coastal sample, L.Spr.C.10 m, grouped more closely to fall/winter samples, likely due to the higher functional richness noted previously in photic coastal samples. Secondly, the MBARI fall open ocean sample taken from waters in a deep chlorophyll maximum (DCM), M.Fall.O.105 m, grouped more closely to fall/winter samples, which could be due to increased functional richness in the DCM. Yet, we could not confirm this given the low sequencing effort and limited trend information in the rarefaction curve from other DCM sample, M.Fall.I.42 m. In the aphotic rarefaction analysis (Figure 4B), winter/fall/spring samples were the most functionally rich followed by summer.
Figure 4

Viral community functional richness in the Pacific Ocean.

Rarefaction analysis of hits to protein clusters from all POV metagenomes in (A) photic zone samples and (B) aphotic zone samples. To be conservative, only protein clusters with >20 members were used in these analyses.

Viral community functional richness in the Pacific Ocean.

Rarefaction analysis of hits to protein clusters from all POV metagenomes in (A) photic zone samples and (B) aphotic zone samples. To be conservative, only protein clusters with >20 members were used in these analyses. In summary functional richness decreased (i) with distance from shore in surface waters only, (ii) from winter to summer, and (iii) from deep to surface waters. These data mirrored patterns seen in marine and brackish bacteria where samples decreased in diversity (i) along a river to ocean gradient [49], (ii) from winter to summer [50], (iii) and from deep to surface [49]. Our data provide a powerful new look at all of these features in a single analysis for viral metagenomes.

Limitations of the POV Dataset

While taking great strides forward in providing a large-scale quantitative viral metagenomic dataset, POV is also not without biases and limitations. First, the dataset excludes ssDNA phages as these viral concentrates were purified using both DNase and cesium chloride banding (CsCl 1.35–1.5 g/ml), where ssDNA phages are in the lighter fractions [51], [52]. While most detectable viruses (DNA-stained counts suggested >99%) were in these fractions, it is now clear that ssDNA phages are often not detectable by such staining procedures [52]. Second, the dataset does not include RNA viruses as nucleic acid extractions were optimized for DNA and the linker amplification enzymes are specific for DNA. There are now methods which allow simultaneous purification of both RNA and DNA viruses [53], and already a road-map for constructing RNA viral metagenomes from the oceans [54]. However, protocols to isolate, purify, and amplify RNA for viral metagenomes have not been rigorously evaluated as was done for dsDNA viruses [31]–[34]. Third, the dataset excludes larger viruses as the source waters for viral community concentration were 0.22 µm pre-filtered to remove bacteria. Fourth, the average read length of ∼300 bp, while an improvement upon the other large-scale viral metagenomic datasets available, will undoubtedly be improved upon as sequencing technologies advance. Finally, one sample is anomalously enriched for “bacteria” in the dataset. Specifically, while samples averaged 4.4%±3.7% bacterial hits per sample across the 32 sample dataset, a single sample showed elevated bacterial hits (24.1% for L.Spr.C.1000 m; Table 5; Figure S1). In the L.Spr.C.1000 m sample, 13.7% of reads matched bacteria from the family Rhodobacteraceae and 7.0% matched Pseudoalteromonadaceae representing the majority of bacterial hits (20.7% in these families as compared to 24.1% total). Further, the majority of reads (∼20% of total reads) from these two bacterial families mapped to just three genomes: Sulfitobacter sp. EE-36, Pseudoalteromonas sp. SM9913, Sulfitobacter sp. NAS-14.1, with 9, 7, and 4% of the hits, respectively. The locations of these “bacterial” reads in the reference genomes showed that the hits matched throughout the genomes indicating that these “bacterial” hits were not simply integrated phage DNA in the bacterial genomes, and must instead be either bacterial contamination that survived the DNase and CsCl purification steps or gene transfer agents (GTA, [55]–[57]). Given the extensive purification (cesium-chloride banding and DNase treatment) of the viral concentrate, we favor the hypothesis that this sample contains a higher GTA content than the rest of the samples. Regardless of the source of the microbial DNA, the L.Spr.C.1000 m sample should be used with caution, particularly for analyses related to auxiliary metabolic genes (AMGs, sensu [58]) or other genetic data known to co-occur in bacteria.
Table 5

Percentage of POV reads assigned to superkingdoms.

samplevirusesbacteriaeukaryotaarchaeaunknown
L.Win.O.1000 m2.8%2.9%0.7%0.0%93.6%
L.Win.O.2000 m2.5%2.7%1.0%0.0%93.8%
L.Spr.C.500 m5.0%4.0%0.9%0.0%90.0%
L.Spr.C.1000 m2.9%24.1%0.8%0.0%72.2%
L.Spr.C.1300 m2.7%2.5%0.4%0.0%94.4%
L.Spr.I.500 m5.2%5.6%1.3%0.0%87.9%
L.Spr.I.1000 m4.0%3.8%0.9%0.0%91.3%
L.Spr.I.2000 m5.9%3.2%0.7%0.0%90.2%
L.Sum.O.500 m3.5%4.3%2.0%0.0%90.2%
L.Sum.O.1000 m2.0%2.9%7.6%0.0%87.5%
L.Sum.O.2000 m3.1%4.1%2.1%0.0%90.6%
L.Spr.O.1000 m3.2%2.9%0.7%0.0%93.2%
L.Spr.O.2000 m5.0%3.7%2.0%0.0%89.3%
L.Win.O.500 m2.9%2.7%0.7%0.0%93.7%
M.Fall.O.1000 m2.7%3.4%0.5%0.0%93.3%
M.Fall.O.4300 m2.6%2.9%0.6%0.0%93.9%
GD.Spr.C.8 m12.4%4.6%0.4%0.0%82.6%
GF.Spr.C.9 m9.7%4.5%0.4%0.0%85.3%
L.Sum.O.10 m4.5%4.0%0.9%0.0%90.6%
L.Spr.C.10 m5.6%2.7%0.6%0.0%91.1%
L.Spr.I.10 m4.7%2.9%1.2%0.0%91.2%
L.Spr.O.10 m5.6%5.1%1.7%0.0%87.6%
L.Win.O.10 m6.0%3.3%0.8%0.0%89.9%
M.Fall.C.10 m8.0%4.1%0.7%0.0%87.2%
M.Fall.I.10 m7.4%3.4%0.6%0.0%88.6%
M.Fall.I.42 m5.3%6.2%2.8%0.0%85.8%
M.Fall.O.10 m5.5%3.5%0.5%0.0%90.6%
M.Fall.O.105 m15.3%3.1%0.4%0.0%81.2%
SFC.Spr.C.5 m10.7%4.7%0.9%0.0%83.6%
SFD.Spr.C.5 m9.1%3.6%0.7%0.0%86.5%
SFS.Spr.C.5 m9.0%3.8%0.9%0.0%86.2%
STC.Spr.C.5 m7.3%4.1%0.7%0.0%87.9%

Percentage of POV reads that were taxonomically assigned based on matches to proteins in SIMAP and curated as described in the methods by sample.

Percentage of POV reads that were taxonomically assigned based on matches to proteins in SIMAP and curated as described in the methods by sample.

Conclusions

Over the last two decades, viruses have emerged as abundant, diverse and biogeochemically important members of nearly any ecosystem. In spite of this importance, mapping the ocean virome has been stifled by technical challenges, limited sampling opportunities, and lack of database and analytical resources. The quantitative dataset and PC organization documented here provide an invaluable mapping resource for future comparative viral metagenomic research. Looking forward, marine viral ecology stands at a tipping point wherein the “unseen” majority [13] can now be spatiotemporally documented using large-scale metagenomic sequencing at a reduced cost. These ever-growing datasets, in combination with emerging information on novel phyla of microbial hosts [59], [60] and transformative experimental methods (e.g., single viral genomics [61], microfluidic digital PCR [62], viral tagging [63], phageFISH [64]) and k-mer-based annotation techniques to rapidly assign function [65] offer new windows into viral diversity across spatial and temporal scales that can be inter-connected with paired microbial datasets to link viruses and their hosts. Although the datasets and analyses are formidable, the curated PC dataset provided here should ease future adventures into the ‘unknown’ and lead to a better understanding of the dynamic microbial and viral processes that drive the biogeochemistry that fuels the planet.

Materials and Methods

Sample Collection, DNA Isolation, Linker Amplification, and Purification

Each sample, from ∼20–50 L of seawater, was pre-filtered using a 150 µm GF/A filter followed by a 0.22 µm, 142 mm diameter Express Plus filter. All filtrates were concentrated by FeCl3-precipitation and purified by DNase+CsCl. Comparison samples from SIO also had additional treatments as follows: (i) Tangential Flow Filtration (TFF) and DNase+CsCl, (ii) FeCl3-precipitation and DNase only, and (iii) FeCl3-precipitation and DNase+Sucrose as previously described [33]. DNA was extracted from the concentrated and purified viral particles using Wizard PCR DNA Purification Resin and Minicolumns as previously described [66]. The resulting DNA samples were randomly sheared and amplified using linker amplification (LA) as described previously [34]. Linker-amplified VLP DNA samples were sequenced using GS FLX Titanium sequencing chemistry on a 454 Genome Sequencer (http://www.454.com). Sequences were quality filtered using a custom pipeline written in Perl and bash shell and executed on a high performance computer running PBSPro to distribute jobs (screenpipe.tar). Briefly sequences were removed that (i) had an “N” anywhere in the sequence, and (ii) deviated from two standard deviations from the mean length or read quality score using protocols proposed by Huse et al. [67]. Artificial duplicates were removed from the pyrosequencing runs using the program cdhit-454 version 4.5.5 with default parameters [68]. All sequences were deposited to CAMERA (http://camera.calit2.net) under the following project accessions: CAM_P_0000914 and CAM_P_0000915.

Assembly and ORF Finding

Metagenomic assembly and ORF calling was conducted using a custom pipeline written in Perl and bash shell and executed on a high performance computer running GridEngine to distribute jobs (ivelvet2_orfpipeline.tar). First, we removed singletons by finding reads that had a 20-mer frequency equal to zero using the vmatch package version 2.1.5 (kmer size = 20; http://www.vmatch.de/), because by definition they cannot contribute to overlap in an assembly [69]. Second, to skew the assembly towards assembling dominant species first and less dominant members in subsequent rounds of assembly, velvet version 1.0.15 (hash length = 29, -long) [70] was used to iteratively assemble sequences based on their k-mer frequency in 2+, 4+, 6+, 10+ bins. Third, the contigs from each frequency bin were merged to remove exact duplicates using cdhit version 4.5.5 and requiring a percent identity of 100% across the smallest contig [68]. Finally, the non-redundant maximally assembled contig dataset was used as input for ORF prediction using the metagenomic mode in Prodigal [71] along with the individual reads. All ORFs that were non-redundant and >60 amino acids in length were retained for further analysis similar to GOS [19].

Protein Clustering

ORFs were clustered based on sequence similarity in a two-step process using cd-hit version 4.5.5 [68] using a custom pipeline written in Perl and bash shell and executed on a high performance computer using GridEngine to distribute jobs (protuniversepipeline.tar). First, ORFs were mapped to known PCs from the global ocean survey (GOS; [19]) and phage known protein sequences using cd-hit-2d (‘-g 1 -n 4 -d 0 -T 24 -M 45000’; 60% percent identity and 80% coverage). The proteins included in this initial clustering included GOS core cluster proteins (http://camera.calit2.net) and 33,857 proteins (NCBI) from known phage genomes downloaded on July 7, 2011 that were mapped to the associated SIMAP proteins for additional annotation information. Second, ORFs that did not match to known GOS clusters or phage genes from SIMAP were self-clustered using cd-hit as above. All ORFs, PCs and annotation are available as a public resource on the CAMERA website (http://camera.calit2.net) under the project accession: CAM_P_0000915.

Taxonomic Classification

BLASTX [72] was used to assign taxonomy to ORFs and sequence reads by comparison to the Similarity Matrix of Proteins (SIMAP, [45], June 25th, 2011 release) using an analysis pipeline written in Perl and bash shell and executed on a high performance computer using PBSPro to distribute jobs (blastpipeline_simap.tar). Taxonomy was assigned based on the species taxonomy ID listed in SIMAP and at the superfamily, family and genus levels using the NCBI taxonomy hierarchy for that species ID. Data curation consisted of re-assigning hits to “uncultured” organisms to their next top match, as well as examining missing family and genus level data to create a curated a subset of the NCBI taxonomy records for the most abundant viruses [33].

Rarefaction Analysis

All high quality metagenomic reads in the POV dataset (Table 2) were compared to ORFs in the 20+ member protein clusters using BLASTX (E value <0.001). We then generated hit counts to the protein clusters and used the data for further rarefaction analysis using the rarefaction calculator: (http://www.biology.ualberta.ca/jbrzusto/rarefact.php). All protocols are available at http://eebweb.arizona.edu/Faculty/mbsulli/protocols.htm, and scripts and associated documentation are archived at http://code.google.com/p/tmpl/. Family taxonomic profile across POV samples by photic zone. Note that these data represent only those metagenomic reads that had a significant hit to the SIMAP database. (TIFF) Click here for additional data file. (A) Genus and (B) species taxonomic profile for all POV samples combined by photic zone. Note that these data represent only those metagenomic reads that had a significant hit to the SIMAP database. (TIFF) Click here for additional data file. Marine Phage Genomes in NCBI Genbank as of November 2012. (DOCX) Click here for additional data file.
  67 in total

1.  Cultivated single-stranded DNA phages that infect marine Bacteroidetes prove difficult to detect with DNA-binding stains.

Authors:  Karin Holmfeldt; Duško Odić; Matthew B Sullivan; Mathias Middelboe; Lasse Riemann
Journal:  Appl Environ Microbiol       Date:  2011-12-02       Impact factor: 4.792

Review 2.  Metagenomics: application of genomics to uncultured microorganisms.

Authors:  Jo Handelsman
Journal:  Microbiol Mol Biol Rev       Date:  2004-12       Impact factor: 11.056

3.  Functional metagenomic profiling of nine biomes.

Authors:  Elizabeth A Dinsdale; Robert A Edwards; Dana Hall; Florent Angly; Mya Breitbart; Jennifer M Brulc; Mike Furlan; Christelle Desnues; Matthew Haynes; Linlin Li; Lauren McDaniel; Mary Ann Moran; Karen E Nelson; Christina Nilsson; Robert Olson; John Paul; Beltran Rodriguez Brito; Yijun Ruan; Brandon K Swan; Rick Stevens; David L Valentine; Rebecca Vega Thurber; Linda Wegley; Bryan A White; Forest Rohwer
Journal:  Nature       Date:  2008-03-12       Impact factor: 49.962

4.  Lysogenic virus-host interactions predominate at deep-sea diffuse-flow hydrothermal vents.

Authors:  Shannon J Williamson; S Craig Cary; Kurt E Williamson; Rebekah R Helton; Shellie R Bench; Danielle Winget; K Eric Wommack
Journal:  ISME J       Date:  2008-08-21       Impact factor: 10.302

Review 5.  The microbial engines that drive Earth's biogeochemical cycles.

Authors:  Paul G Falkowski; Tom Fenchel; Edward F Delong
Journal:  Science       Date:  2008-05-23       Impact factor: 47.728

6.  Metagenomic characterization of Chesapeake Bay virioplankton.

Authors:  Shellie R Bench; Thomas E Hanson; Kurt E Williamson; Dhritiman Ghosh; Mark Radosovich; Kui Wang; K Eric Wommack
Journal:  Appl Environ Microbiol       Date:  2007-10-05       Impact factor: 4.792

7.  Comparative metagenomics of microbial traits within oceanic viral communities.

Authors:  Itai Sharon; Natalia Battchikova; Eva-Mari Aro; Carmela Giglione; Thierry Meinnel; Fabian Glaser; Ron Y Pinter; Mya Breitbart; Forest Rohwer; Oded Béjà
Journal:  ISME J       Date:  2011-02-10       Impact factor: 10.302

8.  SIMAP: the similarity matrix of proteins.

Authors:  Thomas Rattei; Roland Arnold; Patrick Tischler; Dominik Lindner; Volker Stümpflen; H Werner Mewes
Journal:  Nucleic Acids Res       Date:  2006-01-01       Impact factor: 16.971

9.  The marine viromes of four oceanic regions.

Authors:  Florent E Angly; Ben Felts; Mya Breitbart; Peter Salamon; Robert A Edwards; Craig Carlson; Amy M Chan; Matthew Haynes; Scott Kelley; Hong Liu; Joseph M Mahaffy; Jennifer E Mueller; Jim Nulton; Robert Olson; Rachel Parsons; Steve Rayhawk; Curtis A Suttle; Forest Rohwer
Journal:  PLoS Biol       Date:  2006-11       Impact factor: 8.029

10.  Contrasting life strategies of viruses that infect photo- and heterotrophic bacteria, as revealed by viral tagging.

Authors:  Li Deng; Ann Gregory; Suzan Yilmaz; Bonnie T Poulos; Philip Hugenholtz; Matthew B Sullivan
Journal:  MBio       Date:  2012-10-30       Impact factor: 7.867

View more
  149 in total

1.  VIRALpro: a tool to identify viral capsid and tail sequences.

Authors:  Clovis Galiez; Christophe N Magnan; Francois Coste; Pierre Baldi
Journal:  Bioinformatics       Date:  2016-01-05       Impact factor: 6.937

2.  Illuminating structural proteins in viral "dark matter" with metaproteomics.

Authors:  Jennifer R Brum; J Cesar Ignacio-Espinoza; Eun-Hae Kim; Gareth Trubl; Robert M Jones; Simon Roux; Nathan C VerBerkmoes; Virginia I Rich; Matthew B Sullivan
Journal:  Proc Natl Acad Sci U S A       Date:  2016-02-16       Impact factor: 11.205

3.  Niche-dependent genetic diversity in Antarctic metaviromes.

Authors:  Olivier Zablocki; Lonnie van Zyl; Evelien M Adriaenssens; Enrico Rubagotti; Marla Tuffin; Stephen C Cary; Don Cowan
Journal:  Bacteriophage       Date:  2014-12-16

4.  Genome of a SAR116 bacteriophage shows the prevalence of this phage type in the oceans.

Authors:  Ilnam Kang; Hyun-Myung Oh; Dongmin Kang; Jang-Cheon Cho
Journal:  Proc Natl Acad Sci U S A       Date:  2013-06-24       Impact factor: 11.205

5.  Insight into the unknown marine virus majority.

Authors:  Alexander I Culley
Journal:  Proc Natl Acad Sci U S A       Date:  2013-07-10       Impact factor: 11.205

6.  New approaches indicate constant viral diversity despite shifts in assemblage structure in an Australian hypersaline lake.

Authors:  Joanne B Emerson; Brian C Thomas; Karen Andrade; Karla B Heidelberg; Jillian F Banfield
Journal:  Appl Environ Microbiol       Date:  2013-08-30       Impact factor: 4.792

7.  Marine viruses, a genetic reservoir revealed by targeted viromics.

Authors:  Joaquín Martínez Martínez; Brandon K Swan; William H Wilson
Journal:  ISME J       Date:  2013-12-05       Impact factor: 10.302

8.  Viral assemblage composition in Yellowstone acidic hot springs assessed by network analysis.

Authors:  Benjamin Bolduc; Jennifer F Wirth; Aurélien Mazurie; Mark J Young
Journal:  ISME J       Date:  2015-06-30       Impact factor: 10.302

9.  Novel N4 Bacteriophages Prevail in the Cold Biosphere.

Authors:  Yuanchao Zhan; Alison Buchan; Feng Chen
Journal:  Appl Environ Microbiol       Date:  2015-05-29       Impact factor: 4.792

Review 10.  Viromes, not gene markers, for studying double-stranded DNA virus communities.

Authors:  Matthew B Sullivan
Journal:  J Virol       Date:  2014-12-24       Impact factor: 5.103

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.