Literature DB >> 23468974

The Pacific Ocean virome (POV): a marine viral metagenomic dataset and associated protein clusters for quantitative viral ecology.

Abstract

Bacteria and their viruses (phage) are fundamental drivers of many ecosystem processes including global biogeochemistry and horizontal gene transfer. While databases and resources for studying function in uncultured bacterial communities are relatively advanced, many fewer exist for their viral counterparts. The issue is largely technical in that the majority (often 90%) of viral sequences are functionally 'unknown' making viruses a virtually untapped resource of functional and physiological information. Here, we provide a community resource that organizes this unknown sequence space into 27 K high confidence protein clusters using 32 viral metagenomes from four biogeographic regions in the Pacific Ocean that vary by season, depth, and proximity to land, and include some of the first deep pelagic ocean viral metagenomes. These protein clusters more than double currently available viral protein clusters, including those from environmental datasets. Further, a protein cluster guided analysis of functional diversity revealed that richness decreased (i) from deep to surface waters, (ii) from winter to summer, (iii) and with distance from shore in surface waters only. These data provide a framework from which to draw on for future metadata-enabled functional inquiries of the vast viral unknown.

Entities: CellLine Chemical Disease Species

Mesh：

Year: 2013 PMID： 23468974 PMCID： PMC3585363 DOI： 10.1371/journal.pone.0057355

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Bacteria are fundamental to life on Earth and drive energy and nutrient cycles in natural systems [1]. In the surface oceans, viruses, thought to be predominantly phages, and are present at approximately 107 particles per milliliter of seawater often outnumbering their bacterial hosts fifteen to one [2]–[4]. These abundant phages play an integral role in the lifecycle, development and evolution of their diverse hosts (reviewed in [5]). This is particularly well documented in ocean cyanophages where natural abundances are high [6]–[9], and the eco-evolutionary virus-host dynamics involve manipulation of critical aspects of ocean life (e.g., photosynthesis, phosphate, nitrogen; reviewed in [10]). Despite this importance, our understanding of indigenous viral and microbial communities is compromised by both conceptual and technological challenges. Perhaps chief among them is that <1% of microbes in an environment can be routinely grown in the laboratory [11], [12]. To study this ‘unseen majority’ [13], microbial ecologists developed culture-independent techniques, first in the form of gene markers (e.g., small subunit ribosomal DNA, [14]) to quantify biodiversity and subsequently through metagenomics to study metabolic function and gene ecology [12], [15]–[17]. To this end, resources such as the Global Ocean Sampling (GOS) dataset [18] have been developed, and subsequently organized using sequence-based protein clusters (PCs) [19] to (i) group related proteins based on sequence similarity, (ii) map metadata associated with the samples to the PCs, and (iii) aid future inquires in genetic novelty by mapping to pre-defined PCs. The GOS dataset represents 7.7 million reads from 41 surface ocean microbial (0.1–0.8 µm size fraction) samples ranging 8,000 km along coastal Atlantic Ocean eastern United States waters to the Sargasso Sea, Gulf of Mexico, Caribbean Sea and a few sites in the South Pacific Ocean. This broad sampling has enabled GOS to greatly advance microbial ecology (849 citations as of 6 Sept 2012; Google Scholar) both through data mining studies and use of GOS to contextualize environmental research findings. While metagenomics was first applied to uncultured viral communities [20], two years in advance of microbes [15]–[17], there remains no such equivalent GOS-scale datasets (Table 1). Such a resource is missing because of limiting DNA yields from environmental viruses and the fact that viral metagenomic analytics are challenging. On the latter, viral metagenomics is more difficult than microbial metagenomics as (i) viral genome representation in public databases is more sparse (as of November 2012 there are 115 marine phage genomes in Genbank, Table S1), (ii) analytical tools paralleling those for microbes (e.g., MG-RAST [21] and IMG/M [22]) have not been available (but see VIROME [23] and MetaVir [24]) and (iii) the ‘unknown’ problem is even greater in viral metagenomes as often 90% of sequences lack similarity to anything in pre-existing protein databases (see Table 1 versus 65% for microbes [17]). Worse, the viral metagenomic datasets themselves often suffer either from being too small (e.g., LASL-based Sanger-sequenced metagenomes) or contain technical artifacts that are now known to render the metagenomes non-quantitative. To emphasize the latter, we highlight a commonly used marine viral dataset (363 citations as of 6 Sept 2012; Google Scholar) that includes 1.77 million metagenome sequences from 4 major ocean regions [25]. While this dataset was generated using the best available technologies and is the gold-standard for contextualizing viral ecological findings, it is now known to suffer from two major technical issues: (i) pooled samples (4 metagenomes from 184 viral assemblages collected over a decade from 68 sites) inhibit tracking of variability over space and time, and (ii) MDA-amplified material is now known to lead to non-quantitative metagenomes [26]. This latter point is particularly problematic for viruses as MDA biases are both stochastic [26] and systematic [27], [28] with the latter related to nucleic acid type and structure (i.e., a feature which varies across viral types). For these reasons, there are currently no marine viral metagenomic datasets appropriate for quantitative viral ecology.

Table 1

Summary of published marine, non-coral viral metagenomic datasets.

site	Scripps Pier and Mission Bay,San Diego CA	Mission Bay, San Diego, CA	Jericho Pier andStrait of Georgia,British Columbia	BBC, GOM, ARC, SAR	Chesapeake Bay	Line Islands, Skan Bay	Tampa Bay, induced virome	Eutrophic MontereyBay, CA 200 m depth	East Pacific Risediffuse-flow viralcommunity	POV
# metagenomes	1	1	2	4	1	13	1	1	1	32
Total # reads	1 934	1 156	277	1 768 297	5 641	2 258 386	294 068	881	176	6 020 088
avg. len. (bp)	>600	>600	>600	102	695	<105	102	>600	488	310
sequencing platform	Sanger	Sanger	Sanger	454 GS20	Sanger	454 GS20	454 GS20	Sanger	Sanger	454 Titanium
viral fraction (µm)	<0.16	0.2	0.22	0.2	0.2	0.2	0.2	0.2	N/A	0.2
concentration	TFF	TFF	TFF+ultracent.	TFF	TFF	TFF	TFF	TFF	TFF+ultracent.	FeCl₃-ppt
purification	CsCl	CsCl	RNAse	CsCl	N/A	CsCl	CsCl	CsCl	N/A	CsCl+DNase
DNA amplification	LASLs	LASLs	cDNA-based LASLs	MDA	LASLs	MDA	MDA	N/A	LASLs	LA
% unknown	65	75	63–81	91	70	90	93	74	75	89
reference	[20]	[73]	[74]	[25]	[41]	[42]	[75]	[30]	[43]	This publication

Abbreviations are as follows: TFF – Tangential Flow Filtration, ultracent. – ultracentrifugation, FeCl3-ppt – iron-chloride precipitation, CsCl – cesium chloride density gradient, LASLs – linker amplified shotgun libraries, MDA – multiple displacement amplification, LA – linker amplification. Here we introduce a large-scale, quantitative Pacific Ocean Virome (POV) dataset and ∼456 K associated PCs that organize the ‘known’ and ‘unknown’ sequence space for future comparative viral metagenomic study. The 6 million read dataset is derived from 32 temporally- and spatially-resolved viral assemblages, and represents the largest viral metagenomic sampling of the Pacific Ocean to date, including the first large-scale viral metagenomes from the deep pelagic ocean (but see [29], [30] and Table 1). The POV dataset represents a systematically collected, processed, and documented quantitative marine viral metagenomic resource [31] as follows. All thirty-two POV communities were concentrated using a new method that captures nearly all particles [32], purified using DNase digestion and CsCl buoyant density gradients to minimize contamination by non-viral DNA [33], and DNA extracted and linker-amplified to minimize quantitative and cloning biases in the resulting metagenomes [34]. DNA was then sequenced by Roche 454 Titanium technology. The metagenomes and the associated PCs provide a much-needed community resource to test hypotheses about environmental viruses, as GOS has done for microbial ecology. For these reasons, POV will likely become a foundational dataset for future comparative studies of virus genes and communities at the global ocean scale such as those derived from the recent Tara Oceans [35] and Malaspina [36] expeditions.

Results and Discussion

The Pacific Ocean Samples

The 32 POV source waters varied by depth, proximity to land, and season and were derived from four regions in the Pacific Ocean (Figure 1): Scripps Pier in San Diego, California (SIO), Monterey Bay, California (MBARI), near Vancouver Island in British Columbia (LineP), and the Great Barrier Reef in Australia (GBR). Samples, metadata, and metagenomic descriptive statistics are summarized in Table 2.

Figure 1

Sampling site map for the POV dataset.

Table 2

Description of POV samples and associated metadata.

sample	# reads	Mbps	Avg. read length	Stdev read length	photic zone	region	depth (m)	site location	month	season	Oceanic features
L.Win.O.1000 m	147 537	40.5	274	80	aphotic	LineP	1000	open ocean	Feb.	winter	OMZ
L.Win.O.2000 m	125 896	33.4	265	86	aphotic	LineP	2000	open ocean	Feb.	winter
L.Spr.C.500 m	136 876	44.5	325	83	aphotic	LineP	500	coastal	Jun.	summer
L.Spr.C.1000 m	97 126	31.5	324	124	aphotic	LineP	1000	coastal	Jun.	summer	OMZ
L.Spr.C.1300 m	98 478	24.2	245	95	aphotic	LineP	1300	coastal	Jun.	summer
L.Spr.I.500 m	58 108	20.5	353	101	aphotic	LineP	500	intermediate	Jun.	summer
L.Spr.I.1000 m	122 565	45.9	374	88	aphotic	LineP	1000	intermediate	Jun.	summer	OMZ
L.Spr.I.2000 m	49 914	13.2	264	97	aphotic	LineP	2000	intermediate	Jun.	summer
L.Sum.O.500 m	42 118	13.6	322	122	aphotic	LineP	500	open ocean	Aug.	fall
L.Sum.O.1000 m	70 596	19.9	282	131	aphotic	LineP	1000	open ocean	Aug.	fall	OMZ
L.Sum.O.2000 m	68 516	18.6	271	122	aphotic	LineP	2000	open ocean	Aug.	fall
L.Spr.O.1000 m	101 179	28.4	280	115	aphotic	LineP	1000	open ocean	Jun.	summer	OMZ
L.Spr.O.2000 m	55 332	14.8	267	110	aphotic	LineP	2000	open ocean	Jun.	summer
L.Win.O.500 m	167 616	48.0	286	84	aphotic	LineP	500	open ocean	Feb.	winter
M.Fall.O.1000 m	225 833	66.3	293	105	aphotic	MBARI	1000	open ocean	Oct.	fall
M.Fall.O.4300 m	144 588	40.5	279	95	aphotic	MBARI	4300	open ocean	Oct.	fall
GD.Spr.C.8 m	116 855	29.0	247	72	photic	GBR	8	coastal	Oct.	spring
GF.Spr.C.9 m	82 739	20.9	252	73	photic	GBR	9	coastal	Oct.	spring
L.Sum.O.10 m	165 256	48.4	292	102	photic	LineP	10	open ocean	Aug.	fall
L.Spr.C.10 m	107 244	25.7	240	86	photic	LineP	10	coastal	Jun.	summer
L.Spr.I.10 m	92 415	22.6	244	92	photic	LineP	10	intermediate	Jun.	summer
L.Spr.O.10 m	75 036	19.5	259	106	photic	LineP	10	open ocean	Jun.	summer
L.Win.O.10 m	192 685	59.7	310	74	photic	LineP	10	open ocean	Feb.	winter
M.Fall.C.10 m	303 519	105.2	346	118	photic	MBARI	10	coastal	Oct.	fall	Upwelling
M.Fall.I.10 m	321 754	92.9	288	109	photic	MBARI	10	intermediate	Oct.	fall	Upwelling
M.Fall.I.42 m	31 528	10.9	346	114	photic	MBARI	42	intermediate	Oct.	fall	Upwelling; DCM
M.Fall.O.10 m	203 238	52.4	257	95	photic	MBARI	10	open ocean	Oct.	fall
M.Fall.O.105 m	156 509	44	281	101	photic	MBARI	105	open ocean	Oct.	fall	DCM
SFC.Spr.C.5 m	487 339	191.2	392	107	photic	SIO	5	coastal	Apr.	spring
SFD.Spr.C.5 m	645 463	218.7	338	119	photic	SIO	5	coastal	Apr.	spring
SFS.Spr.C.5 m	504 826	173.1	342	130	photic	SIO	5	coastal	Apr.	spring
STC.Spr.C.5 m	821 404	246.3	299	103	photic	SIO	5	coastal	Apr.	spring

Oceanic features include: OMZ = oxygen minimum zone, DCM = deep chlorophyll maximum, Upwelling = within a current system with upwelling.

Sampling site map for the POV dataset.

Thirty-two viral metagenomes represent discretely sampled and processed datasets that vary over time and space in the pelagic Pacific Ocean. (A) Overview of sampling sites, (B) GBR – Great Barrier Reef, Australia, near Dunk and Fitzroy Islands. (C) LineP- oceanographic transect off Vancouver Island, British Columbia (D) MBARI- Line67 oceanographic transect off of Monterey Bay, California (E) SIO- Scripps Pier, San Diego, CA. Images were created using Ocean Data View. Oceanic features include: OMZ = oxygen minimum zone, DCM = deep chlorophyll maximum, Upwelling = within a current system with upwelling. The defining ecological features of each dataset are as follows. Four SIO metagenomes were derived from a single coastal, surface water sample from Scripps Pier (San Diego, CA, April 2009, spring), the site of the first viral metagenome [20], that were differentially concentrated and purified (4 treatments, >2.5 M sequences; [33]). Seven MBARI metagenomes (∼1.4 M sequences) represent viruses concentrated from various depths along a long-standing oceanographic transect, Line67 [33], [37], [38], which spans coastal, upwelling and open ocean waters off Monterey Bay, California collected in fall, October 2009. Nineteen LineP metagenomes (∼2.0 M sequences) represent viruses concentrated from various depths along another long-standing oceanographic transect, LineP [39], which spans coastal-to-open-ocean waters, including the second largest ocean oxygen minimum zone [40], off British Columbia collected in February (winter), June (spring) and August (summer) 2009. Finally, two GBR metagenomes (∼0.2 M sequences) represent viral concentrates from the dry season near Dunk (Tully River impacted) and Fitzroy (less impacted) Islands at the Great Barrier Reef in Australia collected in October (spring) 2009. Together, these metagenomes represent a diversity of pelagic ocean features including oceanic region (SIO, GBR, MBARI, LineP), proximity to land (coastal to open ocean), season (spring, summer, fall, and winter), depth (10 m to 4300 m), primary productivity and oxygen concentration (variability in other physiochemical characteristics are not considered here).

Taxonomic Composition of POV Metagenomes

Long-standing questions in marine viral ecology are centered on understanding the extent to which viral assemblages change spatially, temporally and under different environmental conditions in the ocean [2]. Yet, given the paucity of known viruses in biological databases comparatively examining viral assemblages from diverse environments in the sea is stymied. As commonly observed in marine viral metagenomic studies [20], [25], [41]–[43], the majority (87% photic zone, 91% aphotic zone) of the reads could not be classified based on sequence similarity to known taxa (see Materials and Methods, Figure 2A). Moreover, we found a smaller fraction of reads that matched known viruses in the aphotic zone (3.3%) than the photic zone (8.3%) likely due to more sampling in the surface oceans (Figure 2A and Table 1).

Figure 2

The POV dataset and its place in the viral protein universe.

The POV dataset and its place in the viral protein universe.

(A) Summary superkingdom taxonomy statistics for quantitative Pacific Ocean viral metagenomes from 16 photic and 16 aphotic zone samples. Reads were taxonomically assigned based on matches to proteins in SIMAP and curated as described in the methods. (B) Venn diagram representing medium- to large membership PCs documents the relative contributions of the POV, GOS microbial, and SIMAP datasets to the ‘viral protein universe’. To examine the taxonomic composition of the POV metagenomes in greater detail, we classified the reads from each sample based on their top match at the family level (Figure S1). Overall, we found that metagenomes from samples in the photic zone had a larger proportion of reads that matched Myoviridae (an average of 4.2%±1.9%) than in the aphotic zone (an average of 1.6%±0.8%). Several samples in the deep ocean however, were enriched for Myoviridae including L.Spr.I.2000 m (3.3%) and L.Spr.O.2000 m (3.0%) that closely matched their photic counterparts L.Spr.I.10 m (3.1%) and L.Spr.O.10 m (4.1%). Also notable were the large fraction of reads matching Myoviridae at the deep chlorophyll maximum (DCM) in the open ocean in Monterey Bay, (9.6% for M.Fall.O.105 m) which is more than four times the fraction seen in the surface ocean from the same time point and station (1.9% for M.Fall.O.10 m). We also found a large fraction of sequences that matched Podoviridae in the DCM sample in the open ocean in Monterey Bay (4.2% for M.Fall.O.105 m) and in the surface samples from the Great Barrier Reef (3.8% for GF.Spr.C.9 m and 5.0% for GD.Spr.C.8 m) as compared to the 0.8%±0.5% on average in other samples. Thus, Podoviridae may play an important role in reef ecosystems and the DCM not presently unknown. Finally, we compared and contrasted known viruses at the genus and species level in the combined photic and aphotic samples (Figure S2A and B respectively). At the genus level, we found a higher fraction of T4- and T7-like viruses in the photic zone (6.9% total) than the aphotic zone (2.6% total). At the species level, we found a higher faction of synechococcus and prochlorococcus phages in the photic zone (4.6% total) than the aphotic zone (1.1% total).

The Protein Cluster as a Means to Organize Unknown Sequence Space

While this great ‘unknown’ problem is exacerbated in viral metagenomes, it has also plagued microbial metagenomic studies to the extent that previous analyses of the GOS dataset organized this sequence space, including unknowns, using protein clustering (sensu Yooseph et al., 2007 and 2008 [19], [44]; details in Materials and Methods). Here, as per Yooseph’s approach, we individually assembled each POV metagenome and identified open reading frames (ORFs) on both the contigs and individual reads, yielding ∼4.1 M non-redundant ORFs. These POV ORFs were clustered with ORFs from GOS core clusters (3,625,128 ORFs, [19] of both microbial and viral origin, as well as genes from SIMAP phage genomes (33,857 ORFs, [45] – in total ∼7.8 M ORFs. Given that database representation of viral sequences is sparse at best (e.g., GOS represents mostly microbial-fraction not viral core clusters) and the POV samples represent predominantly unexplored ocean regions, it is not surprising that most (78%) POV ORFs fail to cluster with known PCs (Table 3). Self-clustering the unmapped POV ORFs further organized this unknown sequence space (i.e., another 55% of POV ORFs were clustered), such that only 23% of POV ORFs remained as singletons. These singletons could either represent artifact or more likely are members of the “rare biosphere” [46] under-sampled in this data set due to their rarity.

Table 3

POV ORF recruitment.

photic zone	# ORFs	clustered with GOS/SIMAP phage	self-clustered	singletons	total clustered
photic	2 783 784	31%±4%	47%±9%	22%±7%	78%±7%
aphotic	1 323 811	13%±6%	62%±12%	25%±10%	75%±10%
all	4 107 595	22%±10%	55%±13%	23%±9%	77%±9%

The fraction of POV ORFs that non-redundantly recruited to existing (GOS/known phages) PCs versus those that recruited to PCs derived from the POV dataset (self clustered). POV ORFs are non-redundant by metagenome. The standard deviation values refer to differences in the fraction of ORFS clustered in the POV samples. In total, we identified 456,420 PCs that contained two or more non-redundant members (12,226+1,557+442,637 PCs derived from GOS+POV, Phage+POV and POV only, respectively). Of these, 27,646 PCs contained 20 or more members (counts with varying levels of cluster membership are summarized in Table 4). For comparison, GOS, the first large-scale marine microbial-fraction metagenomic sequencing effort predicted 6.1 M proteins from assemblies derived from 7.7 M Sanger sequences, and identified ∼39 k of these ‘20+ member’ PCs from surface ocean waters [19]. These POV data represent the first marine viral-fraction metagenomes analyzed using PC techniques (but see also [33], [47]), and they more than double the known viral PCs (Figure 2B). Specifically, the POV dataset ‘identified’ 12,302 GOS 20+ member PCs (these existed in GOS and a subset have been previously identified as viral [43], [48]), while adding 15,344 20+ member PCs represented only by POV sequences. These ∼28 k viral clusters are likely also abundant in nature as they represent only ∼6% of the total number of PCs (the ecological ‘binning’ unit), but recruit ∼68% of the POV ORFs (the ecological ‘count’ equivalent).

Table 4

Distribution and count by PC size.

PC size	# POV PCs	# POV reads	# POV ORFs	# GOS ORFs	# SIMAPphage ORFs
2–4	314 144	798 190	797 110	N/A	1 080
5–9	80 302	514 717	514 453	N/A	264
10–19	34 330	452 112	452 012	N/A	100
20–49	18 332	542 090	390 671	151 282	137
50–99	4 522	305 752	148 579	157 106	67
100–199	1 956	270 531	90 302	180 160	69
200–499	1 360	430 580	147 385	283 101	94
500–999	715	514 752	155 714	358 942	96
1000–1999	524	738 986	187 580	551 246	160
2000+	234	881 930	396 720	485 096	114
total	456 420	5 449 640	3 280 526	2 166 933	2 181

Distribution of PC sizes (based upon the number of ORFs a PC contains) and the number of PCs, POV reads mapping to clusters and ORFs that belonged to PCs of that size. All POV data are new to this study.

Protein Clusters as a Viral Community Functional Richness Metric

Because viruses lack gene markers (e.g., small subunit ribosomal DNA) and most of the POV reads cannot be identified in reference databases measuring viral community diversity is problematic. To measure functional richness in POV samples irrespective of annotation, we detected genetic links between viral communities using protein clustering and illustrated patterns in richness between the samples using a rarefaction analysis as previously defined [33], [47].

Seasonal viral functional richness measurements at LineP

To examine ocean viral functional richness across the depth continuum (10 m to 2000 m) and season (spring, summer, and winter), 11 metagenomes from a single LineP open ocean site (station P26) were analyzed by protein cluster/rarefaction analysis [33], [47]. Rarefaction analysis showed that photic samples were less functionally rich than aphotic samples from the same season (Figure 3A). When comparing samples from different seasons in the same photic zone rarefaction analysis showed that winter was the most functionally rich, followed by spring, and summer (Figure 3A). All aphotic samples clearly separated by season, whereas photic samples showed similar levels of functional richness in spring and summer but increased richness in winter. Overall, rarefaction patterns indicate that season and photic zone are important drivers of viral community functional richness at LineP.

Figure 3

Viral community functional richness based on season and proximity to shore.

Rarefaction analysis of hits to protein clusters from: (A) 11 POV metagenomes from a single LineP open ocean site (station P26) (B) 11 POV metagenomes from LineP stations P4, P12, and P26 from a single research cruise (June 2009). To be conservative, only protein clusters with >20 members were used in these analyses.

Viral community functional richness based on season and proximity to shore.

Viral functional richness measurements in June at LineP from coastal to open ocean

To examine viral functional richness across the depth continuum (10 m to 2000 m) and with proximity to shore (coastal vs open ocean), 11 metagenomes from the LineP transect stations P4, P12, and P26 from a single research cruise (June 2009) were analyzed by protein cluster/rarefaction analysis. Broadly, we found that aphotic samples from coastal to open ocean showed the same overall functional richness. Photic samples, however, were more functionally rich at the coast (station P4) as compared to oligotrophic intermediate and open ocean water samples (stations P12 and P26) that are similar in terms of their overall environmental chemistry. These data suggest that coastal photic waters are more functionally rich than photic open ocean waters, but the same pattern is not evident in the aphotic ocean. Given that functional richness differed by photic zone in the smaller subset of LineP samples examined above, we analyzed the complete POV dataset by photic zone using a protein cluster/rarefaction analysis (Figure 4A and B). In the photic rarefaction analysis (Figure 4A), we found that deeply sequenced SIO samples were the most functionally rich. Though the rarefaction analysis should normalize the samples in terms of sequencing effort, the limitation we placed on our analysis to include only 20+ member clusters may have included rare SIO clusters that are highly represented due to the exceptional sequencing effort for this single sample. The samples with the next highest functional richness came from MBARI samples in the same current system as a local upwelling, followed by samples from fall/winter and spring/summer. We noted several exceptions to these general trends. First, the LineP spring coastal sample, L.Spr.C.10 m, grouped more closely to fall/winter samples, likely due to the higher functional richness noted previously in photic coastal samples. Secondly, the MBARI fall open ocean sample taken from waters in a deep chlorophyll maximum (DCM), M.Fall.O.105 m, grouped more closely to fall/winter samples, which could be due to increased functional richness in the DCM. Yet, we could not confirm this given the low sequencing effort and limited trend information in the rarefaction curve from other DCM sample, M.Fall.I.42 m. In the aphotic rarefaction analysis (Figure 4B), winter/fall/spring samples were the most functionally rich followed by summer.

Figure 4

Viral community functional richness in the Pacific Ocean.

Viral community functional richness in the Pacific Ocean.

Rarefaction analysis of hits to protein clusters from all POV metagenomes in (A) photic zone samples and (B) aphotic zone samples. To be conservative, only protein clusters with >20 members were used in these analyses. In summary functional richness decreased (i) with distance from shore in surface waters only, (ii) from winter to summer, and (iii) from deep to surface waters. These data mirrored patterns seen in marine and brackish bacteria where samples decreased in diversity (i) along a river to ocean gradient [49], (ii) from winter to summer [50], (iii) and from deep to surface [49]. Our data provide a powerful new look at all of these features in a single analysis for viral metagenomes.

Limitations of the POV Dataset

While taking great strides forward in providing a large-scale quantitative viral metagenomic dataset, POV is also not without biases and limitations. First, the dataset excludes ssDNA phages as these viral concentrates were purified using both DNase and cesium chloride banding (CsCl 1.35–1.5 g/ml), where ssDNA phages are in the lighter fractions [51], [52]. While most detectable viruses (DNA-stained counts suggested >99%) were in these fractions, it is now clear that ssDNA phages are often not detectable by such staining procedures [52]. Second, the dataset does not include RNA viruses as nucleic acid extractions were optimized for DNA and the linker amplification enzymes are specific for DNA. There are now methods which allow simultaneous purification of both RNA and DNA viruses [53], and already a road-map for constructing RNA viral metagenomes from the oceans [54]. However, protocols to isolate, purify, and amplify RNA for viral metagenomes have not been rigorously evaluated as was done for dsDNA viruses [31]–[34]. Third, the dataset excludes larger viruses as the source waters for viral community concentration were 0.22 µm pre-filtered to remove bacteria. Fourth, the average read length of ∼300 bp, while an improvement upon the other large-scale viral metagenomic datasets available, will undoubtedly be improved upon as sequencing technologies advance. Finally, one sample is anomalously enriched for “bacteria” in the dataset. Specifically, while samples averaged 4.4%±3.7% bacterial hits per sample across the 32 sample dataset, a single sample showed elevated bacterial hits (24.1% for L.Spr.C.1000 m; Table 5; Figure S1). In the L.Spr.C.1000 m sample, 13.7% of reads matched bacteria from the family Rhodobacteraceae and 7.0% matched Pseudoalteromonadaceae representing the majority of bacterial hits (20.7% in these families as compared to 24.1% total). Further, the majority of reads (∼20% of total reads) from these two bacterial families mapped to just three genomes: Sulfitobacter sp. EE-36, Pseudoalteromonas sp. SM9913, Sulfitobacter sp. NAS-14.1, with 9, 7, and 4% of the hits, respectively. The locations of these “bacterial” reads in the reference genomes showed that the hits matched throughout the genomes indicating that these “bacterial” hits were not simply integrated phage DNA in the bacterial genomes, and must instead be either bacterial contamination that survived the DNase and CsCl purification steps or gene transfer agents (GTA, [55]–[57]). Given the extensive purification (cesium-chloride banding and DNase treatment) of the viral concentrate, we favor the hypothesis that this sample contains a higher GTA content than the rest of the samples. Regardless of the source of the microbial DNA, the L.Spr.C.1000 m sample should be used with caution, particularly for analyses related to auxiliary metabolic genes (AMGs, sensu [58]) or other genetic data known to co-occur in bacteria.

Table 5

Percentage of POV reads assigned to superkingdoms.

sample	viruses	bacteria	eukaryota	archaea	unknown
L.Win.O.1000 m	2.8%	2.9%	0.7%	0.0%	93.6%
L.Win.O.2000 m	2.5%	2.7%	1.0%	0.0%	93.8%
L.Spr.C.500 m	5.0%	4.0%	0.9%	0.0%	90.0%
L.Spr.C.1000 m	2.9%	24.1%	0.8%	0.0%	72.2%
L.Spr.C.1300 m	2.7%	2.5%	0.4%	0.0%	94.4%
L.Spr.I.500 m	5.2%	5.6%	1.3%	0.0%	87.9%
L.Spr.I.1000 m	4.0%	3.8%	0.9%	0.0%	91.3%
L.Spr.I.2000 m	5.9%	3.2%	0.7%	0.0%	90.2%
L.Sum.O.500 m	3.5%	4.3%	2.0%	0.0%	90.2%
L.Sum.O.1000 m	2.0%	2.9%	7.6%	0.0%	87.5%
L.Sum.O.2000 m	3.1%	4.1%	2.1%	0.0%	90.6%
L.Spr.O.1000 m	3.2%	2.9%	0.7%	0.0%	93.2%
L.Spr.O.2000 m	5.0%	3.7%	2.0%	0.0%	89.3%
L.Win.O.500 m	2.9%	2.7%	0.7%	0.0%	93.7%
M.Fall.O.1000 m	2.7%	3.4%	0.5%	0.0%	93.3%
M.Fall.O.4300 m	2.6%	2.9%	0.6%	0.0%	93.9%
GD.Spr.C.8 m	12.4%	4.6%	0.4%	0.0%	82.6%
GF.Spr.C.9 m	9.7%	4.5%	0.4%	0.0%	85.3%
L.Sum.O.10 m	4.5%	4.0%	0.9%	0.0%	90.6%
L.Spr.C.10 m	5.6%	2.7%	0.6%	0.0%	91.1%
L.Spr.I.10 m	4.7%	2.9%	1.2%	0.0%	91.2%
L.Spr.O.10 m	5.6%	5.1%	1.7%	0.0%	87.6%
L.Win.O.10 m	6.0%	3.3%	0.8%	0.0%	89.9%
M.Fall.C.10 m	8.0%	4.1%	0.7%	0.0%	87.2%
M.Fall.I.10 m	7.4%	3.4%	0.6%	0.0%	88.6%
M.Fall.I.42 m	5.3%	6.2%	2.8%	0.0%	85.8%
M.Fall.O.10 m	5.5%	3.5%	0.5%	0.0%	90.6%
M.Fall.O.105 m	15.3%	3.1%	0.4%	0.0%	81.2%
SFC.Spr.C.5 m	10.7%	4.7%	0.9%	0.0%	83.6%
SFD.Spr.C.5 m	9.1%	3.6%	0.7%	0.0%	86.5%
SFS.Spr.C.5 m	9.0%	3.8%	0.9%	0.0%	86.2%
STC.Spr.C.5 m	7.3%	4.1%	0.7%	0.0%	87.9%

Percentage of POV reads that were taxonomically assigned based on matches to proteins in SIMAP and curated as described in the methods by sample.

Conclusions

Over the last two decades, viruses have emerged as abundant, diverse and biogeochemically important members of nearly any ecosystem. In spite of this importance, mapping the ocean virome has been stifled by technical challenges, limited sampling opportunities, and lack of database and analytical resources. The quantitative dataset and PC organization documented here provide an invaluable mapping resource for future comparative viral metagenomic research. Looking forward, marine viral ecology stands at a tipping point wherein the “unseen” majority [13] can now be spatiotemporally documented using large-scale metagenomic sequencing at a reduced cost. These ever-growing datasets, in combination with emerging information on novel phyla of microbial hosts [59], [60] and transformative experimental methods (e.g., single viral genomics [61], microfluidic digital PCR [62], viral tagging [63], phageFISH [64]) and k-mer-based annotation techniques to rapidly assign function [65] offer new windows into viral diversity across spatial and temporal scales that can be inter-connected with paired microbial datasets to link viruses and their hosts. Although the datasets and analyses are formidable, the curated PC dataset provided here should ease future adventures into the ‘unknown’ and lead to a better understanding of the dynamic microbial and viral processes that drive the biogeochemistry that fuels the planet.

Materials and Methods

Sample Collection, DNA Isolation, Linker Amplification, and Purification

Each sample, from ∼20–50 L of seawater, was pre-filtered using a 150 µm GF/A filter followed by a 0.22 µm, 142 mm diameter Express Plus filter. All filtrates were concentrated by FeCl3-precipitation and purified by DNase+CsCl. Comparison samples from SIO also had additional treatments as follows: (i) Tangential Flow Filtration (TFF) and DNase+CsCl, (ii) FeCl3-precipitation and DNase only, and (iii) FeCl3-precipitation and DNase+Sucrose as previously described [33]. DNA was extracted from the concentrated and purified viral particles using Wizard PCR DNA Purification Resin and Minicolumns as previously described [66]. The resulting DNA samples were randomly sheared and amplified using linker amplification (LA) as described previously [34]. Linker-amplified VLP DNA samples were sequenced using GS FLX Titanium sequencing chemistry on a 454 Genome Sequencer (http://www.454.com). Sequences were quality filtered using a custom pipeline written in Perl and bash shell and executed on a high performance computer running PBSPro to distribute jobs (screenpipe.tar). Briefly sequences were removed that (i) had an “N” anywhere in the sequence, and (ii) deviated from two standard deviations from the mean length or read quality score using protocols proposed by Huse et al. [67]. Artificial duplicates were removed from the pyrosequencing runs using the program cdhit-454 version 4.5.5 with default parameters [68]. All sequences were deposited to CAMERA (http://camera.calit2.net) under the following project accessions: CAM_P_0000914 and CAM_P_0000915.

Assembly and ORF Finding

Metagenomic assembly and ORF calling was conducted using a custom pipeline written in Perl and bash shell and executed on a high performance computer running GridEngine to distribute jobs (ivelvet2_orfpipeline.tar). First, we removed singletons by finding reads that had a 20-mer frequency equal to zero using the vmatch package version 2.1.5 (kmer size = 20; http://www.vmatch.de/), because by definition they cannot contribute to overlap in an assembly [69]. Second, to skew the assembly towards assembling dominant species first and less dominant members in subsequent rounds of assembly, velvet version 1.0.15 (hash length = 29, -long) [70] was used to iteratively assemble sequences based on their k-mer frequency in 2+, 4+, 6+, 10+ bins. Third, the contigs from each frequency bin were merged to remove exact duplicates using cdhit version 4.5.5 and requiring a percent identity of 100% across the smallest contig [68]. Finally, the non-redundant maximally assembled contig dataset was used as input for ORF prediction using the metagenomic mode in Prodigal [71] along with the individual reads. All ORFs that were non-redundant and >60 amino acids in length were retained for further analysis similar to GOS [19].

Protein Clustering

ORFs were clustered based on sequence similarity in a two-step process using cd-hit version 4.5.5 [68] using a custom pipeline written in Perl and bash shell and executed on a high performance computer using GridEngine to distribute jobs (protuniversepipeline.tar). First, ORFs were mapped to known PCs from the global ocean survey (GOS; [19]) and phage known protein sequences using cd-hit-2d (‘-g 1 -n 4 -d 0 -T 24 -M 45000’; 60% percent identity and 80% coverage). The proteins included in this initial clustering included GOS core cluster proteins (http://camera.calit2.net) and 33,857 proteins (NCBI) from known phage genomes downloaded on July 7, 2011 that were mapped to the associated SIMAP proteins for additional annotation information. Second, ORFs that did not match to known GOS clusters or phage genes from SIMAP were self-clustered using cd-hit as above. All ORFs, PCs and annotation are available as a public resource on the CAMERA website (http://camera.calit2.net) under the project accession: CAM_P_0000915.

Taxonomic Classification

BLASTX [72] was used to assign taxonomy to ORFs and sequence reads by comparison to the Similarity Matrix of Proteins (SIMAP, [45], June 25th, 2011 release) using an analysis pipeline written in Perl and bash shell and executed on a high performance computer using PBSPro to distribute jobs (blastpipeline_simap.tar). Taxonomy was assigned based on the species taxonomy ID listed in SIMAP and at the superfamily, family and genus levels using the NCBI taxonomy hierarchy for that species ID. Data curation consisted of re-assigning hits to “uncultured” organisms to their next top match, as well as examining missing family and genus level data to create a curated a subset of the NCBI taxonomy records for the most abundant viruses [33].

Rarefaction Analysis

All high quality metagenomic reads in the POV dataset (Table 2) were compared to ORFs in the 20+ member protein clusters using BLASTX (E value <0.001). We then generated hit counts to the protein clusters and used the data for further rarefaction analysis using the rarefaction calculator: (http://www.biology.ualberta.ca/jbrzusto/rarefact.php). All protocols are available at http://eebweb.arizona.edu/Faculty/mbsulli/protocols.htm, and scripts and associated documentation are archived at http://code.google.com/p/tmpl/. Family taxonomic profile across POV samples by photic zone. Note that these data represent only those metagenomic reads that had a significant hit to the SIMAP database. (TIFF) Click here for additional data file. (A) Genus and (B) species taxonomic profile for all POV samples combined by photic zone. Note that these data represent only those metagenomic reads that had a significant hit to the SIMAP database. (TIFF) Click here for additional data file. Marine Phage Genomes in NCBI Genbank as of November 2012. (DOCX) Click here for additional data file.

67 in total

1. Cultivated single-stranded DNA phages that infect marine Bacteroidetes prove difficult to detect with DNA-binding stains.

Authors: Karin Holmfeldt; Duško Odić; Matthew B Sullivan; Mathias Middelboe; Lasse Riemann
Journal: Appl Environ Microbiol Date: 2011-12-02 Impact factor: 4.792

Review 2. Metagenomics: application of genomics to uncultured microorganisms.

Authors: Jo Handelsman
Journal: Microbiol Mol Biol Rev Date: 2004-12 Impact factor: 11.056

3. Functional metagenomic profiling of nine biomes.

Authors: Elizabeth A Dinsdale; Robert A Edwards; Dana Hall; Florent Angly; Mya Breitbart; Jennifer M Brulc; Mike Furlan; Christelle Desnues; Matthew Haynes; Linlin Li; Lauren McDaniel; Mary Ann Moran; Karen E Nelson; Christina Nilsson; Robert Olson; John Paul; Beltran Rodriguez Brito; Yijun Ruan; Brandon K Swan; Rick Stevens; David L Valentine; Rebecca Vega Thurber; Linda Wegley; Bryan A White; Forest Rohwer
Journal: Nature Date: 2008-03-12 Impact factor: 49.962

4. Lysogenic virus-host interactions predominate at deep-sea diffuse-flow hydrothermal vents.

Authors: Shannon J Williamson; S Craig Cary; Kurt E Williamson; Rebekah R Helton; Shellie R Bench; Danielle Winget; K Eric Wommack
Journal: ISME J Date: 2008-08-21 Impact factor: 10.302

Review 5. The microbial engines that drive Earth's biogeochemical cycles.

Authors: Paul G Falkowski; Tom Fenchel; Edward F Delong
Journal: Science Date: 2008-05-23 Impact factor: 47.728

6. Metagenomic characterization of Chesapeake Bay virioplankton.

Authors: Shellie R Bench; Thomas E Hanson; Kurt E Williamson; Dhritiman Ghosh; Mark Radosovich; Kui Wang; K Eric Wommack
Journal: Appl Environ Microbiol Date: 2007-10-05 Impact factor: 4.792

7. Comparative metagenomics of microbial traits within oceanic viral communities.

Authors: Itai Sharon; Natalia Battchikova; Eva-Mari Aro; Carmela Giglione; Thierry Meinnel; Fabian Glaser; Ron Y Pinter; Mya Breitbart; Forest Rohwer; Oded Béjà
Journal: ISME J Date: 2011-02-10 Impact factor: 10.302

8. SIMAP: the similarity matrix of proteins.

Authors: Thomas Rattei; Roland Arnold; Patrick Tischler; Dominik Lindner; Volker Stümpflen; H Werner Mewes
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

9. The marine viromes of four oceanic regions.

Authors: Florent E Angly; Ben Felts; Mya Breitbart; Peter Salamon; Robert A Edwards; Craig Carlson; Amy M Chan; Matthew Haynes; Scott Kelley; Hong Liu; Joseph M Mahaffy; Jennifer E Mueller; Jim Nulton; Robert Olson; Rachel Parsons; Steve Rayhawk; Curtis A Suttle; Forest Rohwer
Journal: PLoS Biol Date: 2006-11 Impact factor: 8.029

10. Contrasting life strategies of viruses that infect photo- and heterotrophic bacteria, as revealed by viral tagging.

Authors: Li Deng; Ann Gregory; Suzan Yilmaz; Bonnie T Poulos; Philip Hugenholtz; Matthew B Sullivan
Journal: MBio Date: 2012-10-30 Impact factor: 7.867

149 in total

1. VIRALpro: a tool to identify viral capsid and tail sequences.

Authors: Clovis Galiez; Christophe N Magnan; Francois Coste; Pierre Baldi
Journal: Bioinformatics Date: 2016-01-05 Impact factor: 6.937

2. Illuminating structural proteins in viral "dark matter" with metaproteomics.

Authors: Jennifer R Brum; J Cesar Ignacio-Espinoza; Eun-Hae Kim; Gareth Trubl; Robert M Jones; Simon Roux; Nathan C VerBerkmoes; Virginia I Rich; Matthew B Sullivan
Journal: Proc Natl Acad Sci U S A Date: 2016-02-16 Impact factor: 11.205

3. Niche-dependent genetic diversity in Antarctic metaviromes.

Authors: Olivier Zablocki; Lonnie van Zyl; Evelien M Adriaenssens; Enrico Rubagotti; Marla Tuffin; Stephen C Cary; Don Cowan
Journal: Bacteriophage Date: 2014-12-16

4. Genome of a SAR116 bacteriophage shows the prevalence of this phage type in the oceans.

Authors: Ilnam Kang; Hyun-Myung Oh; Dongmin Kang; Jang-Cheon Cho
Journal: Proc Natl Acad Sci U S A Date: 2013-06-24 Impact factor: 11.205

5. Insight into the unknown marine virus majority.

Authors: Alexander I Culley
Journal: Proc Natl Acad Sci U S A Date: 2013-07-10 Impact factor: 11.205

6. New approaches indicate constant viral diversity despite shifts in assemblage structure in an Australian hypersaline lake.

Authors: Joanne B Emerson; Brian C Thomas; Karen Andrade; Karla B Heidelberg; Jillian F Banfield
Journal: Appl Environ Microbiol Date: 2013-08-30 Impact factor: 4.792