| Literature DB >> 21677864 |
Melissa Beth Duhaime, Renzo Kottmann, Dawn Field, Frank Oliver Glöckner.
Abstract
In any sequencing project, the possible depth of comparative analysis is determined largely by the amount and quality of the accompanying contextual data. The structure, content, and storage of this contextual data should be standardized to ensure consistent coverage of all sequenced entities and facilitate comparisons. The Genomic Standards Consortium (GSC) has developed the "Minimum Information about Genome/Metagenome Sequences (MIGS/MIMS)" checklist for the description of genomes and here we annotate all 30 publicly available marine bacteriophage sequences to the MIGS standard. These annotations build on existing International Nucleotide Sequence Database Collaboration (INSDC) records, and confirm, as expected that current submissions lack most MIGS fields. MIGS fields were manually curated from the literature and placed in XML format as specified by the Genomic Contextual Data Markup Language (GCDML). These "machine-readable" reports were then analyzed to highlight patterns describing this collection of genomes. Completed reports are provided in GCDML. This work represents one step towards the annotation of our complete collection of genome sequences and shows the utility of capturing richer metadata along with raw sequences.Entities:
Keywords: contextual data; genome standards; marine phages; markup language
Year: 2011 PMID: 21677864 PMCID: PMC3111985 DOI: 10.4056/sigs.621069
Source DB: PubMed Journal: Stand Genomic Sci ISSN: 1944-3277
Phages, from a marine habitat, as reported in literature and their corresponding INSDC accession numbers.
| | | | |
|---|---|---|---|
| Cyanophage PSS2 | GQ334450 | Yes | Complete |
| Flavobacterium phage 11b | AJ842011 | No - insufficient data | |
| EU399241 | Yes | Complete | |
| AY772740 | Yes | | |
| Phage phiJL001 | AY576273 | Yes | |
| AF155037 | No - insufficient data | | |
| AY939843 | Yes | Complete | |
| AY939844 | Yes | Complete | |
| AY940168 | Yes | Complete | |
| AF189021 | No - insufficient data | | |
| FJ867910 | No - insufficient data | | |
| FJ867912 | No - insufficient data | | |
| FJ867913 | No - insufficient data | | |
| FJ867914 | No - insufficient data | | |
| FJ591093 | No - insufficient data | | |
| FJ591094 | No - insufficient data | | |
| AF338467 | No - insufficient data | | |
| AJ630128 | No - insufficient data | | |
| FM207411 | No - insufficient data | | |
| DQ149023 | No - too close to coast | x, y, t | |
| EF372997 | Yes | | |
| AY505112 | No - insufficient data | | |
| DQ029335 | No - insufficient data | | |
| AY510084 | No - insufficient data | | |
| AY328852 | No - too close to coast | | |
| AY328853 | No -too close to coast | | |
| AY095314 | No - insufficient data | | |
| AY133112 | No - insufficient data | | |
| AY283928 | No - insufficient data | | |
| AF125163 | No - insufficient data | |
Phages finally determined not to be from marine habitats are noted in superscript and alternatively described according to EnvO-Lite (v1.4). Genomes for which interpolated data could be determined and missing elements required for geo-referencing are listed (note: x, y, z and tare required for precise metadata interpolation).
1) This can be as minimal as a “fuzzy” habitat descriptor (rather than precise x, y), requires a depth (or 'surface sample' description), and does not require a date (as yearly averages can be taken). However, if the sample site is too close to the shore, data interpolation is not possible.
2) isolated from sea ice (aquatic habitat)
3) isolated from aquacultured shrimp (organism-associated habitat)
4) isolated from human (organism-associated habitat)
Figure 1Model of flow of contextual data into biological knowledge. (a) screenshot of interpolated data for Cyanophage PSS2 from megx.net website (b) screenshot of Cyanophage PSS2 GenBank file, the only INSDC report to store x, y, z, t data, (c) section of GCD report showing GCDML structure, highlighting the storage of cruise information and interpolated data from megx.net GIS tools.
Figure 2Screenshot GCDML Report revealing the GCDML schema using the Eclipse plug-in, oXygen. Note the (a) cruise data and (b) interpolated environmental parameters retrieved from megx.net for this genome can be added through the flexible GCDML ‘extensions.’
Figure 3Comparison of compliance with viral components of the MIGS checklist between data available in INSDC reports and that in MIGS/GDC reports that have been supplemented with extensive manual curation. List modified from [9].
Figure 4(a) The 26 'marine' phage genomes (plus 'aquatic' Flavobacterium phage 11b) able to be mapped based on data in their GCDML reports. The map is modified from that available from megx.net. See [23] for exact webserver query. For more information about the mapserver technology used by megx.net, see [24]; (b) sample sites of marine phages clustered by interpolated environmental data; (c) distribution of three of the interpolated environmental parameters (nitrate, phosphate, and oxygen saturation) demonstrating the Cyanophage PSS2 outlier.
Figure 5Overview of marine phage isolation, sequencing year, and genome properties stored in GCDML reports. (a) Trends of isolation and sequencing of the sequenced ‘marine’ phages over the last two decades. (b) Box and whisker plots showing range and distribution of genome sizes for all versus marine phages and %G+C content for marine phages. The box shows the interquartile range (middle 50% of the data); the thick black line demarcates the median, the dotted line extends to the minimum and maximum values; outliers are shown by empty circles. Data for genome sizes of “All Phages” were retrieved from NCBI.
Figure 6Overview of phage taxonomic data. (a)The taxonomic distribution of all sequenced phages versus all sequenced marine phages and (b) the hosts of all sequenced marine phages. All information describing marine phages and their hosts is accessible via GCDML reports.
Figure 7Network of 'genome pairs' and interactions between sequenced marine phages and sequenced hosts. Solid lines link phages (empty circles) to the host strain (solid circles) they infect; dashed lines connect phages to the host species (but not necessarily strain) they infect. Phages with no sequenced host are grouped by host Class (or Subclass for Cyanobacteria). Phage taxonomy is reflected by the color of the empty phage circle. Number of phages infecting a sequenced host is reflected by the size of the solid host circles.