| Literature DB >> 20796305 |
Martin G Reese1, Barry Moore, Colin Batchelor, Fidel Salas, Fiona Cunningham, Gabor T Marth, Lincoln Stein, Paul Flicek, Mark Yandell, Karen Eilbeck.
Abstract
Here we describe the Genome Variation Format (GVF) and the 10Gen dataset. GVF, an extension of Generic Feature Format version 3 (GFF3), is a simple tab-delimited format for DNA variant files, which uses Sequence Ontology to describe genome variation data. The 10Gen dataset, ten human genomes in GVF format, is freely available for community analysis from the Sequence Ontology website and from an Amazon elastic block storage (EBS) snapshot for use in Amazon's EC2 cloud computing environment.Entities:
Mesh:
Year: 2010 PMID: 20796305 PMCID: PMC2945790 DOI: 10.1186/gb-2010-11-8-r88
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
A summary of the tag-value pairs, and their requirement for GVF
| Tag | Value | Necessity | Description |
|---|---|---|---|
| ID | String | Mandatory | While the GFF3 specification considers the ID tag to be optional, GVF requires it. As in GFF3 this ID must be unique within the file and is not required to have meaning outside of the file |
| Variant_seq | String | Optional | All sequences found in this individual (or group of individuals) at a variant location are given with the Variant_seq tag. If the sequence is longer than 50 nucleotides, the sequence may be abbreviated as '~'. In the case where the variant represents a deletion of sequence relative to the reference, the Variant_seq is given as '-' |
| Reference_seq | String | Optional | The reference sequence corresponding to the start and end coordinates of this feature |
| Variant_reads | Integer | Optional | The number of reads supporting each variant at this location |
| Total_reads | Integer | Optional | The total number of reads covering a variant |
| Genotype | String | Optional | The genotype of this variant, either heterozygous, homozygous, or hemizygous |
| Variant_freq | Real number between 0 and 1 | Optional | A real number describing the frequency of the variant in a population. The details of the source of the frequency should be described in an attribute-method pragma as discussed above. The order of the values given must be in the same order that the corresponding sequences occur in the Variant_seq tag |
| Variant_effect | [ | Optional | The effect of a variant on sequence features that overlap it. It is a four part, space delimited tag, The |
| Variant_copy_number | Integer | Optional | For regions on the variant genome that exist in multiple copies, this tag represents the copy number of the region as an integer value |
| Reference_copy_number | Integer | Optional | For regions on the reference genome that exist in multiple copies, this tag represents the copy number of the region as an integer in the form: |
| Nomenclature | String | Optional | A tag to capture the given nomenclature of the variant, as described by an authority such as the Human Genome Variation Society |
For Dbxrefs, the format of each type of ID varies from database to database. An authoritative list of databases, their DBTAGs, and the URL transformation rules that can be used to fetch the objects given their IDs can be found at this location [45]. Further details can be found here [46]. In addition, a Dbxref can be given as a stable Uniform Resource Identifier (URI).
Figure 1The top-level terms in the Sequence Ontology used in variant annotation. There are 1,792 terms in SO, most of which (1,312) are sequence features. There are 100 terms in the ontology that are kinds of sequence variant, of which the two top level terms are shown, and three sub-types, shown with dashed lines, that demonstrate the detail of these terms. The parts of SO that are used to annotate sequence variation files are sequence alteration to categorize the change (five subtypes shown with dashed lines), sequence feature to annotate the genomic features that the alteration intersects, and sequence variant to annotate the kind of sequence variant with regards to the reference sequence.
Figure 2An example of a GVF annotation, showing three hypothetical sequence alterations: an SNV, a deletion and a duplication. Lines beginning with '##' specify file-wide pragmas that apply to all or a large portion of the file. Lines are broken over multiple lines and separated by empty lines for presentation in the manuscript, but all data for a given pragma or feature should be contained on one line in a GVF file. A description of the tag-value pairs is given in Table 1.
The pragmas defined by GVF, in addition to those already defined by GFF3 (gff-version, sequence-region, feature-ontology, attribute-ontology, source-ontology, species, genome-build)
| Pragma | Allowed tags | Description |
|---|---|---|
| file-version | Comment | This allows the specification of the version of a specific file. What exactly the version means is left undefined, but the tag is provided for the case when an individual's variants are described in GVF and then, at a later date, changes to the data or the software require an update to the file. An increment of the file-version could signify such a change. Any numeric version of file-version is allowed |
| file-date | Comment | The file-date pragma is included as a method to describe the date when the file was created. The ISO 8601 standard for dates in the form YYYY-MM-DD is required for the value |
| individual-id | Dbxref, Gender, Population, Comment | This pragma provides details about the individual whose variants are described in the file |
| ##individual-id Dbxref = Coriell:NA18507;Gender = male;Ethnicity = Yoruba; Comment = Yoruba from Ibadan | ||
| source-method | Seqid, Source, Type, Dbxref, Comment | This pragma provides details about the algorithms or methodologies used to generate data for a given source in the file. This is used, for example, to document how a particular type of variant was called. A typical use would be to provide a DBxref link to a journal article describing software used for calling the variant data with the given source tag |
| ##source-method Seqid = chr1;Source = MAQ;Type = SNV;Dbxref = PMID:18714091;Comment = MAQ SNV calls; | ||
| attribute-method | Seqid, Source, Type, Attribute, Dbxref, Comment | This pragma provides details about algorithms or methodologies for a given attribute tag in the file. This is used to document how a particular type of attribute value (that is, Genotype, Variant_effect) was calculated |
| ##attribute-method Source = SOLiD;Type = SNV;Attribute = Genotype;Comment = Genotype is reported here as determined in the original study | ||
| technology-platform | Seqid, Source, Type, Read_length, Read_type, Read_pair_span, Platform_class, Platform_name, Average_coverage. Comment, Dbxref | This pragma provides details about the technologies (that is, sequencing or DNA microarray) used to generate the primary data |
| ##technology-platform Seqid = chr1;Source = AFFY_SNP_6;Type = SNV;Dbxref = URI: | ||
| data-source | Seqid, Source, Type, Dbxref, Data_type, Comment. | This pragma provides details about the source data for the variants contained in this file. This could be links to the actual sequence reads in a trace archive, or links to a variant file in another format that have been converted to GVF |
| ##data-source Source = MAQ;Type = SNV;Dbxref = SRA:SRA008175;Data_type = DNA sequence;Comment = NCBI Short Read Archive | ||
| phenotype-description | Ontology, Term, Comment | A description of the phenotype of the individual. This pragma can contain either ontology constrained terms, or a free text description of the individual's phenotype or both. |
| ##phenotype-description Ontology = | ||
| ploidy | Ontology, Term, Comment | This pragma defines the ploidy for a given genome. This pragma can contain either ontology constrained terms, or a free text description of the individual's ploidy. It is suggested that ontology constrained terms use a subtype of the term PATO:0001374, which includes haploid, diploid, polyploid, triploid etc |
| ##ploidy chr22 1 49691432 diploid | ||
| ##ploidy chrY 1 57772954 haploid | ||
The pragmas defined by GVF may refer to the entire file or may limit their scope by use of tag-value pairs. For example, if a pragma only applies to SNVs that were called by Gigabayes on chromosome 13, then the tags: Seqid = chr13;Source = Gigabayes;Type = SNV would indicate the scope. The Dbxref tag within a GVF pragma takes values of the form 'DBTAG:ID' and provides a reference for the information given by the pragma whether that be the location of sequence files or a link to a paper describing a method. Tags beginning with uppercase letters are reserved for future use within the GFF/GVF specification, but applications are free to provide additional tags beginning with lower case letters.
A reference GVF dataset for public use
| Filename | Individual | Ethnicity | Platform | Reference |
|---|---|---|---|---|
| 10Gen_NA19240_SNV | NA19240 | African | Life SOLiD | [ |
| 10Gen_NA18507_ILMN_SNV | NA18507 | African | Illumina | [ |
| 10Gen_NA18507_SOLiD_SNV | NA18507 | African | Life SOLiD | [ |
| 10Gen_Chinese_SNV | Chinese | Asian | Illumina | [ |
| 10Gen_Korean_SNV | Korean | Asian | Illumina | [ |
| 10Gen_Venter_SNV | Venter | Caucasian | Sanger | [ |
| 10Gen_Watson_SNV | Watson | Caucasian | Roche 454 | [ |
| 10Gen_NA07022_SNV | NA07022 | Caucasian | CGenomics | [ |
| 10Gen_NA12878_SNV | NA12878 | Caucasian | ABI SOLiD | [ |
| 10Gen_Quake_SNV | Quake | Caucasian | Helicos | [ |