| Literature DB >> 22208806 |
Sveinung Gundersen1, Matúš Kalaš, Osman Abul, Arnoldo Frigessi, Eivind Hovig, Geir Kjetil Sandve.
Abstract
BACKGROUND: With the recent advances and availability of various high-throughput sequencing technologies, data on many molecular aspects, such as gene regulation, chromatin dynamics, and the three-dimensional organization of DNA, are rapidly being generated in an increasing number of laboratories. The variation in biological context, and the increasingly dispersed mode of data generation, imply a need for precise, interoperable and flexible representations of genomic features through formats that are easy to parse. A host of alternative formats are currently available and in use, complicating analysis and tool development. The issue of whether and how the multitude of formats reflects varying underlying characteristics of data has to our knowledge not previously been systematically treated.Entities:
Mesh:
Year: 2011 PMID: 22208806 PMCID: PMC3315820 DOI: 10.1186/1471-2105-12-494
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Illustration of the geometric properties of the fifteen track types. The base line is a genome, or a sequence, on which the tracks are defined. Vertical lines represents positions, while horizontal lines represent the lengths of the track elements. Gaps are thus illustrated by any empty areas between the elements. Values are represented by the height of the vertical lines. Interconnections are represented by arrows, the thickness of which correspond to the weight of the edge.
Figure 2Four-dimensional matrix mapping the relations of the fifteen track types. Each dimension represents the exclusion (0) or inclusion (1) of one of the four core informational properties: gaps, lengths, values and interconnections. The track type abbreviations in the top-left box are: Genome Partition (GP), Points (P) and Segments (S); in the bottom-left box: Function (F), Step Function (SF), Valued Points (VP) and Valued Segments (VS); in the top-right box: Linked Base Pairs (LBP), Linked Genome Partition (LGP), Linked Points (LP) and Linked Segments (LS); and in the bottom-right box: Linked Function (LF), Linked Step Function (LSF), Linked Valued Points (LVP) and Linked Valued Segments (LVS). The track types with white background (with gaps) are the sparse track types, while the ones with grey background (without gaps) are the dense track types. See Figure 1 for a geometric illustration of the track types.
Relation between analyses and track types
| Points | Segments | Function | Valued Points | Valued Segments | |
|---|---|---|---|---|---|
| Different frequencies? | Located inside? | Higher values at locations? | Located in highly valued segments? | ||
| Overlap? | Higher values inside? | ||||
| Correlated? | |||||
| Nearby values similar? | Categories differentially located in targets? | ||||
Examples of analyses for different combinations of track types (using only five of the fifteen defined track types). Note that many of these analyses are valid for several (though not all) combinations, and are assigned to what we consider the most typical combination for the analysis. All these analyses are carefully described significance tests [10], available online at the Genomic HyperBrowser [28]
Figure 3Overview of three common tabular formats. A) Generic Feature Format (GFF). The example file is a reduced version of the main example of the GFF version 3 specification [2]. B) Browser Extensible Data format (BED). The example file is fetched from the specification of the format at UCSC [4]. C) Wiggle Track Format (WIG) [8]. The example files show the two subformats variableStep and fixedStep. The track elements in the variableStep file covers single base pairs (span = 1, as default) and contains sparse data. For the fixedStep file, the step attribute is equal to the span attribute. The fixedStep file thus contains dense data. Figure 4 shows GTrack conversions of these example files.
The track types supported by existing tabular, binary and XML formats
| Format | Data | P | S | VP | VS | GP | SF | F | L | Strand | #Cols | Value type | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GFF3/GTF | [ | General | Tab. | ✓1 | ✓ | ✓1 | ✓ | 2 | ✓ | 9 | Float3 | |||
| BED/bigBed | [ | General | Tab./Bin. | ✓1 | ✓ | ✓1 | ✓ | 2 | ✓ | 3-12 | Int(0-1000)/string4 | |||
| BED15 | [ | Microarray | Tab. | ✓1 | ✓ | 2 | ✓ | 15 | List of floats5 | |||||
| bedGraph | [ | General | Tab. | ✓1 | ✓ | 4 | Float | |||||||
| WIG/bigWig (fixedStep) | [ | General | Tab./Bin. | ✓ | ✓ | ✓ | ✓ | 1 | Float | |||||
| WIG/bigWig (variableStep) | [ | General | Tab./Bin. | ✓ | ✓ | 2 | Float | |||||||
| CNT | [ | Copy number | Tab. | ✓ | 4 | Float | ||||||||
| Personal Genome SNP | [ | Variation | Tab. | ✓1 | ✓ | 7 | String6 | |||||||
| VCF | [ | Variation | Tab. | ✓ | ✓ | ≥ 8 | String6,3 | |||||||
| GVF | [ | General/Variation | Tab. | ✓1 | ✓ | ✓1 | ✓ | 2 | ✓ | 9 | Float3 | |||
| PSL | [ | Alignment | Tab. | ✓ | ✓ | ✓ | 21 | Int7 | ||||||
| SAM/BAM | [ | Alignment | Tab./Bin. | ✓ | ✓ | ✓ | 11 | Int/string8 | ||||||
| BioHDF | [ | Alignment | Bin. | ✓ | ✓ | ✓ | 11 | Int/string8 | ||||||
| MAF | [ | Multiple Alignment | Tab. | ✓ | ✓ | 9 | ✓ | 2-7 | Float/string8 | |||||
| FASTA | [ | Sequence | Text | ✓ | N/A | Char | ||||||||
| DAS XML | [ | General | XML | ✓1 | ✓ | ✓1 | ✓ | 2 | ✓ | N/A | Float | |||
| BioXSD 1.0 | [ | General | XML | ✓10 | ✓10 | ✓10 | ✓10 | ✓11 | ✓ | N/A | Float12 | |||
| USeq | [ | General | Bin. | ✓ | ✓ | ✓ | ✓ | ✓ | N/A | Int/float/string | ||||
| Genomedata | [ | General | Bin. | ✓ | ✓ | ✓ | ✓ | N/A | Int/float/char |
The track type abbreviations are as follows: Points (P), Segments (S), Valued Points (VP), Valued Segments (VS), Genome Partition (GP), Step Function (SF), and Function (F). L refers to any of the linked track types. The table also denotes whether the format supports specification of strand, the number of columns of the tabular formats, and the type of the dominant value, if any.
1 Points are specified using both start and end values. There is no way of specifying that a file contains only points.
2 Only a special case of linked segments is supported, namely part-of relationships, such as en exon being a part of a gene.
3 The chosen value type refers to what may be considered the main score column of the format. The format also includes a configurable column containing values that may be extracted by specialized parsers.
4 We limit the bigBed format to the standard BED columns for simplicity, as the bigBed format is highly customizable through the use of AutoSQL configurations.
5 The float values represent a set of gene expression values from microarray experiments.
6 The values represent the possible alleles at a SNP position. Also, the allele frequencies and quality scores are reported and could be used as values.
7 E.g. the number of bases that match/do not match.
8 E.g. the mapping quality or the aligned sequence itself.
9 Links to alignments in other genomes.
10 There is no way of specifying that a record contains only points or only segments.
11 No weights are supported in BioXSD 1.0.
12 Numerical values are always signed double precision floats (8 bytes). A limited set of other value types is also allowed (e.g. sequence variation and alignments).
Overview of the reserved columns in the GTrack format and their associations to track type
| Core property: | Gaps | Lengths | Values | |||||
|---|---|---|---|---|---|---|---|---|
| Points (P) | ? | ! | ✓ | . | . | ? | ? | . |
| Segments (S) | ? | ! | ✓ | ✓ | . | ? | ? | . |
| Genome Partition (GP) | ? | ! | . | ✓ | . | ? | ? | . |
| Valued Points (VP) | ? | ! | ✓ | . | ✓ | ? | ? | . |
| Valued Segments (VS) | ? | ! | ✓ | ✓ | ✓ | ? | ? | . |
| Step Function (SF) | ? | ! | . | ✓ | ✓ | ? | ? | . |
| Function (F) | ? | ! | . | . | ✓ | ? | ? | . |
| Linked Points (LP) | ? | ! | ✓ | . | . | ? | ✓ | ✓ |
| Linked Segments (LS) | ? | ! | ✓ | ✓ | . | ? | ✓ | ✓ |
| Linked Genome Partition (LGP) | ? | ! | . | ✓ | . | ? | ✓ | ✓ |
| Linked Valued Points (LVP) | ? | ! | ✓ | . | ✓ | ? | ✓ | ✓ |
| Linked Valued Segments (LVS) | ? | ! | ✓ | ✓ | ✓ | ? | ✓ | ✓ |
| Linked Step Function (LSF) | ? | ! | . | ✓ | ✓ | ? | ✓ | ✓ |
| Linked Function (LF) | ? | ! | . | . | ✓ | ? | ✓ | ✓ |
| Linked Base Pairs (LBP) | ? | ! | . | . | ✓ | ? | ✓ | ✓ |
C Core reserved column (defines track type)
N Non-core reserved column (reserved, but does not define track type)
✓ Column is mandatory
? Column is optional
. Column is not allowed
! Property must be present, either as a column or in a bounding region specification
1 The length is the difference between the end and the start position, or, if the start column is not present, the difference between the current end position and the previous.
2 The non-core reserved column id is required when the edges column is present.
Figure 4GTrack example files. A) GTrack version of the GFF file in Figure 3A. GTrack conversions of GFF vary according to the set of attributes present in the GFF file. The column selected as the main value may also be changed. B1 and B2) Two possible GTrack conversions of the BED file in Figure 3B. In the direct variant (B1) only a "track type" header line and a column specification line are added. The exon positioning will in this case not be understood by a general GTrack parser. The linked variant (B2) expands the exons into subsegments that links to their parent gene segment. C1 and C2) GTrack conversions of the WIG files in Figure 3C. The variableStep file has sparse track elements covering single base pairs, with associated values. The track is thus of type valued points. The fixedStep file contains dense data, with the same values for a series of consecutive base pairs. The track type is thus of type step function. Note that in the last example, the end values are used for positioning. D) Example GTrack file of type linked genome partition. Here two graphs are defined, one directed and one undirected. To change the active graph, the edges column in the column specification line needs to be changed, in addition to the "undirected edges" header line. The example GTrack files are available at [20]. BioXSD 1.1 versions of the examples are available as follows: A [21], B1 & B2 [22], C1 [23], C2 [24], and D [25].
Overview of the header variables of the GTrack format
| Header variable | Description | Default value |
|---|---|---|
| GTrack version | Version of the GTrack specification used | 1.0 |
| Track type | Track type of the GTrack file | segments |
| Value type | The kind of content accepted in the value column | number |
| Value dimension | The dimension of the content in the value column | scalar |
| Undirected edges | Whether all edges are undirected | false |
| Edge weights | Whether the edges have weights | false |
| Edge weight type | The kind of content accepted as edge weights | number |
| Edge weight dimension | The dimension of the edge weights | scalar |
| Uninterrupted data lines | Whether it is guaranteed that the data lines are not interrupted by bounding region specification lines or comments | false |
| Sorted elements | Whether it is guaranteed that all bounding regions and track elements come in sorted order | false |
| No overlapping elements | Whether it is guaranteed that no two track elements overlap | false |
| Circular elements | Whether any track elements or bounding regions cross the coordinate borders of a circular sequence | false |
| 1-indexed | Whether the coordinates start at 1 (0 if false) | false |
| End inclusive | Whether the coordinates specified in the end column is included in intervals | false |
| *Value column | The name of the column to be used for as the 'value' column | value |
| *Edges column | The name of the column to be used for as the 'edges' column | edges |
| *Fixed length | Fixed length of all track elements | 1 |
| *Fixed gap size | Fixed-size gaps between all neighboring track elements | 0 |
| *Fixed-size data lines | Whether each data line has an exact size in terms of number of characters | false |
| *Data line size | The size of each data line in terms of number of characters | 1 |
| *GTrack subtype | The name of the subtype of the GTrack format specification used for the file | (empty string) |
| *Subtype version | The version of the GTrack subtype | 1.0 |
| *Subtype URL | URL to a GTrack file used as a specification/model for the GTrack subtype | (empty string) |
| *Subtype adherence | Regulates the way a GTrack file may override the subtype specification | free |
All header variables not specified in a GTrack file retains their default values.
* Defined in the extended part of the GTrack specification. See the GTrack specification (Additional file 1) for more details.
Figure 5GTrack subtype example. A) An ad hoc GTrack suptype specification based on the example GTrack file in Figure 4A, which is a conversion from the GFF file in Figure 3A. This and other GTrack subtypes are available from the GTrack website [20]. B) A minimal GTrack header, parsable by fully compliant GTrack parsers. Note that the "Expand GTrack headers" tool, available from the GTrack website [20], can be used to expand headers of GTrack files using subtypes, in order for such files to be used in simpler parsers that do not support the subtype functionality.
The allowed content of a BioXSD FeatureRecord
| Notes | May further contain | |
|---|---|---|
| Name | 1 | |
| Ontology concepts | 1 | |
| Synonyms | ||
| Textual note | ||
| References | to database entries, databases, ontology concepts, other feature types | type of relationship with the referenced object2 |
| More specific type of feature | name and/or concepts, synonyms, database entries | |
| More generic class of feature types | name and/or concepts | |
| Position | strand, certainty | |
| Scores ( | double-precision signed floats (8 bytes), or any well-formatted strings* | unit, index, type of score2, note, position, provenance metadata |
| Evidence | references to databases, tools, and citations; scores, verdict, reliability, provenance metadata | |
| Name | ||
| Note | ||
| Alignments | alignment- and aligned sequence-specific scores, gaps, frameshifts, directions, note, provenance metadata | |
| Sequence variation | variants, canonical variant, scores, position | |
| Frame | ||
| CDS phase | ||
| References | to ontology concepts, database entries, other feature occurences ( | type of relationship with the referenced object2; scores of the relationship ( |
1 At least one of these two is mandatory.
2 By any ontology concept, referred to by a concept URI, identifier, or term; or by a custom term if no ontology concept is available.
3 Points are bases/residues or insertions between them.
4 For example if annotating the position of a regulatory element of a coding sequence, or relations between genes or protein domains.
5 Positions can form multi-segment subsequences, multi-point tuples, and can be combined within feature occurrences according to users' needs. The positions are always 1-based. The feature occurence may apply to the whole sequence (being a non-positioned sequence property).
* Added in BioXSD version 1.1.
Overview of the webtools available from the GTrack website [20]
| GTrack supporting tools | Description |
|---|---|
| Show GTrack specification | Displays a HTML version of the GTrack specification |
| Validate GTrack file | Checks whether a GTrack file complies with the specification |
| Convert tabular file to GTrack | Converts any tabular file to GTrack |
| Convert file to/from GTrack | Converts to and from common tabular formats (GFF, BED, WIG, bedGraph) |
| Expand GTrack headers | Expands partially completed GTrack headers based on data contents |
| Standardize GTrack file | Converts a GTrack file to track type "linked valued segments" using the default indexing scheme |
| Sort GTrack file | Sorts a GTrack file (including bounding regions) |
| Complement GTrack columns | Complements the columns of a GTrack file based on another GTrack file |
All tools are implemented as part of the Genomic HyperBrowser [10,28] and available under the GPL license, version 3 [34].