| Literature DB >> 26336651 |
Vasileios Lapatas1, Michalis Stefanidakis1, Rafael C Jimenez2, Allegra Via3, Maria Victoria Schneider4.
Abstract
Data sharing, integration and annotation are essential to ensure the reproducibility of the analysis and interpretation of the experimental findings. Often these activities are perceived as a role that bioinformaticians and computer scientists have to take with no or little input from the experimental biologist. On the contrary, biological researchers, being the producers and often the end users of such data, have a big role in enabling biological data integration. The quality and usefulness of data integration depend on the existence and adoption of standards, shared formats, and mechanisms that are suitable for biological researchers to submit and annotate the data, so it can be easily searchable, conveniently linked and consequently used for further biological analysis and discovery. Here, we provide background on what is data integration from a computational science point of view, how it has been applied to biological research, which key aspects contributed to its success and future directions.Entities:
Keywords: Bioinformatics; Data driven; Data integration; Open sciences; Standards
Year: 2015 PMID: 26336651 PMCID: PMC4557916 DOI: 10.1186/s40709-015-0032-5
Source DB: PubMed Journal: J Biol Res (Thessalon) ISSN: 1790-045X Impact factor: 1.889
Terminology
| Schema | A structured and “queryable” way of storing data |
| Database | A single or collection of schemata |
| Sources | A number of databases that contain data. Data that reside in each source can either duplicate and/or complement data from other sources |
| Data Integration | The process of combining data that reside in different sources, to provide users with a unified view of such data |
| Data Standards | Agreements on representation, format, and definition for common data |
| Data Formats | A structured way to represent data and metadata in a file |
| Data Warehousing | Model for integrating data where the data from different sources reside on a central repository (aka data warehouse) |
| Federated Databases | Model for integrating data where the data reside on the original sources and users are provided with a unified view of the data based on mapping mechanisms of the information |
| Linked Data | The network of interlinked data that is available on the web. It is used to automatically share semantically rich information and represents the biggest attempt to convert significant amounts of human knowledge across all fields in a computer readable format |
| Ontology | A structured way of describing data, often presented in a computer-readable format. In bioinformatics, ontologies are sets of unambiguous, universally agreed terms used to describe biological phenomena and “entities”, their properties and their relationships |
| lled Vocabulary | A collection of terms for describing a certain domain of interest |
| Unique Identifier | A unique representation for a biological entity (molecule, organism, ontology term, etc.). Usually an alphanumeric string that is used to refer to this entity and distinguishes it from others (much like ID or passport number in humans). |
| Metadata | Data describing data, i.e., additional information (e.g., a comment, explanation, attributes, etc.) for a specific biological entity or process. As an example, in the context of an ontology, this is used to specify significant properties of the ontology |
| Annotation | The process of attaching relevant information (metadata) to a raw biological entity |
| Automatic Annotation | Automatic means that the annotation is being done by computer software (often by transferring information from a source to another). This is a way of producing a large amount of metadata |
| Manual Annotation | As opposed to automatic annotation, manual means that an actual individual does it |
| GUI | Graphical User Interface. Is the way that a user interacts with a computer by using graphical icons and visual indicators such as buttons, forms etc. In the scope of this paper we are using the term GUI to refer to interfaces that allow biologists to search/read/edit integrated biological data |
| API | Application Programming Interface. Set of tool and protocols that a power user can use in order to automatically gain access to functionality and/or data that have been developed/gathered by another individual/organisation |
| UX | User eXperience. The process of improving user satisfaction by focusing on the usability of a given product. |
| Visualisation Tools | Applications that help biologists view the data in a more human-friendly way (e.g., Cytoscape for visualising complex networks) like 3D or graph representations of the data |
Fig. 1Data integration methodologies. This figure illustrates six major types of data integration methodologies in biology
Fig. 2Current state. This figure illustrates a simplified view of the current state of biological data and tools
List of data standards initiatives
| Acronym | Name | Goal | URL | PMID |
|---|---|---|---|---|
| OBO | The Open Biological and | Establish a set of principles for ontology |
| 17989687 |
| Biomedical Ontologies | development to create a suite of orthogonal | |||
| interoperable reference ontologies in | ||||
| the biomedical domain | ||||
| CDISC | Clinical data interchange | Establish standards to support the acquisition, |
| 23833735 |
| standards consortium | exchange, submission and archive of | |||
| clinical research data and metadata | ||||
| HUPO-PSI | Human Proteome Organisation- | Defines community standards for data |
| 16901219 |
| Proteomics Standards Initiative | representation in proteomics to facilitate | |||
| data comparison, exchange and verification | ||||
| GAGH | Global Alliance for Genomics | Create interoperable approaches to catalyze |
| 24896853 |
| and Health | projects that will help unlock the great | |||
| potential of genomic data | ||||
| COMBINE | Computational Modeling | Coordinate the development of the various |
| 25759811 |
| in Biology | community standards and formats for | |||
| computational models | ||||
| MSI | Metabolomics Standards | Define community-agreed reporting |
| 17687353 |
| Initiative | standards, which provided a clear description | |||
| of the biological system studied and | ||||
| all components of metabolomics studies | ||||
| RDA | Research Data Alliance | Builds the social and technical bridges that |
| |
| enable open sharing of data across multiple | ||||
| scientific disciplines |
Fig. 3Ideal state. This figure illustrates a simplified view of an ideal state of biological data and tools
Mostly commonly used data formats in bioinformatics
| Data format class | General data- | Nucleotide sequence | Protein sequence | Structural | Sequence | Other data |
|---|---|---|---|---|---|---|
| interchange formats | data | data | data | alignment | types (PPI, etc) | |
| Tabl | CSV, TSV | BED; GFF | GFF, Uniprot-GFF | PSF(D), MMCIF(D) | SAM(D) | |
| FASTA-like | FASTA; FASTQ | FASTA, PIR | SAM(M) | Wig | ||
| GenBank-like | GenBank; EMBL | Uniprot-TEXT | PDB, PSF(M), MMCIF(D) | CLUSTAL, MSF, | ||
| PHYLIP(D) | ||||||
| Tag-structured | HTML; XML; JSON | SBOL-XML | Uniprot-XML; | PSI MI-XML; | ||
| Uniprot-RDF/XML | PSI-PAR |
D = data; M = metadata. Formats appearing in more than one class are a mixture of classes
Fig. 4Selected parts of a FASTQ file. In this format declaration lines start with two different characters (“@” and “+”) corresponding to different data types (the raw sequence and the sequence quality values, respectively)
Fig. 5Selected parts of the GenBank entry DQ408531. The complete entry can be found at http://www.ncbi. nlm.nih.gov/nuccore/DQ408531
Fig. 6Selected parts of the Uniprot entry P01308 in XML format - The complete entry can be found at http://www.uniprot.org/uniprot/P01308.xml
Fig. 7Selected parts of a SAM file
Common visualisation tools in the area of “Interaction Network Visualisation”
| Name of resource | What it does | URL |
|---|---|---|
| BicOverlapper | Visualisation of biclusters combined with profile plots and heat maps |
|
| BiGGEsTS | Heat map-based bicluster visualisation |
|
| Brain Explorer | Visualisation of 3D transcription data in the central nervous system |
|
| Data Matrix Viewer | Simple profile plot visualisation; supports Gaggle |
|
| EXPANDER | Heat maps, scatter plots and profile plots of cluster averages |
|
| GENESIS | Analysis suite; offers several interactive visualisations |
|
| geWorkbench | Modular suite; heat maps, dendrograms, profile and scatter plots |
|
| Hierarchical Clustering Explorer | Linked heat map, profile and scatter plots; systematic exploration |
|
| Java TreeView | Linked heat maps, karyoscopes, sequence alignments, scatter plots |
|
| Mayday | Modular suite; many linked visualisations; enhanced heat map113 |
|
| MultiExperiment Viewer | Analysis suite; heat maps, dendrograms, profile and scatter plots |
|
| PointCloudXplore | Visualisation of 3D transcription data in Drosophila embryos |
|
| TimeSearcher | Exploration and analysis of time series; advanced profile plots |
|
| R/BioConductor Geneplotter | Karyoscope-style plots and other visualisations |
|
| GenePattern | Modular analysis platform; several visualisation modules available |
|
| Cytoscape | Open source software platform for visualizing molecular interaction networks and biological pathways and integrating these networks with annotations, gene expression profiles and other state data |
|