| Literature DB >> 27881428 |
Claudia Manzoni1,2, Demis A Kia2, Jana Vandrovcova2, John Hardy2, Nicholas W Wood2, Patrick A Lewis1,2, Raffaele Ferrari2.
Abstract
Advances in the technologies and informatics used to generate and process large biological data sets (omics data) are promoting a critical shift in the study of biomedical sciences. While genomics, transcriptomics and proteinomics, coupled with bioinformatics and biostatistics, are gaining momentum, they are still, for the most part, assessed individually with distinct approaches generating monothematic rather than integrated knowledge. As other areas of biomedical sciences, including metabolomics, epigenomics and pharmacogenomics, are moving towards the omics scale, we are witnessing the rise of inter-disciplinary data integration strategies to support a better understanding of biological systems and eventually the development of successful precision medicine. This review cuts across the boundaries between genomics, transcriptomics and proteomics, summarizing how omics data are generated, analysed and shared, and provides an overview of the current strengths and weaknesses of this global approach. This work intends to target students and researchers seeking knowledge outside of their field of expertise and fosters a leap from the reductionist to the global-integrative analytical approach in research.Entities:
Mesh:
Substances:
Year: 2018 PMID: 27881428 PMCID: PMC6018996 DOI: 10.1093/bib/bbw114
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1.Overview of the progressive advance in the methods to study genes, transcripts and proteins in the informatics sciences. The arrow represents the development, over time, of the many disciplines now involved in biomedical science accompanied by the fundamental advances in informatics and community resources. The broad roots of the omics revolution are represented by the wider start of the arrow before the year ‘1950’, when the foundations for a paradigm shift in science (from single observations to systems dynamics) were laid.
General critical considerations on applying bioinformatics to the biomedical sciences. Problems that can be addressed by individual researchers or research groups or that should be addressed by a large community effort have been flagged with * or °, respectively.
| *Online tools are used with little to no criticism | Using inappropriate tools for a particular analysis Using default settings that may not be tailored for the research purpose Accepting an output without much criticism, leading to mis/over-interpretation of results | For informaticians: make the description of the tool as simple as possible For end user: understand the principles underlying a tool before using it |
| *Analysis can be run with different, though equally valid, algorithms and statistical methods | The wealth of tools available feeds the temptation to pick the one that either has the friendliest user interface or gives the most interesting result Results obtained using different tools are different | As with technical replicates in a wet laboratory, a good bioinformatics analysis must give consistent results even with different methods Repetition of the analysis with different tools supports consistency and reproducibility of findings |
| *Analysis may require the subjective selection of | Same tools used by implementing different parameters will likely generate different results | Perform sensitivity analysis using alternative parameters |
| °Databases are on-going projects | Databases are constantly updated Analytical tools that rely on databases may become out of date if their libraries are not updated periodically Published bioinformatics analyses become out of date because of advances in the databases/reference sets | Use software and online tools with recent/frequent updates Bioinformatics analyses are complete only to the extent of the completeness of the reference database used Always document the software version and codes used for a particular analysis Code maintainers should keep archival copies of old software and code versions (if replications are necessary) |
| *Statistical methods were originally designed for ‘small’ scale data | If statistical methods are tailored for small-scale data, eventually the | Be cautious in the statistical approaches used, and ask guidance from experts |
| °Analytical tools | Some of the resources are accessible only after a fee is payed. This very much limits their use to exclusively niche or research groups with funds for bioinformatics analysis | Free omics data access and usage is fundamental for reducing the fragmentation of research and stimulating the improvement of data integration, analysis and interpretation Foster open data policies with the support of governments and funding agencies |
| *Hypothesis-driven analyses | Results might be biased based on initial hypothesis Some outcomes might be inflated because of excessive targeting through the research tools being used (primers or probes, particular protein interactions, tissue-specific data) | Consider whether the experiment or analysis needs to be hypothesis driven or can be hypothesis free; use the right techniques/tools and analysis to address the research question (microarray versus NGS, association versus rare variants analysis, tissue-specific versus all tissues, eQTL versus epigenomics, etc) |
| *Experimental design | Wrong experimental design (without considering power calculation, adequate controls, tissue types, single cells versus tissue homogenates, etc) may lead to biased or underpowered results | Like any experiment, the analysis should be planned within a properly developed pipeline that takes into account data source, sample size, controls, techniques to generate data, analyses to apply to data |
Figure 2.Overview of the types of variants in the genome, their potential consequences and the methods/techniques to untangle them.
General critical considerations on applying bioinformatics to genomics. Problems that can be addressed by individual researchers or research groups or that should be addressed by a large community effort have been flagged with * or °, respectively.
| *Genome build | Analysing data using inconsistent genome builds can lead to spurious results | Use the correct genome build when mapping |
| *Multiple databases [ | Difficult to select among the many databases that exist | Investigate their limitations, including lack of corrections or updates to annotations |
| °Large-scale databases [ | With large-scale data there may be a decrease of phenotype data quality | Consider case ascertainment methods and length of follow-up of controls |
| °Variant effect prediction | Effect prediction tools are not infallible | Verify segregation, absence from controls and |
| °Linear genome reference [ | A linear genome reference is not representative across individuals and populations | GRCh38 addresses this issue by providing alternative sequence representations for regions where a consensus sequence is difficult to be determined |
Figure 3.Summary of various features associated with either RNA-microarrays or RNA-sequencing data generation and analysis.
General critical considerations on applying bioinformatics to transcriptomics. Problems that can be addressed by individual researchers or research groups or that should be addressed by a large community effort have been flagged with * or °, respectively.
| °The transcriptome is cell-specific [ | Use of RNA data from cells/tissues not specific for the aims of a study may lead to misleading results Many RNA data sets are based on tissue homogenates | Use RNA data obtained from source material relevant to the planned study Be aware of the possibility of contamination from different cell types in data originating from homogenates Establish a worldwide project for a bank of well-defined human cell lines representing all tissues and define their transcriptome at different times of the cell cycle to generate a ‘reference-transcriptome’ |
| °The transcriptome is dynamic [ | The generalization of RNA data can lead to misleading interpretations | Be aware that data might reflect a particular cellular phase, or metabolism influenced by micro-environmental stimuli |
| °e/sQTLs depend on temporospatial variables [ | The generalization of e/sQTL results can lead to misleading interpretations | e/sQTLs depend on temporal (cell cycle/age) and spatial (cells/tissue/micro-environment) variables: consider these as covariates during data analysis and/or interpretation |
Figure 4.Summary of protein structural features and methods to generate and analyse proteomics data. The crystal structure of the haeme cavity of the haemoglobin of Pseudoalteromonas haloplanktis (4UUR [75]) was downloaded from PDB and visualized by RasMol (http://www.openrasmol.org/Copyright.html#Copying).
General critical considerations on applying bioinformatics to proteomics. Problems that can be addressed by individual researchers or research groups or that should be addressed by a large community effort have been flagged with * or °, respectively.
| *Protein sequences undergo revision | Changes in the gene sequence and experimental protein sequencing confirmation will result in updates to the protein sequence in protein databases Different bioinformatics tools are updated to different versions of the protein sequence databases | Always refer to the most recent protein sequence and, if old data are used, disclose the version of the protein structure of reference |
| °The same protein is classified through different protein IDs | Different databases classify the same protein under different IDs. This may result in mismatches between protein IDs across repositories as well as between protein and corresponding gene IDs. This causes misrepresentations or loss of relevant information | Revise the bioinformatics tools in use to allow for a comprehensive and straightforward conversion of protein IDs |
| °Proteins are annotated to different extents [ | The information collected in PPI databases is derived from small-scale hypothesis-driven experiments. Therefore, there is an intrinsic bias in that less studied proteins are less reported or missing in databases (ascertainment bias) | Consider that if data for a specific protein is unavailable, this may be because such target has not been studied or annotated yet |
| *The proteome is dynamic [ | Proteomic studies based on MS are normally hypothesis free but difficult to interpret, as the proteome is highly dynamic | Be aware that data might reflect a particular cellular phase, or metabolism influenced by micro-environmental stimuli |
| *Atlases reporting protein expression across tissues should be used carefully | Antibodies are used in immunohistochemistry to detect protein expression across different tissues. For some proteins, antibodies are not available or reliable | Consider the atlas as an indication, rely on the data only when antibodies and protocols with longer track records or those with multiple literature citations are used |
Figure 5.Scheme of a typical functional enrichment analysis. A sample and reference set are compared to highlight the most frequent (i.e. enriched) features within the sample set.
General critical considerations on applying bioinformatics to functional annotation analyses. Problems that can be addressed by individual researchers or research groups or that should be addressed by a large community effort have been flagged with * or °, respectively.
| *Enrichment portals run with different algorithms and statistical methods [ | The software package chosen for the analysis (library, algorithm and statistics) will influence the final result | At the moment, there is no gold standard method for enrichment Use a minimum of three different portals to replicate and validate functional annotations |
| *Enrichment for GO terms may give generic results [ | GO terms are related through family trees: general terms are umbrella terms located at the top of the tree. More specific terms are found gradually moving down towards the roots General terms are overrepresented among the results of functional enrichment | The many very general (top of the tree) GO terms might be ignored comparatively to the more specific terms (roots), as they are less likely to provide useful biological meaning(s) |
Figure 6.Overview on a global approach for the study of health and disease. Ideally, for individual samples, comprehensive metadata (0) should be recorded. To date, (1), (2) and (3) are being studied mainly as compartmentalized fields. A strategy to start integrating these fields currently relies on functional annotation analyses (4) that provide a valuable platform to start shedding light on disease or risk pathways (5). The influence of other elements such as epigenomics, pharmacogenomics, metabolomics and environmental factors on traits is important to have a better and more comprehensive understanding of their pathobiology. The assessment and integration of all such data will allow for the true development of successful personalized medicine (6). Color codes: green = addressed and in progress; orange = in progress; red = not yet addressed; yellow = ideal but not yet fully implemented. The gradually darker shades of green and increased font sizes indicate the expected gradual increase in the translational power of global data integration.