| Literature DB >> 26082168 |
Wynand Alkema, Jos Boekhorst, Michiel Wels, Sacha A F T van Hijum.
Abstract
In the production of fermented foods, microbes play an important role. Optimization of fermentation processes or starter culture production traditionally was a trial-and-error approach inspired by expert knowledge of the fermentation process. Current developments in high-throughput 'omics' technologies allow developing more rational approaches to improve fermentation processes both from the food functionality as well as from the food safety perspective. Here, the authors thematically review typical bioinformatics techniques and approaches to improve various aspects of the microbial production of fermented food products and food safety.Entities:
Keywords: bioinformatics; food, genomics; microorganisms; predictive models
Mesh:
Year: 2015 PMID: 26082168 PMCID: PMC4793891 DOI: 10.1093/bib/bbv034
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1.Data and bioinformatics applied in food application areas. Central in this figure are the food application areas (right panel). From organisms, different data sets can be obtained (data reservoir); their abbreviation is given within parentheses. Middle panel: one (of many important) methods and other methods/data sources (see Table 1 for an explanation) relevant for a main application area shown. Interpretation example: for safety assessment, genomes (G), literature (L) and phenotypes (H) are used with the gene function annotation (2.3), orthology (2.4), comparative genomics (2.5) and predicting phenotypes (4) techniques (see Table 1).
Glossary of food bioinformatics concepts and techniques, their explanation and their application
| Term | Description and examples of tools |
|---|---|
| 1. Big data/grid/cloud/ | With the increasing volume and heterogeneity of data sets (often referred to as “Big Data”), high performance computing is needed for analysis of the data. Many bioinformatics methods have been adapted to run on clusters of multiple computers (grid computing) and on large remotely located servers (cloud computing) [ |
| 1.1 Data mining | Statistical and machine learning techniques to determine trends in typically large data sets. Unsupervised techniques (sample grouping is not explicitly used in the analysis) include: principal component analysis (PCA) and clustering algorithms (e.g. K-means, hierarchical). Supervised techniques (sample grouping is taken into account) include: ANOVA, Mann–Whitney U test, partial least squares analysis (PLS), machine learning (e.g. by support vector machines (SVM) [ |
| 1.2 Virtual machines (VM) | A large computer file (disk image) that consists of an operating system (e.g. Linux), software tools and data. The image can be run on an actual computer using virtual machine software that emulates an actual computer. In other words, a computer in a computer. The advantage of VMs is that they are portable (can be run on many different types of computer hardware), easy to backup and more straight-forward to maintain. Examples of the use of VMs are the generic bioinformatics tools in the NEBC Bio-Linux distribution [ |
| 1.3 Databases | Databases are organized collections of biological data. Bioinformatics is only successful if databases with high-quality data are available, together with structured vocabularies that describe the content of the data sets. An updated overview of relevant biological databases can be found here: |
| 2. Genome sequencing | Determining the complete genome sequence of a microbial strain of interest. Next-generation sequencing (NGS) techniques allow for high-throughput and high-quality sequencing results. Especially the combination if different techniques (e.g. Illumina and Pacific Biosystems or PacBio) result in high-quality (circular) genomes [ |
| 2.1 Sequencing data (FASTQ) | Sequencing data are represented in FASTQ format. These files provide, next to the raw sequence data, additional information regarding the quality of the reads. In this manner, quality control and trimming can be applied. |
| 2.2 Assembly | Raw sequence reads of different NGS technologies can be assembled into contigs, long stretches of DNA sequence representing part of the genome. Most of the assembly methods are based on alignment of sequence reads with each other ( |
| 2.2.1 Scaffolding | Organizing the contigs from the assembly (2.2) to larger, gapped, DNA sequences. Some NGS techniques (e.g. Illumina) allow the synthesis of paired end (PE) or mate pair (MP) libraries; libraries with a fixed insert size that are sequenced at both ends. As reads span a larger DNA fragment, the matched reads pairs can be used to order contigs, even if the sequence in between the contigs has not been assembled. In general, most assembly tools allow for scaffolding, but also dedicated tools exist, such as SSPACE [ |
| 2.2.2 Gap closure strategies | After scaffolding, genome sequences will most often contain gaps. Common strategies to fill these gaps are generating new sequencing data using, for example, PacBio’s long reads [ |
| 2.3 Gene function annotation | Gene function is typically inferred from similarity in amino acid sequence. Gene functions can be predicted by comparing sequences to databases containing genes with known functions with tools like RAST [ |
| 2.4 Orthology | Genes in different organisms are orthologous when they were the same gene in the last common ancestor. Reconstructing the evolutionary history of genes allows the prediction of functional equivalence (i.e. orthologous genes are likely to have similar functions). Tools are OrthoMCL [ |
| 2.5 Comparative genomics | All analyses in which genome sequences or genome content of multiple organisms are compared. |
| 2.6 Metabolic modelling | Prediction of growth, and recruitment of metabolic pathways, of microbes by using the genome sequence as an inventory of all possible metabolic reactions. Genome-scale metabolic models can be constructed using automated [ |
| 3 Microbiome analysis | All microbes present in a particular niche are termed a microbiome. Analysis of microbiomes can be done using different next-generation sequencing-based techniques (see below). |
| 3.1 16s rRNA sequencing | 16s amplicon sequencing is the generation of sequence reads from conserved regions of the 16s gene. Amplicon sequencing (e.g. by Illumina) is used to identify the bacterial (and sometimes archaeal) component of microbial communities. Examples of software to infer community composition from sequencing data are Qiime [ |
| 3.1.1 Functional prediction | 16s sequences derived from a particular ecological niche indicate the taxa present and their relative abundance. From these data, presence of gene functions in those taxa can be performed using, e.g., PICRUSt [ |
| 3.2 Shotgun metagenomics and metatranscriptomics | Random fragments of the DNA or (enriched) mRNA of a given microbiome are sequenced with next-generation sequencing [ |
| 3.2.1 Assembly | Using the sequence overlap, the DNA/RNA-derived sequences can be assembled into larger contigs, see [ |
| 3.2.2 Annotation | Similar to the genome of a single bacterium, the sequences of a metagenome can be functionally and taxonomically annotated by comparing (assembled) sequences or predicted gene products against one or more reference databases with sequences with known functions from known taxonomic origin. Gene context such as operons are, however, primarily missing in shotgun metagenomics reads/contigs. A few tools are: PhymmBL [ |
| 3.3 Strain typing and tracking | Pinpointing the presence of a particular microbe (strain) in a biological sample. Using MLST markers [ |
| 4 Predicting phenotypes | Gene–trait matching: machine learning or statistics methods are used to predict the phenotype of a bacterial strain based on the presence/absence of particular genes [ |
| 5 Metabolomics | The simultaneous measurement of multiple metabolites in biological samples [ |