| Literature DB >> 32595658 |
Patrick König1, Sebastian Beier1, Martin Basterrechea1, Danuta Schüler1, Daniel Arend1, Martin Mascher1,2, Nils Stein1,3, Uwe Scholz1, Matthias Lange1.
Abstract
Genebanks harbor a large treasure trove of untapped plant genetic diversity. A growing world population and a changing climate require an increase in the production and development of stress resistant plant cultivars while decreasing the acreage. These requirements for improved plant cultivars can be supported by the broader exploitation of plant genetic resources (PGR) as inputs for genomics-assisted breeding. To support this process we have developed BRIDGE, a data warehouse and exploratory data analysis tool for genebank genomics of barley (Hordeum vulgare L.). Using efficient technologies for data storage, data transfer and web development, we facilitate access to digital genebank resources of barley by prioritizing the interactive and visual analysis of integrated genotypic and phenotypic data. The underlying data resulted from a barley genebank genomics study cataloging sequence and morphological data of 22,626 barley accessions, mainly from the German Federal ex situ genebank. BRIDGE consists of interactively coupled modules to visualize integrated, curated and quality checked data, such as variation data, results of dimensionality reduction and genome wide association studies (GWAS), phenotyping results, passport data as well as the geographic distribution of germplasm samples. The core component is a manager for custom collections of germplasm. A search module to find and select germplasm by passport and phenotypic attributes is included as well as modules to export genotypic data in gzip-compressed variant call format (VCF) files and phenotypic data in MIAPPE-compliant ISA-Tab files. BRIDGE is accessible at the following URL: https://bridge.ipk-gatersleben.de.Entities:
Keywords: barley; data visualization; data warehouse; genebank genomics; genotyping; phenotyping; plant genetic resources; visual analytics
Year: 2020 PMID: 32595658 PMCID: PMC7300248 DOI: 10.3389/fpls.2020.00701
Source DB: PubMed Journal: Front Plant Sci ISSN: 1664-462X Impact factor: 5.753
FIGURE 1Screenshot of the layout and BRIDGE user interface. Navigating to different parts of the page can be done by selecting the menu items on the left which are clustered into thematic groups and separated by horizontal lines to give users a visual structure. Changing from one menu item to the next does not change the appearance of the web portal, whereby both the layout remains intact and the applied settings are transferred to the entire page. This behavior is compliant to the single-page application (SPA) paradigm. As example “Genetic Diversity” from the navigation menu has been selected to present the colorization of subsets in both PCA plot and the distribution of accessions on a linked world map. Parameters of the visualization can be changed with the buttons, sliders and select boxes throughout the page. Detailed information regarding the selected subsets can be found underneath the plots.
FIGURE 2The basic architecture of BRIDGE showing the flow of data between the Data Warehouse, Web Server and Web Frontend with all available data domains. The Data Warehouse consists of several SQL tables in the in-house ORACLE RDBMS. The communication between the Data Warehouse and Web Server is based on the JDBC interface using SQL as query language. The Web Server provided a REST API for the Web Frontend to deliver the data points for all data visualization modules and to respond to requests for search queries initiated by a user via the Web Frontend. Furthermore, the Web Server provides APIs for the export of data of genetic variants as VCF files and the export of phenotypic data in MIAPPE-compliant ISA-Tab archives. The communication between the Web Server and Web Frontend is based on the HTTP protocol.
FIGURE 3Screenshots as illustrations of the most important search and visualization features, which can be used as an entry point for exploratory data analysis. (A) Data Filters: In the “Search Germplasm” panel, the user can search for germplasm by filtering passport attributes, phenotypic traits and SNP markers. (B) Genomic Diversity Visualization: In the integrated SNP browser users can inspect and subset the SNP matrices visually for different collections of germplasm. (C) Combined Interactive Visualizations: This tool enables the correlation of results of dimensionality reduction algorithms like PCA or t-SNE on the SNP data with countries of origin and phenotypic traits of the germplasm. (D) Interactive World Map: This tool allows the user to create lasso selections of geo-localized germplasm or to highlight user-defined germplasm collections with their specific tagging colors. (E) Manhattan Plots: This tool provides interactive plots of GWAS analysis results where each SNP data point is linked to the SNP browser. The user can click on a SNP data point and is then automatically guided to the corresponding genomic location in the SNP browser. (F) PCA Scatterplot Matrix: This visualization tool allows visual inspection of the first four principal components while highlighting user-defined germplasm collections with their specific tagging colors. It also allows to save custom lasso-selection of data points as a “named collection” of germplasm.
FIGURE 4The concept of “named collections” in combination with the visual analytics concept of “interactive brushing and linking:” A new germplasm collection can be created by (A) defining filters on passport records (under “Search Germplasm”), (B) a lasso selection in the world map (under “Geographic Origins”), (C) a lasso selection in a PCA plot (under “Genetic Diversity”). The save dialog (D) allows to store the current collection with a custom name und tagging color. Furthermore, the save dialog automatically detects intersections of the current selected germplasm samples with already existing germplasm collections and provides a function to add, subtract or intersect the sample lists. After saving, the new collection is available in the “Saved germplasm collections” dialog that is located in the menu “Collections” (E) and can be reused for application-wide visualization in the SNP browser (F), the PCA plots for exploration of the genetic diversity space (G) and the world map (H). Also, the SNP data for that collection can be exported to a VCF file (I), the phenotypic and passport data including pictures of the selected accessions in the collection can be downloaded as a MIAPPE compliant ISA-Tab archive (J) and finally the germplasm for the collection can be ordered online from the IPK genebank information system (K).
Numbers of germplasm samples with specific available data attributes.
| Available data | Number of data records |
| Germplasm samples with passport data | 22,626 |
| Genotyped germplasm samples (available in VCF file) | 22,621 |
| Germplasm samples with at least one observed phenotype | 9,527 |
| Germplasm samples with spike photographs | 6,162 |
| Germplasm samples with geographical (GPS) coordinates | 2,862 |
Available passport data attributes with corresponding MCPD codes (Multi-Crop Passport Descriptor Codes) (Alercia et al., 2015) and data completeness.
| Passport attribute | MCPD code | Data completeness |
| Country of origin | ORIGCTY | 88.64% |
| Subtaxon/Subspecies | SUBTAXA | 99.99% |
| Biological status of accession | SAMPSTAT | 99.57% |
| Donor institute | DONORCODE | N/A |
| Location of collecting site | COLLSITE | 29.27% |
| Full botanical name | N/A | 95,15% |
| Annual growth habit (barley specific) | N/A | 90.21% |
Observed phenotypic traits of barley spikes and number of accessions per trait value in the study panel.
| Phenotypic trait | Possible values | # Accessions |
| Row-type convarity | convar. | 5,572 |
| convar. | 3,030 | |
| convar. | 411 | |
| convar. | 293 | |
| convar. | 207 | |
| Spike density | Lax | 7,804 |
| Middle | 1,483 | |
| Dense | 234 | |
| Spike branching | Unbranched | 9,367 |
| Branched | 153 | |
| Spike brittleness | No | 9,517 |
| Yes | 4 | |
| Grain hull / Cover of caryopses | Covered grain | 8,630 |
| Naked grain | 882 | |
| Color of naked seeds | Yellow | 726 |
| Between black and brown | 57 | |
| Black | 50 | |
| Green | 32 | |
| Purple | 8 | |
| Other | 9 | |
| Length of rachilla hairs | Long | 6,019 |
| Short | 3,474 | |
| Awns roughness | Rough | 8,926 |
| Smooth | 421 | |
| Presence of awns central spikelet | Awnless or very short awns (tips) | 71 |
| Awns short (up to spike length) | 450 | |
| Awns long (1.5–3 times spike length) | 8,850 | |
| Sessile hoods (sessile or on short) | 27 | |
| Elevated hoods (Hood over 1 cm long awns shaped stems) | 19 | |
| Hoods with end awn | 19 | |
| Elevated hoods with end awn | 34 | |
| Presence of awns lateral spikelets | Awnless or very short awns (tips) | 203 |
| Awns short (up to spike length) | 419 | |
| Awns long (1.5–3 times spike length) | 5,276 | |
| Sessile hoods (sessile or on short) | 19 | |
| Elevated hoods (Hood over 1 cm long awns shaped stems) | 17 | |
| Hoods with end awn | 15 | |
| Elevated hoods with end awn | 20 | |
| Presence of awns on glumes | Awnless or very short awns (tips) | 8,725 |
| Awns short (up to spike length) | 785 | |
| Awns long (1.5–3 times spike length) | 10 | |
| Sessile hoods (sessile or on short) | 0 | |
| Elevated hoods (Hood over 1 cm long awns shaped stems) | 0 | |
| Hoods with end awn | 0 | |
| Elevated hoods with end awn | 1 | |
| Glume width | All glumes are narrow (<1 mm) | 5,959 |
| All or some glumes are broad (1–2 mm) | 56 | |
| Color of glumes | Yellow | 5,614 |
| Gray | 122 | |
| Black | 115 | |
| Brown | 92 | |
| Purple | 71 | |
| Green | 1 | |
| Other | 4 |
FIGURE 5The combination of exploratory data analysis and continuous data curation can form a cycle that leads to continuous quality improvements. (A) The heterogeneous primary data from genotyping and phenotyping experiments combined with data of further analysis steps like PCA or GWAS (“Analysis results data”) are collected and serve as initial data sources. (B) The input data is subjected to a first curation step and then fed into the data warehouse. The data warehouse allows programmatic access to the data and export of the data in standardized formats. (C) The iterative process of exploratory data analysis (EDA) makes it possible to derive new scientific hypotheses without prior assumptions. Furthermore, the result of an EDA iteration can reveal data inconsistencies and thus be a starting point for a subsequent data curation step. Ideally, these processes form a continuous cycle, which can lead to continuous quality improvement of primary research data and derived data. Using an entry point such as the germplasm search, a “named collection” of germplasm to be visualized or explored is created. Users can then search for outliers, clusters or other visual keys in a genetic diversity plot (such as PCA). When a sample or a set of samples of specific interest is identified and highlighted by a lasso selection, it is possible to save this subset as a new collection and look for further details, e.g., via the SNP profiles in the SNP browser. This detailed look into the data could be enough to find something unusual that might be worth further experimentation or lead to new scientific findings.
Comparison between BRIDGE, Germinate3, and T3/Barley.
| Resource name | Resource type | Custom collections | Interactive brushing and linking | Integrated visualizations | Interactive plots | Data export |
| Germinate3 | Data warehouse | (✓) | × | ✓ | (✓) | CSV, Flapjack |
| T3/Barley | Data warehouse | × | × | ✓ | × | CSV, Flapjack |
| BRIDGE | Visual analytics webtool | ✓ | ✓ | ✓ | ✓ | ISA-tab, VCF, CSV |