Literature DB >> 27924043

AraPheno: a public database for Arabidopsis thaliana phenotypes.

Ümit Seren¹, Dominik Grimm², Joffrey Fitz³, Detlef Weigel³, Magnus Nordborg¹, Karsten Borgwardt², Arthur Korte⁴.

Abstract

Natural genetic variation makes it possible to discover evolutionary changes that have been maintained in a population because they are advantageous. To understand genotype-phenotype relationships and to investigate trait architecture, the existence of both high-resolution genotypic and phenotypic data is necessary. Arabidopsis thaliana is a prime model for these purposes. This herb naturally occurs across much of the Eurasian continent and North America. Thus, it is exposed to a wide range of environmental factors and has been subject to natural selection under distinct conditions. Full genome sequencing data for more than 1000 different natural inbred lines are available, and this has encouraged the distributed generation of many types of phenotypic data. To leverage these data for meta analyses, AraPheno (https://arapheno.1001genomes.org) provide a central repository of population-scale phenotypes for A. thaliana inbred lines. AraPheno includes various features to easily access, download and visualize the phenotypic data. This will facilitate a comparative analysis of the many different types of phenotypic data, which is the base to further enhance our understanding of the genotype-phenotype map.

Entities: Chemical Disease Species

Mesh：

Year: 2016 PMID： 27924043 PMCID： PMC5210660 DOI： 10.1093/nar/gkw986

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Arabidopsis thaliana is a prime model system in plant biology (1). This is true for developmental biology, where many basic mechanisms have been analyzed, as well as for population genetics (2). Due to its selfing nature, individuals sampled from nature are generally inbred lines, homozygous throughout their genome. This allows efficient collection of many different phenotypes from genetically identical plants – an enormous advantage when studying complex trait variation, in particular the interaction between genotype and environment (3). Furthermore, the lines are direct products of local adaptation. Over the past years, many studies aimed to identify causative genetic variation for a plethora of different phenotypes. The rationale is to link genetic variation, which is present in the population, with observed phenotypic differences. Here, genome-wide association studies (GWAS), which were pioneered in human genetics well over a decade ago (4), constitute a preferred prime tool for such analyses. In GWAS, the effect of each genomic marker on the respective phenotype is assessed and a P-value reports likelihood of the association. To obtain meaningful results, a high marker density is required. In Arabidopsis thaliana, GWAS have been routinely performed using 214 000 markers generated with hybridization technology (5). Nowadays, full genome information for over 1000 different natural inbred lines are available (6, http://1001genomes.org/tools/). This resource is an exceptional genomic data set both in terms of quantity and quality. It allows the re-analysis of existing phenotypic data and has the potential to greatly improve the analysis. To fully exploit the advantages of A. thaliana for connecting genotypes to phenotypes, we are creating a central repository for phenotype data, which complements the central repository for genotype information. As A. thaliana is used for basic research worldwide, the existing collection of phenotypic data are exhaustive, but due to the fragmented nature of the data, comparative analyses are difficult. To summarize, A. thaliana provides one of the best and most extensive collections of population scale phenotype and genotype data that exist, making the species a perfect tool for statistical method development in GWAS (7–10). These models are nowadays heavily used even outside the A. thaliana community (e.g. human genetics). It is highly likely that the collection of phenotypic data in A. thaliana will grow massively in the future. AraPheno provides a central phenotype repository for these data.

DATABASE CONTENT AND USAGE

AraPheno is a database for phenotypes of A. thaliana. Phenotypes are grouped together into studies. Initially, we added six published studies with a total of 260 phenotypes to the database. The detailed data statistics are summarized in Table 1. We plan to add more phenotypes in the future. At the moment AraPheno contains only published phenotypes, but the framework of the database will also allow the moderated integration of unpublished phenotypes (see future directions). The primary purpose of AraPheno is to provide detailed information about the studies and phenotypes stored in the database. All published phenotypes will be linked to trait ontologies (https://bioportal.bioontology.org/ontologies/PTO?p=summary), which provides a controlled vocabulary to describe phenotypic traits in plants and enables functional grouping of different phenotypes.

Table 1.

AraPheno data content and statistics as of 15. Aug. 2016

Data content	Data statistics
General statistics
Studies	6
Phenotypes	260
Accessions	7425
Phenotyped accessions	1425
Observational units^a	4064
Phenotype values	52 741
Top 10 Trait-Ontology terms
days to flowering trait (TO:0000344)	33
bacterial disease resistance (TO:0000315)	20
seed weight (TO:0000181)	8
boron concentration (TO:0006043)	7
cadmium concentration (TO:0006059)	7
calcium concentration (TO:0006047)	7
cobalt concentration (TO:0006050)	7
copper concentration (TO:0006052)	7
iron concentration (TO:0006049)	7
lithium concentration (TO:0006042)	7

aObservational unit describes the number of actual physical different plants that have been used, even if they are genetically identical.

aObservational unit describes the number of actual physical different plants that have been used, even if they are genetically identical. Users can either display a list of studies or a list of phenotypes in table form. The primary purpose of the database is to serve as a central repository for all A. thaliana phenotypes, with the potential to store and host thousands of phenotypes. Therefore, we provide a fulltext-search functionality in order to search for specific terms. Further, AraPheno provides access to a detailed FAQ, tutorials and guided tours that should help new users to navigate the site. Users can obtain detailed information about a specific study or phenotype. In addition, a variety of information (Figure 1A) and interactive visualizations, such as geographic distribution (Figure 1B) of samples and phenotype histograms, are provided. In particular, the Explorer widget matches phenotypic values of the samples to their geographic locations and allows to visually uncover geographic patterns (see Figure 1C).

Figure 1.

Screenshot of the detailed view for a phenotype of interest (https://arapheno.1001genomes.org/phenotype/43/). (A) General information such as ‘Scoring’ or various ontology terms are displayed in text form. (B) The geographic distribution of the samples that were scored are displayed as a GeoChart. (C) A powerful Explorer widget relates the phenotype value of each sample to its geographic location, thus bringing out potential geographic patterns. All displayed information can be downloaded by the user in various data formats (CSV, PLINK and JSON for single phenotypes or PLINK and ISA-TAB for complete studies), as well as be accessed programmatically via an REST API. The Representational State Transfer (REST) architecture allows fast and scalable access to the data. Among common data formats, AraPheno will also support the ISA-TAB (www.isa-tools.org, www.isacommons.org, www.miappe.org) that has been developed to generate a standard format for capturing and communicating metadata that are required for the interpretation of experiments (11). This allows not only a direct analysis of the phenotypic data stored in AraPheno with different GWAS tools (e.g. 12, Grimm et al. (2012), arXiv preprint arXiv:1212.4788.), additionally the storing of metadata ensures that the phenotypes are understandable and, in principle, reproducible: an essential step in creating a comprehensive genotype–phenotype map. AraPheno will provide Digital Object Identifiers (DOIs) for existing phenotypes and studies, as well as for user submitted data. This will enable the citability of individual phenotypes, and encourage the community to upload existing phenotypes, even if the phenotypes have not yet been published. The persistent DOIs will be assigned by DataCite (https://www.datacite.org), a non-profit organization that provides persistent identifiers (DOIs) for research data. In addition to phenotypes and studies, AraPheno also stores a comprehensive list of all available A. thaliana accessions that have been collected in the wild. This information is connected to the phenotypes and as mentioned above allows users to look at geographic patterns. Furthermore, it also allows the user to have an accession-centric entry point into the database and retrieve a list of all phenotypes that have been scored for a certain accession (Figure 2).

Figure 2.

Screenshot of the detailed view for a specific accession (https://arapheno.1001genomes.org/accession/6909/). (A) General information such as ‘Country’ or ‘Collector’ are displayed in text form. (B) A map shows the geographic origin of the accession. (C) Various aggregated statistics about the ontologies for the (D) list of phenotypes that the accession was scored in. For any organism, different phenotypes can be correlated. This phenotypic correlation can be due to shared genetic or shared environmental effects and many methods try to take advantage of phenotypic correlation to map underlying genetic components (7,13–14). AraPheno provides an easy to use correlation-wizard that allows the user to calculate correlations between a set of phenotypes in real time and to visualize the results in an interactive way (see Figure 3). This tool enables the choice of interesting phenotype combinations for downstream analyses.

Figure 3.

Screenshot of the Phenotype-Correlation results (https://arapheno.1001genomes.org/correlation/6,29,30,31,49,102,99,53,86,39/). (A) User can specify the correlation method (Pearson and Spearman are supported). (B) The Phenotype–Phenotype Correlation Plot displays pairwise correlation values for the selected phenotypes. When the user moves the mouse over a cell the (C) The Phenotype–Phenotype Scatter Plot plots the corresponding phenotypic values against each other and the (D) The Phenotype Sample Overlap Diagram shows the overlap between the two selected phenotypes. Users, who want submit their own or new studies to AraPheno can easily do this using either a user friendly form at the AraPheno website or the REST API. Accepted submission formats for new data are PLINK and ISA-TAB. Demo files and a detailed description of the process is available in the FAQ section. Once the study is submitted, it will go through a manual curation step. Here, we will check the submission to make sure that relevant information, such as trait ontology terms, proper scoring information and meaningful phenotype names are provided. If any information is missing, we will notify the user to request a revision of the submitted data. Once this is completed, the study and phenotypes will be automatically made public and get an associated DOI. AraPheno is hosted under the 1001genomes organization (http://1001genomes.org), and its framework will be made available as open source (see Implementation) and under the Arabidopsis Information Portal (https://www.araport.org/).

IMPLEMENTATION

AraPheno has been implemented using the Django web framework (https://www.djangoproject.com/), an open-source and popular web-application framework based on Python and Django REST (http://www.django-rest-framework.org), an open-source REST framework based on Django for the REST endpoints. Documentation of the REST endpoints are done with the django-rest-swagger (https://github.com/marcgibbons/django-rest-swagger), an open-source swagger implementation for Django REST. The data are stored in PostgreSQL (https://www.postgresql.org), which is an open-source and high performance database. The interactive charts were developed using the google charts library (https://developers.google.com/chart), a free library developed by Google and D3.js (https://d3js.org), an open-source and popular JavaScript library for manipulating documents based on data. For the correlation analysis, as well as the computation of phenotype statistics (i.e. Shapiro-Wilk score), we use scientific Python libraries such as NumPy (http://www.numpy.org), an open-source library for scientific computing for Python, and SciPy (https://www.scipy.org), an open-source library for mathematics, science and engineering for Python, as well as Pandas (http://pandas.pydata.org), an open-source library providing high-performance, easy-to-use data structures and data analysis tools for Python. AraPheno is deployed using docker (https://www.docker.com), an open-source and popular software containerization platform. Docker enables us to provide a reproducible deployment of the entire AraPheno system without dealing with dependencies. In order to make it easier for others to start their own instance of AraPheno we provide a docker-compose.yml file for both the development version (using sqlite3 instead of PostgreSQL) as well as the production version of AraPheno. The code for AraPheno is open-source and hosted on github (https://github.com/1001genomes/AraPheno) that allows the user to report issues with the database.

CONCLUSIONS AND FUTURE DIRECTIONS

AraPheno is the first comprehensive database to store phenotypic information for the model plant A. thaliana. As A. thaliana is a natural occurring inbred plant, this information can be easily linked to existing genotype data and can be reused for many downstream analyses. Storing phenotypes together with sample information in a single database is not only a useful resource for the community but also enables researchers to look at the data from different angles and dissect the information in different ways. At the moment AraPheno contains only phenotypes for A. thaliana inbred lines, but the design of the database will enable the integration of mutant phenotypes in the future as well. AraPheno supports searches for specific phenotypes, trait ontology terms or accessions. Interactive visualizations empower the user to uncover interesting patterns in the data and carry out correlation-analysis across the data. Persistent DOIs support unique referencing of phenotypes and attendant analyses, ensuring their citabilty. The submission–curation workflow will make sure that the data available in AraPheno are of high quality and useful to others. So far we have integrated more than 250 publicly available phenotypes from six independent studies, but this number will almost certainly increase markedly over the next years, based on the many still unpublished studies that have already been presented at conferences. The goal is to create a community resource of all phenotypic data in A. thaliana. We will continue (after getting permission of the authors) to upload published phenotypes to the database, as well as enable authors to upload their own phenotypes. This upload will be moderated, to ensure that essential meta-information (e.g. growth conditions, stock numbers and germplasm) is present to enable the interpretation of the phenotypic data. An automatic submission of published studies and phenotypes from easyGWAS (https://easygwas.ethz.ch/) and GWA-Portal (https://gwas.gmi.oeaw.ac.at/) is planned to prevent fragmentation of phenotypic data. This directly links to ongoing efforts to create a central GWAS catalogue for GWAS results in A. thaliana. The easy availability of genotype and phenotype data will enable a plethora of downstream analysis with different GWAS tools, as well as the development and testing of novel statistical methods. The latter is not only limited to GWAS, but will be of high interest for genomic prediction models as well.

14 in total

Review 1. Genome-wide association studies for common diseases and complex traits.

Authors: Joel N Hirschhorn; Mark J Daly
Journal: Nat Rev Genet Date: 2005-02 Impact factor: 53.242

Review 2. Natural genetic variation in Arabidopsis: tools, traits and prospects for evolutionary ecology.

Authors: Chikako Shindo; Giorgina Bernasconi; Christian S Hardtke
Journal: Ann Bot Date: 2007-01-26 Impact factor: 4.357

3. Towards recommendations for metadata and data handling in plant phenotyping.

Authors: Paweł Krajewski; Dijun Chen; Hanna Ćwiek; Aalt D J van Dijk; Fabio Fiorani; Paul Kersey; Christian Klukas; Matthias Lange; Augustyn Markiewicz; Jan Peter Nap; Jan van Oeveren; Cyril Pommier; Uwe Scholz; Marco van Schriek; Björn Usadel; Stephan Weise
Journal: J Exp Bot Date: 2015-06-04 Impact factor: 6.992

4. A Lasso multi-marker mixed model for association mapping with population structure correction.

Authors: Barbara Rakitsch; Christoph Lippert; Oliver Stegle; Karsten Borgwardt
Journal: Bioinformatics Date: 2012-11-22 Impact factor: 6.937

5. GWAPP: a web application for genome-wide association mapping in Arabidopsis.

Authors: Ümit Seren; Bjarni J Vilhjálmsson; Matthew W Horton; Dazhe Meng; Petar Forai; Yu S Huang; Quan Long; Vincent Segura; Magnus Nordborg
Journal: Plant Cell Date: 2012-12-31 Impact factor: 11.277

6. Genome-wide detection of intervals of genetic heterogeneity associated with complex traits.

Authors: Felipe Llinares-López; Dominik G Grimm; Dean A Bodenham; Udo Gieraths; Mahito Sugiyama; Beth Rowan; Karsten Borgwardt
Journal: Bioinformatics Date: 2015-06-15 Impact factor: 6.937

Review 7. The development of Arabidopsis as a model plant.

Authors: Maarten Koornneef; David Meinke
Journal: Plant J Date: 2010-03 Impact factor: 6.417

8. An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations.

Authors: Vincent Segura; Bjarni J Vilhjálmsson; Alexander Platt; Arthur Korte; Ümit Seren; Quan Long; Magnus Nordborg
Journal: Nat Genet Date: 2012-06-17 Impact factor: 38.330

9. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines.

Authors: Susanna Atwell; Yu S Huang; Bjarni J Vilhjálmsson; Glenda Willems; Matthew Horton; Yan Li; Dazhe Meng; Alexander Platt; Aaron M Tarone; Tina T Hu; Rong Jiang; N Wayan Muliyati; Xu Zhang; Muhammad Ali Amer; Ivan Baxter; Benjamin Brachi; Joanne Chory; Caroline Dean; Marilyne Debieu; Juliette de Meaux; Joseph R Ecker; Nathalie Faure; Joel M Kniskern; Jonathan D G Jones; Todd Michael; Adnane Nemri; Fabrice Roux; David E Salt; Chunlao Tang; Marco Todesco; M Brian Traw; Detlef Weigel; Paul Marjoram; Justin O Borevitz; Joy Bergelson; Magnus Nordborg
Journal: Nature Date: 2010-03-24 Impact factor: 49.962

10. 1,135 Genomes Reveal the Global Pattern of Polymorphism in Arabidopsis thaliana.

Authors:
Journal: Cell Date: 2016-06-09 Impact factor: 41.582

23 in total

1. easyGWAS: A Cloud-Based Platform for Comparing the Results of Genome-Wide Association Studies.

Authors: Dominik G Grimm; Damian Roqueiro; Patrice A Salomé; Stefan Kleeberger; Bastian Greshake; Wangsheng Zhu; Chang Liu; Christoph Lippert; Oliver Stegle; Bernhard Schölkopf; Detlef Weigel; Karsten M Borgwardt
Journal: Plant Cell Date: 2016-12-16 Impact factor: 11.277

2. Natural variation in stomata size contributes to the local adaptation of water-use efficiency in Arabidopsis thaliana.

Authors: Hannes Dittberner; Arthur Korte; Tabea Mettler-Altmann; Andreas P M Weber; Grey Monroe; Juliette de Meaux
Journal: Mol Ecol Date: 2018-09-19 Impact factor: 6.185

3. From genotype to phenotype in Arabidopsis thaliana: in-silico genome interpretation predicts 288 phenotypes from sequencing data.

Authors: Daniele Raimondi; Massimiliano Corso; Piero Fariselli; Yves Moreau
Journal: Nucleic Acids Res Date: 2022-02-22 Impact factor: 16.971

4. Neighbor GWAS: incorporating neighbor genotypic identity into genome-wide association studies of field herbivory.

Authors: Yasuhiro Sato; Eiji Yamamoto; Kentaro K Shimizu; Atsushi J Nagano
Journal: Heredity (Edinb) Date: 2021-01-29 Impact factor: 3.821

5. Revisiting a GWAS peak in Arabidopsis thaliana reveals possible confounding by genetic heterogeneity.

Authors: Eriko Sasaki; Thomas Köcher; Danièle L Filiault; Magnus Nordborg
Journal: Heredity (Edinb) Date: 2021-07-05 Impact factor: 3.832

6. The platform GrowScreen-Agar enables identification of phenotypic diversity in root and shoot growth traits of agar grown plants.

Authors: Kerstin A Nagel; Henning Lenz; Bernd Kastenholz; Frank Gilmer; Andreas Averesch; Alexander Putz; Kathrin Heinz; Andreas Fischbach; Hanno Scharr; Fabio Fiorani; Achim Walter; Ulrich Schurr
Journal: Plant Methods Date: 2020-06-23 Impact factor: 4.993

7. A gene-phenotype relationship extraction pipeline from the biomedical literature using a representation learning approach.

Authors: Wenhui Xing; Junsheng Qi; Xiaohui Yuan; Lin Li; Xiaoyu Zhang; Yuhua Fu; Shengwu Xiong; Lun Hu; Jing Peng
Journal: Bioinformatics Date: 2018-07-01 Impact factor: 6.937