Literature DB >> 29986751

Maize Genomes to Fields: 2014 and 2015 field season genotype, phenotype, environment, and inbred ear image datasets.

Naser AlKhalifah^1,2, Darwin A Campbell¹, Celeste M Falcon², Jack M Gardiner^1,3, Nathan D Miller², Maria Cinta Romay⁴, Ramona Walls⁵, Renee Walton¹, Cheng-Ting Yeh¹, Martin Bohn⁶, Jessica Bubert⁶, Edward S Buckler^4,7, Ignacio Ciampitti⁸, Sherry Flint-Garcia^7,3, Michael A Gore⁴, Christopher Graham⁹, Candice Hirsch¹⁰, James B Holland^7,11, David Hooker¹², Shawn Kaeppler², Joseph Knoll⁷, Nick Lauter^1,7, Elizabeth C Lee¹³, Aaron Lorenz^14,10, Jonathan P Lynch¹⁵, Stephen P Moose⁶, Seth C Murray¹⁶, Rebecca Nelson⁴, Torbert Rocheford¹⁷, Oscar Rodriguez¹⁴, James C Schnable¹⁴, Brian Scully^7,18, Margaret Smith⁴, Nathan Springer¹⁰, Peter Thomison¹⁹, Mitchell Tuinstra¹⁷, Randall J Wisser²⁰, Wenwei Xu²¹, David Ertl²², Patrick S Schnable²³, Natalia De Leon²⁴, Edgar P Spalding²⁵, Jode Edwards^26,27, Carolyn J Lawrence-Dill²⁸.

Abstract

OBJECTIVES: Crop improvement relies on analysis of phenotypic, genotypic, and environmental data. Given large, well-integrated, multi-year datasets, diverse queries can be made: Which lines perform best in hot, dry environments? Which alleles of specific genes are required for optimal performance in each environment? Such datasets also can be leveraged to predict cultivar performance, even in uncharacterized environments. The maize Genomes to Fields (G2F) Initiative is a multi-institutional organization of scientists working to generate and analyze such datasets from existing, publicly available inbred lines and hybrids. G2F's genotype by environment project has released 2014 and 2015 datasets to the public, with 2016 and 2017 collected and soon to be made available. DATA DESCRIPTION: Datasets include DNA sequences; traditional phenotype descriptions, as well as detailed ear, cob, and kernel phenotypes quantified by image analysis; weather station measurements; and soil characterizations by site. Data are released as comma separated value spreadsheets accompanied by extensive README text descriptions. For genotypic and phenotypic data, both raw data and a version with outliers removed are reported. For weather data, two versions are reported: a full dataset calibrated against nearby National Weather Service sites and a second calibrated set with outliers and apparent artifacts removed.

Entities: Chemical

Keywords: Breeding; Environment; Genome; Genotype; Hybrid; Inbred; Maize; Phenotype; Prediction; Soil

Mesh：

Year: 2018 PMID： 29986751 PMCID： PMC6038255 DOI： 10.1186/s13104-018-3508-1

Source DB: PubMed Journal: BMC Res Notes ISSN： 1756-0500

Objective

G2F is a multi-institutional, collaborative initiative to develop tools that efficiently predict performance of diverse maize (Zea mays ssp. mays) varieties across multiple growing conditions. G2F projects aim to collect, share, and analyze multi-year, large-scale genomic, phenotypic, and environmental datasets. The project builds on existing maize genome sequence resources by developing approaches to understand the functions of genes and specific alleles based on their expression in typical field conditions. There are many dimensions to the goal of understanding genotype-by-environment (G × E) interactions, including which genes impact which traits and trait components, how genes interact among themselves, the relevance of specific genes under different growing conditions, and how genes influence plant growth during various stages of development. G2F projects foster integration of diverse research disciplines, including (but not limited to) genetics, genomics, plant physiology, agronomy, climatology, and crop modeling as well as analytical perspectives and tools derived from computational sciences, statistics, and engineering. Under the umbrella of G2F are enterprises such as the G × E project that began in 2014. The G × E project aims to document and measure genotypes, phenotypes, and environmental data in standard formats across more than twenty distributed field locations in North America annually. The resulting dataset is unique as it represents, to our knowledge, the most extensive publicly available dataset of its kind, reporting a consistent set of traits across common sets of fully genotyped germplasm not only across many locations, but also with relevant information reported down to the level of specific plots. Making these datasets publicly available enables researchers from many different disciplines to tackle the daunting analyses necessary to make useful predictions of crop performance. Novel data analysis approaches and tools are expected to result from the curated and organized data described here.

Data description

Online forms were developed for logging field site coordinates, field management metadata, and other site-specific information. Datasets include: DNA sequences of inbreds (with and without imputation), including those inbreds used to produce featured hybrids. The process for creating files and metadata pertaining to the genotype by sequencing (GBS) process [1] is described. Data are most readily analyzed using TASSEL software [2]. Raw sequence reads generated are accessible via the Sequence Read Archive [3]. Phenotype measurements for inbreds and hybrids. A handbook of instructions for making traditional phenotype measurements (reviewed in [4]) is available via the G2F website [5]. Traditional traits include stand count, stalk lodging, root lodging, days to anthesis, days to silking, ear height, plant height, plot weight, grain moisture, and test weight. Datatypes reported as both raw files and files with outliers removed are described in README files. Additionally, a large set of ear, cob, and kernel measurements was made with a non-traditional machine vision platform to quantify the components of yield [6]. These data are reported in millimeters with shape descriptors reported as principal components of contour data points. Cob color was reported as RGB (red/green/blue) pixel values. Kernel row number, counted manually, is reported as an integer. Environmental data collected by WatchDog 2700 weather stations (Spectrum Technologies) at 30-min intervals from planting through harvest. Collected information includes wind speed, direction, and gust; air temperature, dewpoint, and relative humidity; rainfall; and solar radiation. Data are reported as a calibrated set (based on calibration derived from nearby National Weather Service stations) and “clean” (based on removing obvious artifacts from the calibrated dataset). Soil characterizations by site (first taken in 2015) including plow depth, pH, buffered pH, organic matter, phosphorus levels (in parts per million), and potassium levels (in parts per million). Data collected in year n are released to project members in spring of the following year (n + 1), and released to the public the subsequent year (n + 2). The 2014 and 2015 datasets are publicly available via the NCBI SRA [7] and CyVerse/iPlant [8] with files and access links shown in Table 1.

Table 1

Overview of data files and data sets

Label	Name of data file/data set	File types (extension)	Data repository and identifier
DNA Sequences of Inbreds	GBS sequencing Maize G2F (G × E) inbreds	Sequence reads	NCBI SRA PRJNA385022 [3] (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA385022)
2014 Field Season Phenotypic and Genotypic Data	_readme.txt	.txt	CyVerse [9] (10.7946/P2V888)
	/a._2014_hybrid_phenotypic_data	directory
	_g2f_2014_hybrid_data_description.txt	.txt
	g2f_2014_hybrid_no_outliers.csv	.csv
	g2f_2014_hybrid_raw.csv	.csv
	/b._2014_gbs_data	directory
	_g2f_2014_gbs_data_description.txt	.txt
	g2f_2014_gbs_data.csv	.csv
	g2f_2014_zeagbsv27.imp.h5	.h5
	g2f_2014_zeagbsv27.imp.h5.gz	.gz
	g2f_2014_zeagbsv27.raw.h5	.h5
	g2f_2014_zeagbsv27.raw.h5.gz	.gz
	g2f_2014_zeagbsv27impv5hmp.txt.gz	.gz
	g2f_2014_zeagbsv27v5hmp.txt.gz	.gz
	/c._2014_weather_data	directory
	_g2f_2014_weather_data_description.txt	.txt
	g2f_2014_weather_calibrated.csv	.csv
	g2f_2014_weather_clean.csv	.csv
	/d._2014_inbred_phenotypic_data	directory
	_g2f_2014_inbred_data_description.txt	.txt
	g2f_2014_inbred_no_outliers.csv	.csv
	g2f_2014_inbred_raw.csv	.csv
	/z._2014_supplemental_info	directory
	g2f_2014_field_characteristics.csv	.csv
2015 Field Season Phenotypic and Genotypic Data	_readme.txt	.txt	CyVerse [10] (10.7946/P24S31)
	/a._2015_hybrid_phenotypic_data	directory
	_g2f_2015_hybrid_data_description.txt	.txt
	g2f_2015_hybrid_no_outliers.csv	.csv
	g2f_2015_hybrid_raw.csv	.csv
	/b._2015_gbs_data	directory
	_g2f_2014_gbs_data_description.txt	.txt
	/c._2015_weather_data	directory
	_g2f_2015_weather_data_description.txt	.txt
	g2f_2015_weather_calibrated.csv	.csv
	g2f_2015_weather_clean.csv	.csv
	/d._2015_inbred_phenotypic_data	directory
	_g2f_2015_inbred_data_description.txt	.txt
	g2f_2015_inbred_raw.csv	directory
	/e._2015_soils	directory
	_g2f_2015_soil_data.txt	.txt
	g2f_2015_soil_data.csv	.csv
	/z._2015_supplemental_info	directory
	_g2f_2015_supplemental_information.txt	.txt
	g2f_2015_cooperator_list.csv	.csv
	g2f_2015_field_irrigation.csv	.csv
	g2f_2015_field_metadata.csv	.csv
2014 and 2015 Inbred Ear Imaging	_readme.txt	txt	CyVerse [11] (10.7946/P2C34P)
	2014_2015_compiledData.tar.gz	.tar.gz
	2014_gxe_compiledDataAndFileNames.csv	.csv
	2014_gxe_compiledDataAndFileNames_Raw.csv	.csv
	2015_gxe_compiledDataAndFileNames.csv	.csv
	2015_gxe_compiledDataAndFileNames_Raw.csv	.csv
	CEK_Data_Files.tar.gz	.tar.gz
	/cob	directory
	_cob.txt	txt
	cob.tar.gz	.tar.gz
	cob_01of05.tar.gz	.tar.gz
	cob_02of05.tar.gz	.tar.gz
	cob_03of05.tar.gz	.tar.gz
	cob_04of05.tar.gz	.tar.gz
	cob_05of05.tar.gz	.tar.gz
	/ear	directory
	_ear.txt	.txt
	ear.tar.gz	tar.gz
	ear_01of08.tar.gz	tar.gz
	ear_02of08.tar.gz	tar.gz
	ear_03of08.tar.gz	tar.gz
	ear_04of08.tar.gz	tar.gz
	ear_05of08.tar.gz	tar.gz
	ear_06of08.tar.gz	tar.gz
	ear_07of08.tar.gz	tar.gz
	ear_08of08.tar.gz	tar.gz
	/kernel	directory
	_kernel.txt	.txt
	kernel.tar.gz	tar.gz
	kernel_01of05.tar.gz	tar.gz
	kernel_02of05.tar.gz	tar.gz
	kernel_03of05.tar.gz	tar.gz
	kernel_04of05.tar.gz	tar.gz
	kernel_05of05.tar.gz	tar.gz

Overview of data files and data sets As technologies develop and the number of researchers involved in the project grows, it is anticipated that increasingly diverse datatypes will be documented. An example of the use of these data has been reported [12]. In that study, phenotypic plasticity was found to be disproportionately controlled by regulatory regions. Because these datasets support lines of inquiry limited only by the questions researchers pose, the potential scope of application for these data is broad. The dataset is anticipated to additionally impact the field simply by being the first public dataset of its scale that has been collected and reported using standardized protocols and formats, respectively, thus defining standards for data collection, formatting, and access.

Limitations

Missing data occurs in most datasets. For genotypic and phenotypic datasets, missing data are left blank rather than zero or ‘null’ representation because some measured data report zero values and some software will only accept numeric values (not strings). The exception is for traits extracted from inbred ear, cob, and kernel image data, which are demarcated with ‘NA’. In some instances, reported data were maintained rather than editing for consistency. These decisions were made to minimize misinterpretation that could lead to incorrect documentation or measurements. For weather data, raw files reported by sensors are not provided because machine data were calibrated based on information from nearby weather stations to ensure accuracy (e.g., if the wind vane was set improperly, a calibration correction was required). Field locations are not always identical year-to-year, primarily due to crop rotation management practices. Each field’s GPS coordinates are reported annually to enable data aggregation in keeping with specific research objectives. Germplasm used and reported are specific to the project and are held by researchers involved in the project. They do not derive directly from national public genebanks. Seed access is granted in keeping with seed availability from cooperating researchers directly.

7 in total

1. TASSEL: software for association mapping of complex traits in diverse samples.

Authors: Peter J Bradbury; Zhiwu Zhang; Dallas E Kroon; Terry M Casstevens; Yogesh Ramdoss; Edward S Buckler
Journal: Bioinformatics Date: 2007-06-22 Impact factor: 6.937

Review 2. The Quest for Understanding Phenotypic Variation via Integrated Approaches in the Field Environment.

Authors: Duke Pauli; Scott C Chapman; Rebecca Bart; Christopher N Topp; Carolyn J Lawrence-Dill; Jesse Poland; Michael A Gore
Journal: Plant Physiol Date: 2016-08-01 Impact factor: 8.340

3. A robust, high-throughput method for computing maize ear, cob, and kernel attributes automatically from images.

Authors: Nathan D Miller; Nicholas J Haase; Jonghyun Lee; Shawn M Kaeppler; Natalia de Leon; Edgar P Spalding
Journal: Plant J Date: 2016-11-19 Impact factor: 6.417

4. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species.

Authors: Robert J Elshire; Jeffrey C Glaubitz; Qi Sun; Jesse A Poland; Ken Kawamoto; Edward S Buckler; Sharon E Mitchell
Journal: PLoS One Date: 2011-05-04 Impact factor: 3.240

5. The sequence read archive.

Authors: Rasko Leinonen; Hideaki Sugawara; Martin Shumway
Journal: Nucleic Acids Res Date: 2010-11-09 Impact factor: 16.971

6. The iPlant Collaborative: Cyberinfrastructure for Enabling Data to Discovery for the Life Sciences.

Authors: Nirav Merchant; Eric Lyons; Stephen Goff; Matthew Vaughn; Doreen Ware; David Micklos; Parker Antin
Journal: PLoS Biol Date: 2016-01-11 Impact factor: 8.029

7. The effect of artificial selection on phenotypic plasticity in maize.

Authors: Joseph L Gage; Diego Jarquin; Cinta Romay; Aaron Lorenz; Edward S Buckler; Shawn Kaeppler; Naser Alkhalifah; Martin Bohn; Darwin A Campbell; Jode Edwards; David Ertl; Sherry Flint-Garcia; Jack Gardiner; Byron Good; Candice N Hirsch; Jim Holland; David C Hooker; Joseph Knoll; Judith Kolkman; Greg Kruger; Nick Lauter; Carolyn J Lawrence-Dill; Elizabeth Lee; Jonathan Lynch; Seth C Murray; Rebecca Nelson; Jane Petzoldt; Torbert Rocheford; James Schnable; Patrick S Schnable; Brian Scully; Margaret Smith; Nathan M Springer; Srikant Srinivasan; Renee Walton; Teclemariam Weldekidan; Randall J Wisser; Wenwei Xu; Jianming Yu; Natalia de Leon
Journal: Nat Commun Date: 2017-11-07 Impact factor: 14.919

7 in total

5 in total

1. Predicting phenotypes from genetic, environment, management, and historical data using CNNs.

Authors: Jacob D Washburn; Emre Cimen; Guillaume Ramstein; Timothy Reeves; Patrick O'Briant; Greg McLean; Mark Cooper; Graeme Hammer; Edward S Buckler
Journal: Theor Appl Genet Date: 2021-08-27 Impact factor: 5.699

2. The importance of dominance and genotype-by-environment interactions on grain yield variation in a large-scale public cooperative maize experiment.

Authors: Anna R Rogers; Jeffrey C Dunne; Cinta Romay; Martin Bohn; Edward S Buckler; Ignacio A Ciampitti; Jode Edwards; David Ertl; Sherry Flint-Garcia; Michael A Gore; Christopher Graham; Candice N Hirsch; Elizabeth Hood; David C Hooker; Joseph Knoll; Elizabeth C Lee; Aaron Lorenz; Jonathan P Lynch; John McKay; Stephen P Moose; Seth C Murray; Rebecca Nelson; Torbert Rocheford; James C Schnable; Patrick S Schnable; Rajandeep Sekhon; Maninder Singh; Margaret Smith; Nathan Springer; Kurt Thelen; Peter Thomison; Addie Thompson; Mitch Tuinstra; Jason Wallace; Randall J Wisser; Wenwei Xu; A R Gilmour; Shawn M Kaeppler; Natalia De Leon; James B Holland
Journal: G3 (Bethesda) Date: 2021-02-09 Impact factor: 3.154

3. Temporal covariance structure of multi-spectral phenotypes and their predictive ability for end-of-season traits in maize.

Authors: Mahlet T Anche; Nicholas S Kaczmar; Nicolas Morales; James W Clohessy; Daniel C Ilut; Michael A Gore; Kelly R Robbins
Journal: Theor Appl Genet Date: 2020-07-01 Impact factor: 5.699

4. Maize genomes to fields (G2F): 2014-2017 field seasons: genotype, phenotype, climatic, soil, and inbred ear image datasets.

Authors: Bridget A McFarland; Naser AlKhalifah; Martin Bohn; Jessica Bubert; Edward S Buckler; Ignacio Ciampitti; Jode Edwards; David Ertl; Joseph L Gage; Celeste M Falcon; Sherry Flint-Garcia; Michael A Gore; Christopher Graham; Candice N Hirsch; James B Holland; Elizabeth Hood; David Hooker; Diego Jarquin; Shawn M Kaeppler; Joseph Knoll; Greg Kruger; Nick Lauter; Elizabeth C Lee; Dayane C Lima; Aaron Lorenz; Jonathan P Lynch; John McKay; Nathan D Miller; Stephen P Moose; Seth C Murray; Rebecca Nelson; Christina Poudyal; Torbert Rocheford; Oscar Rodriguez; Maria Cinta Romay; James C Schnable; Patrick S Schnable; Brian Scully; Rajandeep Sekhon; Kevin Silverstein; Maninder Singh; Margaret Smith; Edgar P Spalding; Nathan Springer; Kurt Thelen; Peter Thomison; Mitchell Tuinstra; Jason Wallace; Ramona Walls; David Wills; Randall J Wisser; Wenwei Xu; Cheng-Ting Yeh; Natalia de Leon
Journal: BMC Res Notes Date: 2020-02-12

Review 5. Computational aspects underlying genome to phenome analysis in plants.

Authors: Anthony M Bolger; Hendrik Poorter; Kathryn Dumschott; Marie E Bolger; Daniel Arend; Sonia Osorio; Heidrun Gundlach; Klaus F X Mayer; Matthias Lange; Uwe Scholz; Björn Usadel
Journal: Plant J Date: 2019-01 Impact factor: 6.417

5 in total