| Literature DB >> 31892954 |
Jill L Wegrzyn1, Taylor Falk1, Emily Grau1, Sean Buehler1, Risharde Ramnath1, Nic Herndon1.
Abstract
Sequencing technologies and bioinformatic approaches are now available to resolve the challenges associated with complex and heterozygous genomes. Increased access to less expensive and more effective instrumentation will contribute to a wealth of high-quality plant genomes in the next few years. In the meantime, more than 370 tree species are associated with public projects in primary repositories that are interrogating expression profiles, identifying variants, or analyzing targeted capture without a high-quality reference genome. Genomic data from these projects generates sequences that represent intermediate assemblies for transcriptomes and genomes. These data contribute to forest tree biology, but the associated sequence remains trapped in supplemental files that are poorly integrated in plant community databases and comparative genomic platforms. Successful implementation of life science cyberinfrastructure is improving data standards, ontologies, analytic workflows, and integrated database platforms for both model and non-model plant species. Unique to forest trees with large populations that are long-lived, outcrossing, and genetically diverse, the phenotypic and environmental metrics associated with georeferenced populations are just as important as the genomic data sampled for each individual. To address questions related to forest health and productivity, cyberinfrastructure must keep pace with the magnitude of genomic and phenomic sampling of larger populations. This review examines the current landscape of cyberinfrastructure, with an emphasis on best practices and resources to align community data with the Findable, Accessible, Interoperable, and Reusable (FAIR) guidelines.Entities:
Keywords: FAIR; cyberinfrastructure; phenomics; plant ontologies; population genetics; tree databases; tree genomics; workflows
Year: 2019 PMID: 31892954 PMCID: PMC6935593 DOI: 10.1111/eva.12860
Source DB: PubMed Journal: Evol Appl ISSN: 1752-4571 Impact factor: 5.183
Figure 1Growth in number of published reference plant genomes in comparison with those of tree species sequenced since 2002. By 2018, there were 148 plant reference genomes (shown in brown) with only 52 tree species (green). The first forest tree species was sequenced in 2006 (Populus trichocarpa). The highlighted genus names denote the year the first reference was generated for a species in that genus
Figure 2(a) NCBI project data depicted for 52 species (10 orders) associated with 6,116 BioProject studies. BioProject data were organized into whole‐genome shotgun (whole genome or resequencing), Transcriptome (RNA‐Seq, sRNA), Epigenome (bisulfite), GBS (genotyping‐by‐sequencing, RAD‐Seq, ddRAD‐Seq, RAPTURE, and similar), and exome (targeted capture). (b) NCBI BioProject data depicted for 972 projects representing 373 unique tree species across 16 orders. BioProject data were organized into whole‐genome shotgun (whole genome or resequencing), Transcriptome (RNA‐Seq, sRNA), Epigenome (bisulfite), GBS (genotyping‐by‐sequencing, RAD‐Seq, ddRAD‐Seq, RAPTURE, and similar), and exome (targeted capture)
Database resources for tree species
| Database | Start date | Scope | Data sharing | URL | Ontologies | FAIR sharing | Analytics |
|---|---|---|---|---|---|---|---|
| TreeGenes | 1995 | Forest Trees | Tripal/Elastic Search |
| SO, GO, TO, PO, CO, PATO, CHEBI | X | X |
| Gramene | 2001 | Plantae | BioMart/Expression Atlas |
| SO, GO, PO, EFO | X | X |
| Genome Database for Rosaceae | 2004 |
| Tripal/Elastic Search |
| SO, GO, TO, PO, CO, PATO | X | X |
| TropGeneDB | 2004 | Tropical Trees | N/A |
| SO, GO | X | |
| AspenDB | 2004 |
| N/A |
| SO, GO | X | |
| PopGenIE | 2009 |
| PlantGenIE |
| SO, GO | X | |
| PLAZA | 2009 | Plantae | N/A |
| SO, GO | X | X |
| Rubber Tree Genome | 2010 |
| N/A |
| SO, GO | ||
| Citrus Genome Database | 2011 |
| Tripal/Elastic Search |
| SO, GO, TO, PO, CO, PATO | X | X |
| CsiDB | 2011 |
| N/A |
| SO, GO | X | |
| EucGenIE | 2011 |
| PlantGenIE |
| SO, GO | X | |
| Eucalyptus camaldulensis Genome Database | 2011 |
| N/A |
| SO, GO | ||
| Jatropha Genome Database | 2011 |
| N/A |
| SO, GO | ||
| Phytozome | 2012 | Viridiplantae | InterMine |
| SO, GO | X | |
| ConGenIE | 2013 | Conifers | PlantGenIE |
| SO, GO | X | |
| Pear Genome Project | 2013 |
| N/A | peargenome.njau.edu.cn | SO, GO | ||
| TropiTree | 2014 | Tropical Trees | N/A |
| SO, GO | ||
| PGDBj | 2014 | Plantae | N/A |
| SO, GO | X | X |
| Hardwood Genomics Web | 2015 | Hardwood Forest Trees | Tripal/Elastic Search |
| SO, GO, TO, PO, CO, PATO | X | X |
| PGP Repository | 2015 | Plantae | e!DAL |
| X | ||
| Quercus Portal | 2015 |
| N/A |
| SO, GO | ||
| Ash Tree Genomes | 2016 |
| N/A |
| SO, GO | ||
| GDA: genome database for angiosperms | 2016 | Angiosperms | N/A |
| SO, GO | ||
| Jatropha Curcas Database | 2016 |
| N/A |
| SO, GO | ||
| Valley Oak Genome Project | 2016 |
| N/A |
| SO, GO | ||
| EUCANEXT | 2017 |
| N/A |
| SO, GO | X | |
| Rubber Genome and Transcriptome DB | 2017 |
| N/A |
| SO, GO | X | |
| Citrus Greening Database | 2017 |
| N/A |
| SO, GO | X | X |
| Cacao Genome Database | 2018 |
| Tripal |
| SO, GO, TO, PO, CO, PATO | X | X |
| CorkOakDB | 2018 |
| Tripal |
| SO, GO | X |
Figure 3Plant and tree‐specific secondary and community databases from 2002 to present
Reference ontologies/vocabularies for plants
| Ontology Name | Scope | Unique Terms |
|---|---|---|
| Crop Ontology (CO) | Traits (species‐specific) | 6298 |
| Trait Ontology (TO) | Trait | 1554 |
| Plant Ontology (PO) | Anatomy and development | 1991 |
| Gene Ontology (GO) | Gene product | 49993 |
| Phenotypic Qualities Ontology (PATO) | Trait qualities | 2730 |
| Chemical Entities of Biological Interest (CHEBI) | Chemistry | 132780 |
| Plant Experimental Conditions Ontology (PECO) | Environment | 563 |
| Sequence Ontology (SO) | Genetic | 2473 |
| Protein Ontology (PRO) | Protein products | 216442 |
| Plant Phenology Ontology (PPO) | Phenology | 254 |
| Flora Phenotype Ontology (FLOPO) | Morphology and trait | 24199 |
Ontologies widely adopted in plant genetic databases.
Workflow languages to support bioinformatic analysis
| Workflow Language/Workbench | Year | Web‐based/Command Line | Syntax |
|---|---|---|---|
| Apache Taverna | 2004 | Both | Explicit |
| Pegasus | 2005 | Command Line | Explicit |
| Ruffus | 2010 | Command Line | Explicit |
| Galaxy | 2010 | Both | Explicit |
| Snakemake | 2012 | Command Line | Implicit |
| bpipe | 2012 | Command Line | Explicit |
| Agave | 2012 | Both | Explicit |
| BigDataScript | 2015 | Command Line | Implicit |
| Sci:Luigi | 2016 | Command Line | Implicit |
| Common Workflow Language | 2016 | Both | Explicit |
| Nextflow | 2017 | Both | Implicit |
| Toil | 2017 | Command Line | Explicit |
Figure 4Schematic of recommended cyberinfrastructure to support and integrate non‐model tree genomics, phenomics, and environmental data. Community databases housed within existing frameworks that utilize content management systems will ease the management of user accounts, data exchange, and content updates. Guided submission workflows will integrate community‐curated ontologies, such as GO, SO, PO, TO, CO, and PATO. Regular imports from primary and secondary sources, as well as multi‐institutional projects, will provide the basis for data that can be further curated. Registered users will have direct access to custom workflows with data housed in the database and raw data that can be transferred from primary databases to the local application server