| Literature DB >> 30500991 |
Anthony M Bolger1, Hendrik Poorter2,3, Kathryn Dumschott1, Marie E Bolger2, Daniel Arend4, Sonia Osorio5, Heidrun Gundlach6, Klaus F X Mayer6, Matthias Lange4, Uwe Scholz4, Björn Usadel1,2.
Abstract
Recent advances in genomics technologies have greatly accelerated the progress in both fundamental plant science and applied breeding research. Concurrently, high-throughput plant phenotyping is becoming widely adopted in the plant community, promising to alleviate the phenotypic bottleneck. While these technological breakthroughs are significantly accelerating quantitative trait locus (QTL) and causal gene identification, challenges to enable even more sophisticated analyses remain. In particular, care needs to be taken to standardize, describe and conduct experiments robustly while relying on plant physiology expertise. In this article, we review the state of the art regarding genome assembly and the future potential of pangenomics in plant research. We also describe the necessity of standardizing and describing phenotypic studies using the Minimum Information About a Plant Phenotyping Experiment (MIAPPE) standard to enable the reuse and integration of phenotypic data. In addition, we show how deep phenotypic data might yield novel trait-trait correlations and review how to link phenotypic data to genomic data. Finally, we provide perspectives on the golden future of machine learning and their potential in linking phenotypes to genomic features.Entities:
Keywords: phenotyping; plant bioinformatics; plant genome annotation; plant genomes
Mesh:
Year: 2019 PMID: 30500991 PMCID: PMC6849790 DOI: 10.1111/tpj.14179
Source DB: PubMed Journal: Plant J ISSN: 0960-7412 Impact factor: 6.417
Glossary table
| Term | Definition |
|---|---|
| Best linear unbiased predictions (BLUP) | A method used to estimate the ‘random’ effects of a mixed model. For a plant researcher this is of relevance when genotypes are considered a ‘random’ effect (reviewed in Piepho |
| Chromosomal pseudomolecules | The largest sequences assembled and ordered by genome sequencing projects, each representing a single chromosome in the genome. These are not necessarily complete, i.e. they might contain stretches of ‘N's. |
| Contigs | Assembled sequences that contain no unknown (‘N’) bases. |
| Copy‐number variation (CNV) | An InDel that increases or decreases the number of copies of a specific DNA sequence. |
| De Bruijn graph method | A method of genome assembly particularly suited to datasets from short‐read sequencing platforms, due to its scalability to large numbers of reads. |
|
| The method of assembling a genome from scratch when there is no reference sequence available. |
| Genome‐wide association studies (GWAS) | An observational study that tries to associate a genome‐wide set of variants (e.g. markers/polymorphisms) to determine whether a variant is associated with a particular trait. Usually requires many genotypes and relies on natural populations and/or panels with diverse cultivars as opposed to biparental populations. |
| Insertions/deletions (InDel) | A genomic variant in which one or more bases have been added and/or removed, resulting in a shorter or longer sequence than originally present. |
| Machine learning | The process of training computers to autonomously extract important information from a data set and identify patterns. Important subfields for a plant researcher include: (i) classification (e.g. is a plant diseased or healthy given an image); (ii) regression (e.g. predict plant biomass from several images); (iii) clustering (e.g. are there subtypes of plants in the experiment based on the measurement)? |
| Minimum information about plant phenotyping experiment (MIAPPE)’ | Presents guidelines and a checklist for describing plant phenotyping experiments so that they are understandable and reproducible. |
| Ontology | An ontology is extending controlled vocabularies (i.e. fixed lists of terms to be used) by relating these terms to each other. In the simplest case it could describe one term to always imply another term (e.g. if monocot, dicot and plant could represent a controlled vocabulary and the addition of monocot IS_A plant; dicot IS_A plant would start to add relationships towards an ontology). |
| Overlap‐layout consensus (OLC) method | A method of genome assembly particularly suited to datasets from long‐read sequencing platforms, originally developed for Sanger sequencing data. |
| Polish | A post‐assembly quality improvement procedure that aims to identify and correct small scale errors. |
| Quantitative trait locus (QTL) | A region of DNA containing one or more genes which are associated to the expression of a quantitative phenotypic trait. |
| Reduced representation libraries (RRL) | A protocol to create a sequencing library that aims to contain sequences only from selected subsets of the source genome. |
| Restriction site associated DNA sequencing (RAD‐seq) | A protocol using restriction enzymes to target specific sequences from a genome for including in a sequencing library. |
| Second‐generation sequencing/next‐generation sequencing | Usually sequencing by synthesis based, high‐throughput sequencing platforms that can sequence millions of DNA strands in parallel, but compared with Sanger sequencing have a higher error rate and limited read length, e.g. 50–600 bases, depending on the specific instrument used. Some platforms offer a paired‐end mode, whereby both ends of a DNA fragment are sequenced. |
| Single nucleotide polymorphism (SNP) | A genomic variant consisting of a single nucleotide substituted for an alternative nucleotide. |
| Third‐generation sequencing | Single‐molecule sequencing platforms that can create multi‐kilobase reads, but which have much higher error rates than Sanger or second‐generation sequencing platforms. |
| Variable importance prediction | A formalized method to predict the importance of variables in PLS type analyses. |
Figure 1Preparatory analyses for genomics and phenomics data for new genomes.
Figure 2Approaches to genome sequencing.
Currently, when approaching genome sequencing, the method used depends on the read lengths available: (a) When more short reads are available, they are first assembled into contigs, which are then scaffolded, guided by the long reads. When more long reads are available, two assembly options exist. Either (b) short reads are used to first correct the long reads, which are then assembled or (c) the long reads are first assembled after which the short reads are used to ‘polish’ the assembly. As these approaches lose information at each step, a method (d) that could combine long and short reads in a single step (theoretically leading to an improved genome assembly) would be optimal.
Figure 3Combining genomic and phenomic data.
The GWAS image was taken from Voiniciuc et al. (2016).