Literature DB >> 21283599

The case for resequencing studies of Arabidopsis thaliana accessions: mining the dark matter of natural genetic variation.

Abstract

Ultra-high-throughput sequencing (UHTS) techniques are evolving rapidly and may soon become an affordable and routine tool for sequencing plant DNA, even in smaller plant biology labs. Here we review recent insights into intraspecific genome variation gained from UHTS, which offers a glimpse of the rather unexpected levels of structural variability among Arabidopsis thaliana accessions. The challenges that will need to be addressed to efficiently assemble and exploit this information are also discussed.

Entities: Chemical Disease Gene Species

Year: 2010 PMID： 21283599 PMCID： PMC3026625 DOI： 10.3410/B2-85

Source DB: PubMed Journal: F1000 Biol Rep ISSN： 1757-594X

Introduction and context

The introduction of ‘next-generation’ sequencing technologies has had a tremendous impact on all areas of biology dealing with genomic information, from population genetics to comparative genomics, including the plant sciences [1,2]. This impact is expected to grow exponentially as technical advances are continuously increasing the length and quality of sequenced DNA fragments while decreasing the cost of the process. The throughput of current ultra-high-throughput sequencing (UHTS) equipment has already reached several giga base pairs per run, which means that complex multicellular organisms with an average-sized genome, such as the model plant Arabidopsis thaliana (Arabidopsis), can be sequenced at reasonable coverage and cost. To reliably assemble whole genome sequences from the short reads generated by UHTS, a well-defined, high-quality reference genome still remains invaluable. This is the case for Arabidopsis [3], which was one of the immediate candidate organisms for resequencing projects. Indeed, a few studies have already focused on comparative analysis of divergent Arabidopsis genomes to characterize the extent of intraspecific genome variation with respect to the Columbia-0 (Col-0) reference accession [4,5].

Major recent advances

Sufficient funding permitting, these initial studies are just the prelude to the ambitious ‘1001 Genomes Project’, which aims to resequence divergent Arabidopsis accessions by the hundreds [6]. In a pilot study [4], Ossowski et al. resequenced the genome of Col-0, revealing more than 2000 homozygous single nucleotide polymorphisms (SNPs) and insertions and deletions (indels) that represent potential errors in the original annotation or spontaneous mutations (see below). Moreover, they analyzed the genome of two divergent strains and found that each of them carries more than 800,000 SNPs and 80,000 short indels. Longer indels were difficult to assess from the short reads by simple re-mapping to the reference genome. However, a de novo assembly approach allowed the authors to identify some larger structural variants by starting from reads that flanked regions characterized by low coverage. Using an alternative, combinatorial approach, Santuari et al. [5] evaluated the genome-wide abundance of large-scale deletions in four Arabidopsis strains sequenced at moderate coverage. The authors demonstrated that the intersection of signal intensities from tiling array hybridizations with UHTS read coverage accurately detects larger deletions. Hundreds of major deletions were observed, which frequently affect gene function. Among them, transposable elements were found to be overrepresented, suggesting that the majority of genomic rearrangements identified result from the activity of mobile elements. Such activity was recently also observed in real time [7]. Individual deletions were frequently observed in two or more of the accessions examined, suggesting that variation in gene content partly reflects a common history of deletion events. In summary, the characterization of only a limited number of divergent Arabidopsis genomes has already identified an unexpected degree of structural diversity that significantly affects gene content and function [4,5,8] (Figure 1). Although the majority of those polymorphisms supposedly originated in the wild, recent studies have highlighted the dynamics of the evolutionary process over merely a few generations, as exemplified in another pioneering publication by Ossowski et al. [9]. In this paper, the authors used next-generation sequencing to evaluate the rate and accumulation of spontaneous mutations across several generations. They analyzed five Col-0 lines that had been maintained for 30 generations and found a mutation per site rate equaling 7 × 10–9 base pairs per generation. Thus, Arabidopsis geneticists are confronted with the finding that mutant lines and reference backgrounds might be more disparate than suspected. In practice, this should not pose a major problem as most mutations observed are essentially neutral. However, another study analyzing structural genome variation over several inbred generations subjected to stress treatment found major, stress-induced structural variation that could significantly affect phenotype [10]. Although purely array-based and thus not ultimately conclusive, if confirmed by UHTS, these findings might force us to rethink our notions of genome stability.

Figure 1.

Example of a major common deletion affecting multiple genes

A region on Arabidopsis chromosome 4 of approximately 68 kilo base pairs (kb) contains a series of receptor-like protein kinase-related proteins in Col-0, but shows a clear lack of read coverage in the Eilenberg-0 (Eil-0) strain (top), as depicted by the grey bars, and significant decrease in coverage for the other strains in the diagram. A closer look reveals that a major number of reads mapped to this region are probably mis-mapped and thus misleading due to very low mapping quality. Overall, this picture indicates a series of large scale deletions comprising at least 15 genes that are present in the Col-0 strain, represented by the brx-2 mutant obtained from a Col-0 background, but missing in all four other strains. Lc-0, Loch Ness-0; Sav-0, Slavice-0, Tsu-1, Tsushima-1.

Example of a major common deletion affecting multiple genes

Future directions

As this perspective is written, it is already likely to be superseded by ongoing technological progress, which should soon permit routine identification of structural genome variants. For instance, using paired-end sequencing it is already possible to follow the movements of transposons, which are highlighted as mis-mappings of the two ends of a sequenced fragment on the reference genome. Beyond hardware improvements, the development of bioinformatics tools will also be a major driver. Already several tools have been developed to detect inter- and intra-chromosomal rearrangements [11,12]. The read length provided by the current versions of most UHTS platforms coupled with different insert size libraries will ultimately make it feasible to move from analysis that is focused on read mapping onto a reference genome towards a reference-free, de novo assembly approach. This is critical to overcome the intrinsic limitations dictated by relying on a single reference sequence. The collection of high-quality scaffolds and contigs from divergent Arabidopsis accessions could be used to define genomic regions that may have accumulated a degree of divergence that would prevent their accurate elucidation by classical mapping approaches (Figure 2). In particular, in the case of duplicated or partially conserved regions, the assembly itself is limited by the highly repetitive content of these sequences. The integration of data from the reference mapping with assembled scaffolds will help to reconstruct specific regions that would otherwise be hard to decipher using either mapping or assembly alone. De novo assembly algorithms that take into account the read mapping position on the reference sequence are already being developed, such as LOCAS [13] or the recent Columbus algorithm implemented in Velvet [14,15]. Notably, these tools can also assemble genomes from low coverage data, further decreasing costs. Another important challenge in analyzing these data is how to make them accessible and useful to researchers. This will probably be driven by increasingly advanced and intuitive genome browsers, such as the recent version of Ensembl Plants or GenomeMapper [16].

Figure 2.

Ambiguity in resolving loci that include duplicated genes and deletions

Ambiguity in resolving loci that include duplicated genes and deletions

The AOP3 gene, involved in glucosinolate biosynthesis and thus pathogen defense, appears to be duplicated in an exact copy in the Eilenberg-0 (Eil-0) accession, as indicated by homozygous single nucelotide polymorphisms depicted with colored vertical lines. Breakpoints are genomic regions where both ends of a fragment sequenced with the paired-end library are mapped on the reference genome at a distance that is significantly different from the insert size of the library, suggesting structural rearrangements. Here, the reads supporting breakpoints are shown in different colors, with each color representing the chromosome where the other end of the fragment is located. Following these reads, it is possible to reconstruct where the second copy of AOP3 is located, which appears to be 10 kilo base pairs downstream of its original locus, immediately downstream of the AOP2 locus in Col-0. Lc-0, Loch Ness-0. Beyond the utility of UHTS in every day lab approaches, such as mutant mapping [17], one might ask: why should we generate these data? Clearly, one of the greatest promises lies in their integration into genome-wide association studies [18], which would enable us to move from assessing the qualitative and quantitative effects of a single locus towards identifying and evaluating the systemic effects of multiple genes involved in a trait of interest [18-20]. The genome sequences of specific parental accessions would also greatly accelerate standard quantitative genetics approaches, such as quantitative trait locus analysis of recombinant inbred lines. Maybe most importantly, sequencing hundreds of strains, as proposed by the ‘1001 Genomes Project’, will not only indicate which genes are divergent, missing, or not functional with respect to the Col-0 reference sequence in a given accession, but will also lead to the discovery of genes that are present in the worldwide Arabidopsis population (but absent from Col-0). Without this ‘dark matter’ of the Arabidopsis genome, defining the full gene complement of the species and gaining a complete understanding of the ecological-evolutionary and developmental history of this plant cannot be attained.

19 in total

1. Using the Velvet de novo assembler for short-read sequencing technologies.

Authors: Daniel R Zerbino
Journal: Curr Protoc Bioinformatics Date: 2010-09

2. Genome-wide survey of Arabidopsis natural variation in downy mildew resistance using combined association and linkage mapping.

Authors: Adnane Nemri; Susanna Atwell; Aaron M Tarone; Yu S Huang; Keyan Zhao; David J Studholme; Magnus Nordborg; Jonathan D G Jones
Journal: Proc Natl Acad Sci U S A Date: 2010-05-17 Impact factor: 11.205

3. Sequencing of natural strains of Arabidopsis thaliana with short reads.

Authors: Stephan Ossowski; Korbinian Schneeberger; Richard M Clark; Christa Lanz; Norman Warthmann; Detlef Weigel
Journal: Genome Res Date: 2008-09-25 Impact factor: 9.043

4. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana.

Authors:
Journal: Nature Date: 2000-12-14 Impact factor: 49.962

5. The 1001 genomes project for Arabidopsis thaliana.

Authors: Detlef Weigel; Richard Mott
Journal: Genome Biol Date: 2009-05-27 Impact factor: 13.583

6. SVDetect: a tool to identify genomic structural variations from paired-end and mate-pair sequencing data.

Authors: Bruno Zeitouni; Valentina Boeva; Isabelle Janoueix-Lerosey; Sophie Loeillet; Patricia Legoix-né; Alain Nicolas; Olivier Delattre; Emmanuel Barillot
Journal: Bioinformatics Date: 2010-08-01 Impact factor: 6.937

Review 7. Next is now: new technologies for sequencing of genomes, transcriptomes, and beyond.

Authors: Ryan Lister; Brian D Gregory; Joseph R Ecker
Journal: Curr Opin Plant Biol Date: 2009-01-20 Impact factor: 7.834

8. Copy number variation shapes genome diversity in Arabidopsis over immediate family generational scales.

Authors: Seth DeBolt
Journal: Genome Biol Evol Date: 2010-07-12 Impact factor: 3.416

9. Natural allelic variation underlying a major fitness trade-off in Arabidopsis thaliana.

Authors: Marco Todesco; Sureshkumar Balasubramanian; Tina T Hu; M Brian Traw; Matthew Horton; Petra Epple; Christine Kuhns; Sridevi Sureshkumar; Christopher Schwartz; Christa Lanz; Roosa A E Laitinen; Yu Huang; Joanne Chory; Volker Lipka; Justin O Borevitz; Jeffery L Dangl; Joy Bergelson; Magnus Nordborg; Detlef Weigel
Journal: Nature Date: 2010-06-03 Impact factor: 49.962

10. Substantial deletion overlap among divergent Arabidopsis genomes revealed by intersection of short reads and tiling arrays.

Authors: Luca Santuari; Sylvain Pradervand; Amelia-Maria Amiguet-Vercher; Jerôme Thomas; Eavan Dorcey; Keith Harshman; Ioannis Xenarios; Thomas E Juenger; Christian S Hardtke
Journal: Genome Biol Date: 2010-01-12 Impact factor: 13.583

4 in total

1. Positional information resolves structural variations and uncovers an evolutionarily divergent genetic locus in accessions of Arabidopsis thaliana.

Authors: Alvina G Lai; Matthew Denton-Giles; Bernd Mueller-Roeber; Jos H M Schippers; Paul P Dijkwel
Journal: Genome Biol Evol Date: 2011-05-27 Impact factor: 3.416

2. Beyond genomic variation--comparison and functional annotation of three Brassica rapa genomes: a turnip, a rapid cycling and a Chinese cabbage.

Authors: Ke Lin; Ningwen Zhang; Edouard I Severing; Harm Nijveen; Feng Cheng; Richard G F Visser; Xiaowu Wang; Dick de Ridder; Guusje Bonnema
Journal: BMC Genomics Date: 2014-03-31 Impact factor: 3.969

3. The floral transition is not the developmental switch that confers competence for the Arabidopsis age-related resistance response to Pseudomonas syringae pv. tomato.

Authors: Daniel C Wilson; Philip Carella; Marisa Isaacs; Robin K Cameron
Journal: Plant Mol Biol Date: 2013-05-31 Impact factor: 4.076

4. Intrapopulation genome size variation in D. melanogaster reflects life history variation and plasticity.

Authors: Lisa L Ellis; Wen Huang; Andrew M Quinn; Astha Ahuja; Ben Alfrejd; Francisco E Gomez; Carl E Hjelmen; Kristi L Moore; Trudy F C Mackay; J Spencer Johnston; Aaron M Tarone
Journal: PLoS Genet Date: 2014-07-24 Impact factor: 5.917

4 in total