| Literature DB >> 27711162 |
Boas Pucker1,2, Daniela Holtgräwe1,2, Thomas Rosleff Sörensen1,2, Ralf Stracke1,2, Prisca Viehöver1,2, Bernd Weisshaar1,2.
Abstract
Arabidopsis thaliana is the most important model organism for fundamental plant biology. The genome diversity of different accessions of this species has been intensively studied, for example in the 1001 genome project which led to the identification of many small nucleotide polymorphisms (SNPs) and small insertions and deletions (InDels). In addition, presence/absence variation (PAV), copy number variation (CNV) and mobile genetic elements contribute to genomic differences between A. thaliana accessions. To address larger genome rearrangements between the A. thaliana reference accession Columbia-0 (Col-0) and another accession of about average distance to Col-0, we created a de novo next generation sequencing (NGS)-based assembly from the accession Niederzenz-1 (Nd-1). The result was evaluated with respect to assembly strategy and synteny to Col-0. We provide a high quality genome sequence of the A. thaliana accession (Nd-1, LXSY01000000). The assembly displays an N50 of 0.590 Mbp and covers 99% of the Col-0 reference sequence. Scaffolds from the de novo assembly were positioned on the basis of sequence similarity to the reference. Errors in this automatic scaffold anchoring were manually corrected based on analyzing reciprocal best BLAST hits (RBHs) of genes. Comparison of the final Nd-1 assembly to the reference revealed duplications and deletions (PAV). We identified 826 insertions and 746 deletions in Nd-1. Randomly selected candidates of PAV were experimentally validated. Our Nd-1 de novo assembly allowed reliable identification of larger genic and intergenic variants, which was difficult or error-prone by short read mapping approaches alone. While overall sequence similarity as well as synteny is very high, we detected short and larger (affecting more than 100 bp) differences between Col-0 and Nd-1 based on bi-directional comparisons. The de novo assembly provided here and additional assemblies that will certainly be published in the future will allow to describe the pan-genome of A. thaliana.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27711162 PMCID: PMC5053417 DOI: 10.1371/journal.pone.0164321
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Assembly statistics.
Metrics of the Nd-1 genome sequence assembly before and after application of SSPACE, GapFiller and subsequent RBH-based manual improvement.
| parameter | CLC assembly | scaffolded | gaps filled | polished |
|---|---|---|---|---|
| number of scaffolds | 10,057 | 5,201 | 5,201 | 5,197 |
| total number of bases | 113,939,710 | 117,144,260 | 117,816,107 | 116,846,015 |
| average scaffold length | 11,329 bp | 22,523 bp | 22,652 bp | 22,483 bp |
| minimal scaffold length | 500 bp | 500 bp | 500 bp | 500 bp |
| maximal scaffold length | 445,914 bp | 3,176,818 bp | 3,190,961 bp | 2,967,516 bp |
| GC content | 35.98% | 35.98% | 35.95% | 35.95% |
| N25 | 102,863 bp | 1,299,823 bp | 1,304,062 bp | 1,211,412 bp |
| N50 | 52,252 bp | 709,626 bp | 713,021 bp | 589,639 bp |
| N75 | 22,586 bp | 214,378 bp | 215,617 bp | 174,007 bp |
| N90 | 7,163 bp | 42,960 bp | 43,285 bp | 40,994 bp |
Fig 1Mapping of Nd-1 scaffolds to Col-0 reference sequence.
Schematic chromosomes are shown in grey with centromere positions in purple. Below each chromosome, red bars indicate the frequency of scaffolds. Above each chromosome, black bars show the abundance of the 180 bp centromeric repeat that has been shown to be a major component of A. thaliana centromeric DNA [62]. Data were calculated for a window size of 50 kbp.
Fig 2Reciprocal best hits (RHB) synteny of Nd-1 and Col-0.
All five pseudochromosomes of the two genome sequences were ordered by their number to provide the x (Col-0) and y (Nd-1) axes of the diagram. Positions of each RBH pair in the two genome assemblies were plotted, resulting in a bisecting line formed from black dots representing perfectly matching RBH pairs. RBH gene pair positions deviating from a fully syntenic position, i.e. the outliers, are represented by green dots for RBH pairs with ambiguous best hits in RBH pair identification, and by red dots for RBH pairs with deviating (non-syntenic) gene positions. Since two red dots overlap each other, only three locations are visible. Positions of the centromeres (CEN1 to CEN5) are indicated by purple lines. Ends of pseudochromosomes (telomers) are indicated by short black lines at the bisectrix (forming crosses) and on both axis. Formally, the unmapped fraction of Nd-1 contigs is appended after pseudochromosome 5, but this sequence of about 134 kbp in length becomes invisible due to the limited resolution of the figure.
Summary of the sizes of large insertions, deletions and HDRs.
The data were compiled from reciprocal read mapping of Nd-1 reads to the Col-0 genome sequence and vice versa. However, the table presents the results regarding PAV from the view of Nd-1; an insertion in Nd-1 is at the same time a deletion in Col-0, and a deletion in Nd-1 is at the same time an insertion in Col-0.
| Variant length [bp] | ZCRs (Col-0 reads) | Insertions in Nd-1 | ZCRs (Nd-1 reads) | Deletions in Nd-1 |
|---|---|---|---|---|
| 3,331 (480,416 bp) | 244 (34,606 bp) | 4,021 (569,529 bp) | 227 (31,974 bp) | |
| 2,817 (794,403 bp) | 220 (60,698 bp) | 2,644 (738,991 bp) | 207 (58,734 bp) | |
| 2,112 (1,196,028 bp) | 140 (79,879 bp) | 1,461 (808,370 bp) | 121 (67,725 bp) | |
| 1,141 (1,281,170 bp) | 106 (118,182 bp) | 775 (862,766 bp) | 99 (110,558 bp) | |
| 631 (1,416,843 bp) | 42 (92,912 bp) | 380 (834,816 bp) | 41 (91,758 bp) | |
| 411 (1,857,365 bp) | 57 (264,498 bp) | 211 (946,860 bp) | 42 (195,585 bp) | |
| 119 (1,029,274 bp) | 15 (116,191 bp) | 57 (469,079 bp) | 8 (56,713 bp) | |
| 25 (410,562 bp) | 2 (26,505 bp) | 4 (61,067 bp) | 1 (13,487 bp) | |
| 3 (103,639 bp) | - | 5 (206,506 bp) | - | |