| Literature DB >> 24647006 |
David B Neale, Jill L Wegrzyn, Kristian A Stevens, Aleksey V Zimin, Daniela Puiu, Marc W Crepeau, Charis Cardeno, Maxim Koriabine, Ann E Holtz-Morris, John D Liechty, Pedro J Martínez-García, Hans A Vasquez-Gross, Brian Y Lin, Jacob J Zieve, William M Dougherty, Sara Fuentes-Soriano, Le-Shin Wu, Don Gilbert, Guillaume Marçais, Michael Roberts, Carson Holt, Mark Yandell, John M Davis, Katherine E Smith, Jeffrey F D Dean, W Walter Lorenz, Ross W Whetten, Ronald Sederoff, Nicholas Wheeler, Patrick E McGuire, Doreen Main, Carol A Loopstra, Keithanne Mockaitis, Pieter J deJong, James A Yorke, Steven L Salzberg, Charles H Langley.
Abstract
BACKGROUND: The size and complexity of conifer genomes has, until now, prevented full genome sequencing and assembly. The large research community and economic importance of loblolly pine, Pinus taeda L., made it an early candidate for reference sequence determination.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24647006 PMCID: PMC4053751 DOI: 10.1186/gb-2014-15-3-r59
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1(A) The sources of haploid and diploid genomic DNA. The reproductive cycle of a conifer showing the unique sources of haploid and diploid genomic DNA sequenced. Both the ova pronucleus and the megagametophyte are derived by mitotic divisions from a single one of the four haploid meiotic segregant megaspores. The tissue from a single megagametophyte formed the basis for all of our shorter insert paired end Illumina libraries (Table 1). To construct longer insert libraries (Illumina mate pair and Fosmid DiTag) requiring greater amounts of starting DNA, needles from the parental genotype (20-1010) were used. (B) Sequencing and assembly schematic. An overlap layout consensus assembly, made possible by MaSuRCA’s critical reduction phase, was followed by additional scaffolding, incorporating transcript assemblies, to improve contiguity and completeness [9,11].
Characteristics of the loblolly pine v1.01 draft assembly
| Number of chromosomes | 12 |
| G + C% | 38.2% |
| Sequence in contigs >64 bp | 20,148,103,497 bp |
| Total span of scaffolds | 23,180,477,227 bp |
| Contig N50 | 8,206 bp |
| Scaffold N50 | 66,920 bp |
| Haploid paired end libraries 200-600 bp | 11 libraries |
| 7.5x billion x 2 reads (GA2x + HiSeq + MiSeq) | |
| 1.4 trillion bp total read length | |
| 63x sequence coverage | |
| 150 million maximal super-reads | |
| 52 billion total bp | |
| 2.4x sequence coverage | |
| Diploid mate pair libraries 1,000-5,500 bp | 48 libraries |
| 863 million x 2 reads (GA2x) | |
| 273 billion total read length | |
| 270 million x 2 reads after filtering | |
| 37x physical coverage | |
| DiTag libraries 35-40 Kbp | 9 libraries |
| 46 million x 2 reads (GA2x) | |
| 4.5 million reads x 2 after filtering | |
| 7.5x physical coverage |
Comparison of gene metrics among sequenced plant genomes
| 20,148 | 12,019a | 135 | 423 | 487 | 706 | |
| 12 | 12 | 5 | 19 | 19 | 13 | |
| 38.2 | 37.9 | 35.0 | 33.3 | 36.2 | 35.5 | |
| 79 | 70 | 15.3 | 42 | 41.4 | N/A | |
| 50,172 | 58,587c | 27,160 | 36,393 | 25,663 | 25,347 | |
| 965 | 723 | 1102 | 1143 | 1095 | 969 | |
| 2,741 | 1,020 | 182 | 366 | 933 | 1,538 | |
| 318,524 | 68,269 | 10,234 | 4,698 | 38,166 | 175,748 |
aEstimated genome size is 19.6 Gbp.
bNumber of full-length genes >150 bp in length and validated through current annotations.
cHigh and medium confidence genes from the Congenie project [8].
Figure 2Unique gene families and Gene Ontology term assignments. (A) Identification of orthologous groups of genes for 14 species split into five categories: conifers (Picea abies, Picea sitchensis, and Pinus taeda), monocots (Oryza sativa and Zea mays), dicots (Arabidopsis thaliana, Glycine max, Populus trichocarpa, Ricinus communis, Theobroma cacao, and Vitis vinifera), early land plants (Selaginella moellendorffii and Physcomitrella patens), and a basal angiosperm (Amborella trichopoda). Here, we depict the number of clusters in common between the biological categories in the intersections. The total number of sequences for each species is provided under the name (total number of sequences/total number of clustered sequences). (B) Gene ontology molecular function term assignments by family for all species (red), conifers (green), and Pinus taeda exclusively (blue).
Figure 3Interspersed and tandem repetitive content. (A) Overview of repetitive content in the Pinus taeda genome for similarity (blue) and de novo (yellow) approaches. Introns are evaluated with similarity methods against PIER 2.0 [32]. (B) Overview of microsatellite content across species with exclusion of mononucleotide repeats. Orange, green, and purple points represent angiosperm, gymnosperm, and lycophyte species, respectively. Each point displays both the density (point size) and length (y-axis) of di-, tri-, tetra-, penta-, hexa-, hepta-, and octanucleotide tandem repeats (x-axis). The Overall category is an accumulation of the previous seven categories.
Figure 4Identification of TNL candidate gene for Fr1. (A) Genome survey of rust resistance in segregating progeny of Fr1/fr1 Pinus taeda among clonally propagated half-siblings (upper) and full-siblings (lower). Bins with highest LOD scores contained. SNP 2_5345_01 (*). (B) Translated gene model (G) on genome scaffold jcf7180063178873 is interrupted by three introns with sizes given in bp, previously available EST (E) containing SNP 2_5345_01 (*), fully assembled transcript Evg1_1A_all_VO_L_3760_240252 from RNAseq (R) and the domain structure of the protein model (P).
Figure 5lp3 sequences from Genbank (AAB07493, AAB96829) were aligned to the same scaffold in v1.01 and supported by two distinct MAKER-derived gene model.