Literature DB >> 23698863

Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data.

Inanc Birol¹, Anthony Raymond, Shaun D Jackman, Stephen Pleasance, Robin Coope, Greg A Taylor, Macaire Man Saint Yuen, Christopher I Keeling, Dana Brand, Benjamin P Vandervalk, Heather Kirk, Pawan Pandoh, Richard A Moore, Yongjun Zhao, Andrew J Mungall, Barry Jaquish, Alvin Yanchuk, Carol Ritland, Brian Boyle, Jean Bousquet, Kermit Ritland, John Mackay, Jörg Bohlmann, Steven J M Jones.

Abstract

UNLABELLED: White spruce (Picea glauca) is a dominant conifer of the boreal forests of North America, and providing genomics resources for this commercially valuable tree will help improve forest management and conservation efforts. Sequencing and assembling the large and highly repetitive spruce genome though pushes the boundaries of the current technology. Here, we describe a whole-genome shotgun sequencing strategy using two Illumina sequencing platforms and an assembly approach using the ABySS software. We report a 20.8 giga base pairs draft genome in 4.9 million scaffolds, with a scaffold N50 of 20,356 bp. We demonstrate how recent improvements in the sequencing technology, especially increasing read lengths and paired end reads from longer fragments have a major impact on the assembly contiguity. We also note that scalable bioinformatics tools are instrumental in providing rapid draft assemblies. AVAILABILITY: The Picea glauca genome sequencing and assembly data are available through NCBI (Accession#: ALWZ0100000000 PID: PRJNA83435). http://www.ncbi.nlm.nih.gov/bioproject/83435.

Entities: Chemical Disease Species

Mesh：

Year: 2013 PMID： 23698863 PMCID： PMC3673215 DOI： 10.1093/bioinformatics/btt178

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

The assembly of short reads to develop genomic resources for non-model species remains an active area of development (Schatz ). The feasibility of the approach and its scalability to large genomes was demonstrated by the ABySS publication (Simpson ) using human genome sequencing data and was later used to assemble the panda genome with the SOAPdenovo tool (Li ). The technology provides high quality results, as demonstrated for bacteria (Bankevich ; Ladner ; Ribeiro ), and has been successfully applied numerous times on more complex genomes (Chan ; Chu ; Diguistini , 2011; Godel ; Swart ). Estimated at 20 giga base pairs (Gb) (Murray, 1998), sequencing and assembly of the genome of this gymnosperm species of the pine (Pinaceae) family present unique challenges. On the data generation end, those challenges include representation biases in whole-genome shotgun sequencing data, and difficulties in building reduced representation resources to scale down the magnitude of the problem. On the bioinformatics end, assembling massive sequencing datasets is extremely demanding on computing cycles, memory usage, storage requirements, and for parallel programming implementations on communication traffic. We addressed the data representation challenges by preparing and sequencing multiple whole-genome shotgun libraries on the HiSeq 2000 and MiSeq sequencers from Illumina (San Diego, CA, USA). Compared with localized sequencing protocols, such as building and sequencing fosmid libraries, or the recent approach of isolating ∼10 kb DNA strands to generate indexed sequencing fragments in high throughput (Moleculo, San Diego, CA, USA), a shotgun only sequencing approach rapidly provides sequence data effectively covering the target genome at a cost that can be an order of magnitude less. The difference in cost is especially substantial when sequencing a large genome. In this work, we demonstrate that shotgun sequence assembly at this scale remains viable and produces valuable results. To assemble the spruce genome, we used the ABySS algorithm (Simpson ), which captures a representation of read-to-read overlaps by a distributed de Bruijn graph and uses parallel computations to build the target genome. The modular nature of the tool allowed us to execute a large number of tests to tune the message passing interface for a successful execution, train the assembly parameters for an optimal assembly and quantify the utility of long reads for large genome assemblies. To the best of our knowledge, the ABySS algorithm is unique in its ability to enable genome assemblies of this scale using whole-genome shotgun sequencing data.

2 METHODS

2.1 Sample collection

Apical shoot tissues were collected in April 2006 from a single white spruce (Picea glauca, genotype PG29) tree at the Kalamalka Research Station of the British Columbia Ministry of Forests and Ranges, Vernon, British Columbia, Canada. Genomic DNA was extracted from 60 gm tissue by BioS&T (http://www.biost.com/, Montreal, QC, Canada) using an organelle exclusion method yielding 300 µg of high quality purified nuclear DNA.

2.2 Library preparation and sequencing

DNA quality was assessed by spectrophotometry and gel electrophoresis before library construction. DNA was sheared for 45 s using an E210 sonicator (Covaris) and then analysed on 8% PAGE gels. The 200–300 bp (for libraries with 250 bp insert size) or 450–550 bp (for libraries with 500 bp insert size) DNA size fractions were excised and eluted from the gel slices overnight at 4°C in 300 µl of elution buffer {5:1 [vol/vol] LoTE buffer [3 mM Tris–HCl (pH 7.5), 0.2 mM EDTA]/7.5 M ammonium acetate} and was purified using a Spin-X Filter Tube (Fisher Scientific) and ethanol precipitation. Genome libraries were prepared using a modified paired-end tag (PET) protocol supplied by Illumina Inc. This involved DNA end repair and formation of 3′ adenosine overhangs using the Klenow fragment of DNA polymerase I (3′–5′ exonuclease minus) and ligation to Illumina PE adapters (with 5′ overhangs). Adapter-ligated products were purified on QIAquick spin columns (Qiagen) and amplified using Phusion DNA polymerase (NEB) and 10 PCR cycles with the PE primer 1.0 and 2.0 (Illumina). PCR products of the desired size range were purified from adapter ligation artifacts using 8% PAGE gels. DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay (Agilent) and Nanodrop 7500 spectrophotometer (Nanodrop). DNA was subsequently diluted to 8 nM. The final concentration was confirmed using a Quant-iT dsDNA HS assay kit and Qubit fluorometer (Invitrogen). The mate pair (MPET, a.k.a. jumping) libraries were constructed using 4 µg of genomic DNA with the Illumina Nextera Mate-Pair library construction protocol and reagent (FC-132-1001). The genomic DNA sample was simultaneously fragmented and tagged with a biotin containing mate pair junction adapter, which left a short sequence gap in the tagmented DNA. The gap was filled by a strand displacement reaction using a polymerase to ensure that all fragments were flush and ready for circularization. After an AMPure Bead cleanup, size selection was done on a 0.6% agarose gel to excise 6–9 kb and 9–13 kb fractions, which were purified using a Zymoclean Large Fragment DNA Recovery Kit. The fragments were circularized by ligation, followed by a digestion to remove any linear molecules and left circularized DNA for shearing. The sheared DNA fragments that contain the biotinylated junction adapter (mate pair fragments) were purified by means of binding to streptavidin magnetic beads, and the unwanted unbiotinylated molecules were washed away. The DNA fragments were then end repaired and A-tailed following the protocol and ligated to indexed TruSeq adapters. The final library was enriched by a 10-cycle PCR and purified by AMPure bead clean-up. Library quality and size were assessed by Agilent DNA 1000 series II assay and KAPA Library Quantification protocol. The two fractions were pooled for sequencing paired end 100 bp using Illumina HiSeq2000. The construction of the 12 kb mate pair libraries was achieved by a hybrid 454/Illumina procedure. Briefly, 50 µg of genomic DNA was fragmented for 20 cycles at speed code 12 using a Hydroshear (Digilab, Marlborough, MA) equipped with a large assembly module. The fragmented DNA was loaded on a 1% agarose gel, and fragments from 12 to 18 kb were extracted. Biotinylated circularization adapters from the GS Titanium Paired-end Adaptor set (454 Life Sciences/Roche, Branford, CT) were added to ends of the gel-extracted fragments. Homologous recombination of the ends was performed with Cre recombinase (New England Biolabs, Ipswich, MA), and linear molecules remaining in solution were removed with Plasmid Safe (Epicentre, Madison, WI). Circular molecules were fragmented using GS Rapid Library Nebulizers (454 Life Sciences/Roche, Branford, CT), and fragment end-repair followed by A-tailing was performed with the GS Rapid Libray preparation kit (454 Life Sciences/Roche, Branford, CT). TruSeq Adaptors (Illumina, San Diego, CA) were ligated to the repaired/A-tailed ends. Biotinylated fragments were enriched using Streptavidin-coupled Dynabeads (Life Technologies, Grand Island, NY) and amplified by PCR using Illumina specific primers. Random bacterial artificial chromosome (BAC) sequencing was performed using DNA from the same genotype on 454 GS-FLX Titanium with 6 kb paired-end libraries at the PlateForme d’Analyses Génomiques of the Institute for Systems and Integrative Biology (Université Laval, Quebec City, QC). A single paired-end library was prepared on a pool of 15 BACs (equimolar concentrations) as described earlier in the text with the following modifications: 15 µg of DNA was fragmented using a Hydroshear with a standard assembly for 20 cycles at speed code 18, 6–10 kb fragments were extracted from the gel and GS-FLX library adaptors were ligated to the repaired/A-tailed fragments. GS-FLX sequencing using the titanium chemistry was performed according to manufacturer’s instructions (454 Life Sciences/Roche, Branford, CT). Sanger sequencing method was used to obtain targeted BAC sequencing data as previously described (Hamberger ; Keeling ).

2.3 MiSeq modification

In sequencing the spruce genome, we generated longer read lengths by modifying the MiSeq platform. The MiSeq uses a clamshell style cartridge (Supplementary Fig. S1A) to hold reagent tubes in an array that is accessed by the MiSeq’s sippers. Most of the reagents are used for read length independent steps such as denaturation and cluster generation, but three reagents, the Scan, Cleavage and Incorporation mixes, are consumed at each cycle. Although the MiSeq allows any read length to be specified in the control software, the reagent cartridge cannot be replaced during the run without stopping it. Increasing the read length therefore requires increasing the quantity of the length-dependent mixes in the cartridge. This led to the solution of combining the length-dependent reagents of two kits into one. A tool was designed that opens the snap-hook latches holding the cartridge together (Supplementary Figs S1B and S2), giving access to the reagent tubes, yet allowing the cartridge to be put back together without damage to its components (Supplementary Fig. S1C). At 40 ml, the stock length-dependent reagent containers allow for a maximum of ∼650 cycles in total. To maximize the potential of the combined kit approach, a new reagent tray with 70 ml wells was designed and placed in a modified clamshell base.

2.4 Read merging

Reads from the HiSeq 2000 PET 250 bp libraries and the MiSeq PET 500 bp libraries were merged using abyss mergepairs (Supplementary Fig. S3). This utility performed a pair-wise Smith Waterman overlapped alignment (Smith and Waterman, 1981) between reads pairs, and selected the best quality base where alignments returned mismatching bases. An arbitrary base was selected when qualities were identical. In cases of read-to-read alignment ambiguity, read pairs were not merged. Many of the MiSeq reads had significantly reduced qualities near the ends, and the second reads typically had more bases with reduced quality (Supplementary Fig. S4). Therefore, the first and second reads were initially trimmed to appropriate lengths, and further quality trimmed based on a quality score threshold of 15. In all datasets, the reads typically merged into long and high quality reads. Supplementary Box S1 shows the command lines and options used for read merging.

2.5 Assembly process

There were four sets of reads used in the assembly: (i) Merged reads, including MiSeq PET 500 bp libraries and HiSeq PET 250 bp, and (ii) HiSeq PET 500 reads were used in the initial sequence assembly. The HiSeq PET 500 and (iii) unmerged HiSeq PET 250 reads were used for paired linking information to generate contigs. Finally, (iv) the long fragment MPET libraries were used to bridge over segments of repetitive regions to form scaffolds (Supplementary Fig. S5). All stages of the assembly were run using the ABySS wrapper, abyss-pe, with the exception of generating the FM-index (see Supplementary Material for details), and the parameters used are outlined in Supplementary Box S2. Assembly execution times are given in Supplementary Table S1.

2.6 Read alignments

To generate contigs and scaffolds, we first needed to align the reads to the previous unitig and contig stages of the assembly, respectively. Owing to the size of the spruce genome, the fragmented nature of the assembly stages, and the size of the resulting fasta files (∼40 GB), we note that general purpose read alignment tools were not suitable for the task. For the same reasons, our standard method of generating an FM-index (Ferragina and Manzini, 2000) within the ABySS-map tool was too memory intensive, requiring >500 GB of RAM. We solved this problem by using bwte (Ferragina ), a tool for generating a Burrows Wheeler Transformation (Burrows and Wheeler, 1994), and converted its results to an FM-index with abyss-index (Supplementary Fig. S6). This method allowed us to index the unitigs and contigs using a maximum of ∼60 GB of RAM on a single machine. For the 500 bp PET libraries sequenced on the HiSeq 2000, we aligned each lane of data in parallel, merged alignments into a single file and inferred fragment size distributions using tools within the ABySS package. The read alignments to different unitigs were converted to distance estimates for the contig assembly stage. The alignments and distance estimates for each MPET Library were done using the standard wrapper for ABySS.

3 RESULTS

Prior experience indicates that sampling a large genome with multiple libraries and fragment lengths can mitigate potential sampling biases and capture a more even representation of the underlying genome (DiGuistini ; Earl ; Keeling ; Li ). One novel feature in our sequencing approach was to complement the high coverage data from the HiSeq 2000 sequencers with longer reads from the MiSeq, at low coverage, to support the assembly process. Using an early access 2 × 150 bp kit on the HiSeq 2000, we generated PET reads from two libraries with 250 bp nominal fragment lengths and 18 libraries with 500 bp nominal fragment lengths to a total of 64-fold raw coverage. Using the MiSeq, we generated 2 × 300 bp PET reads from four libraries with 500 bp nominal fragment lengths and 2 × 500 bp PET reads from one library with 500 bp nominal fragment length, contributing a further 4-fold raw coverage (Table 1). We also generated large fragment libraries of 6, 8 and 12 kb nominal fragment lengths, to provide linkage information across repeat structures. The first two of these libraries were prepared using the MPET protocol of Illumina, and the third was a pool of seven libraries prepared using a modified 454 paired end sequencing protocol (454 Life Sciences/Roche, Branford, CT). All long fragment libraries were sequenced at 2 × 100 bp, and the resulting sequences were used only for their linkage information during the scaffold stage of the assembly.

Table 1.

Sequencing data

Library protocol	Read length (bp)	Sequencing platform	Nominal fragment length (bp)	# Libraries	# Reads (M)	Fold coverage
PET	150	HiSeq 2000	250	2	1520	11.4
PET	150	HiSeq 2000	500	18	7000	52.5
PET	300	MiSeq	500	4	170	2.6
PET	500	MiSeq	500	1	46	1.2
MPET	100	HiSeq 2000	6000	1	816	N/A
MPET	100	HiSeq 2000	8000	1	769	N/A
454	100	HiSeq 2000	12 000	7	34	N/A

Note: Short fragment libraries prepared by the Illumina PET protocol were used for their sequence content. Long fragment libraries prepared by the Illumina MPET, and a modified 454 protocols were used for linkage information during scaffolding. As long fragment libraries do not contribute to the sequence content of the assembly, their contribution to genome coverage is marked as N/A (not applicable).

Sequencing data Note: Short fragment libraries prepared by the Illumina PET protocol were used for their sequence content. Long fragment libraries prepared by the Illumina MPET, and a modified 454 protocols were used for linkage information during scaffolding. As long fragment libraries do not contribute to the sequence content of the assembly, their contribution to genome coverage is marked as N/A (not applicable). The first release of the sequencing chemistry on the MiSeq allowed for 150 read cycles, which increased to 250 cycles in a subsequent release. To obtain longer read lengths, we modified the sequencing instrument as detailed in the Supplementary Material. Longer reads obtained from this platform were instrumental in improving the contiguity of the genome assembly as described later in the text. To build the white spruce genome, we used the ABySS de novo assembly tool (Simpson ), which has been rigorously evaluated (Earl ; Li, 2012; Rahman and Pachter, 2013). The process broadly has three stages: unitig, contig and scaffold building. Performed on a computer cluster of dual Intel Xeon® 6-core processors, each addressing 48 GB of memory, the three stages of the spruce genome assembly took approximately two, four and four days using 1560, 288 and 36 CPU cores, respectively. The first assembly stage constructs a distributed and scalable de Bruijn graph to represent read-to-read overlaps for unitig assembly. The second stage involves read-to-unitig alignments and path traversals on the unitig adjacency graph from the previous stage for contig construction, resolving short repeats along those paths, when possible. Similarly, the third stage involves read-to-contig alignments and path traversal on the contig-adjacency graph for scaffold construction, denoting unresolved sequence content with Ns. The first assembly stage can further be conceptualized in two parts, with roughly equal run times. The first part provides preliminary unitigs and a graph representing their adjacency, which is defined as (k-1) bp overlaps, where k is the primary assembly parameter denoting the minimum read-to-read overlap length that is considered specific enough for assembly. The second part eliminates short redundant unitigs, when their neighbors on both sides share more than a certain length of sequence (default: 10 bp), and enables further graph simplification. We based the preliminary assessment of assembly quality on the first part of the first stage, and optimized our assembly parameters, using the pre-unitig contiguity statistics (Fig. 1).

Fig. 1.

Assembly optimization. The de Bruijn graph stage (pre-unitig) of the assembly was used to optimize the overlap parameter, k-mer length, and the effect of inclusion of longer reads was assessed. The contiguity metrics N50 (solid curves, left y-axis) and N20 (dotted curves, right y-axis) are shown for assemblies that use the short reads only (blue) and short and long reads (black). The contiguity of the two datasets peaked for different k-mer lengths, with dataset of short and long reads having a maximum N50 and a maximum N20 for the same k = 109 bp. For short reads only, optimization with respect to N20 resulted in a slightly lower k-mer length (98 bp) compared with optimization with respect to N50 (101 bp), both of which are lower than the optimum k-mer length for the full dataset. Longer k-mers were desirable, as they help disambiguate longer repeat motifs The relatively short run time of this part of the assembly process allowed us to execute a large number of tests. Here, we also take this opportunity to demonstrate the utility of longer reads, as they pertain to the optimal k-mer length. As a larger k would resolve longer sequence ambiguities, it is desirable to choose it as large as possible, yet not too large; otherwise, we lose the sensitivity to detect valid read-to-read overlaps. The choice of this parameter is determined by several factors, including read lengths; fold coverage; genome complexity; genome size; and experimental noise. Among these, genome complexity and size are determined by the choice of the species to study, and the experimental noise is determined by the sample collection methods and the choice of the sequencing platform. Longer reads and higher fold coverage would enable one to use a longer k-mer length. Figure 1 shows results of our parameter search for an optimum k-mer length using the short HiSeq 2000 reads only and combined short HiSeq 2000 and long MiSeq reads. The contiguity statistics NX (describing X% reconstruction in assembled sequence lengths NX or longer) are typically locally concave-down functions of the k-mer length in a neighborhood where the total sequence reconstruction is close to the genome length. Controlled for the misassemblies, contig or scaffold N50 lengths are widely used quality metrics for genome assemblies. We demonstrate that the optimum pre-unitig N50 occurred at k = 109 bp when using both short and long reads, and at k = 101 bp when using just the short reads. Thus, we observe that incorporating the modest low coverage long read data from the MiSeq allowed us to use a more stringent overlap parameter for the assembly process, and resulted in improved assembly statistics. Optimum k using short and long read data yielded N50 = 1335 bp, whereas the optimum k using short read data only yielded N50 = 1236 bp. The difference between assembly contiguity numbers (7.4%) became more pronounced when the assembly process proceeded to use 158 million and 186 million pre-unitigs in these two cases, respectively, to construct unitigs (9.7%), contigs (9.8%) and scaffolds (17.8%). We also note that the optimum k values in both datasets were longer than the sequencing length of the MPET libraries. We propagated the optimum k = 109 bp assembly with short and long reads through the assembly pipeline. The full assembly of our data yielded 191 347 (171 971) scaffolds over 20 356 bp (22 967 bp) in length, representing >50% of the 20.8 Gb reconstructed (20 Gb estimated) spruce genome. The assembly statistics of the white spruce genome are presented in Table 2 in comparison with the whole-genome shotgun sequence assemblies of three barley cultivars (Mayer ). The recent barley genome assemblies represent results from a whole-genome shotgun sequencing and assembly project using similar data and offer a context for plant genomics.

Table 2.

Assembly Statistics

Species	White spruce			Barley
	Unitig	Contig	Scaffold	Morex	Bowman	Barke
				Contig	Contig	Contig
Number ≥ 500 bp	12.0 M	6.7 M	4.9 M	715 k	729 k	823 k
Number ≥ N50	2.2 M	1.2 M	191 k	121 k	124 k	170 k
Number ≥ NG50	3.0 M	1.0 M	172 k	N/A	N/A	N/A
N80 (bp)	824	2041	2041	1054	1131	998
N50 (bp)	1928	4996	20 356	2793	2994	2330
NG50 (bp)	1548	5351	22 967	N/A	N/A	N/A
N20 (bp)	4070	10 791	80 133	6537	6742	5077
Max (bp)	61 182	99 924	1 047 232	36 062	37 442	38 285
Reconstruction (Gb)	17.4	21.2	20.8	1.3	1.4	1.4

Note: Number, contiguity and reconstruction statistics at the three assembly stages of the white spruce genome in comparison with the contig statistics of whole-genome shotgun assemblies of three barley cultivars (Mayer ). The NG50 calculations assume a predicted genome size of 20 Gb for white spruce. As barley assemblies reconstruct less than half the estimated genome size of 5.1 Gb, the NG50 calculations are not applicable and are marked as N/A.

Assembly Statistics Note: Number, contiguity and reconstruction statistics at the three assembly stages of the white spruce genome in comparison with the contig statistics of whole-genome shotgun assemblies of three barley cultivars (Mayer ). The NG50 calculations assume a predicted genome size of 20 Gb for white spruce. As barley assemblies reconstruct less than half the estimated genome size of 5.1 Gb, the NG50 calculations are not applicable and are marked as N/A. As a means to assess the quality of the assembled spruce genome, the sequences of several BACs from the same genotype were aligned against the assembly using BLAST (Altschul ). Six previously sequenced targeted BACs containing known terpene synthase and cytochrome P450 genes (see Supplementary Material) were then compared with MUMmer dot plots (Kurtz ). Twenty-six scaffolds longer than 1000 bp aligned with >95% similarity to reconstruct 62% of the sequence of these six BACs. Even though we did not use any protein, expressed sequence tag or RNA-seq data to improve the scaffolds, the whole-genome shotgun approach produced a good quality assembly as measured by its representation of genic regions. First, we searched our assembly for a list of 248 ultra-conserved core eukaryotic genes (CEGs) as selected by Parra, Bradnam and Korf (Parra ). Using their Core Eukaryotic Genes Mapping Approach, we identified 95 (38%) of the CEGs as complete sequences and found 184 (74%) of them to have at least a partial representation in our assembly. Similarly, and on a broader scale, we used a set of 13 036 full-length cDNA clones from a closely related species, Sitka spruce (Picea sitchensis) (Ralph ), and measured their representation in the white spruce assembly. We identified 11 108 (83%) of them in our scaffolds, with 5895 (45%) presented in a single scaffold over 90% of their length. We verified that the exon orders of the Sitka spruce transcripts we investigated were conserved in the white spruce assembly. From those 5895 genes found in the white spruce, we estimated the mean and the median gene lengths to be 5151 bp and 804 bp, respectively. These lengths represent genomic distances much less than our typical scaffold lengths.

4 CONCLUSION

The choice between a whole-genome shotgun sequencing approach and sequencing reduced representation libraries was extensively discussed during the Human Genome Project (Lander ; Venter ), and the former became the dominant technology as the sequencing throughput rapidly increased, rendering library techniques to prepare data for the latter approach relatively expensive. A decade later, researchers studying conifer genomes are trying to answer the same question. In our study, we demonstrate that modern whole-genome shotgun sequencing and assembly methods can provide competitive draft genome assemblies at the multi-Gb scale for downstream biological studies in a cost-effective way, even if it is far from producing chromosome level contiguous sequence. We note that a rigorous assessment of the reported assembly is not a trivial undertaking and will need to be performed, as the assembly evolves from its draft stage toward a more established reference. For example, de novo assembly evaluation tools such as CGAL (Rahman and Pachter, 2013), FRCbam (Vezzi ) and ALE (Clark ) would either not scale to the size of the problem or require substantial time and computational resources. Still, compared with previous targeted gene and genome subsampling studies, the assembly introduced in this article already gives the community considerably greater power to identify and study gymnosperm genes, to assist forest management strategies and to understand the environmental biological interactions that involve spruce trees at a basic level.

28 in total

1. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly.

Authors: Heng Li
Journal: Bioinformatics Date: 2012-05-07 Impact factor: 6.937

2. Assemblathon 1: a competitive assessment of de novo short read assembly methods.

Authors: Dent Earl; Keith Bradnam; John St John; Aaron Darling; Dawei Lin; Joseph Fass; Hung On Ken Yu; Vince Buffalo; Daniel R Zerbino; Mark Diekhans; Ngan Nguyen; Pramila Nuwantha Ariyaratne; Wing-Kin Sung; Zemin Ning; Matthias Haimel; Jared T Simpson; Nuno A Fonseca; İnanç Birol; T Roderick Docking; Isaac Y Ho; Daniel S Rokhsar; Rayan Chikhi; Dominique Lavenier; Guillaume Chapuis; Delphine Naquin; Nicolas Maillet; Michael C Schatz; David R Kelley; Adam M Phillippy; Sergey Koren; Shiaw-Pyng Yang; Wei Wu; Wen-Chi Chou; Anuj Srivastava; Timothy I Shaw; J Graham Ruby; Peter Skewes-Cox; Miguel Betegon; Michelle T Dimon; Victor Solovyev; Igor Seledtsov; Petr Kosarev; Denis Vorobyev; Ricardo Ramirez-Gonzalez; Richard Leggett; Dan MacLean; Fangfang Xia; Ruibang Luo; Zhenyu Li; Yinlong Xie; Binghang Liu; Sante Gnerre; Iain MacCallum; Dariusz Przybylski; Filipe J Ribeiro; Shuangye Yin; Ted Sharpe; Giles Hall; Paul J Kersey; Richard Durbin; Shaun D Jackman; Jarrod A Chapman; Xiaoqiu Huang; Joseph L DeRisi; Mario Caccamo; Yingrui Li; David B Jaffe; Richard E Green; David Haussler; Ian Korf; Benedict Paten
Journal: Genome Res Date: 2011-09-16 Impact factor: 9.043

3. ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies.

Authors: Scott C Clark; Rob Egan; Peter I Frazier; Zhong Wang
Journal: Bioinformatics Date: 2013-01-09 Impact factor: 6.937

4. The sequence and de novo assembly of the giant panda genome.

Authors: Ruiqiang Li; Wei Fan; Geng Tian; Hongmei Zhu; Lin He; Jing Cai; Quanfei Huang; Qingle Cai; Bo Li; Yinqi Bai; Zhihe Zhang; Yaping Zhang; Wen Wang; Jun Li; Fuwen Wei; Heng Li; Min Jian; Jianwen Li; Zhaolei Zhang; Rasmus Nielsen; Dawei Li; Wanjun Gu; Zhentao Yang; Zhaoling Xuan; Oliver A Ryder; Frederick Chi-Ching Leung; Yan Zhou; Jianjun Cao; Xiao Sun; Yonggui Fu; Xiaodong Fang; Xiaosen Guo; Bo Wang; Rong Hou; Fujun Shen; Bo Mu; Peixiang Ni; Runmao Lin; Wubin Qian; Guodong Wang; Chang Yu; Wenhui Nie; Jinhuan Wang; Zhigang Wu; Huiqing Liang; Jiumeng Min; Qi Wu; Shifeng Cheng; Jue Ruan; Mingwei Wang; Zhongbin Shi; Ming Wen; Binghang Liu; Xiaoli Ren; Huisong Zheng; Dong Dong; Kathleen Cook; Gao Shan; Hao Zhang; Carolin Kosiol; Xueying Xie; Zuhong Lu; Hancheng Zheng; Yingrui Li; Cynthia C Steiner; Tommy Tsan-Yuk Lam; Siyuan Lin; Qinghui Zhang; Guoqing Li; Jing Tian; Timing Gong; Hongde Liu; Dejin Zhang; Lin Fang; Chen Ye; Juanbin Zhang; Wenbo Hu; Anlong Xu; Yuanyuan Ren; Guojie Zhang; Michael W Bruford; Qibin Li; Lijia Ma; Yiran Guo; Na An; Yujie Hu; Yang Zheng; Yongyong Shi; Zhiqiang Li; Qing Liu; Yanling Chen; Jing Zhao; Ning Qu; Shancen Zhao; Feng Tian; Xiaoling Wang; Haiyin Wang; Lizhi Xu; Xiao Liu; Tomas Vinar; Yajun Wang; Tak-Wah Lam; Siu-Ming Yiu; Shiping Liu; Hemin Zhang; Desheng Li; Yan Huang; Xia Wang; Guohua Yang; Zhi Jiang; Junyi Wang; Nan Qin; Li Li; Jingxiang Li; Lars Bolund; Karsten Kristiansen; Gane Ka-Shu Wong; Maynard Olson; Xiuqing Zhang; Songgang Li; Huanming Yang; Jian Wang; Jun Wang
Journal: Nature Date: 2009-12-13 Impact factor: 49.962

5. A physical, genetic and functional sequence assembly of the barley genome.

Authors: Klaus F X Mayer; Robbie Waugh; John W S Brown; Alan Schulman; Peter Langridge; Matthias Platzer; Geoffrey B Fincher; Gary J Muehlbauer; Kazuhiro Sato; Timothy J Close; Roger P Wise; Nils Stein
Journal: Nature Date: 2012-10-17 Impact factor: 49.962

6. Finished bacterial genomes from shotgun sequence data.

Authors: Filipe J Ribeiro; Dariusz Przybylski; Shuangye Yin; Ted Sharpe; Sante Gnerre; Amr Abouelleil; Aaron M Berlin; Anna Montmayeur; Terrance P Shea; Bruce J Walker; Sarah K Young; Carsten Russ; Chad Nusbaum; Iain MacCallum; David B Jaffe
Journal: Genome Res Date: 2012-07-24 Impact factor: 9.043

7. The genome of the heartworm, Dirofilaria immitis, reveals drug and vaccine targets.

Authors: Christelle Godel; Sujai Kumar; Georgios Koutsovoulos; Philipp Ludin; Daniel Nilsson; Francesco Comandatore; Nicola Wrobel; Marian Thompson; Christoph D Schmid; Susumu Goto; Frédéric Bringaud; Adrian Wolstenholme; Claudio Bandi; Christian Epe; Ronald Kaminsky; Mark Blaxter; Pascal Mäser
Journal: FASEB J Date: 2012-08-13 Impact factor: 5.191

8. CGAL: computing genome assembly likelihoods.

Authors: Atif Rahman; Lior Pachter
Journal: Genome Biol Date: 2013-01-29 Impact factor: 13.583

9. A conifer genomics resource of 200,000 spruce (Picea spp.) ESTs and 6,464 high-quality, sequence-finished full-length cDNAs for Sitka spruce (Picea sitchensis).

Authors: Steven G Ralph; Hye Jung E Chun; Natalia Kolosova; Dawn Cooper; Claire Oddy; Carol E Ritland; Robert Kirkpatrick; Richard Moore; Sarah Barber; Robert A Holt; Steven J M Jones; Marco A Marra; Carl J Douglas; Kermit Ritland; Jörg Bohlmann
Journal: BMC Genomics Date: 2008-10-14 Impact factor: 3.969

10. Draft genome of the mountain pine beetle, Dendroctonus ponderosae Hopkins, a major forest pest.

Authors: Christopher I Keeling; Macaire M S Yuen; Nancy Y Liao; T Roderick Docking; Simon K Chan; Greg A Taylor; Diana L Palmquist; Shaun D Jackman; Anh Nguyen; Maria Li; Hannah Henderson; Jasmine K Janes; Yongjun Zhao; Pawan Pandoh; Richard Moore; Felix A H Sperling; Dezene P W Huber; Inanc Birol; Steven J M Jones; Joerg Bohlmann
Journal: Genome Biol Date: 2013-03-27 Impact factor: 13.583

123 in total

1. Genomics: Sprucing up forest tree genomics.

Authors: Magdalena Skipper
Journal: Nat Rev Genet Date: 2013-06-04 Impact factor: 53.242

2. A comparison of genomic selection models across time in interior spruce (Picea engelmannii × glauca) using unordered SNP imputation methods.

Authors: B Ratcliffe; O G El-Dien; J Klápště; I Porth; C Chen; B Jaquish; Y A El-Kassaby
Journal: Heredity (Edinb) Date: 2015-07-01 Impact factor: 3.821

3. Functional diversification of Arabidopsis SEC1-related SM proteins in cytokinetic and secretory membrane fusion.

Authors: Matthias Karnahl; Misoon Park; Cornelia Krause; Ulrike Hiller; Ulrike Mayer; York-Dieter Stierhof; Gerd Jürgens
Journal: Proc Natl Acad Sci U S A Date: 2018-05-29 Impact factor: 11.205

4. Stepwise large genome assembly approach: a case of Siberian larch (Larix sibirica Ledeb).

Authors: Dmitry A Kuzmin; Sergey I Feranchuk; Vadim V Sharov; Alexander N Cybin; Stepan V Makolov; Yuliya A Putintseva; Natalya V Oreshkova; Konstantin V Krutovsky
Journal: BMC Bioinformatics Date: 2019-02-05 Impact factor: 3.169

5. MADS goes genomic in conifers: towards determining the ancestral set of MADS-box genes in seed plants.

Authors: Lydia Gramzow; Lisa Weilandt; Günter Theißen
Journal: Ann Bot Date: 2014-05-22 Impact factor: 4.357

6. The gymnosperm cytochrome P450 CYP750B1 catalyzes stereospecific monoterpene hydroxylation of (+)-sabinene in thujone biosynthesis in western redcedar.

Authors: Andreas Gesell; Markus Blaukopf; Lina Madilao; Macaire M S Yuen; Stephen G Withers; Jim Mattsson; John H Russell; Jörg Bohlmann
Journal: Plant Physiol Date: 2015-03-31 Impact factor: 8.340

Review 7. Insights into conifer giga-genomes.

Authors: Amanda R De La Torre; Inanc Birol; Jean Bousquet; Pär K Ingvarsson; Stefan Jansson; Steven J M Jones; Christopher I Keeling; John MacKay; Ove Nilsson; Kermit Ritland; Nathaniel Street; Alvin Yanchuk; Philipp Zerbe; Jörg Bohlmann
Journal: Plant Physiol Date: 2014-10-27 Impact factor: 8.340

8. The redox-sensitive chloroplast trehalose-6-phosphate phosphatase AtTPPD regulates salt stress tolerance.

Authors: Julia Krasensky; Caroline Broyart; Fernando A Rabanal; Claudia Jonak
Journal: Antioxid Redox Signal Date: 2014-06-26 Impact factor: 8.401

9. Flavan-3-ols in Norway spruce: biosynthesis, accumulation, and function in response to attack by the bark beetle-associated fungus Ceratocystis polonica.

Authors: Almuth Hammerbacher; Christian Paetz; Louwrance P Wright; Thilo C Fischer; Joerg Bohlmann; Andrew J Davis; Trevor M Fenning; Jonathan Gershenzon; Axel Schmidt
Journal: Plant Physiol Date: 2014-02-18 Impact factor: 8.340

10. Isolation and expression profiles of gibberellin metabolism genes in developing male and female cones of Pinus tabuliformis.

Authors: Shihui Niu; Lu Yuan; Yuncheng Zhang; Xiaoyang Chen; Wei Li
Journal: Funct Integr Genomics Date: 2014-07-31 Impact factor: 3.410