Literature DB >> 34839581

Population differentiation and epidemic tracking of Bursaphelenchus xylophilus in China based on chromosome-level assembly and whole-genome sequencing data.

Xiaolei Ding^1,2,3, Yunfei Guo^3,4, Jianren Ye^1,2, Xiaoqin Wu^1,2, Sixi Lin^1,2, Fengmao Chen^1,2, Lihua Zhu^1,2, Lin Huang^1,2, Xiaofeng Song⁵, Yi Zhang^1,2, Ling Dai^1,2, Xiaotong Xi^1,2, Jinsi Huang^1,2, Kai Wang^3,4,6,7, Ben Fan^1,2, De-Wei Li⁸.

Abstract

BACKGROUND: Bursaphelenchus xylophilus, the pinewood nematode, kills millions of pine trees worldwide every year, and causes enormous economic and ecological losses. Despite extensive research on population variation, there is little understanding of the population-wide variation spectrum in China.
RESULTS: We sequenced an inbred B. xylophilus strain using Pacbio+Illumina+Bionano+Hi-C and generated a chromosome-level assembly (AH1) with six chromosomes of 77.1 Mb (chromosome N50: 12 Mb). The AH1 assembly shows very high continuity and completeness, and contains novel genes with potentially important functions compared with previous assemblies. Subsequently, we sequenced 181 strains from China and the USA and found ~7.8 million single nucleotide polymorphisms (SNPs). Analysis shows that the B. xylophilus population in China can be divided into geographically bounded subpopulations with severe cross-infection and potential migrations. In addition, distribution of B. xylophilus is dominated by temperature zones while geographically associated SNPs are mainly located on adaptation related GPCR gene families, suggesting the nematode has been evolving to adapt to different temperatures. A machine-learning based epidemic tracking method has been established to predict their geographical origins, which can be applied to any other species.
CONCLUSION: Our study provides the community with the first high-quality chromosome-level assembly which includes a comprehensive catalogue of genetic variations. It provides insights into population structure and effective tracking method for this invasive species, which facilitates future studies to address a variety of applied, genomic and evolutionary questions in B. xylophilus as well as related species.

Entities: Chemical

Keywords: Bursaphelenchus xylophilus; SNPs; chromosome-level assembly; epidemic tracking; population structure

Mesh：

Year: 2021 PMID： 34839581 PMCID： PMC9300093 DOI： 10.1002/ps.6738

Source DB: PubMed Journal: Pest Manag Sci ISSN： 1526-498X Impact factor: 4.462

INTRODUCTION

The pine wood nematode (PWN), Bursaphelenchus xylophilus (Steiner & Buhrer) Nickle, is an invasive species and the causative agent of pine wilt disease originating from North America. It does not harm the local pine trees in North America but poses a great threat to all pine forests in Asia and Europe. The PWN was first found in Japan in the early twentieth century. In the 1980s, it spread to other Asian countries, such as China and Korea. In 1999 it was reported in Portugal and continued to infect new forest areas in the Portuguese center region. , Now this nematode is also present in the Madeira Islands and Spain. In China, the infected area has reached 666 counties, 18 provinces and 1 114 000 ha according to State Forestry and Grassland Administration of China. Obviously, this invasive species has spread into very large areas of China within the span of 38 years. It is clear that the widespread and severe cross‐infection of B. xylophilus in China was mostly caused by human mediated activities, such as wood product exchange and infrastructure construction. However, it is still unclear how this parasitic nematode could quickly spread and adapt to various environments. Plenty of studies advocate that the best time to control biological invasions is in their early stages, in terms of cost and effectiveness. Therefore, utilizing quarantine measures in conjunction with identifying geographical pathways should be the first step in preventing the spread of this nematode. The explicit descriptions of B. xylophilus' origin and population structure are essential for epidemic source tracking, monitoring and prevention. Most relevant studies regarding the geographical origins or genetic variations of B. xylophilus are based on the different enzyme sites within DNA fragments, like the inter‐simple sequence repeat (ISSR), amplified fragment length polymorphism (AFLP) and restriction fragment length polymorphism (RFLP) methods. , , Apart from valuable information generated from those studies, struggling to properly identify nucleotide polymorphisms may impede researchers from attaining more promising results in areas with severe cross‐infection, like China. Consequently, we have consistently failed to draw concrete conclusions about the possible population structures or origins of B. xylophilus in most invaded areas. , , It is necessary to try other genetic markers to elucidate the population structures in China to prevent the spread of B. xylophilus. Alternatively, single nucleotide polymorphisms (SNPs) have been proved to be an efficient tool in population genetics due to their abundance and diversity in many species. , , There is pervasive evidence that genome‐wide association studies could be a powerful approach to reveal population structures and other complex traits in mammals, plants and microbes. , , Meanwhile, it is undoubted that an accurate and complete genome assembly should be the fundamental prerequisite in all aspects of molecular researches, including SNP genotyping. Since the first announcement of short‐read sequencing based genome assembly, SNPs markers began to be employed by researchers to identify possible origins and pathogenic traits of B. xylophilus. , , However, few studies have characterized the genetic variations in B. xylophilus with adequate sample sizes. This lack of genetic information hinders the field of B. xylophilus research from advancing the understanding of population structures and invasion routes from a standpoint of genetics and genomics. In addition, short‐read based assembly inherently cannot characterize complex regions and sequences with extreme GC contents, and inevitably leads to reduced assembly contiguity. , Technically, third‐generation, long‐read sequencing is gaining popularity and is proven to be superior in de novo assembly. It would be beneficial for all researchers to have access to a more continuous and accurate genome assembly with the help of new sequencing platforms. We therefore performed single‐molecule real‐time (SMRT) long‐read sequencing and BioNano physical mapping of the B. xylophilus genome coupled with genome re‐sequencing of 181 B. xylophilus strains covering almost all infected cities in China. The ultimate goals of this study were to generate a high‐quality reference assembly to decipher the genomic variation landscape and adaptation associated genes of B. xylophilus all over China. This assembly will elucidate current B. xylophilus populations and subpopulations, inevitably boosting the current research and monitoring system on this invasive species.

MATERIALS AND METHODS

Sampling

The strain AMA3 is an inbred line maintained by Lihua Zhu in Jianren Ye's laboratory for 20 generations and is used for subsequent SMRT PacBio sequencing and Bionano optical map generation. The other 180 nematode strains used for genome re‐sequencing and SNP genotyping were collected from infected trees located in 16 provinces by local forestry departments, with the approval of the State Forestry and Grassland Administration of China. All 180 strains were extracted from chopped trees using the Baermann funnel method. Each strain consists of ~40 hand‐picked individuals under a microscope from one single tree and cultured separately on the fungus Botrytis cinerea Pers. at 25 °C for 7–10 days to obtain sufficient sample sizes. The nematodes were recovered again and stored in water suspension using a 1.5‐mL centrifuge tube for downstream short‐read sequencing. The sole USA strain used in this study was intercepted and isolated from imported wood product and has been maintained in our laboratory since 2014. Detailed sampling information can be found in Table S8.

Generation of sequence data

The high‐molecular‐weight DNA was extracted from inbred line AMA3 for SMRT PacBio sequencing using Qiagen DNeasy Blood & Tissue Kit (Wuhan, China). DNA quality was assessed by Nanodrop, Qubit (Thermo Fisher) and visualized in agarose gel. A 20‐kb insert SMRT cell library was generated and the PacBio RS II platform was used to produce nine SMRT cells of raw data (Nextomics, Wuhan, China). For the BioNano optical map, DNA extracted from the AMA3 inbreed line went through optical scanning, with nicking enzyme NT.BssSI following protocols recommended by the manufacturer (BGI, Shenzhen, China). The Hi‐C library was constructed and sequenced via the Illumina Novaseq/MGI‐2000 platform (Nextomics). The short‐read DNA‐Seq data for genome polishing and RNA‐Seq data for transcriptome analyses were generated using the same AMA3 inbred line. For genome re‐sequencing, the genomic DNA was isolated from the B. xylophilus strains using the CTAB method and sequenced on an Illumina Hiseq 4000 (150 bp paired‐end reads).

Hybrid genome assembly and quality assessment

De novo genome assembly was constructed by minimap (v1cd6ae3) and miniasm (v17d5bd1) using PacBio SMRT long reads. Parameters, optimized based on genome continuity and completeness, were assessed by BUSCO (v2.0). Errors in long reads were then corrected by five rounds of iterative Racon (v0f6d4aa) polishing with long reads. To further correct insertion and deletion errors from long‐read sequencing, three rounds of short‐read polishing were carried out. Both long‐read and short‐read polishing were iterated until no significant improvement was observed with respect to genome completeness and number of errors corrected. The polished assembly was aligned against the National Center for Biotechnology Informatio (NCBI) representative microbe genomes database using BLAST to remove any contigs with microbe contamination (qcov> = 80%, identity> = 85%; ftp://ftp.ncbi.nlm.nih.gov/blast/db/). For quality assessment, PacBio reads were aligned back to the finished assemblies using minimap2 (v2.14‐r883) with preset parameters for PacBio data. Bionano optical data were first filtered with static label SNR 3.2, and the molecular quality report was done with ‘minlen 150 ‐T 1e‐7’. The de novo assembly was carried out with default arguments for medium genome size, and hybrid scaffolding was performed with ‘initial alignment P value 1e‐10, merging P value 1e‐11, final alignment P value 1e‐10’ along with ‘Resolve Conflicts’ option using IrysView (v2.5.1). To anchor hybrid scaffolds onto the chromosome, genomic DNA was extracted for the Hi‐C library from AMA3. Then, we constructed the Hi‐C library and obtained sequencing data via the Illumina Novaseq/MGI‐2000 platform. In total, 370 million paired‐end reads were generated from the libraries. Quality controlling of Hi‐C raw data was performed using Hi‐C‐Pro as former research. First, low‐quality sequences (quality scores <20), adaptor sequences and sequences shorter than 30 bp were filtered out using fastp, and then the clean paired‐end reads were mapped to the draft assembled sequence using bowtie2 (v2.3.2) (−end‐to‐end, –very‐sensitive ‐L 30) to get the unique mapped paired‐end reads. Valid interaction paired reads were identified and retained by HiC‐Pro (v2.8.1) from unique mapped paired‐end reads for further analysis. Invalid read pairs, including dangling‐end, self‐cycle, re‐ligation and dumped products, were filtered by HiC‐Pro (v2.8.1). The scaffolds were further clustered, ordered and oriented onto chromosomes by LACHESIS (https://github.com/shendurelab/, LACHESIS), CLUSTER_MIN_RE_SITES = 100, CLUSTER_MAX_LINK_DENSITY = 2.5, CLUSTER NONINFORMATIVE RATIO = 1.4, ORDER MIN N RES IN TRUNK = 60 and ORDER MIN N RES IN SHREDS = 60. Finally, placement and orientation errors exhibiting obvious discrete chromatin interaction patterns were manually adjusted. QUAST , (5.0.0.dev0) was used to align AH1 (Anhui 1) onto JP (Japan) contigs to determine genome coverage and consensus accuracy (−‐large –circos ‐m 300 ‐i 65 ‐x 1000 ‐t 4). BUSCO (v3.0.2) was run in ‘geno’ mode with the Nematoda lineage database (OrthoDB v9) to identify common orthologs in the assemblies to assess completeness. Sequence similarity analysis between AH1 (query) and JP (reference) derived genes was done by BLAST (−task megablast ‐outfmt ‘7 qacc sacc qstart qend evalue length pident qcovs’ ‐perc_identity 85 ‐max_target_seqs 50). Genome comparison was visualized by dot plot with D‐genies (http://dgenies.toulouse.inra.fr/).

Genome annotation

The genome annotation was carried out with a MAKER2 tool kit following the recommended protocol (http://gmod.org/wiki/MAKER_Tutorial). The repeat sequences were predicted by RepeatModeler and RepeatMasker (http://www.repeatmasker.org/). We chose Augustus (http://augustus.gobics.de/), SNAP, FGENE and other built‐in packages for ab inito gene prediction and included the protein, mRNA sequences obtained from early assembly together with previously generated RNA‐Seq as external evidence (https://parasite.wormbase.org/ftp.html). MAKER2 was then used to integrate the ab inito and existing evidence to revise the gene model. Subsequently, Gene Ontology, Swissprot and Interpro databases were used to conduct functional annotation.

Transcriptome analysis

Genome alignment and transcript assembly process were done by following the protocol described by Cole Trapnell et al. RNA‐Seq data were aligned to the genome using TopHat (v2.1.1) allowing two mismatches, and transcriptome assembly and gene expression analysis were generated by Cufflinks and Cuffdiff (v2.2.2.1). Putative exon skipping events were identified by GESS (v1.0) with default parameters. Total RNA was extracted using Trizol reagent following the product descriptions (https://www.thermofisher.com/order/catalog/product/15596018?SID=srch-srp-15596018). RNA qualities were assessed by Nanodrop 2000 (http://www.nanodrop.com/Productnd2000overview.aspx) and visualized in agarose gel. Upstream and downstream primers for novel transcripts and exon‐skipping events were designed based on the genomic coordinates. PCR was carried out by the following procedure: 94 °C for 2 min, followed by 37 cycles (94 °C for 30 s, 58 °C for 30s and 72 °C for 30 s) and finally 72 °C for 10 min. The PCR products were examined by 1% agarose gel electrophoresis.

Genome variation analysis

The re‐sequencing DNA‐Seq was aligned to the genome using BWA‐MEM (v0.7.15‐r1140) and SAMtools (v1.6). , Duplicates were removed by Picard (http://broadinstitute.github.io/picard/). Putative SNPs were called by freebayes (v1.1.0) with ‘‐u ‐C 5 ‐e 50 –standard‐filters –min‐coverage 10’. Then, VCFtools (v0.1.15) and custom Perl (v5.14.2) scripts were applied to gather general SNP genotyping information among different strains. The R package SNPRelate (v1.14.0) was employed to process minor allele frequency MAF, missing rate and linkage disequilibrium (LD) filtering ; the remaining SNPs were used to perform principal component analysis (PCA) and hierarchical clustering analysis. Treemix (v1.13) was used to estimate population splitting events among different provinces with at least more than two strains to avoid bias using ‘‐noss ‐bootstrap’, and the migration events were inferred by ‘‐m’ with increased numbers. Duite was used to compute the D‐statistics based on the ABBA‐BABA test (https://github.com/millanek/Dsuite). Plink (v1.90b6.1) and R package qqman (v0.1.4) were then used to conduct geographically associated SNP identification and Manhattan plot with minimum threshold 1.6e‐9 (0.05/(SNP numbers*4)). The annotation of SNPs was performed using ANNOVAR (v2018Apr16) and enriched gene functions were analyzed by agriGO (http://systemsbiology.cau.edu.cn/agriGOv2/). The temperature zones were divided based on the average temperature data from 1981 to 2010 obtained from the Open Source Geospatial Foundation (https://www.osgeo.cn/). Environmental factors were downloaded from WorldClime (https://www.worldclim.org/) and association studies were conducted by SPSS using Pearson correlation and Duncan's test.

Experimental validation of genomic variations

Two SNP sites of all 181 samples were validated by the pyro‐sequencing technique using PyroMark Q96 ID. The two pairs of amplification primers and one pair of sequencing primers were designed by PyroMark Assay Design Software 2.0, and one pair of the amplification primers was assigned with biotin label (Table S11). PCR reactions were performed in a total volume of 50 μL containing 2 μL of genomic DNA, 25 μL of mix, 2 μL of forward primer, 2 μL of reverse primer and 19 μL of double‐distilled H2O (ddH2O). PCR amplifications were performed with an initial denaturation for 3 min at 98 °C, followed by 45 cycles of denaturation for 10 s at 98 °C, primer annealing for 30 s at 58 °C and extension for 30 s at 72 °C, followed by a final extension for 5 min at 72 °C. Amplicons were analyzed by gel electrophoresis on a 2% agarose gel and visualized by ultraviolet transillumination. After gel electrophoresis, 10 μL of the residual biotinylated PCR product was immobilized onto 3 μL of streptavidin sepharose beads (GE Healthcare, USA) in 40 μL of binding buffer mixed with 27 μL of ddH2O under 15 min of shaking at 1400 rpm. Single‐stranded DNA was prepared with a PyroMark Q96 Vacuum Workstation (Qiagen, USA). PCR products were washed with 100 mL of 70% ethanol, denatured with 90 mL of denaturation buffer and re‐washed with 110 mL of wash buffer for 5 s each. The beads were released into a 96‐well plate prefilled with sequencing primer in 40 μL of annealing buffer per well. For primer annealing, the 96‐well plate was heated at 80 °C for 2 min and thereafter cooled to room temperature for use. Pyrosequencing assays were performed for all the study samples on a PyroMark Q96 ID system using PyroMark Gold Q96 Reagents (Qiagen). PyroMark Q96 ID was implemented with AQ (allele quantification) mode, and assays were created according to the manufacturer's instructions.

Geographic location prediction model

We determined the latitude and longitude of each sample based on the city or county where the sample was collected, up to two decimal points. One‐unit difference in the second decimal place of latitude or longitude is roughly 1.1 km in distance, therefore two decimal points are sufficiently accurate for our purposes. We then ran a random parameter search in scitkit‐learn (v0.19.1) for XGBoost (v0.7). XGBoost is a gradient boosting method that attempts to build an ensemble of decision trees for prediction. It was chosen because it has been ranked as the top performing model in many Kaggle (kaggle.com) data challenges, and it also gave us the best results in our internal comparison (data not shown) of deep learning networks and automated machine learning framework like h2o (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html). Input to the model was counts of alternative alleles of the SNPs used in our principal component analysis. All loci were biallelic and therefore possible values were 0, 1 and 2 at each locus. The best set of hyper parameters was chosen from 500 rounds of randomized search, ranked by mean squared error of 5‐fold cross‐validation. Similarly, the predictions were generated by 5‐fold cross validation, i.e. data were randomly split into four partitions. For each partition, a new model was trained based on the other four partitions, and predictions were made on the untrained partition. Randomization seeds for random search and cross‐validation prediction were different to avoid bias caused by random search. The latitude and longitude were modeled separately. Due to the small sample size, we subsequently used bootstrapping to estimate the distribution of errors, i.e. the haversine distance between true and predicted locations. All samples were chosen at random, with replacement, and then were subject to random search and 5‐fold cross‐validation prediction. Bootstrapping can also be used to estimate the confidence intervals of a single‐point predictions.

RESULTS

High‐quality long‐read based assembly of the B. xylophilus genome

We leveraged SMRT DNA sequencing technology and sequenced genomic DNA from an inbred nematode strain (AH1) at 123X coverage (Table S1). In total, we obtained about 1 million subreads after quality filtering. These subreads have a mean length of 10.1 kb and a N50 length of 13.8 kb, where sequences are longer than the N50 composite half of the total sequence lengths. We also generated 46X short reads DNA‐Seq on the Illumina un‐stranded 100 bp paired‐end platform (Table S1) for genome polishing. Minimap and miniasm were used to perform de novo genome assembly on the long reads. Then five rounds of Racon polishing using long reads and four rounds of short reads polishing were performed to correct sequencing errors. In total, 5395 single nucleotide variation and 102 719 inserstion, deletion (INDELs) are corrected by short reads (Table S2). The polished assembly was further compared against the NCBI representative microbe genomes database using BLAST to remove contigs with possible microbe contaminations. As a result, 14 contigs (27 Mb) were discarded (Table S3) while 129 contigs were kept in the AH1 draft assembly with N50 at 5.8 Mb. Subsequently, the Bionano Irys‐chip (537X) and Hi‐C were used to scaffold those assembled contigs and six chromosomes were obtained after hybrid scaffolding with N50 at 12 Mb, size totaling 77.1 Mb (Figs 1(a) and S1).

Figure 1

Summary of genome assembly and annotation results between AH1 and Japan (JP). (a) De novo BioNano assembly and chromosome contact matrix of AH1 assembly. Scaffold number is on top left corner. BioNano scaffolding assembly is in green, while Pacbio based contigs are in blue. (b) Dot‐plot of AH1 assembly against JP. (c) Gene length distribution between the two assemblies. (d) Gene ontology enrichment of Bursaphelenchus xylophilus novel genes on AH1 (coverage <50%, identity >85%). The AH1 assembly is considerably more contiguous than the existing assembly (JP) assembled from short reads based on N50 and sequence length (Table 1). QUAST , was used to compare AH1 against JP and it was found that 79.14% of JP can be covered by AH1 and the consensus accuracy of the aligned bases reaches 99.04% on average. Mapping by minimap2 showed that 79.6% and 77.3% subreads can be aligned to AH1 and JP contigs, respectively, and there are 1.37 alignments per aligned read on AH1 and 3.16 on JP. BUSCO analysis showed that 78.4% (770/982) of near‐universal single‐copy orthologs in Nematoda can be found in AH1 compared with 74.0% (727/982) in JP assembly (Table S4). Over 76.4% of the assembly showed high similarities with JP assembly by genome dot plot (Fig. 1(b)). The above assessments unanimously suggest that the AH1 assembly is more complete, contiguous and contains fewer mis‐assemblies than the JP assembly.

Table 1

Comparison between the AH1 and Japan (JP) de novo assemblies on Bursaphelenchus xylophilus

Strain	Ka4C1* (JP)	AMA3 (AH1)
Accession	GCA_000231135.1	N/A
Sequencing platform	Illumina	PacBio SMRT
Scaffolding platform	N/A	BioNano Irys Chip Hi‐C
PacBio long‐read mapping rate	77.26%	81.16%
Size	74 561 461	77 273 647
Number of contigs ^a	10 432	129
Longest contigs ^a	200 348	11 860 753
Contig N50 ^a	18 150	5 825 809
Median contig length ^a	2459	33 909
Number of CHR	N/A	6
Longest CHR	N/A	12 450 865
CHR N50	N/A	12 382 933
Consensus accuracy ^b	N/A	99.16%
Complete BUSCOs	727 (74.03%)	770 (78.41%)
Alignments per long read	3.16	1.37
Exons	80 655	79 734
Genes	12 568	12 197
Gene density (genes/Mb)	150.33	158.19
Average gene size	1451.13	1453.79
Average exon size	225.12	222.39

Contigs downloaded from NCBI; scaffolds downloaded from WormBase.

Measured against JP.

Comparison between the AH1 and Japan (JP) de novo assemblies on Bursaphelenchus xylophilus BioNano Irys Chip Hi‐C Contigs downloaded from NCBI; scaffolds downloaded from WormBase. Measured against JP. The RNA‐Seq data of the AMA3 inbred line as well as mRNA and protein sequences of B. xylophilus deposited in WormBase were included to perform genome annotation using MAKER2 (v2.31). , The AH1 genome contains 12 197 genes and 79 734 exons. The average lengths of genes and exons are ~1453 bp and ~222 bp, respectively. At the same time, JP assembly contains 12 568 genes and 80 655 exons. The average lengths of genes and exons are ~1451 and ~225 bp, respectively. The overall view of gene length distribution also suggested that the AH1 assembly increases the gene length, especially for genes <5 kb (Fig. 1(c)). Functional analysis showed that ~8826 genes (70%) in the AH1 assembly could be assigned with putative functions. The novel gene candidates showing less than 85% identity and 50% coverage compared with JP are mainly enriched in functions, such as single organism process, signal transduction, catalytic activity and oxidoreductase activity (Fig. 1(d)).

Functional annotation of the AH1 genome reveals novel adaptation‐important gene families

We further characterized several key proteins associated with the pathogenicity and host‐adaptation of B. xylophilus. Glycoside hydrolase (GH) cellulase families are known to play important roles in breaking down host cell walls during nematode infection. Here, we found 19 GH families in the AH1 assembly while six of them can be detected in the JP assembly (Table S5). Among the new GH families, GH30 was missed by JP due to the use of a now obsolete protein database a decade ago. The remaining families including GH31, GH37 (bx1.10725‐RA), GH18 and GH29, to name a few, are encoded by novel genes identified only in the AH1 assembly (Table S5). Additionally, we found that cytochrome P450 (CYP), glutathione S‐transferase, GH, and pectate lyase genes are multicopy genes with the help of chromosome level assembly.

Transcriptome characterization finds novel transcripts in the AH1 genome

To further evaluate the accuracy and quality of the AH1 assembly on a transcript level, we aligned the same RNA‐Seq data used in genome annotation to both AH1 and JP. The mapping rate of the RNA‐Seq data on AH1 is 88%, which is 11% higher than that of the JP assembly (77%). Among the unmapped reads (23%) from the JP assembly, half can be mapped back to the AH1 assembly. The AH1‐derived transcriptome contains 13 368 genes and 15 425 transcripts with ~30 Mb, while the JP transcriptome contains 17 658 genes and 19 481 transcripts with ~27 Mb. The overall transcript lengths are significantly longer compared with transcripts from the JP assembly (Fig. 2(a)). Out of the 19 481 transcripts in JP, 85% can also be found in AH1 with >85% coverage while 549 (no hits in JP) transcripts are considered as novel transcripts (Fig. 2(b)). Then 10 randomly selected novel transcripts were amplified via PCR to prove the validity of transcriptome assembly (Fig. 2(c) and Table S6). Furthermore, 93 exon‐skipping events are identified in AH1, and six of those chosen for validation were supported by PCR amplification (Fig. 2(d) and Table S7). The transcriptomic analysis again indicated that the AH1 assembly could be more complete and contiguous than the JP assembly on both the gene and transcript level. Functional annotation of all newly identified transcripts including GH, G‐protein coupled receptor (GPCR) and cytochromeP450 (CYP450) suggests that they are mainly involved in metabolic process, catalytic activity and membrane formation.

Figure 2

Evaluations of AH1 and Japan (JP) assembly based on the same RNA‐Seq data. (a) Transcript length distributions generated from RNA‐Seq data using AH1 and JP assemblies. (b) RNA‐Seq based sequence similarity analysis between AH1 (query) and JP (reference) transcripts using BLAST. x axis indicates query sequence coverage rate. (c) PCR validation of 10 novel transcripts identified in the AH1 assembly only. Lane M, DNA 1000 marker; lanes 1–10 indicate 10 novel transcripts, see Table S6 for details. (d) PCR validation of six exon‐skipping events identified in the AH1 assembly. Lane M, DNA 1000 marker; lanes 1–6 indicate six exon‐skipping events, see Table S7 for details.

Genotyping analysis shows geographically bounded subpopulations of B. xylophilus in China

The genomes of 180 strains from 16 provinces in China together with one alien strain from the USA were sequenced at average 102X coverage (Tables S1 and S8). The average mapping rate on AH1 is 92% (Table S9). Overall, there are 7 807 717 unique SNPs among all 181 nematode strains, 7 395 877 are biallelic SNPs and the rest of the SNPs (5.3%) have more than one alternate allele. The allele frequencies mostly range from 0 to 0.2 (Fig. 3(a)). Up to 24% of the SNPs are located within the exonic regions, 27% in intronic, and 49% in intergenic and flanking regions (Fig. 3(b)). Moreover, 27% of the exonic SNPs are predicted as nonsynonymous SNPs, which might lead to amino acid changes. The median SNP counts, missing SNPs (SNPs absent in one strain but can be found in other strains), homozygous, transitions and transversions are 288 883, 422 622, 137 704, 149 574, and 138 541, respectively (Fig. 3(c) and Table S10). Meanwhile, six genotypes (A to G, C to G, C to T, G to A, G to C, T to C) showed significantly higher counts than other genotypes (Fig. 3(d) and Table S10). Last but not least, 4.8% of the SNPs are found in novel gene candidates according to functional annotation. The enriched gene functions affected by nonsynonymous SNPs are signaling, regulation of biological process and response to stimulus.

Figure 3

Overview of all SNPs found in 181 Bursaphelenchus xylophilus strains. (a) Distribution of minor allele frequencies. (b) Locations of all SNPs in genes. (c) Homozygosity, missing SNP, SNP count, transition and transversion distributions of the SNPs found in 181 strains. (d) Box plots of SNP genotypes in 181 strains. (e) SNP counts distribution among 181 strains. Strains with the highest SNPs counts come from the USA (3582068), followed by one strain from Sichuan (SC) and 18 strains from Guangdong (GD) province (Fig. 3(e)). Similar rankings can be found if the strains are sorted based on homozygous numbers, transitions and transversions. (Tables 2 and S10). Obviously, most of the strains from SC, GD and the USA exhibit different genome variations compared with other strains. Since B. xylophilus is a native species in North America but a foreign species in China, the strains from China may have lost some genetic diversity due to multiple cross‐infection among different areas.

Table 2

General statistics of identified Bursaphelenchus xylophilus SNPs from different localities

Localities ^a	Max. SNPs	Min. SNPs	Mean # of SNPs	Mean # of homozygotes	Mean # of missing SNPs	Mean # of private SNPs	Mean # of transitions	Mean # of transversions
All	3 582 068	3207	455 754	266 880	542 777	20 852	254 579	205 283
AH	366 651	60 374	235 453	132 461	514 392	168	123 758	114 527
CQ	721 468	80 982	310 675	99 959	254 007	285	169 054	146 460
FJ	482 391	45 248	186 645	101 044	538 885	353	99 889	89 740
GD	1 635 962	78 245	1 079 572	748 667	558 571	2748	619 309	465 212
GX	1 057 731	67 137	562 434	40 250	526 319	512	313 534	252 513
GZ	173 487	53 962	105 157	53 743	575 461	97	58 272	47 602
HB	457 945	123 881	266 438	137 711	474 623	418	140 753	130 031
HEN	426 875	264 140	363 367	215 066	315 153	383	190 737	180 544
HN	1 104 349	59 855	435 275	274 241	553 499	440	240 784	197 461
JS	342 897	29 015	166 309	117 667	691 097	211	86 225	81 611
JX	374 490	61 856	175 007	112 946	583 442	112	93 274	83 121
SC	3 101 663	71 990	558 622	385 098	684 350	189 559	319 074	247 130
SD	443 330	50 149	224 630	77 558	625 420	293	116 049	113 535
SX	393 143	43 939	277 210	148 247	355 590	249	148 906	131 058
USA	3 582 068	3 582 068	3 582 068	1 095 820	249 350	1 290 757	2 106 722	1 527 444
YN	292 635	292 635	292 635	213 359	363 485	32	155 039	140 586
ZJ	1 224 018	71 775	417 282	190 182	545 389	3140	231 354	189 767

The acronyms for different provinces are used here to represent the B. xylophilus sampling areas, e.g, AH for Anhui, CQ for Chongqing, FJ for Fujian, GD for Guangdong, GX for Guangxi, GZ for Guizhou, HB for Hubei, HEN for Henan, HN for Hunan, JS for Jiangsu, SC for Sichuan, SD for Shandong, SX for Shaanxi, USA for the United States, YN for Yunnan, ZJ for Zhejiang.

General statistics of identified Bursaphelenchus xylophilus SNPs from different localities The acronyms for different provinces are used here to represent the B. xylophilus sampling areas, e.g, AH for Anhui, CQ for Chongqing, FJ for Fujian, GD for Guangdong, GX for Guangxi, GZ for Guizhou, HB for Hubei, HEN for Henan, HN for Hunan, JS for Jiangsu, SC for Sichuan, SD for Shandong, SX for Shaanxi, USA for the United States, YN for Yunnan, ZJ for Zhejiang. The SNPs were filtered by LD, MAF and missing SNP rate before downstream analyses (Methods) as described in the standard Genome‐wide Association Study (GWAS). To illustrate the population substructures that may exist among all sequenced samples, 3137 candidate SNPs were used as the input data to perform PCA and population splitting event detection. The PCA results indicate that the 181 strains can be divided into four major populations, which are in accordance with their sampling locations differed on latitudes (Figs 4(a),(b) and S2). The Pop1 contains 23 strains collected from GD province and another five strains from the USA (one), Zhejiang (ZJ) (two) and Hunan (HN) (two), so Pop1 mostly consists of GD strains. It is worth noticing that the only alien strain (USA) is placed in Pop1, which suggests the B. xylophilus in Guangdong province might originate from the USA and further spread within GD province. The Pop2 includes eight strains from SC and two strains from Fujian (FJ) and Yunnan, respectively. The Pop3 includes nine strains from ZJ (five), SC (two), GD (one) and Guangxi (one). The Pop4 is mixed with 134 strains from 15 provinces (Fig. 4(b),(c)). It is clear that the B. xylophilus population in China had undergone severe cross‐infection, but significant genetic variations can still be found to correlate with geography. In addition, the number of synonymous and nonsynonymous substitutions also varied among different populations (Fig. S3). It is common sense that latitude is one of the main factors affecting temperature while province‐based population splitting analysis provided insights into potential correlation between nematode populations and temperature zones (divided based on the average annual temperature data from 1981 to 2010). Most strains from JS, GZ, HB, HEN, AH and SD (Pop4) share similar SNP allele frequencies and they all belong to the 12 °C to 16 °C zone. Strains from CQ, FJ, SC, SX, JX, ZJ and HN (Pop2,3,4) form another group within the 16 to 20 °C zone. Strains from GD (Pop1) are independently assigned within 20–24 °C(Fig. 4(a),(d)). Detailed statistical analysis also revealed that the population clustering result is highly correlated with temperature and precipitation (Tables S11 and S12). Since we observed a small portion of geographically distant strains classified in the same cluster, we performed introgression analysis and found three spread routes in China: 1, GD to HN; 2, GD to SC; 3, HEN to ZJ (Fig. 4(a),(e)). The ABBA‐BABA test result also supports the above introgressions among different strains (Fig. S4). This indicates that GD could be one spread center in southern China but no migration routes can be found in other areas due to severe cross‐infections.

Figure 4

Population structures of Bursaphelenchus xylophilus in China. (a) Geographical locations (green dots) of B. xylophilus strains collected in China. Different temperature zones are highlighted with gradient colors and the arrows represent possible migration events. (b) PCA results of 181 strains based on 3137 SNP markers. (c) Hierarchical clustering results of all 181 strains. (d) Population splitting tree of 181 strains based on sampling provinces. Samples from the same temperature zones are shown in circles of different colors. (e) Introgression analysis revealed possible B. xylophilus migration routes in China.

SNPs highly associated with geographic origins

Previous analysis illustrated the current population structures of B. xylophilus in China. It would be beneficial to identify the geographically associated SNPs and their functions. The genome‐wide association results show that 230 (Pop1), 60 (Pop2), 7 (Pop3) and 239 (Pop4) SNPs are significantly associated with their subpopulations after post filtering, respectively (Fig. 5(a)). Moreover, four of them are found on the novel genes from AH1 assembly. In addition, most of the SNPs are located within the intron and intergenic regions (66%) while more than half of the exonic SNPs are identified as nonsynonymous SNPs, which will lead to the change of protein sequences. The highly associated SNPs (nonsynonymous) are found on 7TM GPCR families, which may contribute to the notable correlation between temperature and B. xylophilus subpopulation structure (Fig. 5(a)). , Other proteins, like the GH family, may also have important roles in regulating the adaption and growth of B. xylophilus. GO enrichment analysis also suggested those identified SNPs were involved in protein binding (HSP involved) and G‐protein coupled receptor activity (Fig. 5(b)). To validate the accuracy of the SNP genotyping results based on the new AH1 assembly, we employed pyro‐sequencing to quantify two geographically associated SNPs for all 181 strains (Table S13). The SNP (A to G) at CHR05 shares the same alternate allele G among Pop1 strains while other populations possess reference allele A. The other SNP (C to T) at CHR03 shares the reference allele C among Pop4 while other populations show alternate allele T (Figs 5(c),(d) and S5). The pyro‐sequencing results indicate that genotypes in all 181 samples are 100% identical with the next‐generation sequencing calling results.

Figure 5

Genome‐wide identification of geographical associated SNPs and experimental validations. (a) Manhattan plot for SNPs highly associated with different population structures. Blue lines indicate genome‐wide suggestive line = −log10(1e‐05); red lines indicate genome‐wide significance lines = −log10(1.6e‐09); green boxes indicate SNPs on GPCR genes. (b) Hierarchical tree graphs of over‐represented GO terms for genes affected by geographical associated SNPs. Boxes in the graphs represent GO terms and significant terms (adjusted P value ≤0.05) are highlighted. (c) Screenshots for pyro‐sequencing validation of candidate SNPs (Contig007:3426052) in Pop1. (d) Screenshots for pyro‐sequencing validation of candidate SNPs (Contig001:790035) in Pop4.

Epidemic tracking by machine learning method

In practice, it would be useful to track the epidemic source for unknown strains in order to set up specific quarantine measures. However, the aforementioned model cannot directly predict locations. Since location information will be valuable for effective local control when newly infected areas are detected, we used machine learning to predict the longitude and latitude of the strains in China using SNPs associated with population structure (Methods). The real‐world modeling performance was estimated via 240 iterations of bootstrapping. Haversine distance (distance on sphere) was calculated by comparing true longitude and latitude with predicted values. The distribution of true vs predicted distances (Fig. 6) was skewed with a long tail. Ninety percent predictions are within 727.3 km of true locations, with a mean distance of 293.0 km and a median of 143.3 km. Looking at the worst 5% predictions, i.e. predictions with the largest distances, we found that 10 samples account for 39.7% of all samples seen in the worst 5% predictions. Detailed examination revealed that the 10 samples had distinct genotypes compared to samples collected in nearby locations (Fig. S6). This might be a result of artificial migration of distant populations or biased representation of the underlying region. Nonetheless, it would be necessary to collect more samples from the aforementioned locations to validate our hypothesis. Predictions were also made for the original data without bootstrapping. The distance between true and predicted coordinates ranged from 28.5 to 1137.5 km, with a mean of 434.7 km and a median of 365.6 km.

Figure 6

Distribution of distances of true vs predicted coordinates in 240 iterations of bootstrapping.

DISCUSSION

In this study, we generated the first long‐read based, chromosome‐level genome assembly for B. xylophilus. The AH1 assembly is much more continuous than the previous JP assembly. The AH1 assembly has the longest chromosome at 13 Mb, which is 60 times longer than the JP assembly (0.2 Mb), and the median contig lengths are 14 times longer than JP. In addition, the functional annotation and RNA‐Seq data analysis proved that more novel genes and isoforms could be identified with the new assembly, which should help researchers to capture the complete landscape of the B. xylophilus transcriptome. The newly discovered GH, GPCR and CYP families and differentially expressed genes with potentially adaptation‐critical functions will provide new insights for future pathogenicity‐related studies. The new assembly also generates a subset of AH1‐specific SNPs, which serves as the key tool to elucidate current B. xylophilus population structures in China. The first report on B. xylophilus genome variations found ~3 million SNPs based on JP assembly and concluded that the high level of genetic variations can be explained by a founder effect. In our analysis, the number of SNPs is doubled to over 7 million using AH1 from 181 B. xylophilus strains. Obviously, the increasing sample sizes provide us with more genetic variation resources for downstream subpopulation detection. Our SNP profiling result indicates notable SNP variations among B. xylophilus strains in China and four different subpopulations could be identified in accordance with their latitudes changes. In addition, we found that GD strains may have been introduced from the USA, which led to distinct genome variation patterns compared with other strains (Fig. 4(b),(c)). The numbers of SNPs in each strain (Fig. 3(e)) also agree with the hypothesis that strains from GD might represent closer descendants of foreign ancestors. B. xylophilus is known as the local species in the USA and is most likely to be brought into GD via human‐mediated activities. , Similar research using SNP markers (located on 10 effector transcripts) to assess the geographical origins grouped strains from China, the USA and Japan into one independent cluster. With limited sample sizes, the author did not give further classification within each country but proved that each geographic cluster might carry unique SNPs. Apart from the prediction results, we have nongenetic evidence to support part of the clustering results. In 2015, some infected pine wood was intercepted by the Yunnan (YN) government and the dealer confessed that it was illegally purchased from Sichuan (SC). This pine wood (YN01) was, and the clustering results perfectly support the fact that YN01 is clustered together with SC strains. The same situation can be found in SX and JX provinces as mentioned before. Introgression analysis also captures long‐distance migration events from GD to other provinces. Consequently, the current nematode populations in China are mixed with each other by severe cross‐infection while only small portions of introgressions can be traced in the southern part of China (GD). On the other hand, population splitting analysis reveals that the genetic variations among different strains are possibly induced by temperature and precipitation while those two environmental factors normally interact with each other. Meanwhile, relevant studies usually indicate that temperature is important for the adaptation and survival of B. xylophilus. , , , This identification may suggest that the nematodes could evolve to adapt to different environments to expand their populations. The hypothesis is further confirmed by nonsynonymous SNPs (geographically associated) found on signaling‐related genes like GPCR families. , , Above all, our study suggests that the nematode could evolve to adapt to different environments and spread to more areas with the help of the global warming effect. Coincidently, B. xylophilus was first found and killed many pine trees in Liaoning province (average temperature below 10 °C) recently while Liaoning was previously considered as an unsuitable areas for B. xylophilus to survive. Additionally, our population classification results reveal finer population structure in China, adding more insights to previously underpowered work based on other molecular markers. Other similar studies using internal transcribed spacer PCR (ITS‐PCR), mitochondrial DNA, tandem repeat and the microsatellite marker method proposed that the B. xylophilus strains in China were grouped together as one major clade. , , , In addition, Mallez et al. , reported that a strong genetic variation was found in native strains (USA), low level or no sign of polymorphism in invaded area. All these findings indicate that the native B. xylophilus strains shows more obvious genetic variations than invasive strains. We believe that the abundant genome‐wide SNPs markers provide us with higher sensitivity than previous markers, which would in turn generate more insight results other than ITS, microsatellite and mitochondrial markers. However, we do not observe distinct genetic patterns in JS, and it shares similar SNP variations with many other provinces. The first observation of B. xylophilus was in JS in 1982. After that, this nematode spread rapidly to nearby areas from JS. It is quite obvious that infrastructure construction and other human activities greatly accelerate the spread speed and cross‐infection of this invasive nematode based on the annual statistics of B. xylophilus invaded counties issued by the State Forestry and Grassland Administration of China (http://english.forestry.gov.cn/). Thus, multiple introductions of B. xylophilus may homogenise genetic diversities in JS and other areas regardless of the population divisions. The major concern in controlling pine wilt disease is to block the transmission route of B. xylophilus. Consequently, we present the first method in B. xylophilus to track the epidemic origin based on genomic information. Bootstrapping analysis showed that the model can predict with about 700 km of error for 90% of samples in our study. The results are largely in line with the studies on human population where 90% of individuals could be placed within 840 km of their origin, even though a much larger sample size (n = 3192) was used. We noticed that the model tends to make more predictions near the center of all sampling locations, possibly attributable to the small sample size and the use of mean squared error for optimization. Even though the model still needs improvement for practical application in forest pest control, the accuracy will increase over time as more samples are added for training. It could still be used as a preliminary approach to track the potential spread route of B. xylophilus, as well as in conjunction with other measures to help the forestry administration to make pertinent prevention measures. It can slso be applied to any other species with sufficient re‐sequencing data to conduct such analysis.

CONCLUSION

In summary, generation of the new AH1 assembly greatly improves the genome blueprint with which genetic variations, and gene and transcript mapping in B. xylophilus can be studied. Although it is necessary to include more strains to get clustering result on a finer scale, we provide the first comprehensive catalog for SNPs of broad allele frequencies and origins in B. xylophilus and refined population structures using 181 strains from China and the USA. Above all, we need to be aware that B. xylophilus may be able to survive in areas with low temperatures beyond our early expectation and reinforcement of quarantine measures should be taken in those areas.

CONFLICT OF INTEREST

The authors declare no competing financial interests. Figure S1. Detailed hybrid scaffolding result of AH1 assembly based on Bionano optical map. Figure S2. Population splitting analysis of 181 Bursaphelenchus xylophlius strains. The color scheme is in accordance with Figure 4b. Figure S3. Number of synonymous and nonsynonymous substitutions among different populations. Figure S4. Introgression analysis results using Dsuite. The highlighted areas showed the same introgression events as identified by treemix. Figure S5. Pro‐sequencing validation result of two selected SNPs for 181 strains. (a) SNP at Contig001:790035 (C to T) and (b) SNP at Contig007:3426052 (A to G). Figure S6. SNPs among the 10 most frequently seen strains accounting for the top 5% of predictions with largest distances (error). The 10 strains are in blue, as shown on the left, followed by nearby strains in green. The labels on the right show sample name, sum of allele counts, and longitude and latitude of sampling location. While it is unclear what exactly caused abnormally large errors, one explanation is that some strains are far away from their closest neighbors, e.g. SC17, and some strains show drastically different genotypes compared with samples literally a few kilometers away, e.g. SC10. More samples, ideally uniformly distributed, will aid prediction accuracy and help decide whether some strains are recently introduced or not. Click here for additional data file. Table S1. Summary of sequencing data for AH1 Table S2. Genome polishing using short‐reads DNA‐Seq data Table S3. Bacteria contamination identified from polished assembly Table S4. BUSCO assessment results Table S5. GH family annotation status among two different assemblies Table S6. Primers design for novel transcripts Table S7. Primers design for exon‐skipping events Table S8. List of all 181 Bursaphelenchus xylophilus strains collected from different areas in China and the USA Table S9. The resequencing DNA‐Seq map rate of 181 Bursaphelenchus xylophilus strains Table S10. The detailed information of all SNPs identified from 181 Bursaphelenchus xylophilus strains Table S11. Average climatic characteristics among different populations Table S12. Pearson correlation coefficients of the principal components and climatic factors Table S13. Primers design for pyro‐sequencing validation Click here for additional data file.

57 in total

1. Cloning and characterization of a venom allergen-like protein gene cluster from the pinewood nematode Bursaphelenchus xylophilus.

Authors: Shifeng Lin; Heng Jian; Haijuan Zhao; Dan Yang; Qian Liu
Journal: Exp Parasitol Date: 2011-02 Impact factor: 2.011

2. Pathology of the Pine Wilt Disease Caused by Bursaphelenchus xylophilus.

Authors: Y Mamiya
Journal: Annu Rev Phytopathol Date: 1983-09 Impact factor: 13.078

3. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.

Authors: Felipe A Simão; Robert M Waterhouse; Panagiotis Ioannidis; Evgenia V Kriventseva; Evgeny M Zdobnov
Journal: Bioinformatics Date: 2015-06-09 Impact factor: 6.937

4. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.

Authors: Cole Trapnell; Adam Roberts; Loyal Goff; Geo Pertea; Daehwan Kim; David R Kelley; Harold Pimentel; Steven L Salzberg; John L Rinn; Lior Pachter
Journal: Nat Protoc Date: 2012-03-01 Impact factor: 13.491

5. Development and Validation of a High-Density SNP Genotyping Array for African Oil Palm.

Authors: Qi Bin Kwong; Chee Keng Teh; Ai Ling Ong; Huey Ying Heng; Heng Leng Lee; Mohaimi Mohamed; Joel Zi-Bin Low; Sukganah Apparow; Fook Tim Chew; Sean Mayes; Harikrishna Kulaveerasingam; Martti Tammi; David Ross Appleton
Journal: Mol Plant Date: 2016-04-22 Impact factor: 13.164

6. Bayesian inference and the parametric bootstrap.

Authors: Bradley Efron
Journal: Ann Appl Stat Date: 2012-10-01 Impact factor: 2.083

7. Genomic insights into the origin of parasitism in the emerging plant pathogen Bursaphelenchus xylophilus.

Authors: Taisei Kikuchi; James A Cotton; Jonathan J Dalzell; Koichi Hasegawa; Natsumi Kanzaki; Paul McVeigh; Takuma Takanashi; Isheng J Tsai; Samuel A Assefa; Peter J A Cock; Thomas Dan Otto; Martin Hunt; Adam J Reid; Alejandro Sanchez-Flores; Kazuko Tsuchihara; Toshiro Yokoi; Mattias C Larsson; Johji Miwa; Aaron G Maule; Norio Sahashi; John T Jones; Matthew Berriman
Journal: PLoS Pathog Date: 2011-09-01 Impact factor: 6.823