Literature DB >> 35584001

Investigating plant disease outbreaks with long-read metagenomics: sensitive detection and highly resolved phylogenetic reconstruction applied to Xylella fastidiosa.

Marcela A Johnson^1,2, Haijie Liu¹, Elizabeth Bush¹, Parul Sharma^1,2, Shu Yang¹, Reza Mazloom³, Lenwood S Heath³, Mizuho Nita^1,4, Song Li¹, Boris A Vinatzer¹.

Abstract

Early disease detection is a prerequisite for enacting effective interventions for disease control. Strains of the bacterial plant pathogen Xylella fastidiosa have recurrently spread to new crops in new countries causing devastating outbreaks. So far, investigation of outbreak strains and highly resolved phylogenetic reconstruction have required whole-genome sequencing of pure bacterial cultures, which are challenging to obtain due to the fastidious nature of X. fastidiosa. Here, we show that culture-independent metagenomic sequencing, using the Oxford Nanopore Technologies MinION long-read sequencer, can sensitively and specifically detect the causative agent of Pierce's disease of grapevine, X. fastidiosa subspecies fastidiosa. Using a DNA sample from a grapevine in Virginia, USA, it was possible to obtain a metagenome-assembled genome (MAG) of sufficient quality for phylogenetic reconstruction with SNP resolution. The analysis placed the MAG in a clade with isolates from Georgia, USA, suggesting introduction of X. fastidiosa subspecies fastidiosa to Virginia from the south-eastern USA. This proof of concept study, thus, revealed that metagenomic sequencing can replace culture-dependent genome sequencing for reconstructing transmission routes of bacterial plant pathogens.

Entities: Chemical

Keywords: metagenomics; nanopore sequencing; pandemic prevention; plant biosecurity; plant disease diagnostics

Mesh：

Year: 2022 PMID： 35584001 PMCID： PMC9465077 DOI： 10.1099/mgen.0.000822

Source DB: PubMed Journal: Microb Genom ISSN： 2057-5858

Data Summary

Metagenomic sequencing files have been submitted to the National Center for Biotechnology Information (NCBI) SRA (Sequence Read Archive) under BioProject PRJNA763353, with BioSample accession numbers SAMN21438496, SAMN21438497, SAMN21438498, SAMN21438499, SAMN21438500 and SAMN21438501. The metagenome-assembled genome was submitted to the NCBI Assembly database and assigned accession number GCA_020792795.1. Disease outbreak investigation in humans, animals and plants has tremendously benefitted from the development of new sequencing technology. For example, food-borne bacterial disease outbreaks are now routinely investigated using phylogenetics based on whole-genome sequences obtained from pure cultures. However, time is critical when an outbreak occurs so that control measures can be enacted quickly. Therefore, metagenomic sequencing, which does not require culturing and can be performed on DNA directly extracted from a host sample, would be desirable. Here, we show that it is possible to assemble an entire genome sequence of a bacterial plant pathogen, , directly from metagenomic sequences obtained with the Oxford Nanopore Technologies MinION sequencer. Since grows slowly in culture, this represents significant time saving. The obtained genome sequence quality was better than many public genome sequences available for this pathogen. Thus, it was possible to reconstruct a highly resolved phylogenetic tree and infer the likely geographical origin of the investigated strain. In conclusion, this proof of concept study represents a significant step forward in replacing culture-dependent genome-based approaches with culture-independent metagenome-based approaches when investigating pathogen transmission chains.

Introduction

Pathogens are a constant threat to humans, animals and plants. In regard to plant diseases, international trade of plant materials (plant globalization) plays a major role in the dissemination of plant pathogens to new areas, where they may come in contact with new hosts and lead to emergence of new diseases. Fast, specific and sensitive detection and identification of pathogens, and tracing their transmission routes, thus, is needed to quickly contain emerging disease outbreaks to prevent them from turning into epidemics and pandemics [1]. is a prime example of a pathogen that has spread out of its geographical areas of origin in the Americas to countries around the world through movement of plant material, which has led to the emergence of disease outbreaks on several crops in many countries and, in some cases, significant economic damage [2, 3]. For example, based on phylogenetic reconstruction and circumstantial evidence, it is likely that a single strain of subspecies was introduced to California at the end of the nineteenth century leading to the emergence of Pierce’s disease of grapevine (Vitis vinifera) that has since spread throughout California, other USA wine growing regions, and countries in Europe and Asia [4, 5]. Citrus variegated chlorosis caused by ' subspecies pauca' emerged in Brazil in the 1980s, becoming one of the main citrus diseases in that country [6]. More recently, quick olive decline caused by another strain of ' subspecies pauca' has been introduced to Italy, threatening olive production in the Mediterranean basin [7]. is easily disseminated without being detected, because it can be latently present in asymptomatic shrubs and trees [8]. Once established in a new area, it can then spread quickly over long distances by insect transmission [9]. Moreover, the pathogen takes several days to up to 2 weeks to grow in culture, making culture-dependent identification impractical [5]. When symptoms do develop, they range from reduced yield and fruit size, wilting and dieback to decline and death. Unfortunately, there is no treatment for diseases caused by X. fastidiosa, and the main recourse for outbreak management is to reduce the insect vector population, destroy infected plants to prevent further spread of the pathogen, and/or replace infected plants with relatively more tolerant cultivars [10, 11]. Methods of detection for include microscopy, immunological assays and molecular methods [12], none of which require culturing. Also, identification of to the subspecies level by quantitative PCR (qPCR) and to the strain level using multilocus sequence typing (MLST) can be performed using DNA directly extracted from plants [13, 14]. However, to reconstruct transmission chains, whole-genome sequencing is necessary to identify a sufficient number of SNPs for phylogenetic analysis. This has only been possible from pure cultures [5]. Currently, more than 80 assembled genomes representing all subspecies are available in the National Center for Biotechnology Information (NCBI) Assembly database with genome sizes ranging from 2.4 to 2.7 Mb, a G+C content of 51–52 mol%, number of contigs ranging from 1 to over 400, ambiguous bases from 0 to over 300, and completeness generally over 99 %. There are approximately another 50 unassembled genomes in the NCBI Sequence Read Archive (SRA) database. Because of the slow growth of in culture, a culture-independent method would be desirable to more quickly investigate outbreaks. This is true not only for outbreaks caused by , but also for any other plant pathogen, in particular, for those that cannot be grown in pure culture at all, such as phytoplasmas [15]. Recently, culture-independent metagenomic sequencing (sequencing all the micro-organisms present in a sample simultaneously) has been explored for fast detection and identification of viral, fungal and bacterial plant pathogens [16], including [17, 18]. The Oxford Nanopore Technologies (ONT) MinION sequencer has attracted particular attention, because it requires minimal laboratory infrastructure for library preparation and sequencing, is of relatively low cost, and provides results almost in real-time [19]. Moreover, the long reads generated by the ONT MinION facilitate assembly of single-contig metagenome-assembled genomes (MAGs) as was shown, for example, for human stool samples [20]. We recently demonstrated the feasibility of metagenomic sequencing with the MinION for detection and identification of two tomato pathogens, pv. tomato (Pto) and [21]. Also, in this case, the long reads generated by MinION made it possible to obtain MAGs, which allowed identification to near strain-level resolution. However, it was not possible to reach sufficient resolution to identify transmission chains. This was probably due to the error rate of nanopore sequencing at the time, which is in a large part due to the challenge in basecalling, the process of translating the changes in electrical signal measured at each individual nanopore to a sequence of nucleotides [22]. This is in line with published MAGs of human pathogens obtained by nanopore sequencing, which have been used, for example, to characterize antibiotic resistance [23], but nanopore sequencing has only very recently been shown to provide SNP resolution suitable for phylogenetic reconstruction for outbreak investigations [24]. With respect to , identification has been limited to the subspecies level when employing the MinION for metagenomic sequencing [17]. Because of its high impact and the difficulty of growing it in culture, here, we used for a proof of concept study. We show how metagenomic sequencing using MinION in combination with improved basecalling, the genome-based identification platform LINbase [25] and phylogenetic reconstruction can go beyond sensitive and specific detection and allow phylogenetic reconstruction at SNP resolution. We describe how this approach could be integrated into a worldwide platform for easy and quick sharing of genomic and metagenomic sequencing results for early plant disease outbreak detection for fast and effective intervention to help prevent plant disease epidemics and pandemics.

Methods

Petiole collection and DNA extraction

Five symptomatic grapevine (V. vinifera) samples and one healthy negative control sample (see Table 1 for metadata) were collected in late summer or early fall (August or September). Each symptomatic sample consisted of at least six leaves (including petioles) with marginal leaf necrosis harvested from different portions of the canopy of a single vine. The negative control sample (sample 11) was collected from a geographical area where Pierce’s disease had not been reported (Winchester, VA, USA). Samples were placed in a plastic bag, sealed and transported to the laboratory in a cooler containing ice, then shipped overnight to Virginia Tech’s Plant Disease Clinic (Blacksburg, VA, USA). Six petioles were randomly selected, bundled and cut into pieces (~4 mm length), and 100 mg was transferred to a 2 ml screw-cap tube, which contained garnet matrix A (~180 mg) and a 6.35 mm ceramic sphere (MP BioMedicals). Tubes were frozen in liquid nitrogen and homogenized at a speed of 5 m s−1 for 60 s in a FastPrep-24 (MP BioMedicals). The Isolate II plant DNA kit (Bioline USA) was used to extract DNA according to the manufacturer's instructions, using the supplied CTAB lysis buffer. DNA was stored at −80 °C until use.

Table 1.

Sample metadata, qPCR results and summary of sequencing results

Characteristic	11	368	669	228	255	382
Symptoms	No	Yes	Yes	Yes	Yes	Yes
Grape cultivar	Chardonnay	Petit Verdot	Cabernet Franc	Viognier	Vidal Blanc	Cabernet Franc
Location of collection	Winchester, VA	Middleburg, VA	Smithfield, VA	Charlottesville, VA	Fairlawn, VA	Faber, VA
Year of collection	2020	2013	2020	2013	2013	2013
DNA concentration (ng μl⁻¹)	13.9	22.7	24.7	22.8	23.9	28.7
DNA quality (260/280)	1.56	1.74	1.74	1.72	1.69	1.73
qPCR C _t [grape (RS1)]	20.68	17.3	19.7	18.86	19.06	16.99
qPCR C _t [Xf (rimM)]	Undetermined	17.94	20.4	34.03	35.87	18.73
Samples run per flow cell	2	2	1*	3	3	3
Total number of reads	489 136	55 695	1 271 428	619 578	563 451	232 722
Total number of reads (ha1)	518 818	59 693	1 305 206	735 768	649 927	271 524
Xf reads (wimp)	0	640	26 643	17	8	2245
Xf percentage (wimp)	0	1.208	2.327	0.003	0.002	1.024
Xf reads (wimp ha1)	0	801	29 878	11	13	2906
Xf percentage (wimp ha1)	0	1.342	2.289	0.001	0.002	1.070
Total length of reads (Mb)	399.41	96.29	3962.02	1129.76	1247.70	520.69
Mean read length (bp)	840	1728	3116	1823	2214	2237
Longest read length (bp)	30 160	43 841	264 283	364 246	502 142	160 166

Xf, X. fastidiosa.

*After high-accuracy basecalling.

Sample metadata, qPCR results and summary of sequencing results Characteristic 11 368 669 228 255 382 Symptoms No Yes Yes Yes Yes Yes Grape cultivar Chardonnay Petit Verdot Cabernet Franc Viognier Vidal Blanc Cabernet Franc Location of collection Winchester, VA Middleburg, VA Smithfield, VA Charlottesville, VA Fairlawn, VA Faber, VA Year of collection 2020 2013 2020 2013 2013 2013 DNA concentration (ng μl−1) 13.9 22.7 24.7 22.8 23.9 28.7 DNA quality (260/280) 1.56 1.74 1.74 1.72 1.69 1.73 qPCR C t [grape (RS1)] 20.68 17.3 19.7 18.86 19.06 16.99 qPCR C t [Xf (rimM)] Undetermined 17.94 20.4 34.03 35.87 18.73 Samples run per flow cell 2 2 1* 3 3 3 Total number of reads 489 136 55 695 1 271 428 619 578 563 451 232 722 Total number of reads (ha1) 518 818 59 693 1 305 206 735 768 649 927 271 524 Xf reads (wimp) 0 640 26 643 17 8 2245 Xf percentage (wimp) 0 1.208 2.327 0.003 0.002 1.024 Xf reads (wimp ha1) 0 801 29 878 11 13 2906 Xf percentage (wimp ha1) 0 1.342 2.289 0.001 0.002 1.070 Total length of reads (Mb) 399.41 96.29 3962.02 1129.76 1247.70 520.69 Mean read length (bp) 840 1728 3116 1823 2214 2237 Longest read length (bp) 30 160 43 841 264 283 364 246 502 142 160 166 Xf, X. fastidiosa. *After high-accuracy basecalling.

qPCR

was detected using a multiplex qPCR assay that targeted an endogenous grape gene, resveratrol synthase I mRNA (RS), and the 16S rRNA processing protein rimM gene. The use of an endogenous plant gene in a qPCR detection assay serves as an internal control in the amplification reaction, preventing false negatives that may occur due to operator error or inhibitors present in DNA extractions. The RS primers and TaqMan probe [26] and the rimM primers and TaqMan probe [27] were constructed with the JOE and FAM reporters, respectively, with Black Hole Quencher-1 (Integrated DNA Technologies) for use with the StepOnePlus (Life Technologies) real-time PCR system. Final reaction components consisted of 1× SensiFAST Probe Hi-ROX reaction (Bioline), 2 µl template DNA, 0.9 µM for each primer, 0.25 µM for each probe, and nuclease-free water to achieve a final volume of 20 µl. Final cycling conditions for the multiplex qPCR were: 5 min at 95 °C, followed by 40 cycles of 10 s at 95 °C and 30 s at 60 °C. Cycle threshold (C t) values were applied automatically by the StepOne software v. 2.1 in all qPCR assays. Sensitivity results are reported in Fig. S1 (available with the online version of this article).

Nanopore sequencing

Sequencing library preparation and sequencing

DNA quantity and quality were checked using the NanoDrop ONE (Thermo Scientific). Sequencing libraries were prepared following ONT protocols. First, DNA fragments were repaired using FFPE repair mix (New England Biolabs), the repaired samples were barcoded using the Native barcoding kit (ONT), the barcoded samples were mixed and ligated with the adapter, and the sequencing library was prepared using the ligation sequencing kit (ONT). The products from each of the steps were monitored by using NanoDrop ONE. The library was loaded on the MinION flow cell by using the flow cell priming kit (ONT) and was run using the MinKNOW software (version 4.2.8) for 72 h. Sample 669 was run on a single flow cell to maximize the number of reads. Samples 368 and 11 were run on a flow cell together with samples from other projects. Samples 228, 255 and 382 were run together on one flow cell in order to determine whether enough reads could be obtained when combining two to three samples on the same flow cell.

Basecalling

FAST5 files were used as input to the ONT Guppy CPU (version 5.0.7) basecalling tool with the fast-basecalling setting in MinKNOW software (version 21.02.1). Samples were re-basecalled before attempting assembly using the high-accuracy basecalling setting in ONT Guppy GPU (version 4.5.2) on the Virginia Tech high-performance computing cluster (ARC; Advanced Research Computing). All samples were basecalled with a minimum quality score of 7 and stored in fastq format files. All fastq files were deposited in the NCBI SRA with accession numbers SAMN21438496, SAMN21438497, SAMN21438498, SAMN21438499, SAMN21438500 and SAMN21438501.

Metagenomic analysis of unassembled reads

Basecalled fastq files were used as input to the cloud computing-based wimp (What’s In My Pot?) (v 2021.03.05) analysis tool (ONT) [28], through the platform epi2me for species-level identification of in all the samples. wimp uses Centrifuge [29] and NCBI’s RefSeq database to assign taxonomy to long reads rapidly. For the rarefaction analysis, reads identified as by wimp were randomly subsampled using the command fasta-subsample --seedfrom the MEME Suite (version 5.3.3) [30]. In parallel, fastq files were used as input to sourmash (version 3.5.0) [31]. Sourmash is a command-line tool, with multiple utilities, that uses hash sketches to rapidly compute sequence similarity between a query and a database of reference genomes. One of the utilities is sourmash gather, which uses a greedy algorithm to partition metagenomes into individual genomes to calculate the containment and average abundance values. Publicly available reads of 54 genomes in the NCBI SRA database were assembled in-house using the de novo assembler SPAdes [32]. These assemblies together with 80 NCBI genome assemblies were used to build a custom reference database after checking their quality with the checkM tool [33]. First, sketches of the custom reference database were created with the sourmash command sourmash compute --scaled 1000 k 31. Sketches of each nanopore fastq read file were created with the sourmash commands sourmash compute --scaled 1000 k 31 --track-abundance to obtain the average abundance values. Then, sourmash gather was used with the following command: sourmash gather -k 31 and --threshold-bp 15000 -oto obtain the values p_match, p_query and avg_abundance. The utility sourmash gather gives two values in relation to containment, p_match and p_query with p_match being the fraction and p_query being the fraction .

Metagenome assembly

After high-accuracy basecalling, reads were used as input to the de novo assembler Flye (version 2.8.3) [34] with the following command: flye --nano-raw --out-dir --genome-size 2.7 m --meta -i 5. Other polishing parameters were explored, but there were no improvements in the assemblies beyond five iterations.

Metagenomic analysis of assembled reads

The longest scaffold (consisting of two contigs) from the high-accuracy basecalled 669 assembly (named MAG 669) was used as input to the webserver LINbase [25] to identify the closest related strain and the corresponding LIN group with the function ‘Identify using a genome sequence.’ MAG 669 was aligned with the subsp. Temecula 1 genome using the ANIm function in JSpeciesWS [35] using default parameters. A total of 67 publicly available subsp. genomes along with MAG 669 were used for the reconstruction of the phylogenetic trees. All 68 genome sequences were annotated using the prokka gene annotation tool (version 1.14.6) [36] with default settings. The annotated files were used as input for Roary (version 3.13.0) [37] to obtain a core-gene alignment. The following parameters were used: -e to obtain a multiFASTA core-gene alignment file as output, -n for fast core-gene alignment with the option MAFFT and -p for multithreading. A total of 1138 genes were used to construct the core genome. The core-gene alignment file was used as input for iq-tree (version 2.0.3) [38] using automated model-selection to obtain the final maximum-likelihood-based phylogenetic tree. The final phylogenetic tree was visualized using the ggtree package in R [39]. The same set of 68 genomes were used for Split-K-mer analysis (ska; version 1.0) [40]. All sequences were first converted into split k-mer sketches using the ska fasta subcommand with default k-mer length of 15 (sequences of 15 bp flanking a middle base i.e. split-base, on both ends). The resulting split k-mer sketches were merged into a single file using the ska merge subcommand. A reference-free alignment was performed on the merged file using the ska align subcommand. As for the core-genome tree, the alignment file was used as input for iq-tree to obtain the final maximum-likelihood-based phylogenetic tree, which was visualized using the ggtree package in R [39]. SNPs between selected genomes were counted with Snippy [41].

Results

Analysis flowchart

To determine detection sensitivity and specificity of metagenomic sequencing with the ONT MinION system and test its ability to provide sequencing data sufficient for phylogenetic reconstruction, we developed the workflow depicted in Fig. 1. In short, five symptomatic grapevine samples and one healthy control sample (from a geographical area with no reported Pierce’s disease) were collected in Virginia, USA (Table 1). DNA was extracted from grapevine petioles, which are known to carry a high concentration of DNA. DNA was analysed for quantity and quality, and the presence of was determined by qPCR (Table 1). One symptomatic sample (with ID 669) was sequenced on an individual MinION flow cell to obtain as many reads as possible, while the other symptomatic samples were combined in groups of two or three to determine whether enough reads could be obtained for detection while saving on flow cells. The negative control sample (sample 11) was sequenced on the same flow cell as an unrelated healthy plant sample from another project to avoid any possible cross-contamination from infected samples. After sequencing and basecalling, the ONT tool wimp [28] and the third-party tool sourmash [31, 42] were used to determine the presence and abundance of sequences belonging to and its subspecies. Results were compared with respect to sensitivity of detection with those obtained by qPCR. Finally, reads were assembled with the aim of obtaining MAGs to perform phylogenetic reconstruction for outbreak investigation.

Fig. 1.

Workflow for culture-independent detection and identification of . (a) DNA is extracted from grapevine petioles and used for DNA sequencing library preparation and, as a control, for qPCR. (b) The prepared DNA is then sequenced with the ONT MinION system and basecalled. Basecalled reads are either used directly for detection and identification of at the species and subspecies rank, or assembled for phylogenetic reconstruction for outbreak investigation. Created with BioRender.com.

Detection sensitivity and specificity

The DNA concentration of the samples was consistent, varying only slightly between 13.9 ng μl−1 for the healthy sample and 28.7 ng μl−1 for sample 382, which is in the range of the minimum concentration of 25 ng μl−1 recommended by ONT (Table 1). Also, nucleic acid quality varied little with the 260/280 ratio (nucleic acid/protein) ranging from 1.56 to 1.74, which is slightly lower than the 1.8–2.0 recommended by ONT. Values for C t (the PCR cycle after which a PCR product becomes detectable) obtained from the qPCR tests ranged from 17 to 21 for the grape gene resveratrol synthase 1 (RS1) used as an internal control, showing that the extracted DNA for all samples was amplifiable. C t values for the gene rimM, encoding a 16S rRNA processing protein, ranged from 18 to 36 (except for sample 11, the negative control, for which no rimM product was detected), revealing a wide range of pathogen abundance in the samples with sample 669 containing the highest pathogen abundance (C t of 17.94) and sample 255 containing the lowest pathogen abundance (C t of 35.87, at the limit of detection by qPCR). The total number of reads obtained by MinION sequencing varied greatly between samples, ranging from 55 000 (sample 368) to almost 1.3 million (sample 669). Since the difference in the number of samples run per flow cell ranged only between one and three, and DNA concentration and quality were similar between samples, this was probably due to inconsistencies in sequencing library preparation (i.e. users making small changes inadvertently) and flow cell quality. Mean read length was more consistent between samples, ranging from 840 bp for the negative control, i.e. sample 11, to 3116 bp for sample 669. The longest overall read was obtained for sample 255, with a length of over 500 000 bp. Because of the differences in read number and length, the total length of the obtained metagenomes also varied greatly, from 96 Mb for sample 368 to almost 4.2 Gb) for sample 669. To answer the most relevant question of how many of the obtained sequences were of , we first used the tool wimp [28], which is provided by ONT and uses NCBI RefSeq as a reference database. The number of reads assigned to ranged from 8 in sample 255 (0.002 % of total reads) to 26 643 in sample 669 (2.327 % of total reads) as shown in Table 1 (see Table S1 for full results). Therefore, metagenomic sequencing detected even in the sample with the lowest pathogen abundance. However, since samples 228, 255 and 382 were run on the same flow cell, we cannot exclude the possibility that the small number of reads in samples 228 and 255 were due to cross-contamination from sample 382. Not a single read was misidentified as among the 489 000 total reads in the negative control sample 11, revealing a high specificity of metagenomic sequencing using MinION. However, if a pathogen that is closely related to were present in the same sample, such as pv. viticola, it is possible that some reads of this pathogen could be misidentified as . Since the detection limit of metagenomic sequencing depends on the number of reads that are obtained, we wanted to determine the minimum number of reads that are necessary for consistent detection of . Thus, we performed a rarefaction analysis on samples 669, 382 and 368 by randomly subsampling reads ten times at different total read numbers and considered the number of reads sufficient for detection when all 10 subsamples contained at least five reads identified by wimp as (Fig. 2, Table S2). The minimum read number ranged from 500 for sample 669 to approximately 1500 for samples 368 and 382. Since samples 228 and 255 only contained 17 and 8 . reads, respectively, out of approximately 500 000 total reads (Table 1), no rarefaction analysis was necessary to determine that at least 500 000 reads were necessary to detect in these samples. Therefore, the number of reads necessary to detect ranges from approximately 500 reads for heavily infected samples to approximately 500 000 reads for very mildly infected samples.

Fig. 2.

Detection sensitivity of metagenomic sequencing with the ONT MinION based on wimp [28] analysis for samples 368, 382 and 669. The x-axis shows the total number of reads that were randomly subsampled. The y-axis reports the corresponding mean number of (Xf) reads based on ten subsamples taken for each subsample size. The error bars show the standard deviation.

Resolution of identification using unassembled reads

So far, we have looked at simply detecting the presence of species . However, identification to subspecies level is important for making decisions about disease intervention to control an outbreak. Fig. 3(a) shows the results obtained by wimp for reads assigned at the subspecies rank as a percentage out of all reads identified as . Unfortunately, wimp only assigned a small number of reads to subspecies, while most reads were assigned only at the species rank to overall. Moreover, some reads in all samples were assigned to ' subspecies morus', a mulberry pathogen, which is not known to include grape as a host. Therefore, we decided to use a second taxonomic classification tool, sourmash [31], in combination with a custom reference database consisting of 134 publicly available genomes, which belong to all published subspecies (Table S2). sourmash does not analyse reads individually, but the sourmash command gather identifies the overlap between a k-mer sketch computed for the combined metagenomic sequences in a sample and k-mer sketches computed for each genome in the reference library (see Methods for details). Based on the average abundance of the set of overlapping k-mers in the metagenome, sourmash infers how many copies of each genome are present in the metagenome. Fig. 3(b) shows the results by subspecies as a percentage of subspecies genome copies out of all genome copies. For sample 669, the majority of sequences were assigned to subspecies fastidiosa (71.7%), strongly suggesting that this subspecies was the causative agent, as expected for a grapevine with symptoms of Pierce’s disease. For sample 382, sourmash only identified subspecies . For sample 368, instead, sourmash only identified subspecies . While this subspecies is not known to cause Pierce’s disease, we have determined its presence by MLST on symptomatic grapevines in Virginia in the past (data available on request). The very small number of reads in samples 228 and 255 made subspecies assignment in these samples impossible using sourmash.

Fig. 3.

Identification of at the subspecies rank using (a) wimp [28] and (b) sourmash [31, 42]. The x-axes represents the sample IDs (described in Table 1). The y-axes shows the relative abundance of subspecies as a percentage of all sequences assigned to at the species rank. Note that sample 11 was the negative control with no reads identified by either wimp or sourmash, and sourmash did not identify in samples 228 and 255. Xf, ; Xff, X. fastidiosa subsp. ; Xfm, X. fastidiosa subsp. ; Xfp, 'X. fastidiosa. subsp. pauca'; Xfs, 'X. fastidiosa subsp. sandyi'. Since sourmash gather provides abundance values for each individual genome in the reference database (Table S4), it also allowed us to infer which reference genome had the highest similarity to the genomes present in the metagenomes. For sample 669, this was genome 16M3, suggesting that the subsp. strain that caused Pierce’s disease in sample 669 (isolated in Virginia in 2020) is closely related to 16M3 (isolated in Georgia in 2016). However, similarity alone is not enough to make any conclusions about phylogenetic relationship and epidemiological linkage between strains; in other words, this result was not sufficient to conclude that the subsp. strain present in sample 669 was derived from a strain introduced into Virginia from Georgia.

Assembly of a complete genome from metagenomic reads

Since using the unassembled reads did not allow resolution of phylogenetic relationships between the sequences in our samples and public genomes, we performed assembly with the goal of obtaining MAGs that could then be precisely identified. We had originally basecalled our samples with the fast, but low accuracy, basecalling option of ONT’s Guppy basecalling tool [43]. Using Guppy’s high-accuracy setting instead slightly increased the number of reads identified as (Table 1). We then assembled the high accuracy reads de novo using the tool Flye [44]. For samples 368 and 382, the longest contig length ranged from 90 670 to 493 966, revealing that we did not have enough reads for assembly of complete MAGs. However, for sample 669, the longest contig had a length of 2 523 846 bp, approximately as long as the best publicly available subsp. genome, Temecula1, with a length of 2 521 148 bp. To confirm that this contig presented a complete MAG, we compared it with the Temecula 1 genome using the ANIm function in JSpeciesWS [35] and found the two genomes to have an average nucleotide identity (ANI) of 99.78 % and to align over 99.03 % of their length (Fig. 4a). We then determined the assembly quality of the 669 MAG compared to the Temecula 1 reference genome (Table 2). It can be seen that the 669 MAG is of only slightly lower quality with a completeness percentage of 96.61 % and 100 ambiguous bases (for a comprehensive comparison of the 669 MAG quality with genome assemblies obtained from pure cultures, see Table S3).

Fig. 4.

Table 2.

Comparison of the sequence quality of MAG 669 with the subsp. reference genome

Temecula 1 and the genome of the closely related subsp. isolate XF51 CCPM1 were both obtained from pure cultures.

Characteristic	MAG 669	Temecula 1	XF51 CCPM1
Length (Mb)	2.52	2.52	2.8
Contigs	2	2	115
Scaffolds	1	2	115
N50 (contigs) (bp)	2 522 580	2 519 802	68 803
N50 (scaffolds) (bp)	2 523 846	2 519 802	68 803
Completeness	96.61	99.64	97.76
Contamination	0.12	0	6.51
Ambiguous bases	100	0	0

Comparison of the MAG 669 sequence with other subsp. genomes. (a) Alignment of the 669 MAG with the genome of subsp. Temecula 1 using ANIm implemented in JSpeciesWS [35] (99.78 % ANI, 99.03 % nucleotides aligned) and visualized using Ribbon [50]; (b) Result page of LINbase [25] when using MAG 699 as the query with the ‘identify using a genome sequence’ function. The most similar genome to the query (the ‘best match’) is reported together with its LIN. The ANI value between the query and the best match is shown above the LIN. Below the best match, the taxa to which the query belongs are listed as well. Comparison of the sequence quality of MAG 669 with the subsp. reference genome Temecula 1 and the genome of the closely related subsp. isolate XF51 CCPM1 were both obtained from pure cultures. Characteristic MAG 669 Temecula 1 XF51 CCPM1 Length (Mb) 2.52 2.52 2.8 Contigs 2 2 115 Scaffolds 1 2 115 N50 (contigs) (bp) 2 522 580 2 519 802 68 803 N50 (scaffolds) (bp) 2 523 846 2 519 802 68 803 Completeness 96.61 99.64 97.76 Contamination 0.12 0 6.51 Ambiguous bases 100 0 0 With an assembled genome, we were able to take advantage of the genome-based classification platform LINbase [25], which assigns genomes to genera, species and within-species classes based on ANI. LINbase assigned the 669 MAG to subsp. fastidiosa, because its ANI to the most similar subsp. genome in LINbase, genome subsp. XF51 CCPM1 (see Table 2 for genome characteristics and quality), was over 99.899%, which is significantly higher than the ANI that distinguishes subsp. genomes from each other (98.5%) (Fig. 4b). subsp. XF51 CCPM1 was isolated in Georgia (USA), as was strain 16M3, which had been identified as the genome with the highest relative abundance by sourmash. This increased our suspicion that the Virginia strain was epidemiologically linked to Georgia, but was still insufficient evidence to conclude so.

Using a MAG for phylogenetic reconstruction at SNP resolution

Finally, the high quality 669 MAG made it possible to perform a phylogenetic analysis together with all other publicly available subsp. genomes, including genomes of three Georgia isolates, to further investigate the possibility that the subsp. strain in sample 669 was linked to Georgia. To do so, we reconstructed a core-genome tree based on the alignment of 1138 genes present one time in all included genomes, using the tool Roary [37], and an alignment-free whole-genome SNP tree based on a split k-mer analysis using the tool ska [40] (Fig. 5a, b). In both phylogenetic trees, MAG 669 grouped in a clade with the three Georgia strains with 100 % bootstrap support, revealing that MAG 669 shares a most recent common ancestor with strains from the south-eastern USA, but not strains from California. Also, the relatively small number of SNPs that distinguish MAG 669 from the Georgia strains XF51 and 16M3 (116 and 148, respectively) and the much higher number of SNPs that distinguish MAG 669 from the Californian Temecula 1 strain (796 SNPs) is in line with this result (Tables S4 and S5). Since the Georgia strains were isolated between 2009 and 2016, and sample 669 was collected in 2020, we hypothesize that they are epidemiologically linked and that the ancestor of subsp. in sample 669 may have been introduced to Virginia from the south-eastern USA by interstate movement of plant material or northward movement of infected subsp. sharpshooter insect vectors.

Fig. 5.

Phylogenetic trees of MAG 669, all high-quality publicly available genomes of subsp. isolates from the USA, a selection of international subsp. isolates and ' subsp. sandyi' Ann 1 as an outgroup. (a) Core-genome maximum-likelihood tree using 1138 core genes; (b) whole-genome maximum-likelihood SNP tree based on split k-mer analysis. Genomes are coloured according to isolation location. Percentage bootstrap support values >96 % are shown.

Discussion

In this work, we used the causative agent of Pierce’s disease of grapevine, subsp. , for a proof of concept study to investigate the potential of metagenomic sequencing with the ONT MinION system for plant pathogen detection and phylogenetic reconstruction for outbreak investigation. We found that this approach is comparable to the detection sensitivity of qPCR and can be used in place of whole-genome sequencing of cultured isolates in whole-genome phylogenetic tree reconstruction. This approach is similar to recent reports using metagenomics for the investigation of food-borne outbreaks caused by using Illumina sequencing [37] or nanopore sequencing [38]. Metagenomic sequencing of plant samples with the MinION sequencer or Illumina sequencing platforms had been used previously to detect viral, fungal and bacterial plant pathogens [16], including [17, 45, 46]. However, when the MinION was used previously to detect , the obtained sequencing reads were simply aligned to reference genomes to find the genome to which the highest number of reads mapped without further phylogenetic analysis [17]; identification was limited to subspecies, or sequencing was performed only after PCR of MLST loci making whole-genome-based phylogenetic analysis impossible [45]. More recently, the Illumina iSeq and MiSeq were used for detection and identification of as well [18]. Both Illumina platforms detected with a similar sensitivity as qPCR and identified sequences with subspecies resolution, even in mixed infections. However, phylogenetic reconstruction was not possible. Even the MLST loci, usually used for strain typing [47], were only partially represented in the obtained metagenomic sequences, and the short reads limited the possibility of assembling MAGs. Similar to the Illumina results [40], we observed that metagenomic sequencing with the MinION was as sensitive as qPCR in samples for which we obtained approximately 500 000 reads. Therefore, when using an entire flow cell, with which more than one million reads can be obtained, the detection sensitivity may be even higher than qPCR. Importantly, not a single read was misidentified as in our negative control sample, demonstrating the high specificity of long-read metagenomic sequencing (although we should note that based on our results there is a very low abundance of other bacteria in the grapevine petiole, reducing the risk of misidentifying other bacteria as ). Therefore, metagenomic sequencing with the MinION is an excellent approach to determine whether is present in a plant sample when it is suspected as the causative agent of an emerging disease outbreak on a new crop in an area where it was not previously present. However, because of the shorter protocols and lower cost of qPCR [estimated at $4 (£3.25) per qPCR reaction; £1=$1.23] compared to the cost of metagenomic sequencing [almost $1000 (£812.50) when using one sample per flow cell], qPCR will remain the detection method of choice when a high volume of samples needs to be processed in an area where is already known to be endemic. Sequencing multiple samples on one MinION flow cell may be an option in some situations as well, balancing cost with read number. However, the highly variable number of reads we obtained per flow cell makes it difficult to decide the number of samples to combine in a single run. One challenge when using unassembled metagenomic reads for identification is that it is difficult to conclude which subspecies is present. Here, as well as in other studies [46], metagenomic reads mapped to more than one subspecies for all samples when using wimp and for sample 669 when using sourmash. This may be due to the high rate of recombination in , which can lead to horizontal gene transfer between strains of different subspecies [48]. Therefore, reads of some genes in a metagenome may align to genomes of subspecies that are different from the subspecies present in the sequenced sample. Obtaining enough reads of sufficient length to assemble a MAG, as we did in this study for sample 669 using an entire flow cell, thus, is important for precise identification. In fact, once the 669 MAG was assembled, its identification as subsp. was highly reliable using the LINbase platform, since its ANI to the most similar subsp. genome (99.89%) was much higher than the minimal pair-wise ANI found between subsp. genomes (98.5%). Besides identifying the 669 MAG at the subspecies rank, LINbase also grouped the 669 MAG with genome subsp. XF51 CCPM1 from a Georgia isolate at an ANI threshold of 99.899 %. If additional high-quality genomes of Georgia subsp. isolates (or isolates from other states in the south-eastern USA) had been included in LINbase, it would have been possible to assign the 669 MAG to an ANI cluster (which is called a ‘LINgroup’ in LINbase) [25], corresponding to isolates from the south-eastern USA. This would have strongly suggested that the Virginia isolate was epidemiologically linked to the Georgia isolates even without phylogenetic analysis. However, the other public genome sequences of Georgia isolates were of insufficient quality to be added to LINbase (they were fragmented into more than 500 contigs, making accurate ANI computations impossible). Since genome quality is of a lower concern for phylogenetic reconstruction (sequencing errors increase the ANI of a MAG to other bacterial genomes but because sequencing errors are not shared with other genomes they do not change the phylogenetic position in a tree), it was possible to include these genomes in the alignment-based core-genome tree and in the split k-mer-based SNP tree. Both trees allowed us to assign the 669 MAG with high confidence (100 % bootstrap support) to the same clade as the three Georgia isolates, suggesting that subsp. was not introduced to Virginia on grapevine nursery stock purchased from California but from a sharpshooter insect vector moving north from the south-eastern USA or via transport of subsp. infected plant material from the south-eastern USA to Virginia. While this finding needs to be confirmed with additional subsp. samples, such information is important for disease control and prevention, since it informs growers of the high risk of future introductions of subsp. to Virginia from southern locations. More broadly, this result indicates that metagenomic sequencing with the MinION can be used to reconstruct pathogen transmission chains with accuracy similar to using genome sequences of cultured isolates, as has been the state of the art [5]. To perform such outbreak investigations as part of routine diagnostics by plant disease clinics and plant protection agencies, two steps are required: (i) bioinformatic analysis of metagenomic sequencing needs to become accessible to non-bioinformaticians; and (ii) all plant pathogen MAGs obtained by metagenomic sequencing need to be uploaded to a worldwide database of plant pathogen genomes and MAGs. If these objectives were reached, each time that a disease outbreak with symptoms indicative of a possibly emerging disease occurred, metagenomic sequencing could be performed and MAGs could be assembled and uploaded to the worldwide database where they could be immediately compared with all other genomes and MAGs in the database. If two or more MAGs were highly similar to each other, an alert could be triggered and it might be early enough to stop the outbreak from turning into an epidemic or a pandemic. To make this vision a reality in epidemic and pandemic plant disease prevention, we are currently developing user-friendly pipelines for metagenomic sequence analysis by non-bioinformaticians. Additionally, we are improving and expanding the LINbase web server [25] to transform it into the genomeRxiv platform, which will make sharing sequencing results as easy as submitting a manuscript to a preprint server such as bioRxiv [49]. In conclusion, here, we have demonstrated that culture-independent metagenomic sequencing permits high-quality genome assembly of bacterial plant pathogens suitable for highly resolved phylogeny-based outbreak investigation. If combined with user-friendly sequence analysis tools and a platform for quick and easy sharing of sequence information, metagenomics could be adopted by plant disease clinics in routine diagnostics. This could lead to early detection of emerging plant disease outbreaks and provide the basis for effective interventions to prevent outbreaks from turning into epidemics and pandemics. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file.

38 in total

1. Phytoplasmas: bacteria that manipulate plants and insects.

Authors: Saskia A Hogenhout; Kenro Oshima; El-Desouky Ammar; Shigeyuki Kakizawa; Heather N Kingdom; Shigetou Namba
Journal: Mol Plant Pathol Date: 2008-07 Impact factor: 5.663

Review 2. Xylella fastidiosa: Insights into an Emerging Plant Pathogen.

Authors: Anne Sicard; Adam R Zeilinger; Mathieu Vanhove; Tyler E Schartel; Dylan J Beal; Matthew P Daugherty; Rodrigo P P Almeida
Journal: Annu Rev Phytopathol Date: 2018-06-11 Impact factor: 13.078

3. Update of the Xylella spp. host plant database.

Authors:
Journal: EFSA J Date: 2018-09-10

4. Population genomic analysis of a bacterial plant pathogen: novel insight into the origin of Pierce's disease of grapevine in the U.S.

Authors: Leonard Nunney; Xiaoli Yuan; Robin Bromley; John Hartung; Mauricio Montero-Astúa; Lisela Moreira; Beatriz Ortiz; Richard Stouthamer
Journal: PLoS One Date: 2010-11-16 Impact factor: 3.240