Literature DB >> 35528780

A comparative study of metatranscriptomic assessment methods to characterize Microcystis blooms.

Helena L Pound¹, Eric R Gann¹, Steven W Wilhelm¹.

Abstract

Harmful algal blooms are increasing in duration and severity globally, resulting in increased research interest. The use of genetic sequencing technologies has provided a wealth of opportunity to advance knowledge, but also poses a risk to that knowledge if handled incorrectly. The vast numbers of sequence processing tools and protocols provide a method to test nearly every hypothesis, but each method has inherent strengths and weaknesses. Here, we tested six methods to classify and quantify metatranscriptomic activity from a harmful algal bloom dominated by Microcystis spp. Three online tools were evaluated (Kaiju, MG-RAST, and GhostKOALA) in addition to three local tools that included a command line BLASTx approach, recruitment of reads to individual Microcystis genomes, and recruitment to a combined Microcystis composite genome generated from sequenced isolates with complete, closed genomes. Based on the analysis of each tool presented in this study, two recommendations are made that are dependent on the hypothesis to be tested. For researchers only interested in the function and physiology of Microcystis spp., read recruitments to the composite genome, referred to as "Frankenstein's Microcystis", provided the highest total estimates of transcript expression. However, for researchers interested in the entire bloom microbiome, the online GhostKOALA annotation tool, followed by subsequent read recruitments, provided functional and taxonomic characterization, in addition to transcript expression estimates. This study highlights the critical need for careful evaluation of methods before data analysis.

Entities: Chemical

Keywords: Kaiju; MG-RAST; Microcystis; RNA sequencing; bioinformatics; ghostKOALA; harmful algal bloom; metatranscriptome

Year: 2021 PMID： 35528780 PMCID： PMC9075346 DOI： 10.1002/lom3.10465

Source DB: PubMed Journal: Limnol Oceanogr Methods ISSN： 1541-5856 Impact factor: 3.162

The ecological and economic ramifications of harmful algal blooms draw significant attention from researchers and the public. Toxic algae cause major challenges for recreational and commercial fisheries as well as municipal water supplies (Bullerjahn et al. 2016). In fact, blooms of the cyanobacteria genus Microcystis have caused problems for water treatment facilities around the world, leading to water shortages that included significant recent crises in both the United States and China (Qin et al. 2010; Steffen et al. 2017). As blooms continue to increase in duration and severity, there is an associated increase in research to understand how blooms function (Harke et al. 2016; Tang et al. 2018). For many years, researchers have investigated the environmental conditions that stimulate blooms and attempted to mimic them in the laboratory (Orr and Jones 1998; Kaebernick et al. 2000; Sandrini et al. 2014; Steffen et al. 2014). Studies have relied on water quality metrics such as pH, toxin estimates, nutrient concentrations, chlorophyll a estimates, and basic microscopy until recent advances in DNA/RNA sequencing (Krüger and Eloff 1978; Reynolds et al. 1981; Seitzinger 1991; Paerl et al. 2016). The democratization of new sequencing technologies has opened novel opportunities in harmful algal bloom research. The ability to explore the genetic potential or activity of a single organism or entire community ([meta]genomics or [meta]transcriptomics) within the natural background allows scientists to investigate both novel and historical ecology and physiology (Steffen et al. 2014; Hennon and Dyhrman 2020). As sequencing technologies continue to advance, so does the volume of data they generate, and all at decreasing costs (Shakya et al. 2019). As an example, recent metatranscriptomic studies have readily reached 20 million reads per sample, as compared to the 1 million reads per sample only a few years before (Steffen et al. 2015; Stough et al. 2017). While sample collection, processing, and sequencing must be carefully planned and executed, a parallel challenge in sequencing lies in the analysis of the data generated. The opportunity for error in analysis can range from inaccurate estimations of gene expression to misidentifying taxonomy. Every researcher is faced with the challenge of deciding which informatics method(s) to use to test their hypotheses, while also balancing needs for efficiency and accuracy. For example, online analysis tools may be more approachable to computationally limited labs, as these web‐based tools can handle large datasets using remote servers and rarely possess a steep learning curve to implement (Kanehisa et al. 2016; Keegan et al. 2016; Menzel et al. 2016). Meanwhile, in‐house coding methods can incorporate personally curated databases and allow for multiple tools to be selected and combined into a highly specialized and project specific workflow (Moniruzzaman et al. 2017; Stough et al. 2017; Pound et al. 2020). Tools also vary in the type of input data required, as some require just the libraries of trimmed reads, while others require the libraries to be assembled into contigs. Completely analyzing sequencing data may involve some combination of these or other techniques. All have their strengths and weaknesses that must be carefully considered, as these idiosyncrasies can easily influence the data and conclusions made. The use of molecular tools to study Microcystis bloom events has exploded in recent years as early work demonstrated how polymerase chain reaction (PCR)/quantitative PCR/sequencing can be used to distinguish toxic from nontoxic populations and do so in a quantitative manner (Kurmayer and Kutzenberger 2003; Rinta‐Kanto et al. 2005). Water treatment facilities, government agencies tasked with monitoring, and private commercial groups have all engaged molecular biological approaches to study Microcystis in recent years (Chorus and Welker 2021). Currently, most (meta)transcriptomic‐based Microcystis research characterizes activity in the environment by recruiting reads back to a single genome. Many studies use the reference genome Microcystis aeruginosa NIES‐843 (Harke and Gobler 2013; Steffen et al. 2017; Tang et al. 2018) since it was the first to be sequenced, but other strains have also been used (Davenport et al. 2019; Morimoto et al. 2019). However, it has been demonstrated that there is a great diversity of genetic potential encoded by different Microcystis sp. strains, and in the rest of the microbiome, within a single bloom (Meyer et al. 2017; Cook et al. 2020; Pound and Wilhelm 2020). This study sought to compare multiple techniques and tools in the analyses of transcriptomic sequencing data generated from natural harmful algal bloom communities in fresh waters. Metatranscriptomic libraries were generated from a 2019 Microcystis spp.‐dominated bloom in Lake Erie, USA. A variety of techniques were used to characterize the functional activity and taxonomic composition, using both trimmed reads and assembled contigs. Techniques were evaluated for three primary metrics. The first was the number of reads recruited, either to all Microcystis genes or to specific Microcystis marker genes important to central metabolism (resolved by phylogeny). It is assumed that methods capable of detecting and estimating the natural diversity in an environmental sample will recruit more reads than methods that are unable to resolve the fine scale genetic variation present in the environment. The second was the ability for a tool to classify sequence data by function and taxonomy, independent of any recruitments. Some tools classify the reads themselves, while others classify assembled contigs, which then require reads to be recruited back to them to generate relative quantitative results. The third was to determine how well each method correlated with one another, which provides support that the variation between samples is reflective of the ecology of this bloom and not the methods themselves. Based on our analyses, we make two recommendations that are dependent on the hypotheses being tested. The following study details how we came to those conclusions and discusses the various strengths and weaknesses of each approach examined.

Methods

Sampling and sequencing

Samples were collected from the surface of a 2019 Microcystis spp.‐dominated bloom in Lake Erie, USA, on 21 July 2019, and incubated in 1.0 L acid‐washed polycarbonate bottles for 48 h. Whole water was then passed through a 0.2 μm Sterivex filter (Millipore) and flash frozen. RNA was extracted using an acid‐phenol‐based extraction protocol with an added DNase treatment to remove any residual DNA (Pound and Wilhelm 2020). RNA quality was checked using a NanoDrop ND‐1000 spectrophotometer (Thermo Fisher Scientific) and quantified using a Qubit RNA HS assay (Invitrogen). Extracted RNA was processed using a Illumina® Stranded Total RNA Prep, Ligation with Ribo‐Zero Plus and then 50‐million 100‐bp paired‐end reads were generated on the Illumina NovaSeq platform at Hudson Alpha Discovery Life Sciences (Huntsville, Alabama). Sequence processing and assembly was described in detail in online protocol (Pound and Wilhelm 2021). Briefly, sequences were quality controlled and trimmed in the CLC Genomics workbench version 20.0.4 (Qiagen). Residual ribosomal rRNA reads were removed in silico using SortMeRNA version 4 (Kopylova et al. 2012). The nonribosomal reads classified by SortMeRNA from all samples were jointly assembled in MegaHit version 1.2.9 (Li et al. 2015). This was done to reduce redundancy among identical sequences in various samples as done previously (Pound et al. 2020). Trimmed nonribosomal reads have been uploaded to and are available on MG‐RAST (Keegan et al. 2016) under project name “LE2019MT” and raw reads are available on the NCBI SRA database under BioProject number PRJNA737197.

Background on approaches

We explored three widely used online platforms for sequencing analysis: Kaiju (Menzel et al. 2016), MG‐RAST (Keegan et al. 2016), and the Kyoto Encyclopedia of Genes and Genomes (KEGG) Orthology and Links Annotation (GhostKOALA) (Kanehisa et al. 2016). GhostKOALA and MG‐RAST both provide functional and taxonomic classifications, while Kaiju provides only a taxonomic classification (Table 1). Kaiju and MG‐RAST accept trimmed sequencing reads as input, while GhostKOALA requires coding sequences to be identified from assembled contigs. We also explored in‐house approaches, including variations of a customizable BLASTx database approach (Pound et al. 2020; Pound and Wilhelm 2020), read recruitments to coding sequences of individual Microcystis genomes (Harke and Gobler 2013; Davenport et al. 2019; Krausfeldt et al. 2019), and recruitments to a Microcystis composite genome created for this analysis. The latter is referred to as “Frankenstein's Microcystis” in reference to Mary Shelley's fictional monster comprised of parts from many individuals (Shelley 1818). The BLASTx approach taxonomically classifies a gene of interest within assembled contigs, while the recruitments to the individual and composite genomes use trimmed reads to functionally classify reads against the coding sequences of a single organism (Table 1).

Table 1

Comparison of methods analyzed in this study. Y indicates a requirement or function provided, while N indicates a lack of requirement or output provided.

	BLASTx	Kaiju	MG‐RAST	Ghost KOALA	Individual genomes	Frankenstein genome
Community taxonomy	Y	Y	Y	Y	N	N
Gene function	Y	N	Y	Y	Y	Y
User‐defined database	Y	N	N	N	Y	Y
Input	Contigs	Reads	Reads	Contigs	Reads	Reads
Additional read recruitment	Y	N	N	Y	N	N
Online platform	N	Y	Y	Y	N	N
Example references	Moniruzzaman et al. (2017), Pound et al. (2020)	Chen et al. (2019)	Zhang et al. (2016), Krausfeldt et al. (2019a )	Xie et al. (2016), Li et al. (2018), Cook et al. (2020)	Harke and Gobler (2015), Steffen et al. (2017)	This study

Comparison of methods analyzed in this study. Y indicates a requirement or function provided, while N indicates a lack of requirement or output provided.

BLASTx approach

Identification and transcript quantification for genes of interest was performed using a BLASTx method previously described (Pound and Wilhelm 2019, 2020). In short, protein databases were established for the following genes of interest using the KEGG orthology K number: DNA‐directed RNA polymerase, beta subunit (rpoB, K03043), ribulose bisphosphate carboxylase large chain (rbcL, K01601), ribose‐phosphate pyrophosphokinase 1 (prpS, K00948), bifunctional purine biosynthesis protein (purH, K00602), and orotidine 5′‐phosphate decarboxylase (pyrF, K01591). All reference sequences were downloaded from UniProt KB database version 2020_03 (see Data Availability). The assembled contigs from all samples was then queried using the command‐line BLASTx against each database specific to a gene of interest, retaining sequences with a minimum contig length of 300 bp and an e‐value < 1 × 10−30 (Camacho et al. 2009). Those with BLASTx hits were considered “candidates” and only the aligned portion of the gene sequence was used for recruitments (Pound and Wilhelm 2019). Trimmed reads from each sample were recruited to each trimmed candidate contig in CLC Genomics workbench version 20.0.4 (Qiagen) using a 90% length fraction and 90% identity (Pound and Wilhelm 2020). Any reads that could be recruited to more than one contig with equal identity was randomly assigned, to mimic MG‐RAST and Kaiju data handling approaches and to ensure all reads of interest were quantified. The reference trees were established with maximum likelihood phylogenies based on reference proteins from isolated organisms using PhyML version 3.0 (Guindon et al. 2010). To determine taxonomy, candidate sequences were placed on reference phylogenetic trees using the pplacer algorithm (Matsen et al. 2010).

Kaiju

Trimmed reads were uploaded to the online Kaiju platform and default parameters were used to taxonomically classify reads (Menzel et al. 2016). Reads were classified as Microcystis spp. by summing all the reads that were classified as the Microcystis genus.

MG‐RAST

Trimmed reads were uploaded to the online MG‐RAST platform (Keegan et al. 2016). All parameters were default except for no dereplication and the percent identity threshold (increased to > 90%). Reads were classified as Microcystis spp. by summing all the reads classified as the Microcystis genus based on RefSeq annotations. The number of reads classified as five genes important in central metabolism (rpoB, rbcL, prpS, purH, and pyrF) annotated by the KEGG orthology database within MG‐RAST were also collected.

GhostKOALA

The nucleotide and translated amino acid sequences of coding sequences were predicted within the assembled contigs using MetaGeneMark version 3.25 (Besemer and Borodovsky 1999; Zhu et al. 2010). Amino acid sequences were functionally and taxonomically classified via KEGG orthology using the prokaryote + eukaryote + virus database using GhostKOALA (Kanehisa et al. 2016; Pound et al. 2021). Predicted coding sequences were assigned KEGG orthology numbers (K numbers) and trimmed reads were recruited to the coding sequences using a 90% similarity fraction over a 90% length fraction in CLC Genomic Workbench. Reads that could be recruited to more than one contig with equal identity were randomly assigned. The number of reads recruited to the five genes important in central metabolism mentioned above was also pulled from the read recruitments.

Recruitment to individual genomes

Coding sequences from complete, closed Microcystis spp. genomic assemblies were used in this study. The coding sequences from the following genomes were used: M. aeruginosa FD4 (accession: NZ_CP046973.1), Microcystis sp. MC19 (accession: NZ_CP020664.1), Microcystis viridis NIES‐102 (accession: NZ_AP019314.1), M. aeruginosa NIES‐298 (accession: NZ_CP046058.1), M. aeruginosa NIES‐843 (accession: NC_010296.1), M. aeruginosa NIES‐2481 (accession: NZ_CP012375.1), M. aeruginosa NIES‐2549 (accession: NZ_CP011304.1), Microcystis panniformis FACHB‐1757 (accession: CP011339.1), and M. aeruginosa PCC7806SL (accession: NZ_CP020771.1). Reads were recruited using a 90% similarity fraction over a 90% length fraction to the coding sequences. Any reads that could be recruited to more than one coding sequence with equal identity were randomly assigned. The number of reads recruited to the five markers of central metabolism mentioned above was also pulled from the read recruitments to the individual genomes.

Frankenstein's : Construction and recruitment

Coding sequences from all complete, closed Microcystis genomes (see above) were combined into a single FASTA file and clustered at different nucleotide identities using CD‐HIT‐EST (Fu et al. 2012; Pound et al. 2021). Coding sequences from the individual genomes were clustered at the nucleotide identities of 0.95, 0.90, 0.85, and 0.80 with the word size set to 10, 8, 6, and 5, respectively, to establish composite genomes of various stringencies. Trimmed reads were recruited using a 90% similarity fraction over a 90% length fraction to each composite genome in CLC Genomics Workbench. Any reads that could be recruited to more than one coding sequence with equal identity were randomly assigned. After comparing the various stringencies (Supplemental Fig. S1), the 0.95 identity cluster showed the greatest number of coding sequences, while reducing redundancy, and was used for all subsequent analyses. This synthetic library is referred to as “Frankenstein's Microcystis” and is available on GitHub (see Data Availability). The number of reads recruited to five genes important in central metabolism mentioned above was also pulled from the read recruitments to the individual genomes.

Methods correlation

The total number of reads recruited to, or classified as, Microcystis spp. per sample, and those recruited to, or classified as, specific Microcystis spp. genes important in central metabolism per sample were normalized by the total number of trimmed reads in each sample library. Pearson correlation coefficients were established with Benjamini‐Hochberg corrections for multiple comparisons (Benjamini and Hochberg 1995). All statistical analyses were carried out in R studio. The Pearson's r coefficients and corrected p‐values are reported in Supplemental File S1.

Assessment

We used a metatranscriptomic dataset generated from a Microcystis spp.‐dominated harmful algal bloom in Lake Erie in August 2019 to compare the methods/tools described above. RNA was extracted from samples collected from bottle incubations (where nutrients were being manipulated) as well as in situ samples and then sequenced. Environmental variables were intentionally disregarded, in order to evaluate method performance independent of abiotic variables. A total of ~ 923 million processed reads were generated across twenty sample metatranscriptomes (~ 46 million reads per library, see Supplemental File S1). These reads assembled into 2,335,243 contigs that varied in length from 157 to 66,630 nucleotides. Total Microcystis reads are reported per sample, while reads recruited to or classified as individual genes reads are reported as the average across all samples.

Total reads recruited to spp.

One of the primary metrics used to rate the performance of the methods tested was the number of total reads recruited to, or classified, as Microcystis spp. All but one of our methods (the BLASTx approach) provided an estimate of total Microcystis spp. reads, but not all methods performed equally (Fig. 1). The Kaiju method recruited the fewest reads, as total reads classified as Microcystis spp. ranged from 4.08 × 106 to 1.18 × 107, which was between 11.5% and 22.6% of the total reads per sample library (Supplemental File S1). Total Microcystis spp. read estimates from MG‐RAST (13.8% and 28.9% of library), individual Microcystis spp. genome recruitment (17.0% and 37.7% of library), and Frankenstein's Microcystis genome recruitment (21.4–43.8% of library) surpassed Kaiju's estimates but did not provide the largest estimates. GhostKOALA recruited the most reads, as total reads recruited to Microcystis spp. annotated genes ranged from 8.38 × 106 to 2.36 × 107, which was between 24.2% and 45.2% of the total available reads per sample library.

Fig. 1

The sum of all reads in all samples that were recruited to or classified as Microcystis in each method.

Reads recruited to ‐specific marker genes

It was also important that the methods tested be evaluated for the number of reads recruited to, or classified, as individual genes, as this is important for future studies concerned with the dynamics of individual genes, as opposed to total shifts in the genome. All but one of our methods (the Kaiju approach) provided read estimates of individual Microcystis‐specific genes, and most methods performed relatively equally (Supplemental Fig. S2). MG‐RAST had the fewest reads classified to any of the genes we tested. On average, 1.77 × 104 reads per sample were classified as rpoB, 2.39 × 104 reads per sample were classified as rbcL, 6.57 × 103 reads per sample were classified as prpS, 4.30 × 103 reads per sample were classified as purH, and 3.72 × 102 reads per sample were classified as pyrF (Supplemental File S1). The other methods evaluated were remarkably similar to each other, including the Ghost KOALA annotated gene recruitment, individual Microcystis spp. genome recruitment, and Frankenstein's Microcystis genome recruitment, although the BLASTx method outperformed the others in four of the five genes. On average, the BLASTx method recruited 2.86 × 104 reads per sample to rpoB, 3.64 × 104 reads per sample to rbcL, 8.95 × 103 reads per sample to prpS, 4.20 × 103 reads per sample to purH, and 6.22 × 102 reads per sample to pyrF (Supplemental File S1).

Methods evaluation

The BLASTx method involves the use of in‐house curated databases that each contains a single marker gene from many species; it thus does not allow for a summary of all pathways as the other methods do. However, this approach provides the user full control over the database used and allows for greater confidence in taxonomic predictions based on the subsequent placement of predicted coding sequences on phylogenetic trees (Matsen et al. 2010). These curated databases can be updated by the user at any point and can be accessed locally. This approach characterizes the assembled contigs, which allows for a more accurate estimate of true community diversity, as opposed to recruiting genes to a reference genome. The number of reads recruited for each gene was > 99% correlated to the number of reads recruited by any of the other methods (Supplemental File S1). The primary weakness of this method is the lack of efficiency. This method is restricted to individual gene investigations, which may work well for specific questions, or small genomes (e.g., RNA viruses; Moniruzzaman et al. 2017) but is impractical for the complete characterization of the approximately 3600 genes in a Microcystis genome. While the Kaiju method was the easiest to execute, by our measures it was one of the poorest performers. The program did not provide an efficient way to summarize functional data, as each read was individually characterized to a specific organism, few of which were annotated in the same manner. Therefore, this tool was primarily used to classify taxonomic data, by summing all reads to the Microcystis genus. Kaiju does excel in characterizing the entire community, as opposed to recruiting reads to a single genome. However, it classified fewer reads to Microcystis than all other methods. It is unclear if the “missing” reads were not annotated at all or mistakenly annotated as a different organism. It is worth noting that this tool and its databases have not been updated since 2017. This method was the only method tested that was not as strongly correlated to the other methods, with correlation coefficients ranging from 0.788 to 0.831, depending on the method compared, but none of the differences were considered significantly different (Supplemental File S1). This suggests that Kaiju may not be consistent in the way it analyzes samples, reducing how well the read estimates correlate to other methods, regardless of the magnitude of reads recruited. MG‐RAST was like Kaiju in that it provided an easy online interface that accepted trimmed reads and functionally characterized the entire community, not just Microcystis spp. We note that MG‐RAST outperformed Kaiju, but none of the other methods tested in this study. This method did provide the opportunity to characterize individual genes, including our five marker genes. The number of reads recruited to each gene and total Microcystis spp. was > 98% correlated to the number of reads recruited by any of the other methods (Supplemental File S1). The decrease in total, and gene specific, reads recruited is suspected to arise from many probable Microcystis spp. reads being incorrectly annotated as Cyanothece spp., a species that was not regularly detected in our other methods that were capable of characterizing the entire community, such as the Kaiju or BLASTx methods. The GhostKOALA method provided both community taxonomy and gene functional characterization, all from an online database. Therefore, this method can easily characterize all active species present in any environmental system, not just a Microcystis spp. in a Microcystis‐dominated bloom (Pound et al. 2021). The annotation is based on assembled data and therefore required the recruitment of trimmed reads to genes of interest to make the results quantitative. In our hands this method identified more total reads as Microcystis spp. than any other method. The number of reads recruited for each gene and total Microcystis spp. was > 98% correlated to the number of reads recruited by any of the other methods (Supplemental File S1). The only notable disadvantage to GhostKOALA methodology is that annotations are limited to KEGG orthologies, so some transcripts within samples may not be annotated. To assess whether there was a difference in the number of reads that recruited to an individual Microcystis genome, we recruited reads to coding sequences from all nine closed, complete genomes in NCBI including six M. aeruginosa strains, one M. panniformis strain, one M. viridis strain, and one Microcystis sp. strain (Supplemental File S1). This method does not rely on assembled contigs, and all genomes were easily downloaded from the NCBI database, which is updated regularly to include new genomes and annotations. Each of the five specific genes analyzed in this study showed the highest recruitments in a different genome strain, although the variation between each strain was minimal (1–2%) (Supplemental Fig. S2). The number of reads recruited for each gene and all coding sequences in Microcystis spp. was > 98% correlated to the number of reads recruited by any of the other methods (Supplemental File S1). The primary disadvantage of this method is the lack of taxonomic or functional data on the rest of the microbial community within a sample. While the Frankenstein's genome method is primarily a read recruitment method, it has an additional step as a composite genome must first be generated. However, the approach allows the user to customize and update “Frankenstein's Microcystis” as other genomes are completed and published. There was a total of 44,950 coding sequences between the nine strains (average = 4994). Clustering all the coding sequences at decreasing nucleotide identity reduced the total number of clusters from 13,600 to 8920, when clustered at a nucleotide identity of 0.95 or 0.80, respectively. When all coding sequences were clustered, regardless of nucleotide identity, 20.7–43.4% of the total library reads recruited (Supplemental Fig. S1). The 95% identity composite genome performed well in both total reads recruited and reads recruited to the individual genes. The number of reads recruited for each gene and total Microcystis was > 98% correlated to the number of reads recruited by any of the other methods (Supplemental File S1). As with the individual genomes though, this method only allows for the characterization of the Microcystis community, not the rest of the microbiome.

Discussion

While the preparation of (meta)transcriptomic samples before sequencing can shape the overall outcome of an analysis for microbial communities (Gann et al. 2021), our analyses have indicated that the way sequences are evaluated can have equally large impacts on the conclusions reached. As the volume of sequencing data generated grows and more researchers turn to molecular tools to address environmental questions, researchers may be tempted to use publicly available, automated pipelines or easily downloaded single‐strain genomes. Each of these methods has strengths and weaknesses, and it falls upon the researcher to establish best practices. This study however provides guidelines to the growing community of Microcystis spp. researchers but can also serve as advisory to those interested in other organisms or communities. One of the primary concerns for any methodology for metatranscriptomic sequencing analysis is the ability to characterize the true diversity present efficiently and accurately in a sample, regardless of whether it is from a lab culture or an environmental sample. For many years, the common practice has been to quantify activity by recruited reads back to a single reference genome (Harke and Gobler 2013; Steffen et al. 2017; Davenport et al. 2019). Here, we have tested a composite of available complete, closed genomes, Frankenstein's Microcystis, where coding sequences from different isolates of the same genus were clustered together to reduce redundancy: this provides a database containing both the common core genes associated within the Microcystis genus as well as the unique coding potentials associated with different isolates. We note that this approach can be extended to any microorganism of interest if multiple, well‐curated genomes are available for clustering. However, it is important to note that even a composite genome is limited to the diversity of sequenced, well‐annotated isolates of microorganisms, and still may not represent a natural community's true diversity. Moving forward, researchers will need to update Frankenstein's Microcystis with newly sequenced isolates as they become available. While the BLASTx method we adopted from Pound et al. (2020) and Moniruzzaman et al. (2017) was also capable of more fully characterizing community diversity, the restriction to individual gene databases makes it cumbersome and less ideal for broader hypotheses. A key detail to consider a priori with sequencing data is the resolution to which a researcher may want to characterize the community. As mentioned above, recruitments to Frankenstein's Microcystis provide an efficient and comprehensive way to analyze the Microcystis spp. community. However, it is well known that the rest of the microbiome is likely important to how harmful algal blooms function (Cook et al. 2020). While the BLASTx method mentioned above can characterize the entire microbial community, marker‐gene database inefficiencies are still present. Many tools such as Kaiju, or even 16S rRNA gene sequencing, can provide information on what organisms are present, but few tools also provide information on the functional genes present. Many other factors should also be considered when choosing a method, including the frequency the tool is updated. Even though GhostKOALA performed well in our analysis, the KEGG ontology database will require regular updates to stay relevant. The only way to have full control over a database is to curate it manually, although we recognize that this can be inefficient and can lead to biases from individual laboratory groups based on annotations. Computational power and coding expertise should also be considered, as some of methods discussed are extremely user friendly while other require some knowledge of command line coding. Many of the online tools use remote servers, while many of the command line tools would likely require local computational power.

Recommendations

For researchers sequencing Microcystis spp.‐dominated harmful algal blooms, we make two suggestions based on our analyses and the resolution of a proposed study. For those wishing to focus solely on Microcystis spp. function and activity, we would recommend using a regularly updated Frankenstein's Microcystis composite genome as it does not require the transcriptomes being assembled into contigs and can provide detailed data on every potential gene currently sequenced in Microcystis genus. It is also important to note that the combined genome approach used to establish Frankenstein's Microcystis could easily be applied to other organisms and study systems. However, for researchers wishing to focus on microbiome ecology and interactions between species, we recommend using the GhostKOALA approach, which provides both functional and taxonomic characterization of the entire community, although it does require additional read recruitments to estimate transcriptional activity. Regardless of the study system, we highly stress the critical importance of taking great care in selecting an appropriate method when processing sequence data. Supplemental Fig. S1 Closed circles indicate the total number of coding sequences in each of the individual genomes and combined genomes of various stringencies. Open circles indicate the number of clusters that only contained a single‐coding sequence in the combined genomes, indicating a lack of redundancy in those genes. Click here for additional data file. Supplemental Fig. S2 The sum of all reads in all samples that were recruited to or classified as a specific Microcystis gene in each method. Click here for additional data file. Supplemental File S1 Reads recruited to or classified as Microcystis or specific genes in Microcystis (sheet 1) and Pearson's correlation coefficients between methods (sheets 2 and 3). Click here for additional data file.

43 in total

1. Ecophysiological Examination of the Lake Erie Microcystis Bloom in 2014: Linkages between Biology and the Water Supply Shutdown of Toledo, OH.

Authors: Morgan M Steffen; Timothy W Davis; R Michael L McKay; George S Bullerjahn; Lauren E Krausfeldt; Joshua M A Stough; Michelle L Neitzey; Naomi E Gilbert; Gregory L Boyer; Thomas H Johengen; Duane C Gossiaux; Ashley M Burtner; Danna Palladino; Mark D Rowe; Gregory J Dick; Kevin A Meyer; Shawn Levy; Braden E Boone; Richard P Stumpf; Timothy T Wynne; Paul V Zimba; Danielle Gutierrez; Steven W Wilhelm
Journal: Environ Sci Technol Date: 2017-06-08 Impact factor: 9.028

2. Genetic diversity of inorganic carbon uptake systems causes variation in CO2 response of the cyanobacterium Microcystis.

Authors: Giovanni Sandrini; Hans C P Matthijs; Jolanda M H Verspagen; Gerard Muyzer; Jef Huisman
Journal: ISME J Date: 2013-10-17 Impact factor: 10.302

3. Molecular prediction of lytic vs lysogenic states for Microcystis phage: Metatranscriptomic evidence of lysogeny during large bloom events.

Authors: Joshua M A Stough; Xiangming Tang; Lauren E Krausfeldt; Morgan M Steffen; Guang Gao; Gregory L Boyer; Steven W Wilhelm
Journal: PLoS One Date: 2017-09-05 Impact factor: 3.240

4. Metatranscriptomic Analyses of Diel Metabolic Functions During a Microcystis Bloom in Western Lake Erie (United States).

Authors: Emily J Davenport; Michelle J Neudeck; Paul G Matson; George S Bullerjahn; Timothy W Davis; Steven W Wilhelm; Maddie K Denney; Lauren E Krausfeldt; Joshua M A Stough; Kevin A Meyer; Gregory J Dick; Thomas H Johengen; Erika Lindquist; Susannah G Tringe; Robert Michael L McKay
Journal: Front Microbiol Date: 2019-09-10 Impact factor: 5.640

5. The global Microcystis interactome.

Authors: Katherine V Cook; Chuang Li; Haiyuan Cai; Lee R Krumholz; K David Hambright; Hans W Paerl; Morgan M Steffen; Alan E Wilson; Michele A Burford; Hans-Peter Grossart; David P Hamilton; Helong Jiang; Assaf Sukenik; Delphine Latour; Elisabeth I Meyer; Judit Padisák; Boqiang Qin; Richard M Zamor; Guangwei Zhu
Journal: Limnol Oceanogr Date: 2019-11-19 Impact factor: 4.745

6. Metatranscriptome Library Preparation Influences Analyses of Viral Community Activity During a Brown Tide Bloom.

Authors: Eric R Gann; Yoonja Kang; Sonya T Dyhrman; Christopher J Gobler; Steven W Wilhelm
Journal: Front Microbiol Date: 2021-05-31 Impact factor: 5.640

7. CD-HIT: accelerated for clustering the next-generation sequencing data.

Authors: Limin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li
Journal: Bioinformatics Date: 2012-10-11 Impact factor: 6.937

8. Global transcriptional responses of the toxic cyanobacterium, Microcystis aeruginosa, to nitrogen stress, phosphorus stress, and growth on organic matter.

Authors: Matthew J Harke; Christopher J Gobler
Journal: PLoS One Date: 2013-07-23 Impact factor: 3.240

9. Fast and sensitive taxonomic classification for metagenomics with Kaiju.

Authors: Peter Menzel; Kim Lee Ng; Anders Krogh
Journal: Nat Commun Date: 2016-04-13 Impact factor: 14.919

10. Genome sequences of lower Great Lakes Microcystis sp. reveal strain-specific genes that are present and expressed in western Lake Erie blooms.

Authors: Kevin Anthony Meyer; Timothy W Davis; Susan B Watson; Vincent J Denef; Michelle A Berry; Gregory J Dick
Journal: PLoS One Date: 2017-10-11 Impact factor: 3.240