Literature DB >> 32394199

A practical guide to amplicon and metagenomic analysis of microbiome data.

Yong-Xin Liu^1,2,3, Yuan Qin^4,5,6,7, Tong Chen⁸, Meiping Lu⁹, Xubo Qian⁹, Xiaoxuan Guo^4,5,6, Yang Bai^10,11,12,13.

Abstract

Advances in high-throughput sequencing (HTS) have fostered rapid developments in the field of microbiome research, and massive microbiome datasets are now being generated. However, the diversity of software tools and the complexity of analysis pipelines make it difficult to access this field. Here, we systematically summarize the advantages and limitations of microbiome methods. Then, we recommend specific pipelines for amplicon and metagenomic analyses, and describe commonly-used software and databases, to help researchers select the appropriate tools. Furthermore, we introduce statistical and visualization methods suitable for microbiome analysis, including alpha- and beta-diversity, taxonomic composition, difference comparisons, correlation, networks, machine learning, evolution, source tracing, and common visualization styles to help researchers make informed choices. Finally, a step-by-step reproducible analysis guide is introduced. We hope this review will allow researchers to carry out data analysis more effectively and to quickly select the appropriate tools in order to efficiently mine the biological significance behind the data.

Entities: Chemical

Keywords: high-throughput sequencing; marker genes; metagenome; pipeline; reproducible analysis; visualization

Mesh：

Year: 2020 PMID： 32394199 PMCID： PMC8106563 DOI： 10.1007/s13238-020-00724-8

Source DB: PubMed Journal: Protein Cell ISSN： 1674-800X Impact factor: 14.870

Introduction

Microbiome refers to an entire microhabitat, including its microorganisms, their genomes, and the surrounding environment (Marchesi and Ravel, 2015). With the development of high-throughput sequencing (HTS) technology and data analysis methods, the roles of the microbiome in humans (Gao et al., 2018; Yang and Yu, 2018; Zhang et al., 2018a), animals (Liu et al., 2020), plants (Liu et al., 2019a; Wang et al., 2020a), and the environment (Mahnert et al., 2019; Zheng et al., 2019) have gradually become clearer in recent years. These findings have completely changed our understanding of the microbiome. Several countries have launched successful international microbiome projects, such as the NIH Human Microbiome Project (HMP) (Turnbaugh et al., 2007), the Metagenomics of the Human Intestinal Tract (MetaHIT) (Li et al., 2014), the integrative HMP (iHMP) (Proctor et al., 2019), and the Chinese Academy of Sciences Initiative of Microbiome (CAS-CMI) (Shi et al., 2019b). These projects have made remarkable achievements, which have pushed microbiome research into a golden era. The framework for amplicon and metagenomic analysis was established in the last decade (Caporaso et al., 2010; Qin et al., 2010). However, microbiome analysis methods and standards have been evolving rapidly over the past few years (Knight et al., 2018). For example, there was a proposal to replace operational taxonomic units (OTUs) with amplicon sequence variants (ASVs) in marker gene-based amplicon data analysis (Callahan et al., 2016). The next-generation microbiome analysis pipeline QIIME 2, a reproducible, interactive, efficient, community-supported platform was recently published (Bolyen et al., 2019). In addition, new methods have recently been proposed for taxonomic classification (Ye et al., 2019), machine learning (Galkin et al., 2018), and multi-omics integrated analysis (Pedersen et al., 2018). The development of HTS and analysis methods has provided new insights into the structures and functions of microbiome (Jiang et al., 2019; Ning and Tong, 2019). However, these new developments have made it challenging for researchers, especially those without a bioinformatics background, to choose suitable software and pipelines. In this review, we discuss the widely used software packages for microbiome analyses, summarize their advantages and limitations, and provide sample codes and suggestions for selecting and using these tools.

HTS methods of microbiome analysis

The first step in microbiome research is to understand the advantages and limitations of specific HTS methods. These methods are primarily used for three types of analysis: microbe-, DNA-, and mRNA-level analyses (Fig. 1A). The appropriate method(s) should be selected based on sample types and research goals.

Figure 1

Advantages and limitations of HTS methods used in microbiome research. A Introduction to HTS methods for different levels of analysis. At the molecule-level, microbiome studies are divided into three types: microbe, DNA, and mRNA. The corresponding research techniques include culturome, amplicon, metagenome, metavirome, and metatranscriptome analyses. B The advantages and limitations of various HTS methods for microbiome analysis Culturome is a high-throughput method for culturing and identifying microbes at the microbe-level (Fig. 1A). The microbial isolates are obtained as follows. First, the samples are crushed, empirically diluted in liquid medium, and distributed in 96-well microtiter plates or Petri dishes. Second, the plates are cultured for 20 days at room temperature. Third, the microbes in each well are subjected to amplicon sequencing, and wells with pure, non-redundant colonies are selected as candidates. Fourth, the candidates are purified and subjected to 16S rDNA full-length Sanger sequencing. Finally, the newly characterized pure isolates are preserved (Zhang et al., 2019). Culturome is the most effective method for obtaining bacterial stocks, but it is expensive and labor intensive (Fig. 1B). This method has been used for microbiome analysis in humans (Goodman et al., 2011; Zou et al., 2019), mouse (Liu et al., 2020), marine sediment (Mu et al., 2018), Arabidopsis thaliana (Bai et al., 2015), and rice (Zhang et al., 2019). These studies not only expanded the catalog of taxonomic and functional databases for metagenomic analyses, but also provided bacterial stocks for experimental verification. For further information, please see (Lagier et al., 2018; Liu et al., 2019a). DNA is easy to extract, preserve, and sequence, which has allowed researchers to develop various HTS methods (Fig. 1A). The commonly used HTS methods of microbiome are amplicon and metagenomic sequencing (Fig. 1B). Amplicon sequencing, the most widely used HTS method for microbiome analysis, can be applied to almost all sample types. The major marker genes used in amplicon sequencing include 16S ribosome DNA (rDNA) for prokaryotes and 18S rDNA and internal transcribed spacers (ITS) for eukaryotes. 16S rDNA amplicon sequencing is the most commonly used method, but there is currently a confusing array of available primers. A good method for selecting primer is to evaluate their specificity and overall coverage using real samples or electronic PCR based on the SILVA database (Klindworth et al., 2012) and on host factors including the presence of chloroplasts, mitochondria, ribosomes, and other potential sources of non-specific amplification. Alternatively, researchers can refer to the primers used in published studies similar to their own, which would save time in method optimization and facilitate to compare results among studies. Two-step PCR is typically used for amplification and to add barcodes and adaptors to each sample during library preparation (de Muinck et al., 2017). Sample sequencing is often performed on the Illumina MiSeq, HiSeq 2500, or NovaSeq 6000 platform in paired-end 250 bases (PE250) mode, which generates 50,000–100,000 reads per sample. Amplicon sequencing can be applied to low-biomass specimens or samples contaminated by host DNA. However, this technique can only reach genus-level resolution. Moreover, it is sensitive to the specific primers and number of PCR cycles chosen, which may lead to some false-positive or false-negative results in downstream analyses (Fig. 1B). Metagenomic sequencing provides more information than amplicon sequencing, but it is more expensive using this technique. For ‘pure’ samples such as human feces, the accepted amount of sequencing data for each sample ranges from 6 to 9 gigabytes (GB) in a metagenomic project. The corresponding price for library construction and sequencing ranges from $100 to $300. For samples containing complex microbiota or contaminated with host-derived DNA, the required sequencing output ranges from 30 to 300 GB per sample (Xu et al., 2018). In brief, 16S rDNA amplicon sequencing could be used to study bacteria and/or archaea composition. Metagenomic sequencing is advisable for further analysis if higher taxonomic resolution and functional information are required (Arumugam et al., 2011; Smits et al., 2017). Of course, metagenomic sequencing could be used directly in studies with smaller sample sizes, assuming sufficient project funding is available (Carrión et al., 2019; Fresia et al., 2019). Metatranscriptomic sequencing can profile mRNAs in a microbial community, quantify gene expression levels, and provide a snapshot for functional exploration of a microbial community in situ (Turner et al., 2013; Salazar et al., 2019). It is worth noting that host RNA and other rRNAs should be removed in order to obtain transcriptional information of microbiota (Fig. 1B). Since viruses have either DNA or RNA as their genetic materials, technically, metavirome research involves a combination of metagenome and metatranscriptome analyses (Fig. 1A and 1B). Due to the low biomass of viruses in a sample, virus enrichment (Metsky et al., 2019) or the removal of host DNA (Charalampous et al., 2019) is essential steps for obtaining sufficient quantities of viral DNA or RNA for analysis (Fig. 1B). The selection of sequencing methods depends on the scientific questions and sample types. The integration of different methods is advisable, as multi-omics provides insights into both the taxonomy and function of the microbiome. In practice, most researchers select only one or two HTS methods for analysis due to time and cost limitations. Although amplicon sequencing can provide only the taxonomic composition of microbiota, it is cost effective ($20–50 per sample) and can be applied to large-scale research. In addition, the amount of data generated from amplicon sequencing is relatively small, and the analysis is quick and easy to perform. For example, data analysis of 100 amplicon samples could be completed within a day using an ordinary laptop computer. Thus, amplicon sequencing is often used in pioneering research. In contrast to amplicon sequencing, metagenomic sequencing not only extends taxonomic resolution to the species- or strain-level but also provides potential functional information. Metagenomic sequencing also makes it possible to assemble microbial genomes from short reads. However, it does not perform well for low-biomass samples or those severely contaminated by the host genome (Fig. 1B).

Analysis pipelines

“Analysis pipeline” refers to a particular program or script that combines several or even dozens of software programs organically in a certain order to complete a complex analysis task. As of January 23, 2020, the words “amplicon” and “metagenome” were mentioned more than 200,000 and 40,000 times in Google Scholar, respectively. Due to their wide usage, we will discuss the current best-practice pipelines for amplicon and metagenomic analysis. Researchers should get acquainted with the Shell environment and R language, which we discussed in our previous review (Liu et al., 2019b).

Amplicon analysis

The first stage of amplicon analysis is to convert raw reads (typically in fastq format) into a feature table (Fig. 2A). The raw reads are usually in paired-end 250 bases (PE250) mode and generated from the Illumina platforms. Other platforms, including Ion Torrent, PacBio, and Nanopore, are not discussed in this review and may not be suitable for the analysis pipelines discussed below. First, raw amplicon paired-end reads are grouped based on their barcode sequences (demultiplexing). Then the paired reads are merged to obtain amplicon sequences, and barcode and primers are removed. A quality-control step is normally needed to remove low-quality amplicon sequences. All of these steps can be completed using USEARCH (Edgar, 2010) or QIIME (Caporaso et al., 2010). Alternatively, clean amplicon data supplied by sequencing service providers can be used for next analysis (Fig. 2A).

Figure 2

Workflow of commonly used methods for amplicon (A) and metagenomic (B) sequencing. Blue, orange, and green blocks represent input, intermediate, and output files, respectively. The text next to the arrow represents the method, with frequently used software shown in parentheses. Taxonomic and functional tables are collectively referred to as feature tables. Please see Table 1 for more information about the software listed in this figure

Table 1

Introduction to software for amplicon and metagenomic analysis

Name	Link	Description and advantages	Reference
QIIME	http://qiime.org	The most highly cited and comprehensive amplicon analysis pipeline, providing hundreds of scripts for analyzing various data types and visualizations	(Caporaso et al., 2010)
QIIME 2	https://qiime2.org https://github.com/YongxinLiu/QIIME2ChineseManual	This next-generation amplicon pipeline provides integrated command lines and GUI, and supports reproducible analysis and big data. Provides interactive visualization and Chinese tutorial documents and videos	(Bolyen et al., 2019)
USEARCH	http://www.drive5.com/usearch https://github.com/YongxinLiu/UsearchChineseManual	Alignment tool includes more than 200 subcommands for amplicon analysis with a small size (1 Mb), cross-platform, high-speed calculation, and free 32-bit version. The 64-bit version is commercial ($1485)	(Edgar, 2010)
VSEARCH	https://github.com/torognes/vsearch	A free USEARCH-like software tool. We recommend using it alone or in addition to USEARCH. Available as a plugin in QIIME 2	(Rognes et al., 2016)
Trimmomatic	http://www.usadellab.org/cms/index.php?page=trimmomatic	Java based software for quality control of metagenomic raw reads	(Bolger et al., 2014)
Bowtie 2	http://bowtie-bio.sourceforge.net/bowtie2	Rapid alignment tool used to remove host contamination or for quantification	(Langmead and Salzberg, 2012)
MetaPhlAn2	https://bitbucket.org/biobakery/metaphlan2	Taxonomic profiling tool with a marker gene database from more than 10,000 species. The output is relative abundance of strains	(Truong et al., 2015)
Kraken 2	https://ccb.jhu.edu/software/kraken2	A taxonomic classification tool that uses exact k-mer matches to the NCBI database, high accuracy and rapid classification, and outputs reads counts for each species	(Wood et al., 2019)
HUMAnN2	https://bitbucket.org/biobakery/humann2	Based on the UniRef protein database, calculates gene family abundance, pathway coverage, and pathway abundance from metagenomic or metatranscriptomic data. Provide species’ contributions to a specific function	(Franzosa et al., 2018)
MEGAN	https://github.com/husonlab/megan-ce http://www-ab.informatik.uni-tuebingen.de/software/megan6	A GUI, cross-platform software for taxonomic and functional analysis of metagenomic data. Supports many types of visualizations with metadata, including scatter plot, word clouds, Voronoi tree maps, clustering, and networks	(Huson et al., 2016)
MEGAHIT	https://github.com/voutcn/megahit	Ultra-fast and memory-efficient metagenomic assembler	(Li et al., 2015)
metaSPAdes	http://cab.spbu.ru/software/spades	High-quality metagenomic assembler but time-consuming and large memory requirement	(Nurk et al., 2017)
MetaQUAST	http://quast.sourceforge.net/metaquast	Evaluates the quality of metagenomic assemblies, including N50 and misassemble, and outputs PDF and interactive HTML reports	(Mikheenko et al., 2016)
MetaGeneMark	http://exon.gatech.edu/GeneMark/	Gene prediction in bacteria, archaea, metagenome and metatranscriptome. Support Linux/MacOSX system. Provides webserver for online analysis	(Zhu et al., 2010)
Prokka	http://www.vicbioinformatics.com/software.prokka.shtml	Provides rapid prokaryotic genome annotation, calls metaProdigal (Hyatt et al., 2012) for metagenomic gene prediction. Outputs nucleotide sequences, protein sequences, and annotation files of genes	(Seemann, 2014)
CD-HIT	http://weizhongli-lab.org/cd-hit	Used to construct non-redundant gene catalogs	(Fu et al., 2012)
Salmon	https://combine-lab.github.io/salmon	Provides ultra-fast quantification of reads counts of genes using a k-mer-based method	(Patro et al., 2017)
metaWRAP	https://github.com/bxlab/metaWRAP	Binning pipeline includes 140 tools and supports conda install, default binning by MetaBAT, MaxBin, and CONCOCT. Provides refinement, quantification, taxonomic classification and visualization of bins	(Uritskiy et al., 2018)
DAS Tool	https://github.com/cmks/DAS_Tool	Binning pipeline that integrates five binning software packages and performs refinement	(Sieber et al., 2018)

Picking the representative sequences as proxies of a species is a key step in amplicon analysis. Two major approaches for representative sequence selection are clustering to OTUs and denoising to ASVs. The UPARSE algorithm clusters sequences with 97% similarity into OTUs (Edgar, 2013). However, this method may fail to detect subtle differences among species or strains. DADA2 is a recently developed denoising algorithm that outputs ASVs as more exactly representative sequences (Callahan et al., 2016). The denoising method is available at denoise-paired/single by DADA2, denoise-16S by Deblur in QIIME 2 (Bolyen et al., 2019), and -unoise3 in USEARCH (Edgar and Flyvbjerg, 2015). Finally, a feature table (OTU/ASV table) can be obtained by quantifying the frequency of the feature sequences in each sample. Simultaneously, the feature sequences can be assigned taxonomy, typically at the kingdom, phylum, class, order, family, genus, and species levels, providing a dimensionality reduction perspective on the microbiota. In general, 16S rDNA amplicon sequencing can only be used to obtain information about taxonomic composition. However, many available software packages have been developed to predict potential functional information. The principle behind this prediction is to link the 16S rDNA sequences or taxonomy information with functional descriptions in literature. PICRUSt (Langille et al., 2013), which is based on the OTU table of the Greengenes database (McDonald et al., 2011), could be used to predict the metagenomic functional composition (Zheng et al., 2019) of Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways (Kanehisa and Goto, 2000). The newly developed PICRUSt2 software package (https://github.com/picrust/picrust2) can directly predict metagenomic functions based on an arbitrary OTU/ASV table. The R package Tax4Fun (Asshauer et al., 2015) can predict KEGG functional capabilities of microbiota based on the SILVA database (Quast et al., 2013). The functional annotation of prokaryotic taxa (FAPROTAX) pipeline performs functional annotation based on published metabolic and ecological functions such as nitrate respiration, iron respiration, plant pathogen, and animal parasites or symbionts, making it useful for environmental (Louca et al., 2016), agricultural (Zhang et al., 2019), and animal (Ross et al., 2018) microbiome research. BugBase is an extended database of Greengenes used to predict phenotypes such as oxygen tolerance, Gram staining, and pathogenic potential (Ward et al., 2017); this database is mainly used in medical research (Mahnert et al., 2019).

Metagenomic analysis

Compared to amplicon, shotgun metagenome can provide functional gene profiles directly and reach a much higher resolution of taxonomic annotation. However, due to the large amount of data, the fact that most software is only available for Linux systems, and the large amount of computing resources are needed to perform analysis. To facilitate software installation and maintenance, we recommend using the package manager Conda with BioConda channel (Grüning et al., 2018) to deploy metagenomic analysis pipelines. Since metagenomic analysis is computationally intensive, it is better to run multiple tasks/samples in parallel, which requires software such as GNU Parallel for queue management (Tange, 2018). The Illumina HiSeqX/NovaSeq system often produces PE150 reads for metagenomic sequencing, whereas reads generated by BGI-Seq500 are in PE100 mode. The first crucial step in metagenomic analysis is quality control and the removal of host contamination from raw reads, which requires the KneadData pipeline (https://bitbucket.org/biobakery/kneaddata) or a combination of Trimmomatic (Bolger et al., 2014) and Bowtie 2 (Langmead and Salzberg, 2012). Trimmomatic is a flexible quality-control software package for Illumina sequencing data that can be used to trim low-quality sequences, library primers and adapters. Reads mapped to host genomes using Bowtie 2 are treated as contaminated reads and filtered out. KneadData is an integrated pipeline, including Trimmomatic, Bowtie 2, and related scripts that can be used for quality control, to remove host-derived reads, and to output clean reads (Fig. 2B). The main step in metagenomic analysis is to convert clean data into taxonomic and functional tables using reads-based and/or assembly-based methods. The reads-based methods align clean reads to curated databases and output feature tables (Fig. 2B). MetaPhlAn2 is a commonly used taxonomic profiling tool that aligns metagenome reads to a pre-defined marker-gene database to perform taxonomic classification (Truong et al., 2015). Kraken 2 performs exact k-mer matching to sequences within the NCBI non-redundant database and uses lowest common ancestor (LCA) algorithms to perform taxonomic classification (Wood et al., 2019). For a review about benchmarking 20 tools of taxonomic classification, please see Ye et al. (2019). HUMAnN2 (Franzosa et al., 2018), the widely used functional profiling software, can also be used to explore within- and between-sample contributional diversity (species’ contributions to a specific function). MEGAN (Huson et al., 2016) is a cross-platform graphical user interface (GUI) software that performs taxonomic and functional analyses (Table 1). In addition, various metagenomic gene catalogs are available, including catalogs curated from the human gut (Li et al., 2014; Pasolli et al., 2019; Tierney et al., 2019), the mouse gut (Xiao et al., 2015), the chicken gut (Huang et al., 2018), the cow rumen (Stewart et al., 2018; Stewart et al., 2019), the ocean (Salazar et al., 2019), and the citrus rhizosphere (Xu et al., 2018). These customized databases can be used for taxonomic and functional annotation in the appropriate field of study, allowing efficient, precise, rapid analysis. Introduction to software for amplicon and metagenomic analysis https://qiime2.org https://github.com/YongxinLiu/QIIME2ChineseManual http://www.drive5.com/usearch https://github.com/YongxinLiu/UsearchChineseManual https://github.com/husonlab/megan-ce http://www-ab.informatik.uni-tuebingen.de/software/megan6 Assembly-based methods assemble clean reads into contigs using tools such as MEGAHIT or metaSPAdes (Fig. 2B). MEGAHIT is used to assemble large, complex metagenome datasets quickly using little computer memory (Li et al., 2015), while metaSPAdes can generate longer contigs but requires more computational resources (Nurk et al., 2017). Genes present in assembled contigs are then identified using metaGeneMark (Zhu et al., 2010) or Prokka (Seemann, 2014). Redundant genes from separately assembled contigs must be removed using tools such as CD-HIT (Fu et al., 2012). Finally, a gene abundance table can be generated using alignment-based tools such as Bowtie 2 or alignment-free methods such as Salmon (Patro et al., 2017). Millions of genes are normally present in a metagenomic dataset. These genes must be combined into functional annotations, such as KEGG Orthology (KO), modules and pathways, representing a form of dimensional reduction (Kanehisa et al., 2016). In addition, metagenomic data can be used to mine gene clusters or to assemble draft microbe genomes. The antiSMASH database is used to identify, annotate, and visualize gene clusters involved in secondary metabolite biosynthesis (Blin et al., 2018). Binning is a method that can be used to recover partial or complete bacterial genomes in metagenomic data. Available binning tools include CONCOCT (Alneberg et al., 2014), MaxBin 2 (Wu et al., 2015), and MetaBAT2 (Kang et al., 2015). Binning tools cluster contigs into different bins (draft genomes) based on tetra-nucleotide frequency and contig abundance. Reassembly is performed to obtain better bins. We recommend using a binning pipeline such as MetaWRAP (Uritskiy et al., 2018) or DAStool (Sieber et al., 2018), which integrate several binning software packages to obtain refined binning results and more complete genomes with less contamination. These pipelines also supply useful scripts for evaluation and visualization. For a more comprehensive review on metagenomic experiments and analysis, we recommend Quince et al. (2017).

Statistical analysis and visualization

The most important output files from amplicon and metagenomic analysis pipeline are taxonomic and functional tables (Figs. 2 and 3). The scientific questions that researchers could answer using the techniques include the following: Which microbes are present in the microbiota? Do different experimental groups show significant differences in alpha- and beta-diversity? Which species, genes, or functional pathways are biomarkers of each group? To answer these questions, methods are needed for both overall and details statistical analysis and visualization. Overall visualization can be used to explore differences in alpha/beta- diversity and taxonomic composition in a feature table. Details analysis could involve identifying biomarkers via comparison, correlation analysis, network analysis, and machine learning (Fig. 3). We will discuss these methods below and provide examples and references to facilitate such studies (Fig. 3 and Table 2).

Figure 3

Table 2

Introduction to various analysis and visualization methods

Method	Scientific question	Visualization	Description and example reference
Alpha diversity	Within-sample diversity	Boxplot	Distribution (Edwards et al., 2015) or significant difference (Zhang et al., 2019) of alpha diversity among groups (Fig. 3A)
		Rarefaction curve	Sample diversity changes with sequencing depth or evaluation of sequencing saturation (Beckers et al., 2017)
		Venn diagram	Common or unique taxa (Ren et al., 2019)
Beta diversity	Distance among samples or groups	Unconstrained PCoA scatter plot	Major differences of samples showing group differences (Fig. 3B) or gradient changes with time (Zhang et al., 2018b)
		Constrained PCoA scatter plot	Major differences among groups (Zgadzaj et al., 2016; Huang et al., 2019)
		Dendrogram	Hierarchical clustering of samples (Chen et al., 2019)
Taxonomic composition	Relative abundance of features	Stacked bar plot	Taxonomic composition of each sample (Beckers et al., 2017) or group (Jin et al., 2017) (Fig. 3C)
		Flow or alluvial diagram	Relative abundance (RA) of taxonomic changes among seasons (Smits et al., 2017) or time-series (Zhang et al., 2018b)
		Sanky diagram	A variety of Venn diagrams showing changes in RA and common or unique features among groups (Smits et al., 2017)
Difference comparison	Significantly different biomarkers between groups	Volcano plot	A variety of scatter plots showing P-value, RA, fold change, and number of differences (Shi et al., 2019a)
		Manhattan plot	A variety of scatter plots showing P-values, taxonomy, and highlighting significantly different biomarkers (Zgadzaj et al., 2016) (Fig. 3D)
		Extend bar plot	Bar plot of RA combined with difference and confidence intervals (Parks et al., 2014)
Correlation analysis	Correlation between features and sample metadata	Scatter plot with linear fitting	Shows changes in features with time (Metcalf et al., 2016) or relationships with other numeric metadata (Fig. 3E)
		Corrplot	Correlation coefficient or distance triangular matrix visualized by color and/or shape (Zhang et al., 2018b)
		Heatmap	RA of features that change with time (Subramanian et al., 2014)
Network analysis	Global view correlation of features	Colored based on taxonomy or modules	Finding correlation patterns of features based on taxonomy (Fig. 3F) and/or modules (Jiao et al., 2016)
		Colors highlight important features	Highlighting important features and showing their positions and connections (Wang et al., 2018b)
Machine learning	Classification groups or regression analysis for numeric metadata prediction	Heatmap	Colored block showing classification results (Fig. 3G) (Wilck et al., 2017) or feature patterns in a time series (Subramanian et al., 2014).
		Bar plot	Feature importance, RA (Zhang et al., 2019), and increase in mean squared error (Subramanian et al., 2014).
Treemap	Phylogenetic tree or taxonomy hierarchy	Phylogenetic tree or cladogram	Phylogenetic tree (Fig. 3H) shows relationship of OTUs or species (Levy et al., 2018). Taxonomic cladogram highlighting interesting biomarkers (Segata et al., 2011).
		Circular tree map	Shows features in a hierarchy color bubble (Carrión et al., 2019)

Overview of statistical and visualization methods for feature tables. Downstream analysis of microbiome feature tables, including alpha/beta-diversity (A/B), taxonomic composition (C), difference comparison (D), correlation analysis (E), network analysis (F), classification of machine learning (G), and phylogenetic tree (H). Please see Table 2 for more details Introduction to various analysis and visualization methods Alpha diversity evaluates the diversity within a sample, including richness and evenness measurements. Several software packages can be used to calculate alpha diversity, including QIIME, the R package vegan (Oksanen et al., 2007), and USEARCH. The alpha diversity values of samples in each group could be visually compared using boxplots (Fig. 3A). The differences in alpha diversity among or between groups could be statistically evaluated using Analysis of Variance (ANOVA), Mann-Whitney U test, or Kruskal-Wallis test. It is important to note that P-values should be adjusted if each group is compared more than twice. Other visualization methods for alpha diversity indices are described in Table 2. Beta diversity evaluates differences in the microbiome among samples and is normally combined with dimensional reduction methods such as principal coordinate analysis (PCoA), non-metric multidimensional scaling (NMDS), or constrained principal coordinate analysis (CPCoA) to obtain visual representations. These analyses can be implemented in the R vegan package and visualized in scatter plots (Fig. 3B and Table 2). The statistical differences between these beta-diversity indices can be computed using permutational multivariate analysis of variance (PERMANOVA) with the adonis() function in vegan (Oksanen et al., 2007). Taxonomic composition describes the microbiota that are present in a microbial community, which is often visualized using a stacked bar plot (Fig. 3C and Table 2). For simplicity, the microbiota is often shown at the phylum or genus level in the plot. Difference comparison is used to identify features (such as species, genes, or pathways) with significantly different abundances between groups using Welch’s t-test, Mann-Whitney U test, Kruskal-Wallis test, or tools such as ALDEx2, edgeR (Robinson et al., 2010), STAMP (Parks et al., 2014), or LEfSe (Segata et al., 2011). The results of difference comparison can be visualized using a volcano plot, Manhattan plot (Fig. 3D), or extended error bar plot (Table 3). It is important to note that this type of analysis is prone to produce false positives due to increases in the relative abundance of some features and decreases in other features. Several methods have been developed to obtain taxonomic absolute abundance in samples, such as the integration of HTS and flow cytometric enumeration (Vandeputte et al., 2017), and the integration of HTS with spike-in plasmid and quantitative PCR (Tkacz et al., 2018; Guo et al., 2020; Wang et al., 2020b).

Table 3

Useful websites or tools for reproducible analysis

Resource	Links	Description
GSA	http://gsa.big.ac.cn	HTS data deposition and sharing. Fast data transfer, interfaces in both Chinese and English, automated submission, technical support via email or QQ group, and widely recognized by international journals
Qiita	https://qiita.ucsd.edu	Platform for amplicon data deposition, analysis, and cross-study comparisons
MGnify	https://www.ebi.ac.uk/metagenomics	Webserver for amplicon and metagenomic data deposition, sharing, analysis, and cross-study comparisons
gcMeta	https://gcmeta.wdcm.org	Webserver for amplicon and metagenomic data analysis, deposition, and sharing
R Markdown	https://rmarkdown.rstudio.com	Uses a productive notebook interface to weave together narrative text and code to produce an elegantly formatted report in HTML or PDF format. Is becoming increasingly popular in microbiome research
R Graph Gallery	https://www.r-graph-gallery.com	R code for 42 chart types
GitHub	https://github.com	Online code-saving and sharing platforms with version control systems. Supports searching

Useful websites or tools for reproducible analysis Correlation analysis is used to reveal the associations between taxa and sample metadata (Fig. 3E). For example, it is used to identify associations between taxa and environmental factors, such as pH, longitude and latitude, and clinical indices, or to identify key environmental factors that affect microbiota and dynamic taxa in a time series (Edwards et al., 2018). Network analysis explores the co-occurrence of features from a holistic perspective (Fig. 3F). The properties of a correlation network might represent potential interactions between co-occurring taxa or functional pathways. Correlation coefficients and significant P-values could be computed using the cor.test() function in R or more robust tools that are suitable for compositional data such as the SparCC (sparse correlations for compositional data) package (Kurtz et al., 2015). Networks could also be visualized and analyzed using R library igraph (Csardi and Nepusz, 2006), Cytoscape (Saito et al., 2012), or Gephi (Bastian et al., 2009). There are several good examples of network analysis, such as studies exploring the distribution of phylum or modules (Fan et al., 2019) or showing trends at different time points (Wang et al., 2019). Machine learning is a branch of artificial intelligence that learns from data, identifies patterns, and makes decisions (Fig. 3G). In microbiome research, machine learning is used for taxonomic classification, beta-diversity analysis, binning, and compositional analysis of particular features. Commonly used machine learning methods include random forest (Vangay et al., 2019; Qian et al., 2020), Adaboost (Wilck et al., 2017), and deep learning (Galkin et al., 2018) to classify groups by selecting biomarkers or regression analysis to show experimental condition-dependent changes in biomarker abundance (Table 2). Treemap is widely used for phylogenetic tree construction and for taxonomic annotation and visualization of the microbiome (Fig. 3H). Representative amplicon sequences are readily used for phylogenetic analysis. We recommend using IQ-TREE (Nguyen et al., 2014) to quickly build high-confidence phylogenetic trees using big data and online visualization using iTOL (Letunic and Bork, 2019). Annotation files of tree can easily be generated using the R script table2itol (https://github.com/mgoeker/table2itol). In addition, we recommend using GraPhlAn (Asnicar et al., 2015) to visualize the phylogenetic tree or hierarchical taxonomy in an attractive cladogram. In addition, researchers may be interested in examining microbial origin to address issues such as the origin of gut microbiota and river pollution, as well as for forensic testing. FEAST (Shenhav et al., 2019) and SourceTracker (Knights et al., 2011) were designed to unravel the origins of microbial communities. If researchers would like to focus on the regulatory relationship between genetic information from the host and microorganisms (Wang et al., 2018a), genome-wide association analysis (GWAS) might be a good choice (Wang et al., 2016).

Reproducible analysis

Reproducible analysis requires that researchers submit their data and code along with their publications instead of merely describing their methods. Reproducibility is critical for microbiome analysis because it is impossible to reproduce results without raw data, detailed sample metadata, and analysis codes. If the readers can run the codes, they will better understand what has been done in the analyses. We recommend that researchers share their sequencing data, metadata, analysis codes, and detailed statistical reports using the following steps:

Upload and share raw data and metadata in a data center

Amplicon or metagenomic sequencing generates a large volume of raw data. Normally, raw data must be uploaded to data centers such as NCBI, EBI, and DDBJ during publication. In recent years, several repositories have also been established in China to provide data storage and sharing services. For example, the Genome Sequence Archive (GSA) established by the Beijing Institute of Genomics Chinese Academy of Sciences (Wang et al., 2017; Members, 2019) has a lot of advantages (Table 3). We recommend that researchers upload raw data to one of these repositories, which not only provides backup but also meets the requirements for publication. Several journals such as Microbiome require that the raw data should be deposited in repositories before submitting the manuscript.

Share pipeline scripts with other researchers

Pipeline scripts could help reviewers or readers evaluate the reproducibility of experimental results. We provide sample pipeline scripts for amplicon and metagenome analyses at https://github.com/YongxinLiu/Liu2020ProteinCell. The running environment and software version used in analysis should also be provided to help ensure reproducibility. If Conda is used to deploy software, the command “conda env export environment_name > environment.yaml” can generate a file containing both the software used and various versions for reproducible usage. For users who are not familiar with command lines, webservers such as Qiita (Gonzalez et al., 2018), MGnify (Mitchell et al., 2020), and gcMeta (Shi et al., 2019b) could be used to perform analysis. However, webservers are less flexible than the command line mode because they provide fewer adjustable steps and parameters.

Provide a detailed statistical and visualization reports

The tools used for statistical analysis and visualization of a feature table include Excel, GraphPad, and Sigma plot, but these are commercial software tools, and are difficult to quickly reproduce the results. We recommend using tools such as R Markdown or Python Notebooks to trace all analysis codes and parameters and storing them in a version control management system such as GitHub (Table 3). These tools are free, open-source, cross-platform, and easy-to-use. We recommend that researchers record all scripts and results of statistical analysis and visualization in R markdown files. An R markdown document is a fully reproducible report that includes codes, tables, and figures in HTML/PDF format. This work mode would greatly improve the efficiency of microbiome analysis and make the analysis process transparent and easier to understand. R visualization codes can refer to R Graph Gallery (Table 3). The input files (feature tables + metadata), analysis notebook (*.Rmd), and output results (figures, tables, and HTML reports) of the analysis can be uploaded to GitHub, which would allow peers to repeat your analyses or reuse your analysis codes. ImageGP (http://www.ehbio.com/ImageGP) provides more than 20 statistical and visualization methods, making it a good choice for researchers without a background in R.

Notes and perspectives

It is worth noting that experimental operations have a far greater impact on the results of a study than the pipeline chosen for analysis (Sinha et al., 2017). It is better to record detailed experimental processes as metadata, which includes sampling method, time, location, operators, DNA extraction kit, batch, primers, and barcodes. The metadata can be used for downstream analyses and help researchers to determine whether these operational differences contribute to false-positive results (Costea et al., 2017). Some specific experimental steps could be used to provide a unique perspective on microbiome analysis. For example, the development and use of methods to remove the host DNA can effectively increase the proportion of the microbiome in plant endophytes (Carrión et al., 2019) and human respiratory infection samples (Charalampous et al., 2019). A large amount of relic DNA in soil can be physically removed with propidium monoazide (Carini et al., 2016). In addition, when using samples with low microbial biomass, researchers must be particularly careful to avoid false-positive results due to contamination (de Goffau et al., 2019). For these situations, DNA-free water should be used as a negative control. In human microbiome studies, the major differences in microbiome composition among individuals are due to factors such as diet, lifestyle, and drug use, such that the heritability is less than 2% (Rothschild et al., 2018). For recommendations about information that should be collected, please refer to minimum information about a marker gene sequence (MIMARKS) and minimum information about metagenome sequence (Field et al., 2008; Yilmaz et al., 2011), minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea (Bowers et al., 2017), and minimum information about an uncultivated virus genome (Roux et al., 2019). In the early stage of microbiome research, data-driven studies provide basic components and conceptual frame of microbiome, however, with the development of experimental tools, more hypothesis-driven studies are needed to dissect the causality of microbiome and host phenotypes. Shotgun metagenomic sequencing could provide insights into a microbial community structure at strain-level, but it is difficult to recover high-quality genome (Bishara et al., 2018). Single-cell genome sequencing shows very promising applications in microbiome research (Xu and Zhao, 2018). Based on flow cytometry and single-cell sequencing, MetaSort could recover high-quality genomes from sorted sub-metagenome (Ji et al., 2017). Recently developed third-generation sequencing techniques have been used for metagenome analysis, including Pacific Biosciences (PacBio) single molecule real time sequencing and the Oxford Nanopore Technologies sequencing platform (Bertrand et al., 2019; Stewart et al., 2019; Moss et al., 2020). With the improvement in sequencing data quality and decreasing costs, these techniques will lead to a technological revolution in the field of microbiome sequencing and bring microbiome research into a new era.

Conclusion

In this review, we discussed methods for analyzing amplicon and metagenomic data at all stages, from the selection of sequencing methods, analysis software/pipelines, statistical analysis and visualization to the implementation of reproducible analysis. Other methods such as metatranscriptome, metaproteome, and metabolome analysis may provide a better perspective on the dynamics of the microbiome, but these methods have not been widely accepted due to their high cost and the complex experimental and analysis methods required. With the further development of these technologies in the future, a more comprehensive view of the microbiome could be obtained.

129 in total

1. Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies.

Authors: Anna Klindworth; Elmar Pruesse; Timmy Schweer; Jörg Peplies; Christian Quast; Matthias Horn; Frank Oliver Glöckner
Journal: Nucleic Acids Res Date: 2012-08-28 Impact factor: 16.971

2. Tax4Fun: predicting functional profiles from metagenomic 16S rRNA data.

Authors: Kathrin P Aßhauer; Bernd Wemheuer; Rolf Daniel; Peter Meinicke
Journal: Bioinformatics Date: 2015-05-07 Impact factor: 6.937

3. MetaSort untangles metagenome assembly by reducing microbial community complexity.

Authors: Peifeng Ji; Yanming Zhang; Jinfeng Wang; Fangqing Zhao
Journal: Nat Commun Date: 2017-01-23 Impact factor: 14.919

4. Salt-responsive gut commensal modulates T_H17 axis and disease.

Authors: Nicola Wilck; Mariana G Matus; Sean M Kearney; Scott W Olesen; Kristoffer Forslund; Hendrik Bartolomaeus; Stefanie Haase; Anja Mähler; András Balogh; Lajos Markó; Olga Vvedenskaya; Friedrich H Kleiner; Dmitry Tsvetkov; Lars Klug; Paul I Costea; Shinichi Sunagawa; Lisa Maier; Natalia Rakova; Valentin Schatz; Patrick Neubert; Christian Frätzer; Alexander Krannich; Maik Gollasch; Diana A Grohme; Beatriz F Côrte-Real; Roman G Gerlach; Marijana Basic; Athanasios Typas; Chuan Wu; Jens M Titze; Jonathan Jantsch; Michael Boschmann; Ralf Dechend; Markus Kleinewietfeld; Stefan Kempa; Peer Bork; Ralf A Linker; Eric J Alm; Dominik N Müller
Journal: Nature Date: 2017-11-15 Impact factor: 49.962

5. The Fast Track for Microbiome Research.

Authors: Kang Ning; Yigang Tong
Journal: Genomics Proteomics Bioinformatics Date: 2019-04-26 Impact factor: 7.691

6. MGnify: the microbiome analysis resource in 2020.

Authors: Alex L Mitchell; Alexandre Almeida; Martin Beracochea; Miguel Boland; Josephine Burgin; Guy Cochrane; Michael R Crusoe; Varsha Kale; Simon C Potter; Lorna J Richardson; Ekaterina Sakharova; Maxim Scheremetjew; Anton Korobeynikov; Alex Shlemov; Olga Kunyavskaya; Alla Lapidus; Robert D Finn
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

7. Improved metagenomic analysis with Kraken 2.

Authors: Derrick E Wood; Jennifer Lu; Ben Langmead
Journal: Genome Biol Date: 2019-11-28 Impact factor: 17.906

8. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools.

Authors: Christian Quast; Elmar Pruesse; Pelin Yilmaz; Jan Gerken; Timmy Schweer; Pablo Yarza; Jörg Peplies; Frank Oliver Glöckner
Journal: Nucleic Acids Res Date: 2012-11-28 Impact factor: 16.971

9. CD-HIT: accelerated for clustering the next-generation sequencing data.

Authors: Limin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li
Journal: Bioinformatics Date: 2012-10-11 Impact factor: 6.937

10. Dysbiosis of maternal and neonatal microbiota associated with gestational diabetes mellitus.

Authors: Jinfeng Wang; Jiayong Zheng; Wenyu Shi; Nan Du; Xiaomin Xu; Yanming Zhang; Peifeng Ji; Fengyi Zhang; Zhen Jia; Yeping Wang; Zhi Zheng; Hongping Zhang; Fangqing Zhao
Journal: Gut Date: 2018-05-14 Impact factor: 23.059

55 in total

1. Identification of a protective Bacteroides strain of alcoholic liver disease and its synergistic effect with pectin.

Authors: Qiangqiang Wang; Yating Li; Longxian Lv; Huiyong Jiang; Ren Yan; Shuting Wang; Yanmeng Lu; Zhengjie Wu; Jian Shen; Shiman Jiang; Jiawen Lv; Shengjie Li; Aoxiang Zhuge; Lanjuan Li
Journal: Appl Microbiol Biotechnol Date: 2022-05-13 Impact factor: 4.813

2. Fungal and ciliate protozoa are the main rumen microbes associated with methane emissions in dairy cattle.

Authors: Adrián López-García; Alejandro Saborío-Montero; Mónica Gutiérrez-Rivas; Raquel Atxaerandio; Idoia Goiri; Aser García-Rodríguez; Jose A Jiménez-Montero; Carmen González; Javier Tamames; Fernando Puente-Sánchez; Magdalena Serrano; Rafael Carrasco; Cristina Óvilo; Oscar González-Recio
Journal: Gigascience Date: 2022-01-25 Impact factor: 6.524

3. Improving analysis of the vaginal microbiota of women undergoing assisted reproduction using nanopore sequencing.

Authors: Theresa Lüth; Simon Graspeuntner; Kay Neumann; Laura Kirchhoff; Antonia Masuch; Susen Schaake; Mariia Lupatsii; Ronnie Tse; Georg Griesinger; Joanne Trinh; Jan Rupp
Journal: J Assist Reprod Genet Date: 2022-10-12 Impact factor: 3.357

Review 4. Biodegradation of plastics: mining of plastic-degrading microorganisms and enzymes using metagenomics approaches.

Authors: Dae-Wi Kim; Jae-Hyung Ahn; Chang-Jun Cha
Journal: J Microbiol Date: 2022-09-27 Impact factor: 2.902

5. Fusarium fruiting body microbiome member Pantoea agglomerans inhibits fungal pathogenesis by targeting lipid rafts.

Authors: Sunde Xu; Yong-Xin Liu; Tomislav Cernava; Hongkai Wang; Yaqi Zhou; Tie Xia; Shugeng Cao; Gabriele Berg; Xing-Xing Shen; Ziyue Wen; Chunshun Li; Baoyuan Qu; Hefei Ruan; Yunrong Chai; Xueping Zhou; Zhonghua Ma; Yan Shi; Yunlong Yu; Yang Bai; Yun Chen
Journal: Nat Microbiol Date: 2022-05-26 Impact factor: 30.964

6. Obese Individuals With and Without Phlegm-Dampness Constitution Show Different Gut Microbial Composition Associated With Risk of Metabolic Disorders.

Authors: Juho Shin; Tianxing Li; Linghui Zhu; Qi Wang; Xue Liang; Yanan Li; Xin Wang; Shipeng Zhao; Lingru Li; Yingshuai Li
Journal: Front Cell Infect Microbiol Date: 2022-06-01 Impact factor: 6.073

7. The Response of Ruminal Microbiota and Metabolites to Different Dietary Protein Levels in Tibetan Sheep on the Qinghai-Tibetan Plateau.

Authors: Xungang Wang; Tianwei Xu; Xiaoling Zhang; Na Zhao; Linyong Hu; Hongjin Liu; Qian Zhang; Yuanyue Geng; Shengping Kang; Shixiao Xu
Journal: Front Vet Sci Date: 2022-06-29

8. Integrated Microbiome and Host Transcriptome Profiles Link Parkinson's Disease to Blautia Genus: Evidence From Feces, Blood, and Brain.

Authors: Xingzhi Guo; Peng Tang; Chen Hou; Li Chong; Xin Zhang; Peng Liu; Li Chen; Yue Liu; Lina Zhang; Rui Li
Journal: Front Microbiol Date: 2022-05-26 Impact factor: 6.064

9. Fungal diversity in shade-coffee plantations in Soconusco, Mexico.

Authors: Eugenia Zarza; Alejandra López-Pastrana; Anne Damon; Karina Guillén-Navarro; Luz Verónica García-Fajardo
Journal: PeerJ Date: 2022-06-29 Impact factor: 3.061

10. Variations in Gut Microbiome are Associated with Prognosis of Hypertriglyceridemia-Associated Acute Pancreatitis.

Authors: Xiaomin Hu; Liang Gong; Ruilin Zhou; Ziying Han; Li Ji; Yan Zhang; Shuyang Zhang; Dong Wu
Journal: Biomolecules Date: 2021-05-06