Literature DB >> 35571306

Oral microbiome research - A Beginner's glossary.

Priya Nimish Deo¹, Revati Shailesh Deshmukh¹.

Abstract

Oral microbiome plays a key role in the etiology of oral diseases and is linked to many diseases in other parts of the body as well. This makes the oral microbiome an area of interest for researchers globally. A meticulous planning of the research project is the first and most crucial step while conducting an oral microbiome study. For beginners in this field, it is essential to be familiar with the terminologies used in oral microbiome research for a better understanding. The purpose of this article is to familiarize new researchers to the frequently used terms for the field of oral microbiome research. Copyright:

Entities: Chemical

Keywords: Microbiome; metagenomics; sequencing

Year: 2022 PMID： 35571306 PMCID： PMC9106258 DOI： 10.4103/jomfp.jomfp_455_21

Source DB: PubMed Journal: J Oral Maxillofac Pathol ISSN： 0973-029X

MICROBIOME

The term microbiome was coined by scientist Joshua Lederberg, a Nobel Prize laureate, to describe the ecological community of symbiotic, commensal and pathogenic micro-organisms.[1]

MICROBIOTA

Microbiota refers to the assembly of micro-organisms present in a defined environment. This term “microbiota” was first defined by Lederberg and McCray who pointed out the significance of micro-organisms inhabiting in the human body in health and disease states.[2]

METAGENOMICS

Metagenomics is the direct analysis of genomes which are obtained from different environments. The term metagenomics is used interchangeably with 16S ribosomal RNA (rRNA) sequencing. 16S rRNA sequencing is a marker gene approach and does not target the whole genome, while metagenomics is a shotgun sequencing approach for the genomic analysis of the microbes from a particular environment. It catalogs all micro-organisms both culturable and nonculturable from complex environmental samples.[3]

META-TRANSCRIPTOMICS

Meta-transcriptomics refers to the genes that are expressed as a whole by a community.[4] It is an approach to reveal information about transcriptionally active populations rather than just the genetic content of bacterial populations, as shown by metagenomic analysis.[5]

METAPROTEOMICS

Meta-proteomics is an upcoming complementary approach for metagenomics and meta-transcriptomics. It is used to analyze the function of microbial communities. The term “metaproteomics” was defined in 2004 as “the large-scale characterization of the entire protein complement of environmental microbiota, at a given point in time.” It is a dynamic tool to study the presence and abundance of proteins in oral microbiome samples.[6]

METATAXONOMICS

Amplicon metataxonomics generally target 16S rRNA genes because its sequence is similar enough across the microbiome taxa to be amplified by universal polymerase chain reaction (PCR) primers and also distinct enough to be used for taxonomic classification of species.[7]

16S RIBOSOMAL RNA GENES

16S rRNA gene is a gene which is invariably present in all prokaryotic organisms.[8] It is around 1600 base pairs in length and contains nine hypervariable regions (V1-V9) that can be used for bacterial identification.[9]

ALPHA DIVERSITY

Alpha diversity is the diversity within a sample. For example, saliva sample. The three alpha diversity indices used commonly used in research are Chao 1 index, Shannon-Wiener index and Simpson index.[10]

BETA DIVERSITY

Beta diversity describes differences in the microbiota in between samples or groups. It is basically used to study whether the differences between in the microbiota compositions in between the groups are significant. The two common indices to measure beta diversity are Bray-Curtis dissimilarity and UniFrac distance.[10]

RICHNESS AND EVENNESS OF SPECIES

Richness is defined as a measure of various kinds of micro-organisms in a particular community. Evenness compares the similarity (homogeneity) of the population size of each species.[11]

PIPELINE

It is a defined sequence of processing steps that is used to the conversion of raw data into meaningful data.[12]

DNA SEQUENCING

DNA sequencing is a process of determining or identifying the exact order of nucleotides sequence (adenine, guanine, cytosine and thymine) in a DNA.

NEXT-GENERATION SEQUENCING

Next-generation sequencing (NGS) is a comprehensive method used to describe:[13] Template preparation for the genomic DNA for downstream analysis Generation of millions or billions of short DNA sequences called reads in a massively parallel manner Alignment of the reads to sequences from known database Assembling of the aligned sequences and discovery of new genetic variants. Different NGS platforms are available for performing the sequencing of millions of DNA fragments. It is a high throughput method. Individual fragments of DNA are mapped to the reference databases and analyzed by bioinformatics.[14]

AMPLICON SEQUENCING

It is the ultra-deep sequencing of PCR amplification products for analyzing of the genetic variations.[15]

DE NOVO SEQUENCING

De novo sequencing is the generating of the first genetic sequence for a micro-organism which does not have any prior sequence data.[12]

WHOLE GENOME SEQUENCING

It is an alternative approach to 16S rRNA sequencing. It uses random primers to sequence overlapping regions of a genome. The taxa are more accurately defined at the species level using whole-genome sequencing (WGS). WGS requires extensive data analysis.[16]

SHOTGUN SEQUENCING

Shotgun sequencing is a process in which a long DNA molecule is randomly broken into fragments which are sequenced. Each DNA fragment is from a different source in a long DNA molecule.[17]

DNA AMPLICONS

DNA amplicons are sections/fragments of DNA which are the products of amplification. PCR is the most important method for amplicon generation. These amplification products are then sequenced and compared with known microbiome databases.[18] PCR amplification produces around thousands to millions of amplicons of the target DNA. These amplicons are then sequenced using high-throughput sequencing and nucleotide sequences called as reads are obtained.[19]

READS

Shotgun and NGS procedure involves shredding of the genomic DNA into smaller pieces/fragments which are then sequenced. The raw sequenced fragments are known as reads.[20]

FRAGMENT READ

A read which is produced from a fragment library. They are generated from single end of a small fragment of DNA in the order of 100–500 base pairs based on the sequencing platform. Fragment paired-end reads– These are two reads which are produced from each end of DNA fragment from a fragment library. Mate-paired read– They are two reads formed from each end of a large fragment of DNA with a predefined size range.[21]

COVERAGE

The number of times the sequenced nucleotide bases are covered by the target genome. E.g.,– ×30 coverage means that every base pair from the reference genome was covered by approximately 30 reads.[21]

DNA BARCODE

DNA barcode is a DNA sequence which is used for the identification of a target molecule during DNA sequencing. DNA barcode libraries are classified into two groups– randomly generated libraries and rationally designed libraries. Randomly generated libraries are produced by physically assembling oligonucleotides in the pool. Rationally generated libraries are designed using computer modeling (in silico) and then manufactured.[22] The fragments of DNA sequences which enable to identify unknown species are called as DNA barcodes and the process is described as DNA barcoding.[23]

ADAPTORS/ADAPTOR SEQUENCES

They are short oligonucleotide sequences which are ligated at the ends of DNA fragments of interest. This is done to combine with primers for amplification. This is a part of library preparation.[24]

ADAPTOR TRIMMING

Adaptor trimming is an essential step for analyzing NGS data when reads are more in length than the target DNA/RNA fragments. Short oligonucleotides called adapter sequences are ligated to the ends of DNA fragments of interest so that primers can be used to amplify them. The adapter sequence is read out, sometimes partially, next to the unknown target DNA sequence when the sequencing read length is greater than that of the target DNA. It is critical to identify and trim the adapter sequence to recover the target DNA sequence.[24]

LIBRARY PREPARATION

The conventional NGS preparation protocol consists of three basic steps:[25] Fragmentation– It is the first step in library preparation. The DNA molecules are mechanically or enzymatically fragmented into small uniform fragments around 200–400 base pairs Adaptor ligation– The sequencing adaptors are ligated (tied) to the fragments Amplification– After PCR amplification, the DNA library is set to go through many quality control steps to be loaded into the NGS machine. A good library preparation is of utmost importance for generating good sequence depth and coverage. Different methods are available to achieve this goal.[26]

RAREFACTION

Rarefaction is a method for adjusting the differences in library sizes across samples in order to make alpha diversity comparisons easier. Sanders in 1968 proposed rarefaction, which entails selecting a number of samples equal to or less than the number of samples in the smallest sample, then discarding reads from larger samples at random until the number of samples remaining is equal to this threshold. Diversity metrics can be calculated based on these equal-sized subsamples to compare the ecosystems ‘fairly’, regardless of sample size differences. [27]

FASTQ

It is the most common output sequence data format from NGS platforms. It is a text-based format. FASTA format– The FASTA format is a format for storing DNA and amino acid sequences. A FASTA file starts with a single line that describes the sequence. The ‘greater’ symbol at the beginning of the line distinguishes the description lines from the sequence lines. It is recommended that no more than 80 characters be used for definitions in the standard. The name or a unique identifier for the sequence, as well as other information, is usually included in the description line. Although the structure of this header and the information it contains are not standardized, each database sequence has its own FASTA header.[28]

SEQUENCE ALIGNMENT

It is a process in which a short DNA sequence read generally <250 bp is aligned with a reference genome. This procedure assigns a Phred quality score to each sequence read which indicates the confidence of the alignment process. This step can also be used to calculate the proportion of the mapped reads and the depth of sequencing for one or more loci of interest in the sequenced region. The data are stored in a standard BAM file format (binary alignment map) which is the binary version of MAP format.[29]

DNA ASSEMBLY

DNA assembly is defined as the regeneration of a genome from the large number of short overlapped fragments (reads) obtained by a sequencing machine. The length of every read and the number of reads are determined by the type of sequencer.[30]

PHRED SCORE

A score assigned to each base of a raw sequence in the sequencing platforms is the Phred score. The scores are determined by using predictors of possible errors.[31] The Phred score is useful for filtering and trimming of sequences.[32] Illumina reads are typically 25-250 nucleotide long sequences generated in the sequencing machine by a reversible-terminator cyclic reaction linked to base-specific colorimetric signals. Reads can be “single reads” or “paired reads”, in which case they represent both ends of the same nucleotide fragment (generally 200-1000 bp long). An internal Illumina software (CASAVA) converts these colorimetric signals into base calls in the FASTQ format. Each nucleotide is associated with an ASCII-encoded quality number corresponding to a PHRED score (Q), which is directly translated into probability P that the corresponding base call is incorrect using the following equation.[33]

CHIMERA

Chimeras are hybrid products of multiple parent sequences that are misinterpreted as new organisms, inflating the appearance of diversity.[34] Chimeras, which are caused by incomplete template extension and appear to be recombination between dissimilar sequences can lead to inflated diversity.[35] Some of the amplified sequences can be produced by multiple parent sequences during the PCR amplification process, resulting in chimeras. Chimeric sequences are important for alpha diversity estimates, even though they are technical artifacts rather than actual members of the community.[36]

OPERATIONAL TAXONOMIC UNITS

Operational taxonomic units (OUTs) are common currency of marker gene or 16S rRNA gene studies.[37] OTU Table– Marker gene sequence reads are typically clustered based on sequence similarity, with the assumption that sequences with greater similarity represent more phylogenetically similar organisms, to facilitate taxonomy-independent analyses and to reduce the computational resources required for such analyses. These clusters, also known as OTUs, are a common analytical unit in microbial ecology.[38]

ANNOTATION

Genome annotation entails attaching biologically relevant information to genome sequences by analyzing their structure and composition, as well as taking into account what we know from closely related species that can be used as a reference.[39] It is the process of identifying functional elements along with a genome's sequence and thus giving it a meaning. It is required because DNA sequencing generates sequences with unknown functions.[40]

BLAST

Blast stands for-Basic local alignment search tool. It is the most commonly used tool for the calculation of sequence similarity. Different variations of BLAST are available for different sequence comparisons. E.g., -DNA query to a DNA database, a protein query to a protein database.[41]

DENOISING

Denoising aims to carry out filtering of the noisy reads, reduces repetition, remove singletons, chimeric sequences and correction of errors in marginal sequences. This is a prerequisite step, before clustering. OTU clustering.[42]

CLADOGRAM

Cladogram is defined broadly as – ‘any branching diagram, graph or written statement that depicts the relationship between three or more taxa.[43]

INTERACTIVE TREE OF LIFE

It is a web-based application for viewing, manipulating and annotating phylogenetic trees. iTOL was one of the first tools to allow trees to be annotated with various types of extra data.[44]

PHYLOGENY (PHYLOGENETIC TREE)

It is a graphical representation of hypothesized relationships based on genetic differences between sequences.[45] It is a diagram that depicts the relations between taxa (or sequences) and their presumed common ancestors (Nei and Kumar 2000; Felsenstein 2004; Hall 2011). The majority of phylogenetic trees today are based on molecular data, such as DNA or protein sequences. The goals of today's phylogenetic trees include understanding the relationships among the sequences without regard to the host species and inferring the functions of genes that haven’t been experimentally studied (Hall et al. 2009), There are four steps to constructing a phylogenetic tree: (Step 1) find and acquire a set of homologous DNA or protein sequences, (Step 2) align those sequence data (Step 3) estimate a tree from the aligned sequences and (Step 4) present that tree in such a way that the relevant information is clearly conveyed to others.[46]

CONCLUSION

There is a whole set of new terminologies which a researcher comes across while planning a microbiome study. It is important to use precise terminologies in research work with a clear understanding of its meaning. This article will assist in relating the taxonomy and functionality of the oral microbiome. Hence an attempt of this article for beginners as a guide for oral microbiome research.

Financial support and sponsorship

Nil.

Conflicts of interest

There are no conflicts of interest.

39 in total

Review 1. Methods for phylogenetic analysis of microbiome data.

Authors: Alex D Washburne; James T Morton; Jon Sanders; Daniel McDonald; Qiyun Zhu; Angela M Oliverio; Rob Knight
Journal: Nat Microbiol Date: 2018-05-24 Impact factor: 17.745

Review 2. Genomic approaches to studying the human microbiota.

Authors: George M Weinstock
Journal: Nature Date: 2012-09-13 Impact factor: 49.962

Review 3. Standards and Guidelines for Validating Next-Generation Sequencing Bioinformatics Pipelines: A Joint Recommendation of the Association for Molecular Pathology and the College of American Pathologists.

Authors: Somak Roy; Christopher Coldren; Arivarasan Karunamurthy; Nefize S Kip; Eric W Klee; Stephen E Lincoln; Annette Leon; Mrudula Pullambhatla; Robyn L Temple-Smolkin; Karl V Voelkerding; Chen Wang; Alexis B Carter
Journal: J Mol Diagn Date: 2017-11-21 Impact factor: 5.568

4. Analysis of the microbiome: Advantages of whole genome shotgun versus 16S amplicon sequencing.

Authors: Ravi Ranjan; Asha Rani; Ahmed Metwally; Halvor S McGee; David L Perkins
Journal: Biochem Biophys Res Commun Date: 2015-12-22 Impact factor: 3.575

5. PhredEM: a phred-score-informed genotype-calling approach for next-generation sequencing studies.

Authors: Peizhou Liao; Glen A Satten; Yi-Juan Hu
Journal: Genet Epidemiol Date: 2017-05-31 Impact factor: 2.135

Review 6. From Theory to Practice: Translating Whole-Genome Sequencing (WGS) into the Clinic.

Authors: Francois Balloux; Ola Brønstad Brynildsrud; Lucy van Dorp; Liam P Shaw; Hongbin Chen; Kathryn A Harris; Hui Wang; Vegard Eldholm
Journal: Trends Microbiol Date: 2018-09-04 Impact factor: 17.079

7. The vocabulary of microbiome research: a proposal.

Authors: Julian R Marchesi; Jacques Ravel
Journal: Microbiome Date: 2015-07-30 Impact factor: 14.650

Review 8. What is next generation sequencing?

Authors: Sam Behjati; Patrick S Tarpey
Journal: Arch Dis Child Educ Pract Ed Date: 2013-08-28 Impact factor: 1.309

Review 9. Metagenomics, Metatranscriptomics, and Metabolomics Approaches for Microbiome Analysis.

Authors: Vanessa Aguiar-Pulido; Wenrui Huang; Victoria Suarez-Ulloa; Trevor Cickovski; Kalai Mathee; Giri Narasimhan
Journal: Evol Bioinform Online Date: 2016-05-12 Impact factor: 1.625