| Literature DB >> 34695152 |
Isabel Salado1, Alberto Fernández-Gil2, Carles Vilà1, Jennifer A Leonard1.
Abstract
Ecological and conservation genetic studies often use noninvasive sampling, especially with elusive or endangered species. Because microsatellites are generally short in length, they can be amplified from low quality samples such as feces. Microsatellites are highly polymorphic so few markers are enough for reliable individual identification, kinship determination, or population characterization. However, the genotyping process from feces is expensive and time consuming. Given next-generation sequencing (NGS) and recent software developments, automated microsatellite genotyping from NGS data may now be possible. These software packages infer the genotypes directly from sequence reads, increasing throughput. Here we evaluate the performance of four software packages to genotype microsatellite loci from Iberian wolf (Canis lupus) feces using NGS. We initially combined 46 markers in a single multiplex reaction for the first time, of which 19 were included in the final analyses. Megasat was the software that provided genotypes with fewer errors. Coverage over 100X provided little additional information, but a relatively high number of PCR replicates were necessary to obtain a high quality genotype from highly unoptimized, multiplexed reactions (10 replicates for 18 of the 19 loci analyzed here). This could be reduced through optimization. The use of new bioinformatic tools and next-generation sequencing data to genotype these highly informative markers may increase throughput at a reasonable cost and with a smaller amount of laboratory work. Thus, high throughput sequencing approaches could facilitate the use of microsatellites with fecal DNA to address ecological and conservation questions.Entities:
Mesh:
Year: 2021 PMID: 34695152 PMCID: PMC8544849 DOI: 10.1371/journal.pone.0258906
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Summary of program properties.
Description of programs compared in this study, including an outline of the workflow, default parameters, important features, and brief observations on the ease of use.
| Software | AmpliSAS | Megasat | MicNeSs | CHIIMP |
|---|---|---|---|---|
|
| · Web server | · Windows | · Linux | · Windows |
| · Mac OS | · Mac OS* | |||
| · Linux | ||||
| · Linux | ||||
| · Linux | ||||
|
| Yes | Yes | No | No |
|
| Perl | Perl & R | Python 2.7 | R |
|
| 1. | 1. | 1. | 1. |
| 2. | 2. | 2. | 2. | |
| 3. | 3. | 3. | 3. | |
|
| All identical reads added to the coverage of the unique variant (‘dominant’ sequence) and variant freq. calculated (if two highly sequenced variants and similar in sequence, ‘subdominant’ seq. considered). In clustering, variants are aligned to each other to find seq. errors, erroneous variants (artefacts) are identified and removed (filtering) and coverages added to the true ones, and a consensus sequence is created (allele asignment). Cluster only exact length/in frame can be user-defined. | Based on depth ratios of as many as four of the most common length variants among amplification products relative to the most common length variant (A1). Decision process considers the relative size (> or < A1) and difference in size of putative alleles relative to A1. If the sum of the two most common sequence length variants exceeds the min. read depth (default = 50), Megasat will score the genotypes. Decision variables are user-definable. Recommended to review genotype calls (depth vs size histogram plots). | Assigning a pair of asymmetric Gaussians from repeat number (that represent alleles, each characterized by four parameters: a mode, substitutions( | Sequences that passes filters (locus’ primer, repeat motif and length range) and exceed min. read depth will be genotyped. Only sequences accounting for at least a min. of the filtered reads are considered (5%). Potential stutters, artifacts or ambiguous sequences are excluded. After filters, if only one sequence remains, then sample labelled as homozygous; if two or more, heterozygous. Several quality control tables and graphs are generated for manually review. |
|
| · Substitution error rate (%) (clustering) = 1 (Illumina) | · No. mismatches (error tolerance to forward and reverse primers and flanking regions) = 2 | · No. substitutions = 1 | · Min. read depth = 500 |
| · Motif size = [2,5] | ||||
| · Indel error rate (%) (clustering) = 0.001 (Illumina) | ||||
| · Min. no. repeats = 4 | ||||
| · No. threads (multiprocessing) = 1 | ||||
| · Min. no. repeats = 3 | ||||
| · Min. frequency respect to the dominant seq. (%) (subdominant seq.) (clustering) = 10–25 (Illumina) | ||||
| · Min. read depth = 16 | ||||
| · Max. width of the distribution (upper limit for standard deviations) = 5 | ||||
| · Min. amplicon depth (no. reads per amplicon) (filtering) = 100 | ||||
| · Min. fraction retained of the total no. filtered reads (%) = 5 | ||||
| · Max. asymmetry of the distribution (ratio between right and left standard deviation) = 2.5 | ||||
| · Min. per-amplicon frequency (%) (filtering) = 3 | ||||
| · Min. chimera length (filtering) = 10 | ||||
| · Max. no. alleles (filtering) = 10 (2, our study) | ||||
|
| · Demultiplexing by sample | · Demultiplexing by sample | · Demultiplexing by sample and locus | · Demultiplexing by sample |
| · AmpliMERGE | · Format file conversion (fastq -> fasta) | |||
| · AmpliCLEAN | ||||
| · Adapter trimming (cutadapt) | ||||
|
| · Primer file (TXT, CSV) | · Sequence files (FASTA/FASTQ) | · Sequence files (FASTA) | · Sequence files (FASTA/FASTQ) |
| · Sample attributes (CSV) | ||||
| · Sequences files (FASTA/FASTQ (R1 & R2 merged)) | ||||
| · Primer file & locus attributes (CSV) | ||||
| · Locus attributes (CSV) | ||||
| · known individuals (optional) (CSV) | ||||
| · Named alleles (optional)(CSV) | ||||
|
| · Clustered & filtered sequences (FASTA)· | · Summary of genotypes (TXT) | · Summary of genotypes (CSV) | · Summary of genotypes (CSV)· |
| Processed files & samples (CSV) | ||||
| · Histograms (PNG) | ||||
| · Histograms (PDF) | · Allele sequences (FASTA) | |||
| Summary of genotypes (XLS) | ||||
| · Alignments (FASTA) | ||||
| · Alignments (PNG) | ||||
| · Report (HTML) | ||||
|
|
|
|
|
|
|
| 15 | 16 | 17 | 18 |
|
| Yes, corresponding authors and Google forum available | Yes, corresponding authors | Not possible to contact with corresponding authors | Yes, corresponding authors |
|
| · Not intuitive output format. | · GUI | · It does not work with Python 3 | · No need to know R language to run the program, an executable is available after installing required R packages.· |
| · AmpliCHECK did not work | · Genotype is not given in the standardize format (length)· | |||
| · Only possible to change minimum amplicon depth through the command line, not in the web server. | ||||
| Last program version 0.3.1 (31-Jan-2020). Last documentation version (10-Jul-2019) | ||||
| Last program version 1.0 (19-Apr-2017). Last documentation version (Dec-2015) | Last program version 1.1 (07-Aug-2015). Last documentation version (11-Aug-2015) | |||
| · Last program version 1.0 (19-Nov-2018). Last documentation version (24-Jun-2018) |
†GUI: Graphical user interface;
‡ Most important parameters recommended by authors, most of them are user-definable;
§Pre-processing steps and input files used in this study, following the guidelines of authors;
¶ Alleles reported as (mode, substitutions);
*Operating system not tested in this study.
Genotyping error rates among software packages with default settings.
Default minimum read depth; AmpliSAS, 100 reads; Megasat, 50 reads; CHIIMP, 500 reads; MicNeSs, 16 reads. Proportion genotyped, the number of genotypes estimated by each software (N) divided by the total number of reference genotypes per locus across all replicates and samples. Genotyping success, the number of genotypes that coincide with the consensus (N) divided by the total number of genotypes estimated by the software per locus across all replicates and samples (N). Proportion ADO, the number of heterozygote genotypes for which only one of the two alleles could be genotyped (N) divided by the number of heterozygote genotypes in the reference. Proportion FA, the number of genotypes including a false allele (N) divided by the total number of reference genotypes. Values are shown as mean ± standard error per locus.
| Software | Ngenotypes | Nsuccessful | NADO | NFA | Proportion genotyped | Genotyping success | Proportion ADO | Proportion FA |
|---|---|---|---|---|---|---|---|---|
| AmpliSAS | 12 ± 1 | 6 ± 1 | 0 | 6 ± 1 | 0.40 ± 0.04 | 0.47 ± 0.09 | 0 | 0.53 ± 0.09 |
| CHIIMP | 5 ± 1 | 4 ± 1 | 0 | 2 ± 1 | 0.16 ± 0.02 | 0.69 ± 0.10 | 0.01 ± 0.01 | 0.28 ± 0.11 |
| Megasat | 15 ± 2 | 12 ± 2 | 2 ± 1 | 2 ± 1 | 0.50 ± 0.05 | 0.74 ± 0.08 | 0.11 ± 0.04 | 0.15 ± 0.07 |
| MicNeSs | 27 ±1 | 16 ± 2 | 3 ± 1 | 8 ± 2 | 0.91 ± 0.02 | 0.59 ± 0.06 | 0.19 ± 0.04 | 0.31 ± 0.07 |
Genotyping error rates among software packages using a minimum read depth of 16.
Proportion genotyped, the number of genotypes estimated by each software (N) divided by the total number of reference genotypes per locus across all replicates and samples. Genotyping success, the number of genotypes that coincide with the reference (N) divided by the total number of genotypes estimated by the software per locus across all replicates and samples (N). Proportion ADO, the number of heterozygote genotypes for which only one of the two alleles could be genotyped (N) divided by the number of heterozygote genotypes in the reference. Proportion FA, the number of genotypes including a false allele (N) divided by the total number of reference genotypes. Values are shown as mean ± standard error per locus.
| Software | Ngenotypes | Nsuccessful | NADO | NFA | Proportion genotyped | Genotyping success | Proportion ADO | Proportion FA |
|---|---|---|---|---|---|---|---|---|
| AmpliSAS | 25 ± 1 | 9 ± 2 | 0 | 16 ± 2 | 0.84 ± 0.04 | 0.36 ± 0.06 | 0 | 0.64 ± 0.06 |
| CHIIMP | 25 ± 1 | 11 ± 2 | 1 ± 0 | 12 ± 2 | 0.84 ± 0.03 | 0.46 ± 0.07 | 0.07 ± 0.02 | 0.49 ± 0.06 |
| Megasat | 22 ± 1 | 16 ± 2 | 3 ± 1 | 2 ± 1 | 0.74 ± 0.04 | 0.72 ± 0.06 | 0.22 ± 0.05 | 0.12 ± 0.04 |
| MicNeSs | 27 ± 1 | 16 ± 2 | 3 ± 1 | 8 ± 2 | 0.91 ± 0.02 | 0.59 ± 0.06 | 0.19 ± 0.04 | 0.31 ± 0.07 |
Fig 1Multidimensional Scaling (MDS) of genotyping programs.
Axes show distance values of distance matrix. Distance matrix was obtained from the comparison of consensus genotypes of each software and establishing minimum read depth of 16 (S2 Table in S1 File). The closer two programs are, the more similar their results are. Model statistics: two-dimensions ratio MDS using majorization, stress-1 value (normalized): 0.061, number of iterations: 17. Reference indicates consensus genotypes used as reference for comparison.
Fig 2Relation between genotyping success and read depth per locus in Megasat.
Proportion of correct genotypes using Megasat when varying the sequencing coverage. With 150 reads Genotyping success was greater than 0.9 (horizontal line). Simulations were performed with 100 random draws of a given number of reads for each locus and PCR replicate. The average of 100 random draws is represented; error bars indicate standard errors.
Fig 3Relation between genotyping success and number of replicates using Megasat.
Single locus (a): although the probability of genotyping success is higher for homozygotes (0.66; dashed line) than for heterozygotes (0.46; solid line), seven independent replicates are required to determine a homozygous genotype in a noninvasive sample, while eight were for a heterozygote. Multiple loci (b, c): Probability of obtaining the correct genotype for multiple homozygous (b, dashed lines) and heterozygous (c, solid lines) loci with different number of replicates. Considering a total of 19 loci, probabilities were calculated to obtain a correct genotype for at least 16 (squares), 17 (circles), 18 (triangles) and 19 (diamonds) of the 19 loci. Horizontal line marks probability of 0.95.