| Literature DB >> 35715444 |
Benjamin Flück1,2, Laëtitia Mathon3, Stéphanie Manel3, Alice Valentini4, Tony Dejean4, Camille Albouy5, David Mouillot6,7, Wilfried Thuiller8, Jérôme Murienne9, Sébastien Brosse9, Loïc Pellissier10,11.
Abstract
High-throughput DNA sequencing is becoming an increasingly important tool to monitor and better understand biodiversity responses to environmental changes in a standardized and reproducible way. Environmental DNA (eDNA) from organisms can be captured in ecosystem samples and sequenced using metabarcoding, but processing large volumes of eDNA data and annotating sequences to recognized taxa remains computationally expensive. Speed and accuracy are two major bottlenecks in this critical step. Here, we evaluated the ability of convolutional neural networks (CNNs) to process short eDNA sequences and associate them with taxonomic labels. Using a unique eDNA data set collected in highly diverse Tropical South America, we compared the speed and accuracy of CNNs with that of a well-known bioinformatic pipeline (OBITools) in processing a small region (60 bp) of the 12S ribosomal DNA targeting freshwater fishes. We found that the taxonomic labels from the CNNs were comparable to those from OBITools, with high correlation levels for the composition of the regional fish fauna. The CNNs enabled the processing of raw fastq files at a rate of approximately 1 million sequences per minute, which was about 150 times faster than with OBITools. Given the good performance of CNNs in the highly diverse ecosystem considered here, the development of more elaborate CNNs promises fast deployment for future biodiversity inventories using eDNA.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35715444 PMCID: PMC9205931 DOI: 10.1038/s41598-022-13412-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Principal coordinate analysis (PCoA) of species composition dissimilarity between filters. (A) Ordination of filter species composition dissimilarity in the outputs of OBITools. (B) Ordination of filter species composition dissimilarity in the outputs of the CNN applied to raw reads. Dissimilarity matrices were built with Bray–Curtis distances on read abundance per species per filter. (C) Maps of the filter locations, coloured according to the position of the filters in the PCoA space for OBITools outputs. (D) Maps of the filter locations, coloured according to the position of the filters in the PCoA space for the CNN applied to raw reads outputs. The maps were created with QGIS 3.6.1.
Figure 2Kendall Tau-b correlation coefficient between the outputs of the CNN and OBITools. The left side of the violin plots (blue) displays correlation values between OBITools and the CNN applied to raw reads. The right side of the violin plots (red) displays correlation values between OBITools and the CNN applied to clean reads. The x-axis represents the threshold of the minimum read number per species for the species to be considered present. Stars represent a significant difference between the two correlations. The analysis was made at three levels: PCR replicates (top), eDNA filters (middle), and rivers (bottom).
Figure 3Kappa correlation coefficient between the outputs of the CNN and OBITools. The left side of the violin plots (blue) displays correlation values between OBITools and the CNN applied to raw reads. The right side of the violin plots (red) displays correlation values between OBITools and the CNN applied to clean reads. The x-axis represents the threshold of the minimum read number per species for the species to be considered present. Stars represent a significant difference between the two correlations.
Figure 4Species detections with the CNN approach, with OBITools, and in historical records in the combined Maroni and Oyapock rivers. (A) Overlap of species detections between the CNN applied to raw reads (blue), OBITools (yellow) and historical records (grey). (B) Number of species per family, detected with only one method (CNN applied to raw reads, OBITools or historical records). (C) Overlap of species detections between the CNN applied to clean reads (red), OBITools (yellow) and historical records (grey). (D) Number of species per family that were detected with only one method (CNN applied to clean reads, OBITools or historical records).