Literature DB >> 34936673

NGS read classification using AI.

Benjamin Voigt¹, Oliver Fischer¹, Christian Krumnow¹, Christian Herta¹, Piotr Wojciech Dabrowski¹.

Abstract

Clinical metagenomics is a powerful diagnostic tool, as it offers an open view into all DNA in a patient's sample. This allows the detection of pathogens that would slip through the cracks of classical specific assays. However, due to this unspecific nature of metagenomic sequencing, a huge amount of unspecific data is generated during the sequencing itself and the diagnosis only takes place at the data analysis stage where relevant sequences are filtered out. Typically, this is done by comparison to reference databases. While this approach has been optimized over the past years and works well to detect pathogens that are represented in the used databases, a common challenge in analysing a metagenomic patient sample arises when no pathogen sequences are found: How to determine whether truly no evidence of a pathogen is present in the data or whether the pathogen's genome is simply absent from the database and the sequences in the dataset could thus not be classified? Here, we present a novel approach to this problem of detecting novel pathogens in metagenomic datasets by classifying the (segments of) proteins encoded by the sequences in the datasets. We train a neural network on the sequences of coding sequences, labeled by taxonomic domain, and use this neural network to predict the taxonomic classification of sequences that can not be classified by comparison to a reference database, thus facilitating the detection of potential novel pathogens.

Entities: Chemical

Mesh：

Substances：

Year: 2021 PMID： 34936673 PMCID： PMC8694450 DOI： 10.1371/journal.pone.0261548

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Over the past one and a half decades, Next Generation Sequencing (NGS) has revolutionized genomics and adjacent fields of research. The ability to sequence massive amounts of DNA at ever-decreasing costs per base has led to an explosion of the genetic information available for researchers. For instance, since the introduction of the Roche 454, the first commercially successful NGS machine [1], in 2005, the number of bases in GenBank has grown from about 1010 to almost 1012, at a staggering average rate of 5 × 1010 bases per month—the same number of bases every two months that it had previously taken 22 years to accumulate [2]. And this is just the analyzed tip of the iceberg: The Sequence Read Archive (SRA) currently holds over 4 × 1016 bases of raw NGS data [3]. This massive amount of available data presents diverse challenges when it comes to data analysis. One common application of NGS is metagenomic sequencing, where all genetic material in a complex sample, such as a patient’s body liquid (clinical metagenomics) or a piece of arctic ice (environmental metagenomics), is sequenced. While in targeted approaches, such as sequencing a cultured bacterium, the composition of the sample is known a priori and that knowledge can be used to inform the analysis, in metagenomic sequencing the primary data analysis task is determining that composition. The common approach to this challenge is using known reference sequences. Very broadly speaking, for each read from an NGS sample the similarity to all sequences within a reference database is determined and the read is classified as belonging to a taxon based on this comparison. While this approach allows highly successful detection of organisms with already sequenced relatives, as evidenced by the results of studies such as CAMI where all entries used variations on the above-mentioned approach [4], it does not allow the detection of entirely novel organisms—especially if those are only represented at low levels in the sample. Data from such organisms disappears within the thousands to millions of “unclassified” reads that remain as a byproduct of any metagenomic NGS dataset analysis—hidden among technical artifacts and reads from known organisms that could not be clearly assigned to a taxon. Although such “unclassified” reads are usually discarded within the analysis workflow, the hints towards novel bacteria and viruses contained therein would be a valuable resource if they could be identified and isolated for further, more detailed analysis. This would facilitate the detection and characterization of the huge number of organisms that have not yet been sequenced—for instance, [5] estimate that there are around 1014 bacterial taxa, of which only around 106 have been described, and [6] estimate that there are over 3 × 105 undetected mammalian viruses. It has previously been shown that machine learning can be a valuable tool in overcoming this challenge: [7] have successfully used Random Forests to predict the presence of sequences from human pathogenic organisms in metagenomic datasets. While several other approaches have used machine learning to optimize parameters of existing tools [8, 9], to our knowledge the work by [7] represents the only attempt to detect microbial sequences using machine learning without relying on reference sequences. Here, we aim to extend the understanding of such approaches’ usefulness by applying transformer neural networks to the classification of NGS reads as mammalian, bacterial or viral in origin. The application of neural network models is profoundly successful in the field of natural language processing [10, 11]. In particular, models based on the Transformer architecture [12] have led to significant breakthroughs in developing so-called language models that have shown state-of-the-art performance on a variety of natural language processing tasks [13-15]. One tremendous advantage is that such models can be trained using a self-supervised approach, i.e., there is no requirement for labeled data like in a supervised learning regime. In recent work, language models from various transformer networks, primarily multi-layer bidirectional Transformer [14], have been trained on datasets containing a large number of unlabeled protein sequences, e.g., UniProt [16]. [17] developed the protein generation language model (ProtGen) that creates proteins that exhibit near-native structure energies. [18] investigated the learned representations of a protein language model. Their findings show that high-capacity networks reflect biological structure at multiple levels, including amino acids, proteins, and evolutionary homology. [19] compared the performance of the embeddings generated by several network architectures [14, 20–22] on multiple supervised learning tasks, e.g., classification into membrane and non-membrane proteins. In this paper we build upon the aforementioned work on the application of transformer networks in protein classification to demonstrate their applicability to taxonomic classification of NGS reads. Since the overarching goal is the detection of entirely novel organisms from metagenomic datasets, in this initial work we focus on a classification on the domain level, specifically into mammalian (i.e. host in the case of clinical metagenomic datasets), viral and bacterial reads. This will allow extraction and specific examination of reads representing hitherto undescribed viruses and bacteria from reads that remain in the “unclassified” bin after traditional metagenomic data analysis and taxonomic classification has been performed with tools such as Kraken [23], RIEMS [24], PathoScope [25], PAIPline [26] or MetaMeta [27]. Since we are building upon large existing models that have been trained on protein sequences, we limit this proof of concept to the classification of reads that lie within a coding sequence (CDS). While, in order to be able to correctly perform classification independent of a read’s offset within the CDS, we also automatically determine which of the six possible frame the read is in. This pre-requisite step in itself is a novel application of machine learning to ORF detection, as current tools either (i) rely purely on presence/absence of start/stop codons without further interpretation of the sequence (such as getorf [28] or OrfM [29]) or return all candidate sequences for each read without clearly resolving potentially contradictory hypotheses (such as FragGeneScan [30], CNN-MGP [31] or geneRFinder [32]). However, since this is a proof-of-concept work, we do not—in contrast to these existing tools—examine reads that are outside of CDSs in this paper.

Methods and implementation

We developed a proof-of-concept for the classification of NGS-reads into the taxonomic domains viruses, bacteria, and mammals. The classification is done by concatenating multiple data processing and sub-classification steps. At first, the frame of a read within its CDS must be recognized to translate the DNA sequence fragments into amino acid sequences correctly. This is a non-trivial step because there are typically no start or stop codons in the fragments. We developed a classifier based on a language model to detect the correct frame of a read. Using this information, the read sequences can be translated into amino acid sequences. In a final step, the amino acid sequences are classified into taxonomic domains by another language model. In this section, we describe this proof-of-concept pipeline in more detail. Then, we provide information about the design and training process of the individual classifiers which are used in different steps of the pipeline. Finally, we describe how training and test data sets were generated.

Classification pipeline

The pipeline was implemented as an python script (see https://github.com/CBMI-HTW/TaxonomicClassification-NGS-NN). A general overview is shown in Fig 1.

Fig 1

Overview of the complete neural network classification pipeline.

The pipeline consists of four major blocks: (1) preprocessing NGS reads, (2) frame classification of NGS reads, (3) frame correction and translation of NGS reads, (4) taxonomic classification of amino acid sequence. The dotted arrow line shows an optional loop of the frame classification used for checking the frame correction block, as shown in Fig 2.

Overview of the complete neural network classification pipeline.

Fig 2

Prediction results of the frame classification model on the test dataset.

As input the pipeline gets a file in the FASTA format with NGS reads. We expect the nucleotide sequences s to be of 300 base pairs each, i.e., s = (s1, …, s300). This is not a strict requirement as shown by our experiments with different amino acid sequence length (S1 Appendix; S1 Fig), but since the classifiers were optimized on that length it will lead to the best results. While for the prototype, we assume that all reads lie within CDSs, we plan to add an automatic selection of such reads from raw NGS data in future work. Each nucleotide sequence is translated into a protein sequence x = (x1, …, x100) using biopython [33]. Often, this is not the correct translation because most reads are off-frame. Such wrong translations are detected in the next step and are then re-translated. However, if there are any stop codons in the initial translation, a full six-frame translation of the read is performed and, if a translation without stop codons is found, this is used instead. The resulting amino acid sequence x from the initial translation is fed into the frame classification model that returns a probability distribution over the classes p(k ∣ x). There are six possible classes k ∈ {0, …, 5} which are predicted by the model for each sequence: on-frame (k = 0), offset by one base (k = 1), offset by two bases (k = 2), reverse-complementary (k = 3), reverse-complementary and offset by one base (k = 4) and reverse complementary and offset by two bases (k = 5). Based on the classification result , with , the transformations (shifting, reverse-complementing) necessary to make each amino acid sequence on-frame are performed on the original DNA sequences s before they are again translated into amino acid sequences for further processing. Finally, each on-frame amino acid sequence is classified into one of the taxonomic domains t used in this prototype: virus, bacteria, or human/mammalian using the second classification model . The output of the pipeline is the input FASTA file with a modified identifier line, i.e., information about the frame classification and species classification is appended.

Model description, training, and fine-tuning

The classifiers are based on pre-trained (protein) language models. A language model is an assignment of a probability distribution to a sequence of tokens for a language. In our context, the tokens are amino acid symbols. Such language models are trained (self-supervised) on large corpora of sequences/sentences of a language, e.g. on protein data bases. In the training process the language model must continue a sequence (autoregressive) or should predict missing tokens which are masked in the input [14], i.e. it must assign high probabilities to the corresponding tokens of the training data. Therefore, no labeled data is needed. A neural network language model can be used after training as a feature extractor, i.e., the token sequence is transformed by the model into a sequence of feature vectors. The sequence of features can be used for other tasks, e.g., for the classification of a complete sequence. Here-fore, the sequence of features is pooled into one feature vector with a fixed size. With additional labeled data a classification model can be trained on pairs of feature vectors and labels. This is an example of transfer learning: The knowledge learned by modeling the language is used for another task, e.g., a classification. The parameters of the feature extractor can also be modified for the specific task. Typically, a classification head replaces the last layer(s) of the pre-trained language model neural network such that the new neural network can be trained end-to-end on the classification data (sequence-label pairs). I.e., the parameters of the neural network are fine-tuned to improve the classification performance. We used a pre-trained language model called ProtBert [19], a multi-layer bidirectional transformer neural network [14] with 30 layers and 420 million parameters. The language model was trained on the Uniref100 dataset [16]. As an important detail we note that the dataset contains protein sequences in their correct reading frame. Such sequences don’t contain any stop codon which we take into account in our pipeline as discussed above in Section “Classification Pipeline” and in our data generation as described in Section “Data generation”. The language model maps an amino acid sequence x = (x1, …, x) into a sequence of feature vectors (h′1, …, h′). In case of the pre-trained language model ProtBert each h′ consists of 1024 elements. We reduce these sequence of feature vectors into a single feature vector h by using different pooling strategies. We explored mean, max, and dot product self-attention functions for pooling. For classification, we fed the feature vector h into a two-layer dense network (classification head), projecting this representation into unnormalized log probabilities z = (z1, …, z) with c being the number of classes of the task. The probabilities are computed from z by a softmax operation. Note that the language models are trained on complete protein sequences. In contrast, the classification is done on (protein) fragments. Both classification models p and p were trained with two different approaches: (pure) feature extraction and fine-tuning. In the first variant, the feature vectors generated by the transformer were fixed, i.e., the parameters that were trained by the language model objective are frozen. Only, the parameters of the classification head are adapted during training. This approach has the advantage of significantly reduced training time. Since only the small dense network parameters have to be updated, the batch size can be increased. However, updating all weights during the training process lets the pre-trained models adapt to the specific task. We explored this alternative using smaller batch sizes and different learning rates for the dense networks and the pre-trained language model. Regardless of the training type, we used the LAMB optimizer [34] to update the model parameters. We optimized model hyperparameters using the the ASHA algorithm [35]. We summarize the final hyperparameter settings of our experiments in Table 1.

Table 1

Hyperparameter settings used for the training process of P and P.

	P _frame	P _tax
Epochs	2	10
Batch size	128(8)	64(4)
LM Learning Rate	1e⁻⁵	25e⁻⁵
CN Learning Rate	0.0025	0.0005
CN Feature Number	512	256
CN Dropout Rate	0.0	0.25

For brevity in the table LM is short for language model and CN referring to the classification network.

For brevity in the table LM is short for language model and CN referring to the classification network. The model calculation was done using a small cluster consisting of 8 Nvidia V100 GPUs. We realized distributed training through data parallelism, i.e., distribute the same model with different batches across the nodes. In Table 1 we report the global batch size with the number of nodes in brackets if calculated distributed, e.g., 256(4), meaning a batch size of 64 per GPU.

Data generation

To train the models as described above with the most reliable data available, we used amino acid sequences from Uniprot [16] for the taxonomic classification model and RefSeq [36] reference sequences for the frame classification model. We then tested the applicability of the trained models and the whole pipeline by classifying reads from two metagenomic sequencing projects available within the SRA [37], after selecting only the reads matching the criteria for this pipeline (i.e. those that lie within a viral, bacterial or mammalian CDS). The steps for generating the input sequences x for the classifiers from the initial sequences s of the three raw data sets are described in detail in the following sections.

Training data for taxonomic classification

For training of the taxonomic classification, we downloaded the 2020–04 release of the fully manually annotated and curated UniProtKB/Swiss-Prot database for bacteria, viruses and human as representative for mammalian sequences [16, 38]. For each amino acid sequence x = (x1, …, x) we created with a sliding window all possible patches (x, …, x) for all l ≤ N(s) − 100 of length 100. N(x) denotes the varying length of the initial sequence x and sequences with N(x) < 100 were discarded. The initial data is strongly unbalanced with respect to its biological origin. In order to split the data in training, validation and test sets we iteratively draw without replacement from all sequences x and fill all patches generated from x successively either in the test, validation or training. Further we balanced the data by considering all viral sequences and down sampling sequences with bacterial and human origin until all data sets contain the same number of patches for all three classes with an approximate ratio of (10% test, 10% validation, training 80%) of the total sizes. The final data sets contain approximately 1.8 × 107 patches and are deposited at zenodo [39].

Training data for frame classification

For training of the frame classification, randomly selected viral and bacterial genomes and the human (GRCh38.p13) reference genome were downloaded from GenBank [40]. From these genomes, all annotated CDS DNA sequences s were extracted. Similar to the amino acid data, for each nucleic acid sequence s = (s1, …, s) using a sliding window all possible patches (s, …, s) of length 300 as well as their reverse complemented versions for all l ≤ N(s) − 300 were created and translated to amino acids using biopython [33]. By this, we create the on-frame sequence, as well as all possible off-frame configurations. In order to avoid the model from relying on the presence of a stop codon for the classification of off-frame sequences, all sequences whose translation contained a stop codon were discarded. We split the data into three sets with approximate ratios of (10%, 10%, 80%) of the total sizes by placing all patches generated from one initial sequence s into one of the three sets. Due to the removal of sequences containing stop codons which are only present in off-frame sequences, on-frame sequences were heavily over-represented in these data sets. We balanced the data sets by discarding sequences in over-represented frames until all frames were present at the same ratio—and the final three data sets have the exact size ratios (10% test, 10% validation, 80% training). The resulting data sets contain a total of 1.2 × 107 patches and are deposited at zenodo [41].

Application data

To test the applicability of the trained models to real data, we downloaded the raw NGS reads from two metagenomic SRA runs using a read length of 300: A human skin metagenomic study (SRR7188139) and a swine feces metagenome (ERR3013343). Since the proof-of-concept pipeline only classifies reads that lie within a CDS, eligible reads were extracted by mapping. To that end, all RefSeq viral and bacterial genomes, the human reference genome (GRCh38.p13) and the sus scrofa reference genome (GCF_000003025.6) were downloaded, annotated CDS sequences were extracted and raw reads were mapped to these using bowtie2 [42] with the –end-to-end parameter. Reads mapping to either only viral, bacterial or mammalian ORFs were selected for the application test.

Benchmarking of frame classification

The first step of the pipeline—determining the frame for translating a read’s sequence—is a task that is also tackled by other existing tools. It is therefore not immediately obvious whether the best performance can be achieved by using frame classification using ProtBert, as shown in Fig 1, or by using one of these existing tools. In order to answer this question, we have compared the ability of our classifier to predict the correct frame of a read to that of other tools. We have found six tools that tackle the task of determining the correct translation frame directly from short NGS reads: MGC [43], metaGun [44], metaGene [45], Orphelia [46], fragGeneScan [30] and CNN-MGP [31]. The last one of these, CNN-MGP, also uses a neural network to perform the classification. Unfortunately, out of these six tools, only two were suitable for our comparison. Orphelia requires a java binary that was built against gcc version 3.4, which has been superseded by version 4.0 in late 2006. Setting up a system with such old package versions was outside of the scope of this work. The websites referenced in the publications for metaGun and metaGene are offline, and MGC does not even mention any download website or include the binaries in the supplementary materials. We were not able to find any other resources such as mirrors or git repositories from which source code or binaries can be downloaded, making it impossible to run any of these tools. Accordingly, we only included fragGeneScan and CNN-MGP in our benchmarks. In order to make the benchmark reproducible, we have implemented it as a nextflow [47] pipeline (see https://gitlab.com/dabrowskiw/cdsfinderbenchmark). For the evaluation, we have used the above-mentioned test dataset [41]. Since the documentation of CNN-MGP output is not entirely clear on how the reported reverse reading frames are encoded, we have manually tested all possible interpretations and chosen the one yielding the best results for CNN-MGP. We have also excluded reads for which CNN-MGP or fragGeneScan reported no reading frame from the calculation, since including these would have given our approach an unfair advantage—while we know that in this proof-of-concept work we have only included reads from within CDS and we thus predict a frame for each read, CNN-MGP and fragGeneScan can be applied to real data also including reads from non-coding regions and thus need to be able to predict that a read contains no valid frame.

Results

In this section, we report the results of the trained model for two different settings. First, we test the taxonomic and frame classification models separately on the test data of the corresponding training setups. Then we use the full pipeline on real data from metagenomic sequencing studies.

Evaluation of both models

We evaluated the final classification models using a 10% split of generated data. As metrics, we report the ROC curve and error matrix as a heatmap for both tasks. On the test dataset, the frame classification p achieved an overall accuracy of 0.98 (S2 Fig in S1 Appendix). Since the datasets’ classes are balanced, we expected a strong diagonal in the classification task’s error matrix. This expectation has been confirmed; see Fig 2A. After applying the frame correction, we used the classification model to verify that the reads were correctly shifted into frame zero (k = 0). We observed that almost all sequences had been moved accordingly (Fig 2B).

Prediction results of the frame classification model on the test dataset.

Predictions of the most probable class on the frame test data [41] are shown as an error matrix. The classes are as follows: on-frame (0), offset by one base (1), offset by two bases (2), reverse-complementary (3), reverse-complementary and offset by one base (4), reverse complementary and offset by two bases (5). (A) represents the initial classification results of the reads. (B) Re-evaluation of the reads after applying the frame correction to validate that the reads were correctly shifted to be on-frame, i.e., k = 0. We measured an accuracy of 0.91 on the corresponding test data for the taxonomic classification model p, as shown in the error matrix (Fig 3A). We calculated the inner class accuracies to inspect that result in more detail. We observe that for reads predicted as bacterial, indeed 94% were correctly classified. In contrast, for sequences classified as viral, only 88% were actually of viral origin and 8% were mammalian. We observed a similar behavior in the reads classified as mammalian (92% mammalian and 6% were viral). This indicates that the classifier has the most problems in differentiating between these two classes. This observation is also reflected in Fig 3B, the classifier’s ROC curve, where class 0 (viral) and 2 (human) are slightly worse compared to class 1 (bacterial). This is likely due to the presence of retroviral sequences in the human genome.

Fig 3

Prediction results of the taxonomic classification model on the test dataset.

Error matrix (A) and ROC curve (B) on the taxonomic test dataset [39] are shown. The classes are as follows: 0—viral, 1—bacterial, 2—mammalian.

Prediction results of the taxonomic classification model on the test dataset.

Error matrix (A) and ROC curve (B) on the taxonomic test dataset [39] are shown. The classes are as follows: 0—viral, 1—bacterial, 2—mammalian.

Exemplary analyses

The exemplary analysis of data from real metagenomic sequencing studies presented a more challenging classification task, most likely due to more noisy data. Firstly, training and testing was performed using error-free sequences derived from curated references, while real NGS reads contain sequencing errors. Secondly, since filtering of reads belonging to a CDS was performed by mapping to all CDS sequences annotated in RefSeq, all automatic annotations that mistakenly classified a non-coding open reading frame as CDS cause reads that do not encode a real protein sequence to remain in the dataset. Still, a taxonomic domain classification precision of 0.87 could be achieved on the swine feces metagenomic dataset, or 0.90 if all reads that had a stop codon in every possible frame were discarded (since this should not happen in a correct read within a CDS and can thus be seen as indicative of an error). These results are visualized in Fig 4A.

Fig 4

Prediction results of the complete classification pipeline on data from real metagenomic sequencing studies.

ROC curves for taxonomic classification in swine feces metagenome (A) and human skin metagenome (B) datasets. The classes are as follows: 0—viral, 1—bacterial, 2—mammalian.

Prediction results of the complete classification pipeline on data from real metagenomic sequencing studies.

ROC curves for taxonomic classification in swine feces metagenome (A) and human skin metagenome (B) datasets. The classes are as follows: 0—viral, 1—bacterial, 2—mammalian. Using SKESA [48] with default parameters on the reads classified as viral it was possible to de novo assemble five out of six CDSs of the porcine epidemic diarrhea virus present in the sample, each in a single config—the only CDS that could not be be assembled was that encoding the envelope protein, since it is shorter than the read length of 300 bases used in the analysis and accordingly reads containing this CDS were discarded during dataset generation. However, the classification of reads from the human skin metagenome only showed a precision of 0.62, due to a large number of bacterial reads being wrongly classified as viral. The resulting ROC curve is shown in Fig 4B. This disparate performance on different datasets warrants closer examination in the future. One possible explanation could be the presence of unannotated bacteriophage sequences on the bacterial reference genomes used for the mapping-based classification of the reads, which would lead to the observed discrepancy between the neural network’s and the mapper’s assessment of whether a sequence is bacterial or viral in origin. In the classification of the frame in a read, our approach using ProtBert (98.18% correctly classified frames) significantly outperformed both CNN-MGP (33.62% correctly classified frames) and fragGeneScan (58.65% correctly classified frames). However, it is important to note that due to the limitations described in the methods section, these results are only meaningful in the context of this specific proof of concept work. Especially given the current limitation to recognizing the frame of reads that wholly lie within CDS, this step of the pipeline does not represent a production-ready alternative method of determining the correct frame of NGS reads in general.

Discussion

With this work, we have shown that a taxonomic classification on the domain level based on short sections of the amino acid sequence of an organism’s proteins is possible using a transformer neural network without relying on a reference database. We have also demonstrated that it is possible to determine the frame of a short DNA sequence within an ORF using a transformer neural network without knowledge of the reference sequence or the usual comparison of a six-frame translation with a protein sequence database. This novel application of transformer neural networks to sequence classification will support the reference-free detection of hitherto unknown bacterial and viral proteins—and, by proxy, unknown organisms—in the “unclassified” readset typically left over after metagenomic NGS data analysis. It could also support the analysis of metaproteomic experiments, allowing an initial high-level classification of peptides and thus aiding protein sequence assemblers such as SSAKE-based PASS [49] by presenting them with a reduced number of sequences with a lower likelihood of misleading overlaps. Additionally, the ability to classify the frame of a short DNA sequence should be useful in diverse fields of study. For instance, it is very likely that this classification is possible due to an ability to recognize biologically functional amino acid sequences. In that case, the application of this classifier could allow the detection of recent frame-shift mutations in new genomes without requiring a high-quality reference sequence. It could also be integrated into gene detection algorithms to aid in the ranking of ORFs based on their likelihood of encoding a functional protein, complementing approaches such as that described by [50]. In order to allow simple integration into analysis workflows using raw NGS reads, in the future we will add an initial classification to determine which reads lie within an ORF and can thus be used for frame and taxonomic classification.

Classification with variable sequence length [51].

(PDF) Click here for additional data file.

Frame classification with realistic NGS reads [41, 52, 53].

(PDF) Click here for additional data file. 27 Jul 2021 PONE-D-21-19845 NGS read classification using AI PLOS ONE Dear Dr. Dabrowski, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Sep 10 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Yanbin Yin Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: No Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors present a machine learning pipeline to predict pathogens from metagenome NGS data. The pipeline consists of two parts, coding frame prediction and taxonomy classification. The machine learning models were trained and tested on test dataset and real NGS reads data. The results show the models perform not bad using ROC curve and accuracy. However, several concerns are listed as follow. 1. The authors created a dataset that only contains CDS. I think the frame can be easily predicted by translating DNA reads into proteins using six reading frames. And then check if the stop codon is included in the protein sequence. The incorrect coding frames will lead to some stop codons in the translated sequence. I strongly suggest the author compare the reading frame prediction model to other CDS prediction approaches. 2. For classification of NGS reads, I suggest the author also compare their model to other classification softwares, such as Krakens, RIEMS. 3. The purpose of the model is to predict pathogens from NGS data. The sequencing data includes sequencing error. The model should be trained and tested on NGS simulated data. E.g., the simulation data can be generated by software, such as CAMISIM. Reviewer #2: The authors aim to predict the taxonomic classification of sequences that can not be classified by comparison to a reference database by neural network. The topic is interesting, and the method shows a good prediction performance. However, there are several problems: 1.The authors provide the pipeline in Figure 1. Although the pipeline is relatively clear, more details should be provided to better illustrate their framework and flow about neural network model. 2. Authors should pay attention to the format. For example, the model steps should be described clearly according to the format of algorithm description. 3.To study the impact of parameters, the authors should describe how to select the dimension of a single feature vector h. More comprehensive results are needed when the dimension is set to various values. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 13 Nov 2021 First and foremost, we would like to extend our sincere gratitude to the reviewers for their insightful and helpful comments. They have allowed us to modify and extend our work and to hopefully close the identified gaps - primarily in making the description of our approach clearer, better describing the aim of our work, and adding more realistic data and a comparison to other tools. We describe what specific action we took in response to each comment in more detail in the following sections. ## Reviever 1 ## Reviewer's comment: The authors created a dataset that only contains CDS. I think the frame can be easily predicted by translating DNA reads into proteins using six reading frames. And then check if the stop codon is included in the protein sequence. The incorrect coding frames will lead to some stop codons in the translated sequence. I strongly suggest the author compare the reading frame prediction model to other CDS prediction approaches. Author's response: We wholeheartedly agree with the reviewer that it is important to compare the performance of our frame prediction to existing tools in order to show whether rolling our own frame prediction makes sense. We have accordingly performed a basic benchmark against other existing tools and added a section ``Benchmarking of frame classification'' in both the ``Methods and Implementation'' (lines 241-271) and the ``Results'' (lines 324-333) sections. In short, we have found six applicable tools, out of which two were installable and could thus be used in the benchmark. On our test dataset, we outperformed both of then by a significant margin. Reviewer's comment: For classification of NGS reads, I suggest the author also compare their model to other classification softwares, such as Krakens, RIEMS. Author's response: While we agree with the reviewer that benchmarking is a very important factor, out proof-of-concept software is not intended as an alternative to taxonomic classification or pathogen detection tools such as Kraken or RIEMS - neither in the current state nor once it is in a production-ready state. Instead, the aim is to create a follow-up tool that will allow the analysis of the reads that still remain unclassified after the use of such tools. As such, benchmarking against these tools would not represent the intended use-case. In order to make it clearer that we are explicitly not attempting to create yet another taxonomic profiler that solves the same problem, we have expanded the sentence in lines 70-72 to read: ``[...] after traditional metagenomic data analysis and taxonomic classification has been performed with tools such as Kraken, RIEMS, PathoScope, PAIPline or MetaMeta''. Reviewer's comment: The purpose of the model is to predict pathogens from NGS data. The sequencing data includes sequencing error. The model should be trained and tested on NGS simulated data. E.g., the simulation data can be generated by software, such as CAMISIM. Author's response: We thank the reviewer for pointing out this aspect and agree that investigating the stability of our scheme in more realistic settings is indeed interesting. Therefore, we added supporting information to the draft (``S2 Appendix: Frame classification with realistic NGS reads'', lines 379-410), which examines the influence of sequencing errors on the frame classification. We used ART to generate simulated NGS data. The general outcome is that sequencing errors slightly affect the accuracy of our classification, as expected. Still, the frame classification seems relatively robust, and regular error-reduction practices such as cutting off the last bases (for the sake of simplicity, we represented this by cutting off the last 50 bases of each read instead of performing a thorough evaluation of the effect of different quality trimming approaches) help to improve it. ## Reviever 2 ## Reviewer's comment: The authors provide the pipeline in Figure 1. Although the pipeline is relatively clear, more details should be provided to better illustrate their framework and flow about neural network model. Author's response: We agree with the reviewer that clarity is paramount in such visualizations and are very grateful for this comment. Figure 1 has been accordingly revised and now contains more details of the pipeline. We hope that it illustrates the framework and information flow more clearly now. Reviewer's comment: Authors should pay attention to the format. For example, the model steps should be described clearly according to the format of algorithm description. Author's response: We thank the reviewer for pointing out that the description of the algorithm might not be sufficiently clear. Unfortunately, we have not been able to find any format specifications in the PLOS ONE Submission Guidelines regarding the illustration of algorithms. We have thus generally reworked Figure 1 to make the overall operation of the pipeline easier to understand (as per the first comment) and hope that this also satisfies the reviewer's requirements. Reviewer's comment: To study the impact of parameters, the authors should describe how to select the dimension of a single feature vector h. More comprehensive results are needed when the dimension is set to various values. In general, we fully agree that a more in-depth investigation of the feature vectors towards the overall performance is interesting. However, the dimension of the feature vectors results from the pre-trained language model used in the pipeline and is therefore an externally given parameter. In the case of ProtBert it is fixed to 1024. We reworked lines 156-158 to clarify this aspect. Such an evaluation would thus require retraining the whole ProtBert language model which is very - depending on the available resources even prohibitively - expensive (note that ProtBert utilized 1024 GPUs or 512 TPUs in the training process). This would also run contrary to the underlying idea of our work of investigating the power of using an existing language model, which is in part motivated by exactly this often prohibitive cost of re-training. Submitted filename: RebuttalLetter.pdf Click here for additional data file. 6 Dec 2021 NGS read classification using AI PONE-D-21-19845R1 Dear Dr. Dabrowski, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Yanbin Yin Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: (No Response) Reviewer #2: The authors addressed reviewers’ comments well. The revised version is improved in quality. I have no further suggestions to make. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No 10 Dec 2021 PONE-D-21-19845R1 NGS read classification using AI Dear Dr. Dabrowski: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Yanbin Yin Academic Editor PLOS ONE

33 in total

1. EMBOSS: the European Molecular Biology Open Software Suite.

Authors: P Rice; I Longden; A Bleasby
Journal: Trends Genet Date: 2000-06 Impact factor: 11.639

2. Biopython: freely available Python tools for computational molecular biology and bioinformatics.

Authors: Peter J A Cock; Tiago Antao; Jeffrey T Chang; Brad A Chapman; Cymon J Cox; Andrew Dalke; Iddo Friedberg; Thomas Hamelryck; Frank Kauff; Bartek Wilczynski; Michiel J L de Hoon
Journal: Bioinformatics Date: 2009-03-20 Impact factor: 6.937

3. Orphelia: predicting genes in metagenomic sequencing reads.

Authors: Katharina J Hoff; Thomas Lingner; Peter Meinicke; Maike Tech
Journal: Nucleic Acids Res Date: 2009-05-08 Impact factor: 16.971

4. Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software.

Authors: Alexander Sczyrba; Peter Hofmann; Peter Belmann; David Koslicki; Stefan Janssen; Johannes Dröge; Ivan Gregor; Stephan Majda; Jessika Fiedler; Eik Dahms; Andreas Bremges; Adrian Fritz; Ruben Garrido-Oter; Tue Sparholt Jørgensen; Nicole Shapiro; Philip D Blood; Alexey Gurevich; Yang Bai; Dmitrij Turaev; Matthew Z DeMaere; Rayan Chikhi; Niranjan Nagarajan; Christopher Quince; Fernando Meyer; Monika Balvočiūtė; Lars Hestbjerg Hansen; Søren J Sørensen; Burton K H Chia; Bertrand Denis; Jeff L Froula; Zhong Wang; Robert Egan; Dongwan Don Kang; Jeffrey J Cook; Charles Deltel; Michael Beckstette; Claire Lemaitre; Pierre Peterlongo; Guillaume Rizk; Dominique Lavenier; Yu-Wei Wu; Steven W Singer; Chirag Jain; Marc Strous; Heiner Klingenberg; Peter Meinicke; Michael D Barton; Thomas Lingner; Hsin-Hung Lin; Yu-Chieh Liao; Genivaldo Gueiros Z Silva; Daniel A Cuevas; Robert A Edwards; Surya Saha; Vitor C Piro; Bernhard Y Renard; Mihai Pop; Hans-Peter Klenk; Markus Göker; Nikos C Kyrpides; Tanja Woyke; Julia A Vorholt; Paul Schulze-Lefert; Edward M Rubin; Aaron E Darling; Thomas Rattei; Alice C McHardy
Journal: Nat Methods Date: 2017-10-02 Impact factor: 28.547

5. MGC: a metagenomic gene caller.

Authors: Achraf El Allali; John R Rose
Journal: BMC Bioinformatics Date: 2013-06-28 Impact factor: 3.169

6. RIEMS: a software pipeline for sensitive and comprehensive taxonomic classification of reads from metagenomics datasets.

Authors: Matthias Scheuch; Dirk Höper; Martin Beer
Journal: BMC Bioinformatics Date: 2015-03-03 Impact factor: 3.169

7. PAIPline: pathogen identification in metagenomic and clinical next generation sequencing samples.

Authors: Andreas Andrusch; Piotr W Dabrowski; Jeanette Klenner; Simon H Tausch; Claudia Kohl; Abdalla A Osman; Bernhard Y Renard; Andreas Nitsche
Journal: Bioinformatics Date: 2018-09-01 Impact factor: 6.937

8. UniProt: a worldwide hub of protein knowledge.

Authors:
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

Review 9. Machine Learning Approaches for Epidemiological Investigations of Food-Borne Disease Outbreaks.

Authors: Baiba Vilne; Irēna Meistere; Lelde Grantiņa-Ieviņa; Juris Ķibilds
Journal: Front Microbiol Date: 2019-08-06 Impact factor: 5.640

10. PathoScope 2.0: a complete computational framework for strain identification in environmental or clinical sequencing samples.

Authors: Changjin Hong; Solaiappan Manimaran; Ying Shen; Joseph F Perez-Rogers; Allyson L Byrd; Eduardo Castro-Nallar; Keith A Crandall; William Evan Johnson
Journal: Microbiome Date: 2014-09-05 Impact factor: 14.650