Literature DB >> 33166367

List-wise learning to rank biomedical question-answer pairs with deep ranking recursive autoencoders.

Yan Yan¹, Bo-Wen Zhang², Xu-Feng Li¹, Zhenhan Liu¹.

Abstract

Biomedical question answering (QA) represents a growing concern among industry and academia due to the crucial impact of biomedical information. When mapping and ranking candidate snippet answers within relevant literature, current QA systems typically refer to information retrieval (IR) techniques: specifically, query processing approaches and ranking models. However, these IR-based approaches are insufficient to consider both syntactic and semantic relatedness and thus cannot formulate accurate natural language answers. Recently, deep learning approaches have become well-known for learning optimal semantic feature representations in natural language processing tasks. In this paper, we present a deep ranking recursive autoencoders (rankingRAE) architecture for ranking question-candidate snippet answer pairs (Q-S) to obtain the most relevant candidate answers for biomedical questions extracted from the potentially relevant documents. In particular, we convert the task of ranking candidate answers to several simultaneous binary classification tasks for determining whether a question and a candidate answer are relevant. The compositional words and their random initialized vectors of concatenated Q-S pairs are fed into recursive autoencoders to learn the optimal semantic representations in an unsupervised way, and their semantic relatedness is classified through supervised learning. Unlike several existing methods to directly choose the top-K candidates with highest probabilities, we take the influence of different ranking results into consideration. Consequently, we define a listwise "ranking error" for loss function computation to penalize inappropriate answer ranking for each question and to eliminate their influence. The proposed architecture is evaluated with respect to the BioASQ 2013-2018 Six-year Biomedical Question Answering benchmarks. Compared with classical IR models, other deep representation models, as well as some state-of-the-art systems for these tasks, the experimental results demonstrate the robustness and effectiveness of rankingRAE.

Entities: CellLine Chemical Disease Gene Species

Year: 2020 PMID： 33166367 PMCID： PMC7652278 DOI： 10.1371/journal.pone.0242061

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Due to the continuous growth of information produced in the biomedical domain, there is a substantially growing demand for biomedical QA from the general public, medical students, health care professionals and biomedical researchers [1]. Public demand for biomedical knowledge or access to natural knowledge is on the rise, especially regarding prevention methods and disease symptoms: medical students find relevant knowledge in papers or from work, while researchers follow the research results from previous studies. Moreover, biomedical QA is the most significant component of several real-world medical applications [2]. In recent years, various methods have been proposed in the field of biomedical QA [3]. It is known from experience that the current typical QA models or systems consist of three main parts: question processing, document processing, and answer processing phases [4, 5]. The question processing phase is usually responsible for converting questions in natural language expressions into queries which are suitable for a document search engine. Afterwards, the document processing phase controls retrieval of the most relevant documents with the generated queries and extracts candidate answer passages. Finally, in the answer processing phase, the candidate answers are matched against the expected answer type and are ranked according to the matching scores [6-8]. There have been several investigations concerning improvements of the query processing phase. For example, Cao et al. [9], Wasim et al. [10] and Abacha et al. [11] have employed question classifying approaches, with semantic information obtained from the UMLS resources. However, some researchers have noted that these medical QA approaches have limitations in terms of the types and formats of questions that they can process [12]. In contrast to the above studies focusing on query processing, several systems have been developed [13-15] with different document processing approaches. Standard IR [16] engines, e.g., Google, biomedical query systems, e.g., PubMed, or their combination have been proposed to return relevant documents in response to a query. In addition, some researchers have managed to utilize semantic knowledge in document retrieval [17, 18]. However, the statistics indicate that passage extraction can benefit more from incorporation of semantics as compared to document retrieval. As a consequence, besides appropriate question analysis and document retrieval process, effectively extracting and selecting relevant answers represents the bottleneck in the entire process. From our own perspective, investigating how to select relevant snippets (explained in detail in Remark 1) directly from retrieved documents is significant in overcoming the limitations of question type and improving the performances to a large extent. There is not much research on returning relevant snippets for biomedical questions. A previous study utilizing NCBI [19] suggested that the cosine similarities between questions and sentences in relevant documents represent the question-answer relationships. The study argued that the higher similarities represent higher relevances. There are two studies that ignore the differences between QA and IR. One is a study of BioASQ participants that builds a model with a granularity of several random words and calculates a ranking of the subdocument level through a document retrieval model [20]. Another one is a study of BioNLP participants utilizing encoder technology to measure the relationship between questions and answers [21]. Despite the improvements of performance, the above extracting strategies may break the completeness of semantics, whether with respect to the use of “sentence” or the definition of “granularity”. Actually, in most cases, it is possible for a relevant snippet to be a single sentence or multiple sequential sentences, or even half a sentence, and the study of NCBI and BioNLP participants both exhibit inadequacy in that regard. According to our experiences, the snippets with the most similar keywords or term distributions are probably not the requested answers. For instance, if the relevant documents happen to contain the exact statement of the question, then the expected answer is obviously the following sentences, rather than the similar sentence. In this paper, we suppose that there are some latent semantic relations between a biomedical question and its relevant snippet answers, namely, a Q-A relation. Thus, the problem of selecting relevant snippet answers can be converted into several classification tasks to decide whether a question and the candidate answers have the Q-A relation. First, all possible candidate snippets are extracted from the documents, and each candidate snippet is combined with the question to form a question-snippet (Q-S) pair. Then, an appropriate vector representation model is utilized to represent the semantics of the Q-S pairs. Convolutional Neural Networks and Recurrent Neural Networks are both common vector representation models used to represent the global semantics, while the local semantics and the syntactic information may be ignored for comprehension. In contrast, Recursive Neural Networks (RNNs) maintain the local semantics to the utmost and take both syntactic and semantic information into account. As a result, RNNs are chosen to learn the semantic representations. Unlike the conventional classification, a specific prediction of Q-A relations may have various ranking results. Taking that fact into account, we modified RNNs by defining the “ranking error” and integrated it into loss function computation to correct the errors caused by ranking. With the semantic vectors of Q-S pairs and supervised learning, the probabilities of Q-A relations are computed and ranked to select relevant snippet answers. We performed the experimental evaluations on the BioASQ 2013-2018 benchmarks with the Medline corpus. The results show that our proposed approach outperforms several competitive baselines, including the classical IR models and the proposed model with replaced vector representations, e.g., CNNs, LSTM and state-of-the-art BioASQ participants. In summary, the main contributions are: 1) proposing a novel approach to solve the snippet retrieval problem in biomedical QA with a classification model; 2) redesigning the loss function of RNNs to orient ranking; and 3) providing a better solution for BioASQ. Remark 1 “Snippet” is not an unambiguous concept like “sentence” or “paragraph”. The exact definition is “a small and sequential piece of articles which represents an independent and complete semantic”. The separators might be commas, dots, semicolons, or even the word “and”. For instance, it could be a single sentence or a half sentence like “Most cases of CMT are caused by mutations in PMP22,” or multiple sequential sentences like “PMP22 is the common gene found mutated through a duplication in CMT1A. Other genes are MPZ and SH3TC2.”. So the extraction of snippets is a great challenge of snippets retrieval.

Related work

In a review several years ago, a few studies were highlighted that were dedicated to studying the biomedical question answering (QA) system [17, 22]. According to our research, Cairns et al. represented the first group to emphasize the importance of establishing a biomedical domain-specific question answering system. Then, TREC, one of the authoritative forums in the field of information retrieval based on large test collections related to QA systems, started a genomics track. Further, EQueR-EVALDA [23], a French evaluation campaign for question answering (QA) systems, provided two tasks: one of which is a biomedical domain-specific task to solve medical questions. Recently, there has been a huge range of success. In the sixth edition of the BioASQ challenge, 26 teams with more than 90 systems participated in this challenge in total, and the best ones were able to outperform the strong baselines [24]. Similarly to participants of the BioASQ challenge, participants of the BioNLP challenge also demonstrate very good performance [25]. However, there are still some limitations of biomedical QA, such as lack of annotated data, ambiguity in clinical text and lack of comprehension of question/answer text by models [1, 26]. Apart from the tracks, organizations such as Google, MedQA [27], Onelook, and PubMed are also trying to construct question answering applications. With regard to the aspects of the quality of answers and ease of use, Google performs very well and better than the other three organizations [28]. All of them can return an acceptable response to the greater part of definitional questions posed by physicians. Due to some restrictions, however, only definitional questions can be solved. Another research project focused on retrieval of answers from biomedical literature through narrowing down the candidate answer space by question classification and distributing a higher rank to the correct answers [10]. This research still suffered from some troublesome problems [7, 29], such as the need for a clear factoid and list type. In 2013, the first BioASQ challenge was held. Organizers of this challenge provided a large-scale question answering competition, in which the systems are required to cope with all stages of a question answering task, including the retrieval of relevant articles and snippets as well as the provision of natural language answers [30, 31]. The two teams, Choi S et al. [32] and Papanikolaou Y et al. [33], in this challenge proposed a model with a reference value. The third edition of the BioASQ challenge was hold in 2015. Sarrouti and El Alaoui [34] proposed using stemmed words and UMLS concepts as features for the BM25 model, which achieved good performances. The reason why they achieves good performances mainly because that they made full advantage of UMLS concepts and sentence components in both the document retrieval phase and the snippet retrieval phase. Their paper also proves that using the language resources of the sentence itself was equally important as using the model. A recent article also illustrates this fact [35]. The sixth edition of the BioASQ challenge was held in 2018 [24]. A team in this challenge took advantage of the theory of attention [36]. They used point multiplication of the query terms matrix and document terms matrix, like attention via dot-product, for encoding. They use pretrained embeddings with one dense layer and residual to generate context sensitive term encoding. Intuitively and rigorously, the context sensitive term encoding achieved the same effect with context encoding via the bidirectional RNN [37], and the former was faster. As a result, the system scored at the top or near the top for all tasks of this challenge [8]. The above models or systems have some defects: only some matching of relevant documents achieved successful results, according to the evaluation. When searching for relevant snippets, the results became terrible because the systems could not find the accurate positions of the relevant snippets [17]. However, as mentioned in the introduction, relevant documents cannot meet the requirements because the accurate statements are difficult to locate manually when given the candidate literature. Instead, relevant snippets can solve this issue. According to the overview of BioASQ competitions, most participants working on snippet retrieval adopted similar proposals to the methods while searching articles. The main differences were the methods used to split the documents. NCBI suggested the use of sentences directly in relevant documents [19]. Another study of BioASQ participants aimed to define a granularity of several words to split the documents [20]. There were also several researchers who regarded all possible snippets as different “short documents.” The indices of these candidates were then built for preprocessing, and the same retrieval models were utilized to rank them. Apart from the retrieval approaches, the framework proposed by NCBI [38] directly computes the cosine similarities between the questions and the candidate sentences to measure their relatedness. Finally, the best scoring sentences from the title or the abstract were chosen as relevant snippets for a question. From our perspective, these approaches excessively rely on the information retrieval techniques in which the ranking is based on the distributions of query terms in documents and the whole collection. There is severe weakness existing in these approaches due to lack of consideration of the semantics. The cosine similarity represents the degree of resemblance rather than the Q-A relations. In a similar way, the output scores from any classical IR models also represent the similarities of term distributions in the questions/queries, in the documents, or in the whole collection. The semantic meanings are not taken into account when deciding whether they have Q-A relations, while the semantics are usually the definitive factors. For instance, for a biomedical question such as “How to treat infectious mononucleosis,” a statement inside a candidate document is “What is the treatment for infectious mononucleosis? Chloroquine and steroids are worth attempting.” Obviously, the expected relevant snippet is the latter sentence, “Chloroquine and steroids are worth attempting,” rather than the former “What is the treatment for infectious mononucleosis?” Consequently, including semantics is of great importance to locate the relevant snippets for biomedical questions.

Ranking with modified RNNs

As described above, the modified RNNs is used to generate a variable-size vector representing the Q-S pair to discover the semantic relations between the question and the candidate snippet. In this section, we respectively introduce the preprocessing work, the unsupervised RNNs, which recursively combine word vectors, and the modified semi-supervised RNNs, which both learn the semantic representations and solve the ranking problem. The architecture of modified RNNs, which learn the semantic vector representations of Q-S pairs and classify whether the Q-S pairs have Q-A relations, is shown in Fig 1.

Fig 1

Illustration of the modified RNNs architecture to learn semantic vector representations for a Q-S pair.

Words are pretrained first into continuous vectors. Then, they are recursively combined into a fixed length vector through the same autoencoders. The vectors at each node are used as features to predict the local semantic relations.

Illustration of the modified RNNs architecture to learn semantic vector representations for a Q-S pair.

Preprocessing and pretraining

We first perform query formulation on the input questions and feed the generated queries into a search engine to retrieve relevant documents. Specially, in that step of query formulation, we use NLTK to create a parse tree of parts of speech on every input question and remove all non-noun phrase (NNP) parts, for, it is not enough to remove stop words. A typical problem is that we cannot retrieve the documents that we need if we remove stop words from questions only. Through word frequency analysis, we find that most documents containing answers contain nouns or other forms of nouns in the question but other parts of speech do not appear regularly. So if we don’t delete those words, we may not be able to retrieve all the documents we need since search engines tend to retrieve documents with more input. Experiments have found that leaving noun phrases works better than leaving nouns only. Then, all possible candidate snippets are extracted from top-N documents to guarantee the recall of ideal snippet answers. Each snippet and question are combined together into a Q-S pair. Moreover, the semantic vectors of words are required. Random continuous vectors are usually used, but here, a coarse learning process is applied to pretrain the word vectors with the word2vec tool on the Medline article collection. With pretraining, the recursive iterations and the corpus impact can be effectively decreased.

Recursive autoencoders and variants

The goal of autoencoders is to combine a sequence of word vectors into a single vector of fixed dimensions and size. At each step, it encodes two adjacent vectors that meet certain standard as a vector. For example we have a sequence x = (x1, x2, x3, x4, x5) and (x2, x3) that meets the standard. It will be required to encode (x2, x3) as a vector y1. Then, a new sequence will be generated x = (x1, y1, x4, x5) and it becomes shorter. We call y1 the father node of (x2, x3). After a few steps, the sequence will be encoded as a single vector and the track of encoder is a tree structure. Fig 2 shows an instance of recursive autoencoders (RAE) with a list of word vectors x = (x1, …, x) and a binary tree structure. We chose a binary tree instead of a parse tree. This is because that the parse tree is built according to certain standards, and it can only encode vectors according to a fixed pattern. For binary tree, the pattern is selected by the neural network itself. It can constantly encode moreproperly and enable vectors to encode together. The tree structure can also be described with several triplets p → c1 c2 where p is the parent node and c1, c2 are the children, such as(y1 → x3 x4, y2 → x2 y1, y3 → x1 y2). With the same neural networks, the parent representations p can be computed from the children c1, c2 with: where the concatenation of the two children is multiplied by a matrix of parameters . After adding a bias term b, the tanh is applied as activation function. A reconstruction layer is usually designed to validate the combination process by reconstructing the children with: Then, through comparisons between the reconstructed and the original children vectors, the reconstruction errors can be computed by their Euclidean distance, as shown in:

Fig 2

Illustration of an application of a recursive autoencoder to a binary tree.

The white nodes are utilized to calculate the reconstruction errors.

Illustration of an application of a recursive autoencoder to a binary tree.

The white nodes are utilized to calculate the reconstruction errors. Now that the vector representation for a parent node p of two children (c1, c2) can be computed and the dimensions are the same, the full tree is constructed with the triplets and recursive combinations; as such, the reconstruction error at each nonterminal node is available. However, during the recursive process, the child node could represent a different number of words and, thus, different importance for the overall meaning reconstruction. We therefore adopt the strategy [39] to redefine reconstruction error as: where the n1 and n2 represent the number of words in (c1, c2) and θ stands for the parameters. To minimize the reconstruction errors of all vector pairs of children in a tree, the tree structure can be computed through: where A(x) stands for the set of all possible trees that can be built from an input Q-S pair x. According to [39], a greedy approximation can simplify the tree construction. For each time, the “potential” parent node and reconstruction error of each pair of neighboring vectors are calculated, and the pair with the lowest reconstruction error is replaced by a parent node. This process is repeated until the sequence is encoded as a single vector, and a encoding tree is also constructed completely. This approximation captures single-word information to a large extent and does not necessarily follow syntactic constraints; it even breaks the boundaries between questions and snippet, which may help to decide whether a question and a snippet are naturally connected and also solve the problem that the length of the sentence is not equal or too long.

Semi-supervised modified RNNs for ranking

With unsupervised RAE, the semantic vectors of Q-S pairs are generated. We extend the approach into a semi-supervised RNNs to predict the semantic relations and rank the potentially relevant snippets for a question. The distributed vector representation of each parent node in the tree built by RAE could also be regarded as features of the Q-S pairs, so we leverage the vector representations by adding a simple softmax layer on top of each parent node to predict class distributions. This is a multi-task learning structure, with encoder as the main task and classification as the branch task. The classification layer will affect the encoding results, making the encoder to generate vectors that are more friendly and suitable for classification, and therefore achieve the purpose of improving accuracy: Fig 3 shows a unit in the modified RNNs at a parent node. Let d = (d1, d2), d1 + d2 = 1 represent the distribution with and without Q-A relations, and t1, t2 be the target label distribution for one entry. Since the outputs of the softmax layer are conditional probabilities d = p(k|[c1;c2]), the cross-entropy error can be computed with:

Fig 3

Illustration of a unit in modified RNNs at a nonterminal node.

The red nodes shows the ranking error.

Illustration of a unit in modified RNNs at a nonterminal node.

The red nodes shows the ranking error. So the training error for each entry can thus be computed through the sum over the error at the nodes of the tree T: where the error at each nonterminal node is the weighted sum of reconstruction and cross-entropy errors: As mentioned, the modified RNNs are in charge of not only classifying the Q-S pairs but also ranking the candidate snippet answers according to the value of relevance. However, we have found that the same classification result may lead to different ranking results due to the influence among samples, which cannot be measured with cross-entropy error. For example, a question q has a relevant answer s1 and an irrelevant answer s2. The target label distributions of qs1, qs2 are thus (1, 0), (0, 1) respectively. Assume that there are two classifiers with predictions (0.51, 0.49), (0.52, 0.48) and (1, 0), (0.99, 0). With classification, the two classifiers have the same results, where candidate snippets s1, s2 are both relevant. The cross-entropy errors of the latter classifier are much larger than those of the former one. However, if the top-1 answer is requested, the latter would make the correct selection. In fact, the ranking accuracy is much more significant than classification accuracy in this case. The above instance indicates that the training error of each entry is influenced by estimated probabilities of the other entries, which corresponds to the same question. Hence, we define the “ranking error” to represent the training error associated with the ranking process. Assume that there is a set of top N candidate snippets C = {x(1), x(2), …, x(} for a biomedical question and a set of representation vectors for Q-S pairs P = {p(1), p(2), …, p(}. Let D = {d(1), d(2), …, d(} be the set of output distributions, where . To avoid confusion, we assume that x(1), x(2), …, x( are relevant and the rest are irrelevant. The set of target label distributions is therefore L = {t(1), t(2), …, t(}, t( = (1, 0), i ≤ m and t( = (0, 1), i > m. According to the values of , the rank r of the candidate snippets C can be computed through r = rank(D) = rank(d(P;θ)). In addition, m equals the number of t( = (1, 0), that is m = count(L). Mean Average Precision (MAP) is a global evaluation metric to measure the ranking results, so the ranking error is defined as the negative of the logarithm of the MAP score: As a result, the loss function which corresponds to a question E′(C, L;θ) can be computed by following Eq (11), while the final objective and its gradient are respectively shown in Eqs (12) and (13). With proper learning through the modified RNNs, the probability of Q-A relations within a Q-S pair can be estimated by the output distributions. The candidate snippet answers are then ranked according to the estimated probabilities of the corresponding Q-S pairs, and the top ranked snippets are predicted to be relevant.

Experiments

Experimental evaluation

We evaluate the performance of our method on the biomedical literature collection from PubMed/MedLine [40] and the benchmark datasets of questions from the three-year BioASQ challenges [20, 24]. The literature collection contains a total of over 20 million records which contain the article title and abstract. The benchmark datasets contain several questions that are requested to reflect real-life information needs encountered during the work, research or diagnosis of several biomedical professionals. Moreover, each question should be independent, i.e., it should not contain any pronouns referring to entities mentioned in other questions. The ground truth of each question and the supportive information are also provided by these experts. The questions are categorized into four classes [41, 42]: (1) yes/no questions, (2) factoid questions, (3) listed questions, which respectively require a “yes”/“no” answer, a particular entity (e.g., a disease, drug, or gene), or a list of entities as an answer, and (4) Summary questions that can only be answered by producing short text summarizing the most prominent relevant information; e.g., “What is the treatment for infectious mononucleosis?”. There are respectively 3, 5, and 5 batches in the three-year BioASQ challenge, each batch containing 100 questions. For snippet retrieval, participants are asked to submit at most 10 relevant snippets extracted from the literature.

Algorithm comparison

We compare the performance of the proposed method with several strong baselines. Specifically, in order to validate the effectiveness of components, we implement some baselines by replacing the vector representation in our model with several sentence models, including Convolutional Neural Networks (CNN) [43, 44], Recurrent Neural Networks (RNN) [45, 46], Long Short Term Memory (LSTM) [47], and the original RAE, to validate the necessity of the ranking error definition. Some classical IR models provided by an open-source search engine are also chosen as baselines to verify the use of classification, including Query Likelihood (QL), Sequential Dependence Model (SDM) and BM25 [48]. Moreover, the participating systems developed by the challenge winners of the six-year BioASQ are also baselines. For all experiments, the sets of candidate documents are retrieved based on a unified index construction and IR model provided by an open-source search engine, Galago http://www.lemurproject.org/galago.php, with default settings.

Results

Comparisons with variants of our approach

We present the performances of our proposed approach https://github.com/lixuf/RAE-Recursive-AutoEncoder-for-bioasq-taskB-phaseA-snippets-retrieve- and the variants that replace the vector representation model with self-implemented Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNNs), Long Short-term Memory (LSTM), and the RAE without the use of ranking error (RAE). Evaluated with the BioASQ official metric of Mean Average Precision (MAP), the comparison results with these variants are reported in Table 1. The results show that our approach performs better than all of the variants across all datasets. Specifically, if we use the names of representation models to stand for the baselines, then in terms of BioASQ 2013, our method outperforms CNN, RNN, LSTM, and RAE by 36.2%, 30.0%, 26.8% and 18.6% over the 3 batches; with respect to BioASQ 2014, the CNN, RNN, LSTM, and RAE are improved by 59.4%, 49.6%, 46.5% and 18.9% on average; and in terms of BioASQ 2015, the average improvements of performance are 34.0%, 35.6%, 28.4% and 19.7%, respectively. The loss function of RAE is the Euclidean distance between the input vectors and the decoded vectors, is shown in Fig 4. Therefore, the goal of each iteration of RAE is to obtain the decoded vectors with the highest similarity to the input vectors, so as to obtain the hidden layer state (encoded vector) that can be put into the decoder and get the vectors with a high degree of similarity to the input vectors. We can visually think of the encoder as a compression tool, which compresses the input vectors into a vector. It is worth noting that the structure of the encoder and the decoder must be consistent while the data flow should be opposite. This is why we can use the similarity between the decoded vectors and the input vectors as a criterion. It was found that the input vectors x1, x2 encoder to vector y after a series of compute and the vector y decoder to the vectors that are highly similar to the input vectors x1, x2 again after a series of compute. The reason why the vector y can be returned to the input vectors is that the vector y has most of the features of the input vectors similar to word embedding. So RAE can retain as much of the local semantics as possible, which is its goal also.

Table 1

The MAP performances of our approach compared with the variants and classical IR models on BioASQ.

Dataset	Batch	Our	CNN	RNN	LSTM	RAE	QL	SDM	BM25
BioASQ 2013	Batch 1	0.0822	0.0642	0.0675	0.0694	0.0736	0.0564	0.0583	0.0546
	Batch 2	0.0631	0.0450	0.0486	0.0497	0.0523	0.0354	0.0372	0.036
	Batch 3	0.0795	0.0559	0.0568	0.0582	0.0637	0.0536	0.0548	0.0527
BioASQ 2014	Batch 1	0.0892	0.0524	0.0568	0.0571	0.0783	0.0586	0.0650	0.0524
	Batch 2	0.0656	0.0478	0.0493	0.0506	0.0612	0.0465	0.0478	0.045
	Batch 3	0.0795	0.0465	0.0483	0.0498	0.0624	0.0542	0.0563	0.0517
	Batch 4	0.0743	0.0482	0.0490	0.0503	0.0617	0.0510	0.0536	0.0493
	Batch 5	0.0668	0.0482	0.0476	0.0485	0.0523	0.0518	0.0523	0.0529
BioASQ 2015	Batch 1	0.0724	0.0374	0.0397	0.0416	0.0539	0.0386	0.0429	0.0378
	Batch 2	0.0931	0.0589	0.0603	0.0641	0.0685	0.0594	0.0648	0.0592
	Batch 3	0.1048	0.0762	0.0824	0.0863	0.0932	0.0856	0.0895	0.0873
	Batch 4	0.1056	0.0945	0.0938	0.0976	0.0960	0.0895	0.0928	0.0864
	Batch 5	0.1412	0.1190	0.1052	0.1131	0.1203	0.1178	0.1201	0.1165

Fig 4

The input vectors x1, x2 are the children mentioned in recursive autoencoders and variants section, and the encoded vector y is the parent mentioned in recursive autoencoders and variants section.

Obviously, from the statistics and the analysis above, the vector representation model in our proposed approach is more suitable than other vector representation models to a large extent on this task. Among these variants, RAE is substantially better than others. Moreover, LSTM is slightly better than CNN or RNN, except for some individual batches. From our perspective, the CNN aims to discover the full depth of the input sentences with a global pooling operation, which is appropriate for learning the global semantics, while the RNN or LSTM generate the semantic vectors with the sequential models, which are usually utilized to predict the next words in a sequence. However, for the Q-S pairs, we are more concerned about the relations rather than the precise semantics, which is supported by the statistics of CNN. Additionally, from the comparisons of CNN and RNN/LSTM, we have found that the sequentiality is beneficial to the judgments of the Q-A relations to a certain degree but still does not represent the decisive factor. Moreover, the results of RAE prove the significance of maintaining the local semantics for accurate judgments. In addition, the comparisons with RAE manifest our novel design that takes the “ranking error” into consideration during the loss function computation.

Comparisons with classical IR models

As mentioned above, the entire process of answer matching and ranking can be regarded as snippet retrieval, so we also compare our approach with classical IR models, including the query likelihood model (QL), BM25 and sequential dependence model (SDM). The exact MAP scores on BioASQ 2013-2015 are also shown in Tables 1 and 2–7 show the MAP scores on BioASQ 2013-2018. The statistics in the table indicate that our approach extensively outperforms the QL, SDM and BM25 for all batches of three-year datasets. Compared to QL, the average improvements on BioASQ 13-15 are respectively 54.6%, 43.2% and 32.3%; our approach exhibits a great advantage over SDM by the average improvements of 49.6%, 36.5% and 26.1%; the average improvements are even larger with respect to BM25, which are respectively 56.9%, 49.4% and 33.5%.

Table 2

Comparisons with BioASQ 2013 participants.

System	Batch 1	Batch 2	Batch 3
our	0.0822	0.0631	0.0795
Wishart	-	0.0360	-
BAS 100	0.0578	0.0337	0.0537
BAS 50	0.0512	0.0272	0.0527

Table 7

Comparisons with BioASQ 2018 participants.

System	Batch 1	Batch 2	Batch 3	Batch 4	Batch 5
our	0.1189	0.1628	0.1950	0.1102	0.0895
MindLab	0.0004	0.2736	0.2217	0.1413	0.1006
ustb_prir	0.1209	0.1731	0.2021	0.1216	0.0967
testtext	0.1151	0.1463	0.2021	0.1213	0.0861
aueb	0.1684	0.3187	0.3320	0.2138	0.1147

Among these IR models, SDM performs better than QL and BM25. This improvement is mainly because SDM focuses on the sentence structure of queries and documents. The possible phrases in sentences are considered through the exact phrase feature and unordered window feature during retrieval, which is similar to the remaining local semantics in our approach. QL and BM25 are mainly based on the term distributions in queries and documents, which lack the consideration of semantics. Therefore, our proposed approach is more suitable than these IR models to retrieve the relevant snippets for biomedical questions. Unlike the preceding year, quite a few teams participated in BioASQ 2014 [20], and most of the submitted results were well-performed. The performances of our approach and challenge winners are shown in Table 3. The Wishart team utilized a similar strategy in BioASQ 2013. The NCBI team’s framework used the cosine similarity between question and sentence to compute their similarity. The HPI team relied on the Hana Database and BioPortal to retrieve biomedical concepts and merged the concepts to retrieve the snippets.

Table 3

Comparisons with BioASQ 2014 participants.

System	Batch 1	Batch 2	Batch 3	Batch 4	Batch 5
our	0.0892	0.0656	0.0795	0.0743	0.0668
Wishart	0.0364	0.0379	0.0574	0.0503	0.0476
main system	0.0095	0.0062	-	-	-
Biomedical Text Ming	0.0296	-	0.0215	0.0240	0.0195
BAS 100	0.0608	0.0319	0.0486	0.0549	0.0544
BAS 50	0.0601	0.0313	0.0480	0.0539	0.0539
HPI-S1	-	0.0482	0.0517	0.0300	-

In BioASQ 2015, semantic vectors were first applied among the participants (ustb_prir team) [49] to look up the synonyms of the keywords in queries to select effective terms for query expansion. The oaqa team [50] proposed a collective reranking model with supervised learning. The qaiiit team [51] applied snippet extraction based on the similarity of the top 10 sentences of the retrieved documents and the queries. The evaluation results are demonstrated in Table 4.

Table 4

Comparisons with BioASQ 2015 participants.

System	Batch 1	Batch 2	Batch 3	Batch 4	Batch 5
our	0.0724	0.0931	0.1048	0.1056	0.1412
ustb_prir	0.0797	0.0776	0.1840	0.2005	0.2410
qaiiit	0.0789	0.1159	-	0.1415	-
HPI	0.0971	0.0719	0.1269	0.1627	0.1341
testtext	0.0752	0.0817	0.1128	0.2070	-
oaqa	-	-	0.1969	0.2092	0.2196
fdu	-	-	0.1166	0.2480	0.2424

In BioASQ 2016, HPI-S1 [52] was based on the existing NLP functions from a in-memory database (IMDB) and it was extended with a new process specifically to QA. KNU-SG [53] proposed a system using a cluster–based language model. WS4A [54] proposed a novel approach consists on the maximum exploitation of existing web services. The evaluation results are shown in Table 5.

Table 5

Comparisons with BioASQ 2016 participants.

System	Batch 1	Batch 2	Batch 3	Batch 4	Batch 5
our	0.1247	0.1392	0.1701	0.2300	0.2401
KNU-SG Team_Korea	0.1365	0.1590	0.1693	0.2305	0.2386
ustb_prir	0.0700	0.0884	0.1127	0.2298	0.2250
testtext	0.0641	0.0774	0.1069	0.1834	0.1694
fdu	-	0.1870	0.2214	0.2365	0.2882
HPI	0.1601	-	0.1696	-	0.2049

In BioASQ 2017, Brokos etc. [55] proposed a retrieval method that represents documents and questions as weighted centroids of word embeddings and reranks the retrieved documents with a relaxation of Word Mover’s Distance. USTB_PRIR [56] introduced different multimodal query processing strategies to enrich query terms and assign different weights to them. The evaluation results are shown in Table 6.

Table 6

Comparisons with BioASQ 2017 participants.

System	Batch 1	Batch 2	Batch 3	Batch 4	Batch 5
our	0.1402	0.1598	0.1524	0.1726	0.1847
KNU-SG Team_Korea	0.1393	0.1734	0.1411	0.1385	-
ustb_prir	0.1747	0.2598	0.2727	0.2423	0.2090
testtext	0.1585	0.2523	0.3500	0.2465	0.1843
fdu	-	0.1711	0.3183	0.1436	0.1170

In BioASQ 2018, MindLab [26] proposed a model making use of semantic similarity patterns that were evaluated and measured by a convolutional neural network architecture. AUEB [8] used novel extensions to deep learning architectures. The evaluation results are shown in Table 7. From the statistics in the tables, we can see that our approach improves the best participating systems in BioASQ 2013-2015 by 52.4%, 36.1% and 18.0%, respectively. In BioASQ 2016-2018, our model performed close to the best competitors and even prevailed in some batches. The decreases of improvements do not indicate the decline of robustness. The ultimate causes are the introduction of extra resources, for example, introducing a pre-trained document retrieval model at the stage of retrieving documents can not only retrieve more comprehensively, but also reduce the probability of retrieving useless documents greatly. Using the more primitive search tool provided by pubmed, the top 100 relevant documents contain an average of 4.3 target documents, which can barely be called a comprehensive search. However, introducing too many useless documents brings a large error to the classification model. If extra resources of a pre-trained document retrieval model are introduced, the compression ratio of useful documents to useless documents can be reduced to 1:2 or even lower during the retrieval phase. Especially after BioASQ 2015, most of the systems based on extra resources contain a large amount of domain knowledge in biomedicine. In addition, extra resources of language system, like UMLS, can help the model to better calculate the relationship between the problem and the paragraph through the connection between medical concepts or vocabulary. [34] The word frequency is used to represent the degree of professionalism of the vocabulary, q-s pairs containing vocabulary with a word frequency less than 15 can be selected and retrieved in UMLS, and then be put the selected q-s pairs encoded with our model into the classification model containing Attention and output the results of the Attention. After standardizing the result of Attention, each word’s the degree of influence in the q-s pairs of the final classification result can be obtained, which is a decimal between 0 to 1. A larger value indicates that the word has a greater influence on the final result. It is found that more than half of these highly specialized vocabularies have a low impact on the final result. But these specialized vocabulary and many professional concepts associated with the vocabulary are the key to a correct answer. In summary, the proposed modified RNNs represent a practical approach to retrieve relevant snippets for biomedical questions compared with the state-of-the-art [57] BioASQ participants.

Significance testing and experimental analysis

To report effect sizes and confidence intervals more informatively, we performed several two-sided paired t-test experiments between our approach and each self-implementing approach on all 13 batches, including the variants and the IR models, according to Tetsuya Sakai’s significance testing [58]. According to the two-sided paired t-test experiments for the difference in means (with the unbiased estimate of the population variance V = 0.0008), our approach statistically significantly outperforms CNN (t(13) = 3.1946, p < 0.0077, ES = 0.8860, 95% CI [0.0079,0.0418]). The exact results of the other comparisons are shown in Table 8. Obviously, we can observe that all p-values are less than 0.01.

Table 8

Two sided paired t-test results on our approach with the baselines.

	d¯	t₀	p(<)	ES	95% CI
CNN	0.0249	3.1946	0.0077	0.8860	[0.0079, 0.0418]
RNN	0.0240	3.1779	0.0080	0.8814	[0.0075, 0.0405]
LSTM	0.0216	3.1602	0.0082	0.8765	[0.0067, 0.0365]
RAE	0.0138	3.1337	0.0086	0.8691	[0.0042, 0.0235]
QL	0.0245	3.2535	0.0069	0.9024	[0.0081, 0.0410]
SDM	0.0217	3.2567	0.0069	0.9033	[0.0072, 0.0362]
BM25	0.0258	3.2427	0.0071	0.8994	[0.0085, 0.0431]

Through the above comparisons and the statistical significance testing, we can conclude that our approach outperforms other vector representation models, IR models and state-of-the-art BioASQ participants. There are several reasons leading to the improvements. First, our approach aims to discover the semantic relations by classifying the Q-S pairs, while the purposes of classical IR models and some BioASQ participants are to measure the similarities of term distributions or semantics between the question and the candidate answers. Second, during the vector representation process, our approach retains as much of the local semantics as possible, which may benefit the classification of Q-S pairs. There is a typical model that get results by measuring the similarities of term distributions [53]. It is difficult to consider comprehensively, although this paper proposed six aspects to measure the similarity between questions and answers. The model proposed by this paper also requires a lot of expertise and experimentation to determine which aspects to be used. Our proposed model encodes Q-S pair automatically instead of the similarities of term distributions. Our approach not only needs less expertise and experiments but also automatically selects the required information by neural network and has better result than that model. Another paper [26] also uses the similarities of term distributions, but they propose a similarity matrix generated by part-of-speech and similarity. The form of matrix can increase their computing speed because they have a document retrieval model that can provide more accurate related documents than search engines, which is one of our weaknesses. Another disadvantage of our model is that we don’t take full advantage of part of speech, position and similarity, but they do. Two methods provided us with improved ideas above. Combining our method with their method may achieve better results but requires more computing power and manpower.

Conclusion

This paper studies the problem of answer matching and ranking issues for biomedical question answering with respect to a modified RNNs model. Our approach features the following novelties. (1) The proposed model successfully converts a snippet retrieval problem for biomedical questions into several classification tasks judging the semantic relations between biomedical questions and the candidate snippets. (2) The modified RNNs proposed a brand new definition—“ranking error”—in the loss function computation, which makes the conventional recursive neural networks more suitable for a ranking problem. (3) The proposed approach provides a simple but effective snippet retrieval proposal for the development of a biomedical question answering system. As relevant issues for future work, there are two directions. One direction is to extend our model to the semantic search of short text within the open domain. The other is to popularize the “ranking error” to make other classification models suitable for ranking. (RAR) Click here for additional data file. 6 Jan 2020 PONE-D-19-28201 List-wise Learning to Rank Biomedical Question-Answer Pairs with Deep Ranking Recursive Autoencoders PLOS ONE Dear Miss. Yan, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. The reviewers raised some important issues that need to be fully addressed, namely: 1 - According to PLOS ONE guidelines: “if the manuscript’s primary purpose is the description of new software or a new software package, this software must be open source, deposited in an appropriate archive, and conform to the Open Source Definition.” 2 - Give more details about the methods used so others can replicate the analysis, include error analysis and examples 3 - Include in the analysis the BioASQ 16-18 datasets and systems. We would appreciate receiving your revised manuscript by Feb 20 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. We look forward to receiving your revised manuscript. Kind regards, Francisco M Couto Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements: 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.plosone.org/attachments/PLOSOne_formatting_sample_main_body.pdf and http://www.plosone.org/attachments/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Thank you for stating the following in the Financial Disclosure section: "The author(s) received no specific funding for this work." We note that one or more of the authors are employed by a commercial company: Alibaba Group. a. Please provide an amended Funding Statement declaring this commercial affiliation, as well as a statement regarding the Role of Funders in your study. If the funding organization did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript and only provided financial support in the form of authors' salaries and/or research materials, please review your statements relating to the author contributions, and ensure you have specifically and accurately indicated the role(s) that these authors had in your study. You can update author roles in the Author Contributions section of the online submission form. Please also include the following statement within your amended Funding Statement. “The funder provided support in the form of salaries for authors [insert relevant initials], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.” If your commercial affiliation did play a role in your study, please state and explain this role within your updated Funding Statement. b. Please also provide an updated Competing Interests Statement declaring this commercial affiliation along with any other relevant declarations relating to employment, consultancy, patents, products in development, or marketed products, etc. Within your Competing Interests Statement, please confirm that this commercial affiliation does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests) . If this adherence statement is not accurate and there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared. c. Please include both an updated Funding Statement and Competing Interests Statement in your cover letter. We will change the online submission form on your behalf. Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: No Reviewer #2: Yes Reviewer #3: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: No Reviewer #2: Yes Reviewer #3: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The manuscript provides an approach to snippet retrieval for biomedical question answering using recursive auto-encoders (RAE). The authors evaluated their approach on BioASQ datasets and compared to other deep learning architectures such as CNN, RNN and LSTMs, as well as with BioASQ participants. However it's conclusions are not well supported since the comparisons are not made against the current state-of-the-art. Still, I believe that if the appropriate comparisons are made (i.e. evaluate on the most recent BioASQ datasets and compare with the results obtained by the more recent systems) and the results are discussed with more detail, the idea is interesting and has potential. strengths: - statistical comparison with other approaches - evaluation on multiple BioASQ datasets - detailed explanation of the proposed method with figures weaknesses: - The results are not well supported. To really understand why your method outperforms all the others, I would need more than MAP scores, for example, questions that other method got wrong and your method got right. Other aspects such as inference time would also be interesting. - The discussion does not go in-depth about what makes your method "more suitable than other vector representation models to a large extent" or that it "retains as much of the local semantics as possible". - Why not show the results for BioASQ 16 -18? You also claim that your approach was evaluated on two other open domain tasks on the abstract, but this is never mentioned in the text. - Furthermore you also ignored other approaches to the snippet ranking task that have been proposed since the original tasks. For example this paper: https://www.sciencedirect.com/science/article/pii/S1532046417300503 (Table 2) shows esults for BioASQ 2015 snippet retrieval task superior to yours on every batch. - Since the code is not provided, it would be difficult to test on other datasets or reproduce the results - Please make clear the distinction between Recurrent Neural networks and Recursive Neural Networks. You end up using RNN for both, while generally RNN is used for Recurrent Neural Networks. - There are occasional typos and language errors that should be revised, for example Line 250 "re relevant", while some expressions do not seem to be standard English: Line 147 "From our points". Reviewer #2: This paper presents a model for biomedical question answering using a neural network architecture. The proposed model uses a recursive autoencoder aggregated over a binary tree representation of the sequence and a recurrent neural net is trained with a ranking loss of candidate word spans. Results are reported on three years of BioASQ challenges comparing against different encoding methods and competing systems at the challenge. Strengths - The paper presents an extensive evaluation on the BioASQ challenge - The results outperform baseline systems significantly. This is presented through statistical testing and comparison with multiple baselines - Baselines range from simple IR systems to other system implementations in the challenge Weaknesses - There is no error analysis documenting the reasons why the proposed system performs better - It is unclear if the code will be publicly available - It is unclear how the system will perform outside the biomedical domain and on datasets other than the BioASQ task - The paper needs some proof reading for English - Why is a binary tree better than a parse tree for the RAE? Overall I think this is good work but needs to convince the reader better the reasons behind why the proposed model outperforms existing methods. Reviewer #3: This paper addresses the problem of extracting and ranking the relevant snippets in the context of Question-Answering. To achieve this goal, the authors proposes to catch the semantic relation between the question and the snippet and introduce a ranking error. The paper is well written and the method is clearly described. The state of the art is well covered. The authors may also mention work related to the querying biomedical linked data, even if NLP is mainly involved in the question processing. The method is evaluated against standard QA set from BioASQ challenges. A significance testing is performed However, the results could be better analysed, especially to identify if the improvement are related a better ranking or a better snippet extraction. A discussion about the limit of the message is also welcomed. There is a regular typo along the article: many times, space character is wrongly added before commas or periods (even in the affliation part). line 250: re -> are ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Andre Lamurias Reviewer #2: No Reviewer #3: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step. 20 Feb 2020 The reviewer's comments have been answered item by item in the attachment, and the response has been uploaded as a separate file. Submitted filename: Response to Reviewers.pdf Click here for additional data file. 19 May 2020 PONE-D-19-28201R1 List-wise Learning to Rank Biomedical Question-Answer Pairs with Deep Ranking Recursive Autoencoders PLOS ONE Dear Miss. Yan, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. There are still important issues raised by the reviewers that need to be addressed, namely inconsistencies in the MAP scores, lack of examples/error analysis, incomplete comparison of BioASQ2015, and grammar errors. We would appreciate receiving your revised manuscript by Jul 03 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. We look forward to receiving your revised manuscript. Kind regards, Francisco M Couto Academic Editor PLOS ONE [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: (No Response) Reviewer #4: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: No Reviewer #4: Partly ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #4: I Don't Know ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #4: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: No Reviewer #4: No ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors addressed some of my issues properly, namely they made the code open source, explained with more the detail the advantages of their method, provided results of more recent edition of BioASQ and mention post-competiton results of other groups. However some issues were not properly addressed and others were introduced with this revision: - examples/error analysis: unlike the teams that participated on the competition, it is possible to analyse the test set and give examples where the system performed well and where it failed. This is helpful to improve future approaches but it is still not present in this version - proper comparison to other methods: although the authors now show the results of their method on more recent editions of BioASQ; they justify the lower performance of the proposed method on the recent editions this way: "The decreases of improvements do not indicate the decline of robustness. The ultimate causes are the introduction of extra resources. Especially after BioASQ 2015, most of the systems based on extra resources contain a large amount of domain knowledge in biomedicine." this statement is vague and unfounded: what do the authors mean by "extra resources"? The description provided on the manuscript of the other team's systems mentions other resources used on the previous editions too: "Hana Database and BioPortal to retrieve biomedical concepts" and "look up the synonyms of the keywords in queries to select effective terms for query expansion". A more in-depth analysis is necessary to understand if the extra resources were really the cause of the improved performance of recent systems. - Furthermore although it is mentioned in the related work section, the results of the proposed method on BioASQ2015 are not compared with these: https://www.sciencedirect.com/science/article/pii/S1532046417300503 - English should be revised, particularly on the text that was added since the previous version -Tables 5 6 and 7 - it is misleading to have the values of the first row bolded. I would suggest highlighting the highest value of each column instead. Reviewer #4: This paper presents an approach for the automated selection of relevant snippets for answering biomedical questions. The authors propose a Deep Learning approach based on ranking question-snippet pairs introducing a “ranking error” loss function for this task. Different experiments are presented comparing the performance of the proposed approach to various baselines based on classic IR models (implemented by them) and to some existing systems participating in the BioASQ challenge. Their system outperforms the IR baselines and achieves decent results compared with the BioASQ participants. I consider two main improvements necessary for this manuscript: 1) There are inconsistencies in the MAP scores of BioASQ participants reported in tables 2 to 7. A thorough check is needed to guarantee that the numbers are correct. 1.a) For example, the MAP scores in table 4 are significantly lower than the official results http://participants-area.bioasq.org/results/3b/phaseA/. This table even includes a score in the second batch for the oaqa system which didn't participate in that batch. In table 5, batch 5 the KNU team hast the score of the ustb team and vice versa. 1.b) It is not clear which teams are selected to be presented in the tables. For example in table 4 (batch 4 and 5) and 5 (batch 2,3,4, and 5) the winning fdu team (with the highest MAP score) is not included in the table. The same with the aueb team in table 7. 1.c) In addition, even for teams included in the tables, the reported MAP score is not always the best score achieved by the team. For example, in table 6, batch 1, the score 0.1620 is reported which corresponds to utsb_prir3, while utsb_prir2 achieves 0.1774. 2) There are various expression and orthographic issues in the manuscript that need to be checked. Some examples: line 46: improvements of performances, -> improvements of performance, line 88: So the extraction of documents is a great challenge -> So the extraction of snippets is a great challenge line 102: Like the BioASQ challenge, participants of the BioNLP-> Similarly to participants of the BioASQ challenge, participants of the BioNLP... line 122: Sarrouti etc. [34] proposed using temmed words -> Sarrouti and El Alaoui [34] proposed the use of stemmed words line 153: From our perspectives, -> From our perspective, line 182: on evey input questions and remove all non-noun phrase(NNP) part-> on every input question and remove all non-noun phrase (NNP) parts line 186: oftenly. -> often. line 187: since search engine tend to retrieve -> since search engines tend to retrieve line 189: works best then leave nouns only -> works better than leaving nouns only line 192: the semantic vectors of words are requested -> the semantic vectors of words are required line 194: Medline Articles collection -> Medline article collection line 206: standards which and we can -> (need for rephrasing) line 208: more suitably and easier to encode vectors together -> (need for rephrasing) line 251: distribution of with and without -> distribution with and without line 309: "What -> ``What line 313: Comparison Algorithms -> Algorithm Comparison (or Comparison of Algorithms) line 350: after a series of compute -> (need for rephrasing) line 352: The reason why the vector y can returned to the input vectors is the vector y has most -> The reason why the vector y can be returned to the input vectors is that the vector y has most Fig 4. Input vector x1; x2 is the children mentioned in Recursive Autoencoders and Variants section, and encoded vector y is the parent mentioned in Recursive Autoencoders and Variants section. -> Fig 4. The input vectors x1; x2 are the children mentioned in Recursive Autoencoders and Variants section, and the encoded vector y is the parent mentioned in Recursive Autoencoders and Variants section. line 377: other teams and our exact MAP scores -> (need for rephrasing) line 456: The reason why they get a better result mainly because -> They get a better result mainly because ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Andre Lamurias Reviewer #4: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step. 6 Jul 2020 Thank you for your responsible attitude and sincere suggestions. All comments were carefully answered in the ”response to reviewer". Thanks again. Submitted filename: response to reviewers.pdf Click here for additional data file. 12 Oct 2020 PONE-D-19-28201R2 List-wise Learning to Rank Biomedical Question-Answer Pairs with Deep Ranking Recursive Autoencoders PLOS ONE Dear Dr. Yan, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. The manuscript was recommended to minor revision, thus please make changes according to the suggested comments and please do a final proofreading of the document. Optionally, if you find relevant, you can update your results according to a recent corpus that also improved BioASQ IR tasks: https://ieeexplore.ieee.org/document/9184044 Please submit your revised manuscript by Nov 26 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Francisco M Couto Academic Editor PLOS ONE [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #4: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #4: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #4: I Don't Know ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #4: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #4: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: (No Response) Reviewer #4: This paper presents an approach for the automated selection of relevant snippets for answering biomedical questions. As in the original version of the manuscript, the authors present experiments comparing their deep-learning approach that introduces a "ranking error" loss function for ranking question-snippet pairs. Considerable improvement has been done in the direction of the two issues highlighted previously. However, there are still some minor improvements that need the attention and action of the authors. 1) The main inconsistencies in the MAP scores of BioASQ participants presented in tables 2-7 have been removed. However, a) there is an error in table 6, batch 3 as the results reported are for a different task (document retrieval) instead of snippet retrieval, as done for the other batches and tables. b) The performance of the top participating systems is still missing in some tables/batches (In particular: table 3, batch 2, HPI-S1 MAP 0.048; table 4, batch 1, HPI-S2, MAP 0.0971; table 4, batch 3, oaqa, MAP 0.1969; table 4, batch 4, fdu2, MAP 0.2480; table 4, batch 5, fdu2, MAP 0.2424; table 5, batch 1, HPI-S2, MAP 0.1601) 2) There are still some syntactic issues in English that should be carefully re-checked. In addition, a minor formatting error exists in Table 1, where the performance 0.1203 is bold instead of 0.1412 in BioASQ 2015, batch 5. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #4: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 24 Oct 2020 We are very grateful for your comments and opinions , and apologize for our mistakes. Submitted filename: Response to reviewers.pdf Click here for additional data file. 27 Oct 2020 List-wise Learning to Rank Biomedical Question-Answer Pairs with Deep Ranking Recursive Autoencoders PONE-D-19-28201R3 Dear Dr. Yan, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Francisco M Couto Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: 29 Oct 2020 PONE-D-19-28201R3 List-wise Learning to Rank Biomedical Question-Answer Pairs with Deep Ranking Recursive Autoencoders Dear Dr. Yan: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Mr. Francisco M Couto Academic Editor PLOS ONE

11 in total

1. The MiPACQ clinical question answering system.

Authors: Brian L Cairns; Rodney D Nielsen; James J Masanz; James H Martin; Martha S Palmer; Wayne H Ward; Guergana K Savova
Journal: AMIA Annu Symp Proc Date: 2011-10-22

2. A cognitive evaluation of four online search engines for answering definitional questions posed by physicians.

Authors: Hong Yu; David Kaufman
Journal: Pac Symp Biocomput Date: 2007

3. Development, implementation, and a cognitive evaluation of a definitional question answering system for physicians.

Authors: Hong Yu; Minsuk Lee; David Kaufman; John Ely; Jerome A Osheroff; George Hripcsak; James Cimino
Journal: J Biomed Inform Date: 2007-03-12 Impact factor: 6.317

Review 4. Biomedical question answering: a survey.

Authors: Sofia J Athenikos; Hyoil Han
Journal: Comput Methods Programs Biomed Date: 2009-11-13 Impact factor: 5.428

5. A passage retrieval method based on probabilistic information retrieval model and UMLS concepts in biomedical question answering.

Authors: Mourad Sarrouti; Said Ouatik El Alaoui
Journal: J Biomed Inform Date: 2017-03-07 Impact factor: 6.317

6. Automatically extracting information needs from complex clinical questions.

Authors: Yong-gang Cao; James J Cimino; John Ely; Hong Yu
Journal: J Biomed Inform Date: 2010-07-27 Impact factor: 6.317

7. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.

Authors: George Tsatsaronis; Georgios Balikas; Prodromos Malakasiotis; Ioannis Partalas; Matthias Zschunke; Michael R Alvers; Dirk Weissenborn; Anastasia Krithara; Sergios Petridis; Dimitris Polychronopoulos; Yannis Almirantis; John Pavlopoulos; Nicolas Baskiotis; Patrick Gallinari; Thierry Artiéres; Axel-Cyrille Ngonga Ngomo; Norman Heino; Eric Gaussier; Liliana Barrio-Alvers; Michael Schroeder; Ion Androutsopoulos; Georgios Paliouras
Journal: BMC Bioinformatics Date: 2015-04-30 Impact factor: 3.169

8. Clinical text classification with rule-based features and knowledge-guided convolutional neural networks.

Authors: Liang Yao; Chengsheng Mao; Yuan Luo
Journal: BMC Med Inform Decis Mak Date: 2019-04-04 Impact factor: 2.796

9. A question-entailment approach to question answering.

Authors: Asma Ben Abacha; Dina Demner-Fushman
Journal: BMC Bioinformatics Date: 2019-10-22 Impact factor: 3.169

10. Knowledge Discovery and interactive Data Mining in Bioinformatics--State-of-the-Art, future challenges and research directions.

Authors: Andreas Holzinger; Matthias Dehmer; Igor Jurisica
Journal: BMC Bioinformatics Date: 2014-05-16 Impact factor: 3.169