Literature DB >> 34276970

ITEXT-BIO: Intelligent Term EXTraction for BIOmedical analysis.

Rodrique Kafando1,2, Rémy Decoupes1,2, Sarah Valentin2,3,4, Lucile Sautot2,5, Maguelonne Teisseire1,2, Mathieu Roche2,3.   

Abstract

Here, we introduce ITEXT-BIO, an intelligent process for biomedical domain terminology extraction from textual documents and subsequent analysis. The proposed methodology consists of two complementary approaches, including free and driven term extraction. The first is based on term extraction with statistical measures, while the second considers morphosyntactic variation rules to extract term variants from the corpus. The combination of two term extraction and analysis strategies is the keystone of ITEXT-BIO. These include combined intra-corpus strategies that enable term extraction and analysis either from a single corpus (intra), or from corpora (inter). We assessed the two approaches, the corpus or corpora to be analysed and the type of statistical measures used. Our experimental findings revealed that the proposed methodology could be used: (1) to efficiently extract representative, discriminant and new terms from a given corpus or corpora, and (2) to provide quantitative and qualitative analyses on these terms regarding the study domain.
© The Author(s) 2021.

Entities:  

Keywords:  Biomedical terminology; Intelligent analysis; Terminology extraction

Year:  2021        PMID: 34276970      PMCID: PMC8272612          DOI: 10.1007/s13755-021-00156-6

Source DB:  PubMed          Journal:  Health Inf Sci Syst        ISSN: 2047-2501


Introduction

The usefulness of terminology extraction from corpora is clearly acknowledged as it has generated a great deal of research and discussion. This well-established process is used in natural language processing and has led to the development of several tailored tools such as TBXTools [31], TermSuite [9], BioTex [22], etc. Based on [22], our proposal deals with domain-based terminology extraction from heterogeneous corpora, and how to efficiently generate a quantitative and qualitative analysis. To this end, we propose a generic methodology hinged on a combination of extraction and analysis strategies. Term extraction strategies are based on combinations of linguistic, statistical measures, and corpus segmentation approaches, while analysis strategies are based on combinations of extracted terms. Based on the combined strategies, ITEXT-BIO aims to extract: (1) representative terms, (2) discriminant or relevant terms, and (3) new relevant terms from a corpus or corpora. These strategies are specifically useful for dedicated tasks, such as corpus analysis, specific domain monitoring (e.g. epidemiology) or scientific research monitoring. This paper is organized as follows. In Section Related work, we briefly present the state-of-the-art related to terminology extraction. Section Dataset description details the dataset dedicated to scientific papers. Sections Methodology and Experiments respectively provide an overview of our proposal and the experiments. In Section Case study: epidemic intelligence, we illustrate the genericity of the proposal by presenting a case study of an implementation of the combined strategies for epidemiological intelligence analysis. We conclude in Section Conclusion by presenting some perspectives for future studies.

Related work

Domain terminology extraction is a major focus of interest and discussion in natural language processing (NLP) research. It has prompted several proposals of methodologies [20, 32, 34, 36] geared towards effective extraction of terms within a given corpus. Also known as automatic term extraction (ATE), this task is considered in various NLP applications, such as in information retrieval [2, 4, 11, 37], topic modeling [15, 42], domain-based monitoring [1, 19, 27], keyword extraction [7] and summarization [2], ontology acquisition, thesaurus construction, etc. According to [23], term extraction techniques can be categorized under four approaches: linguistic, statistical, machine learning and hybrid. Overall, linguistic approaches take morphosyntactic part-of-speach (POS) rules into account to describe terms with common structures [5]. Statistical approaches use statistical measures such as term frequency [35, 43], or term co-occurrence between words and phrases like Chi-square [26]. Machine learning approaches use statistical measures and are mainly jointly focused on term extraction [7, 8, 12], classification [41] and summarization [2]. They combine linguistic and statistic approaches to extract terms from textual data in order to build machine learning models. In [7], the authors highlighted that most of these tasks are tackled with unsupervised learning algorithms. Hybrid approaches include, for instance, C_Value [34], C/NC_Value [13] methods, which combine statistical measures and linguistic based rules to extract multi-word and nested terms. In [6, 30], the authors combine rule-based methods and dictionaries to extract terms from Spanish biomedical texts and specialised Arabic texts respectively. Studies such as [18, 21] related to these latter approaches have revealed the effectiveness and high performance of hybrid term extraction approaches. The proposed methodologies apply to several domains. In [22], the authors proposed BioTex, a linguistic and statistical measure-based tool to extract terms related to the biomedical domain. The same approach was used in [1] to detect terms or signals for infectious disease monitoring on the web. In [28], a hybrid methodology was proposed to extract terminology for electronic heath records. This hybrid approach was also adapted by [44] to extract concepts related to Chinese culture. The overall related studies have focused on techniques and methods for term extraction mainly from corpora. Based on existing methodologies, we oriented our study to develop an efficient approach for term extraction from heterogeneous corpora, along with a set of combined strategies to analyze these terms in the biomedical domain. Our methodology combines and tailors linguistic and statistic criteria associated with structural information in texts in order to highlight relevant terms therein. The presented strategies also aim to overcome the time-consuming issues related to machine learning methods which require manually annotated or partially annotated data.

Dataset description

Our study focused on the COVID-19 Open Research Dataset1 [40] which contains scientific papers on COVID-19 and related historical coronavirus research. Throughout this study, we refer to the dataset as COVID19-MOOD-data. The COVID19-MOOD-data dataset is divided into two main corpora, respectively named Papers1 and Papers2. Papers1 contains the commercial use subset (includes PubMed Central content), while Papers2 contains the commercial use subset (includes PubMed Central content), the non-commercial use subset (includes PubMed Central content) and the custom license subset. Three data pre-processing operations are performed per corpus (Papers1, Papers2) in order to create three corpora according to the title, abstract and content:We named them PapersX-title, PapersX-abstract and PapersX-content, respectively. See Table 1 for further details and Table 2 for the acronym definitions.
Table 1

Statistics related to the COVID19-MOOD-data dataset

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$NB_d(C)$$\end{document}NBd(C)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$NB_M (d)$$\end{document}NBM(d)std(c)
Papers1
Papers1-title931515± 8
Papers1-abstract9315180± 94
Papers1-content93154639± 359
Papers2
Papers2-title3232213± 10
Papers2-abstract32322168± 88
Papers2-content323224913± 720
Table 2

Table legend

AbbreviationsDescription
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$NB_d(C)$$\end{document}NBd(C)Number of documents in the corpus
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$NB_M(d)$$\end{document}NBM(d)Average number of words of a document in the corpus
std(c)Corpus standard deviation
NNNoun
NNNNMatches singular and plural noun terms
JJAdjective
NPProper noun
Title represents the corpus that contains only paper titles; Abstract represents the corpus that contains only paper abstracts; Content represents the corpus that contains only paper contents. Statistics related to the COVID19-MOOD-data dataset Table legend

Methodology

Here we outline two complementary term extraction and analysis approaches: the free term extraction approach and the driven term extraction approach. The first one is based on a combination of the type of corpus and the statistical measures, while the second is based on a combination of the type of corpus and the morphosyntactic variation rules.

The free term extraction approach

The free term extraction approach seeks to ensure that users will be able to extract significant terms related to a specific domain from a given corpus. As we mentioned in Section Related work, existing tools have been proposed for term and concept extraction. We opted for the BioTex tool to support the free term extraction mode for several reasons:Three essential parameters related to the BioTex tool are defined below:In addition to these parameters, there is the number of linguistic patterns (like NN NN, JJ NP NP, NN NP NP, etc.) that can be associated, but this is preset at 20 by default in BioTex. BioTex also includes patterns for verb terms, such as: NN VBD NN NN, NP NN VBD NN NP, etc. Figure 1 outlines the overall three-step process for free term extraction.
Fig. 1

The Free and Driven process for term extraction using BioTex and FASTR

BioTex was initially built for medical domain term extraction. BioTex uses hybrid measures (linguistic and several statistical measures) for the term extraction process. Most existing tools (e.g. Maui-indexer2, Topia Termextract3, KEA4, etc.) are designed for keyword extraction within single documents, and they only function for English language documents, while BioTex is tailored for terminology extraction and supports sets of documents (corpora) and multi-language use. a corpus: this is the data source from which terms are extracted; a statistical measure: as mentioned above, the BioText processing approach is based on linguistic and statistical measures. The linguistic parameter is defined by default, but the user must define the statistical parameter, as several exist, in order to run the term extraction process; the number of words to be extracted per concept: so called n-grams, this concerns the length of the extracted terms and ranges from 1 to 4_g for BioTex. At the end of the BioTex process, extracted terms are classified in two sets: TermSet, which only contains single word terms (SWT), and MultiTermSet, which contains multi-word terms (MWT). By using the Driven Extraction process (with FASTR), we can capture the entire term for a given incomplete one obtained during the first step (Free Extraction). The Driven Extraction process step uses incomplete terms to capture the entire terms in the document. For example, if “higher risk acute” or “higher risk area” terms are extracted in the Free Extraction process step, an entire term which could be“higher risk acute care area” will be obtained during the Driven Extraction process.

The driven term extraction approach

This extraction approach seeks to ensure that the terms extracted using BioTex could be used to improve the domain terminology. From a given term, the process aims to extract some variations of this term that exist in the corpus. The overall processing under this approach is handled with FASTR [17]. FASTR is a rule-based linguistic tool that generates morphosyntactic variants of terms. We respectively note NN, NNS, NNP, NNPS for noun paterns, VB, VBD, VBG, VBN, VBP, VBZ for verbs, RB, RBR, RBS for adverbs and finally JJ, JJR, JJS for adjectives. It enables extraction of variants of a given term in full-text documents. For a given term, FASTR helps extract nearby or long terms that contain the initial term. Figure 1 illustrates the two steps (4 and 5) of the driven term extraction approach. For a given term, FASTR helps extract nearby or long terms that contain the initial one. The driven process has the advantage of extracting relevant new terms that BioTex cannot extract from the corpus. The Free and Driven process for term extraction using BioTex and FASTR

Proposed combination for term extraction

Based on the elements given in Sects. The free term extraction approach and The driven term extraction approach, we propose a workflow in Fig. 2 for term extraction and analysis dedicated to scientific papers. We outline this workflow according to the type of corpus, measure, and approach:
Fig. 2

Proposed combination for term extraction

The type of corpus: as described in the data section, for a given paper, we considered three parts to build the corresponding corpora, i.e. the Title (T), Abstract (A) and Content (C); The measures: BioTex integrates several statistical measures, each of which uses a specific strategy to compute the term score. In this case, we selected the two measures C_Value and F-TFIDF-C_M. C_Value indicates the importance of terms that appear most frequently in a document, based on the idea that the frequency of appearance of a term in the document reflects its importance in the document. Moreover, based on frequency criteria, C_Value favors multi-word term extraction by taking into account nested terms (e.g. virus) in multi-word terms (e.g. influenza virus) [13]. F-TFIDF-C_M represents the harmonic mean of the two C_Value and TF-IDF values, which ranks terms by weight according to their relevance in the document while taking the whole corpus into account [24]. C_Value and F-TFIDF-C_M are complementary, as the first favors relevant MWT extraction while the second gives weight to discriminant terms. For each measure, the aim is to organise the extracted terms in to five sets. (1) Terms corresponding to the Title corpus Set(T), (2) terms corresponding to the Abstract corpus Set(A), (3) terms corresponding to the Content corpus Set(C), (4) terms that intersect within the Title and the Abstract corpus Set(TA), and (5) terms that intersect within the Title and the Content corpus Set(TC). The approach: terms could be extracted using both a given corpus and a specific statistical measure in a free extraction approach. Moreover, for the driven process, term variations are extracted by using both a given corpus and specific set of terms. The set of terms could be defined from the output of the previous approach. Proposed combination for term extraction

Experiments

To set the parameters, throughout our study we used C_Value and F-TFIDF-C_M as statistical measures, 50 different patterns or term extraction rules, and a number of words ranging from 1 to 4-g (). These parameters are applied for corpora described in section 3. The choice of C_Value and F-TFIDF-C_M is based on the findings of previous studies [13, 24] which showed that both allow efficient SWT and MWT extraction. Before applying BioTex, some specific pre-processes were applied for the Papers1-content and Papers2-content corpora due to their size. Papers1-content was divided into 09 sub-corpora (8 corpora of 1000 documents each and 1 corpus of 1315 documents) and Papers2-content into 32 sub-corpora (31 corpora of 1000 documents each, and 1 corpus of 1332 documents). Each corpus was partitioned into smaller units to enhance scalability. The results obtained from the smaller units were then composed by computing the average ranked values. The final rank for a given term was thus equal to the average of its ranked values in all sub-corpora in which it was present. The final result gave a set of terms, listed in ascending order according to the ranking values. Table 3 shows an example of the MWT set obtained using BioTex. The Terms column contains the extracted terms, the in_umls column indicates if the corresponding term is available in the Unified Medical Language System (UMLS) Metathesaurus [3] or not, and rank shows the significance of the term based on statistical measures in the whole list of terms for a given corpus. In our study, we used the UMLS Metathesaurus as reference for the extracted terms as our study is linked to a biomedical terminology analysis. This comparison aimed to separate new terminologies or terminologies that were not yet listed in the Metathesaurus.
Table 3

Example of BioTex ouput

Termsin_umlsRank
Public health11602.3971
Respiratory syndrome01481.9399
Infectious disease11198.2317
Virus infection11126.9083
Influenza virus11023.8858
Immune response11008.0362
Example of BioTex ouput We used BioTex, as outlined in Section The free term extraction approach, to extract terms from corpora in free mode. Several analyses are performed below on the obtained results. To this end, we conducted the experiments to address three main questions: (1) for each corpus, what are the most representative terms or domain concepts (terms that summarize the main content of the corpus) per statistical measure? (2) for each corpus, what are the most representative concepts for both measures? and (3) what are the discriminant and common concepts of the overall corpus? For each case, we determined if the extracted terms exist or not in the UMLS Metathesaurus.

Corpus representative terms

In this section, we illustrate how representative terms can be extracted from different datasets. Based on the BioTex ranking measures, a term is more important than another one in a given corpus if it has a higher ranking than the other term. Figure 3 shows representative terms for the Title, Abstract and Content corpora with the corresponding statistical measures (see Tables 7 and 8 for more details).
Fig. 3

Representative terms from Papers1

Table 7

Best ranked terms extracted from Paper1 using F-TFIDF-C_M

F-TFIDF-C_M
Title CorpusAbstract CorpusContent Corpus
termsranktermsranktermsrank
Respiratory syncytial virus1.9880Public health1.9986Additional file1.9976
Middle east respiratory syndrome coronavirus1.9846Infectious diseases1.9979Infectious disease1.997
Systematic review1.9842Immune responses1.9976nk cells1.997
Open access1.9819Influenza virus1.9976Health care1.996
Zika virus1.9819t cells1.9975Endothelial cells1.9957
Gene expression1.9795Virus infection1.9974Frequency domain1.9957
Virology journal1.9788Respiratory tract1.9973Ebola virus1.9948
Human coronavirus1.976Viral infections1.9969Influenza infection1.9943
Case report1.9756RNA viruses1.9967Real-time rt-pcr1.9933
Syncytial virus1.9752Acute respiratory syndrome1.9961Incubation period1.99325
t cell1.974695percent ci1.996Health emergency1.9932
Infectious bronchitis1.9726Ebola virus1.9945Index patient1.9932
Sars coronavirus1.9723Influenza viruses1.9943Membrane rafts1.9931
BMC public health1.9701Avian influenza1.9939pcr products1.9929
t cells1.9689Respiratory tract infections1.99382c atpase1.9926
Acute respiratory infection1.9672Health care1.9925b cell1.9924
Mini review1.9636Hepatitis c1.9922Close contact1.9924
Respiratory viral infections1.9636Type I1.9918Final dataset1.9922
BMC public1.9625Cell line1.99143d8 scfv1.9921
Ebola virus disease1.9592Spike protein1.9909Pol ii1.992
Supplementary information1.9574Codon usage1.99083c pro1.992
Community-acquired pneumonia1.9543Pandemic influenza1.9907Influenza pandemic1.9919
Global health1.9543Endoplasmic reticulum1.9904Phylogenetic tree1.9918
Peer review1.9543Saudi Arabia1.9904Protein vi1.9917
Japanese encephalitis virus1.9512Innate immunity1.9903ag nps1.9916
Innate immunity1.9488Porcine epidemic1.9903Influenza b1.99125
Multiple sclerosis1.9488Global health1.9902ifn \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta - 1\alpha$$\end{document}β-1α1.991
Human rhinovirus1.9466Vaccine development1.9901ill patients1.9908
Supplementary material1.9442Cell death1.9898Poly tail1.9908
Cell entry1.9417Infectious disease1.9896Host range1.9906
Coronavirus spike1.9417Peripheral blood1.9895Cyclin d31.9903
Human adenovirus1.9417Hong Kong1.9894Sequence accession1.9903
East respiratory syndrome coronavirus1.9414Immune cells1.9888Antiviral drugs1.9897
Mers coronavirus1.9388Cell cycle1.9886Subunit vaccines1.9897
West Africa1.9388Clinical trials1.9885Protein sequences1.9895
Molecular epidemiology1.9323Infection control1.9884Oil spill1.9895
National natural science1.931Mass spectrometry1.9883Swine flu1.9894
Natural science foundation1.931Genome sequence1.9881Membrane proteins1.9893
Rift valley fever1.931Clinical samples1.9877Contact tracing1.9891
National natural science foundation1.9307Acute respiratory infections1.9874sars 3a1.9889
Influenza infection1.9284Severe disease1.9868Critical care1.9888
Protein response1.9284Hepatitis b1.9864hk-2 cells1.9888
Science foundation1.9284Host response1.9864ap2 group1.9887
Supplementary materials1.9284Type II1.9864prp sc1.9887
Natural science1.9241Nucleic acids1.9862t-cell responses1.9887
Respiratory syndrome coronavirus infection1.9241Surveillance systems1.9859DNA vaccines1.9886
Influenza virus1.9212Influenza virus infection1.9852Reverse genetics1.9886
Obstructive pulmonary disease1.92Antiviral drugs1.9851Health system1.9884
Emerging microbes1.9193DNA vaccine1.9847 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b7-h1$$\end{document}b7-h1 1.9884
Original research1.9193Influenza infection1.9845hcv infection1.9883
Retrospective study1.9193Reference genes1.9842Lung cancer1.9879
Phylogenetic analysis1.9153Cell types1.984Nucleocapsid protein1.9879
Respiratory syndrome coronavirus1.9151b cell1.98353c protease1.9878
Clinical characteristics1.9138Vaccine candidates1.9835tgev infection1.9878
Mass spectrometry1.9138Host species1.9833cs dna1.9878
National natural1.9138Respiratory viral infections1.9832Risk perception1.9875
Rift valley1.9138Endothelial cells1.9829s1 protein1.9875
Science china1.9138Sequence data1.9829Ring vaccination1.9875
Valley fever1.9138DNA viruses1.9826Syrian hamster1.9873
Respiratory virus infections1.913Host innate1.9826Wild mice1.9873
Syndrome coronavirus1.9096Parainfluenza virus1.9824Yellow fever1.9873
Classical swine fever virus1.9087Tract infections1.9822Climate change1.9873
b cells1.9074South Korea1.9821Public health services1.9873
Host response1.9074Acute respiratory infection1.9817Index patients1.9872
Science foundation of china1.9074Reproduction number1.9816Small rna1.9872
Viral proteins1.9074Surveillance system1.9816IC activity1.9871
Virus disease1.9065Causative agent1.9813Ebola virus disease1.9868
Clinical infectious diseases1.9048Multiple sclerosis1.9811RNA chaperone1.9867
World health organization1.9048rsv infection1.9809Caco-2 cells1.9867
Antiviral agents1.9001Cellular proteins1.9808m2 channel1.9865
Cell culture1.9001West nile virus1.9806Overlapping genes1.9865
Pulmonary disease1.9001Respiratory diseases1.9805Nasal mucosa1.9865
Study protocol1.9001tgev infection1.9805Hepatitis e1.9865
Dengue virus1.8946e protein1.9802Genetic drift1.9865
Public health1.893Gene expression1.9801a7 gfp1.9865
RNA replication1.8915Structural proteins1.9799Tumor cells1.9864
Japanese encephalitis1.8902Acute respiratory tract1.9792Tanguticum nanoparticles1.9864
Syndrome coronavirus infection1.8864Hand hygiene1.9792cfu ml1.9864
Human respiratory syncytial virus1.8841Disease transmission1.9788Ward closure1.9861
Synonymous codon usage1.8824Human rhinovirus1.9785Case definitions1.9861
Clinical infectious1.8813Bacterial infections1.9781Richards model1.9861
Health organization1.8813Cancer cells1.9781Epimedium koreanum1.9861
Severe pneumonia1.8813DNA vaccines1.9777ms2 plp1.986
Dengue virus infection1.8772Type III1.9777Gene therapy1.9859
Clinical samples1.8768Viral pathogenesis1.9773Integrin b31.9859
Classical swine fever1.8744Zoonotic diseases1.9773Cardiovascular diseases1.9859
Human antibody1.869Early detection1.9765Fourth site1.9859
Lassa virus1.869Lung cancer1.9756Serial interval1.9858
Pilot study1.869Nile virus1.9756trm cells1.9858
Avian influenza viruses1.8667Human disease1.9751Electronic supplementary material1.9857
Human respiratory syncytial1.8667rnase l1.9751Emergency nurses1.9856
International health regulations1.8667Health systems1.9746Pet substrate1.9856
Hepatitis c virus infection1.8661Incubation period1.9746fcov type1.9856
Infectious bronchitis virus strain1.8661Rabies virus1.9746s1 text1.9856
Vaccine development1.8601Adaptive immunity1.9741Global health research1.9854
Protects hepatocytes from type I1.8564Multiplex pcr1.9741ace2 activity1.9853
Type I interferon signaling disrupts1.8564nk cells1.9741\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta$$\end{document}β 6 ko1.9853
Adaptive immunity1.8538Feline coronavirus1.9735Global health1.9852
Adenovirus type1.8538Human populations1.9735Ham tsp1.9851
Nonhuman primates1.8538Common cold1.9723Blood culture1.9849
Table 8

Best ranked terms extracted from Paper1 using C-Value

C-Value
Title CorpusAbstract CorpusContent Corpus
TermsRankTermsRankTermsRank
Respiratory syndrome386.7309Public health1393.182t cells2063.1457
Virus infection366.1263Respiratory syndrome1095.2091Public health1644.7156
Porcine epidemic diarrhea virus329.7138Infectious diseases952.5625Amino acid1409.82415
Porcine epidemic diarrhea318.0Immune response908.1835Immune response1400.94835
Epidemic diarrhea virus306.0Immune responses841.6151Influenza virus1185.8689
East respiratory syndrome284.0Influenza virus841.6151Immune responses1056.536
Middle east261.5188t cells803.576t cell1056.37753
Epidemic diarrhea256.7639Virus infection760.7811Gene expression1050.6716
Diarrhea virus245.6692Respiratory tract727.4978Viral replication1021.5083
Infectious diseases245.6692Vviral infection668.8542Infected cells939.72426
Respiratory syndrome coronavirus240.0Viral replication665.6843Cell lines897.4057
Influenza a225.0647Viral infections640.3249Viral infection888.6884
Public health209.2151East respiratory syndrome638.0Virus infection872.68035
Syndrome coronavirus191.7805Respiratory syndrome coronavirus636.0Amino acids866.816
Porcine epidemic190.1955Middle east630.8151mg ml824.4975
Influenza virus182.2707Gene expression627.6452Infectious diseases822.27855
Respiratory tract180.6857Infectious disease613.3805Present study812.45177
Middle east respiratory syndrome174.1446RNA viruses603.8707Respiratory tract812.13477
Middle east respiratory170.0Present study575.3414Epithelial cells759.03855
Respiratory syncytial virus166.0Respiratory viruses551.567Previous studies732.41119
Infectious bronchitis160.0812Acute respiratory syndrome516.0Room temperature714.3426
Infectious disease156.9113t cell513.5279Cell culture673.60907
Infectious bronchitis virus156.0Syndrome coronavirus511.9429Additional file657.75946
East respiratory136.3068Porcine epidemic diarrhea506.0Viral infections635.72848
Syncytial virus134.721895percent ci502.4331Immune system617.97689
Avian influenza131.5519Viral rna499.2632Respiratory syndrome617.3429
Respiratory viruses131.5519Amino acid489.7534Cell line611.16155
East respiratory syndrome coronavirus130.028Respiratory syncytial virus472.0Infectious disease607.04063
Middle east respiratory syndrome coronavirus129.2481Cell lines443.7895\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu$$\end{document}μ g ml576.13388
Influenza a virus126.0Respiratory infections426.3549Western blot568.36754
Bronchitis virus125.212Epithelial cells424.77rnase l565.0391
Respiratory infections125.212Virus replication420.0151Virus replication560.6012
Systematic review125.212Polymerase chain reaction408.0Cell surface543.9591
Ebola virus120.4572Epidemic diarrhea virus406.0xx542.0572
Acute respiratory117.2872Epidemic diarrhea402.5805Host cell539.83825
Viral infections117.2872Host cell396.2406Codon usage523.03765
Virus replication115.7023Syncytial virus378.806Viral proteins520.6601
Open access109.3624Porcine epidemic diarrhea virus376.1524Respiratory viruses515.4298
Zika virus109.3624Antiviral activity374.0512nk cells503.2256
Respiratory tract infections102.0Risk factors374.0512Time points497.8367
Viral infection101.4376Immune system369.2963Influenza viruses492.7648
Immune response99.8526Ebola virus364.5414Important role491.0213
Hepatitis c virus98.0Chain reaction355.0316Allergic rhinitis486.5835
Gene expression96.6827Influenza viruses348.6918Antiviral activity481.3531
Pandemic influenza96.6827Infected cells347.1068Global health473.9038
Respiratory syndrome virus96.0Diarrhea virus340.7669mg kg470.0998
Epithelial cells95.0978Host cells334.4271Frequency domain469.1489
Complete genome93.5128Important role331.2572Control group466.13749
Syndrome virus93.5128Phylogenetic analysis331.2572Viral load465.34499
Virology journal93.5128Polymerase chain331.2572Binding site459.6391
Hepatitis c91.9278Respiratory disease326.5023Expression levels453.6162
Immune responses90.3429Avian influenza324.9173Hong Kong450.7237
Genome sequence88.7579Respiratory tract infections320.0Clinical signs448.8613
Dengue virus87.1729Infectious bronchitis285.2933Protein expression448.2274
Molecular sciences84.0029Cell culture272.6136Wild type446.7833
Type i84.0029Hepatitis c virus268.0Endothelial cells441.4120
Acute respiratory syndrome84.0Health care264.6887Table s1438.4006
Complete genome sequence84.0Zika virus264.6887Flow cytometry437.4496
Human coronavirus82.4181Infectious bronchitis virus260.0Saudi Arabia433.4872
Respiratory infection82.4181Tract infections258.3489Viral genome433.3992
Case report80.8331Hepatitis c255.179Negative control433.2230
Tract infections80.8331Innate immune response252.0431.7890
Risk factors79.2481Monoclonal antibodies248.8391Cell types431.1098
Spike protein77.6632Viral genome247.2542Viral entry427.9399
t cell77.6632Type I242.4993Cell death425.24544
Acute respiratory infections76.0Central nervous system242.0er stress423.185
Coronavirus infection74.4932Amino acids239.3293Significant differences420.6490
RNA viruses74.4932Animal models237.7444Health care420.4905
Severe acute respiratory72.0Real-time pcr236.1594Tcid 50417.3734
Sars coronavirus71.3233Dengue virus232.9895Cathepsin l410.5053
Isothermal amplification69.7384Viral load232.9895Risk factors408.9203
Respiratory disease69.7384World Health Organization232.0Positive selection405.7504
BMC public health66.0Cell line231.4045Cell cycle400.9955
Disease virus64.9835Viral proteins229.8196Nucleotide sequences397.8256
t cells63.3985Nervous system226.6496Plasma membrane393.5990
Influenza viruses61.8135Wide range223.4797Intensive care392.2782
Acute respiratory infection60.0Virus infections221.8948Host cells384.82889
Type i interferon60.0Middle east respiratory syndrome220.5832Hand hygiene383.5609
Journal frontiers58.6436Immunodeficiency virus218.7248Significant difference382.6099
Fever virus57.0587Spike protein218.7248Immune cells381.02498
Respiratory syncytial57.0587Life cycle217.1399Reference genes380.3909
Severe acute57.0587Recent years217.1399HIV aids377.2211
Respiratory tract infection56.0Codon usage215.5549Avian influenza376.8688
Antiviral activity55.4737Viral pathogens215.5549Serum samples375.8625
BMC infectious55.4737Pandemic influenza213.9699Body weight375.0021
Hong Kong55.4737Clinical signs212.385Fig. 1a374.0511
Viral replication55.4737Dendritic cells209.2151Membrane fusion374.0511
Virus infections55.4737Acute respiratory syndrome coronavirus208.9735Clinical trials373.8750
BMC infectious diseases54.0Bronchitis virus207.6301Time point373.3719
Respiratory viral infections54.0Endoplasmic reticulum207.6301Protein synthesis369.2962
Case study53.8887RNA virus207.6301Dengue virus367.7113
Dendritic cells53.8887Saudi Arabia207.6301e protein367.7113
Mini review53.8887Innate immunity206.0451High levels365.3339
RNA virus53.8887Recent studies206.0451Virus particles364.5414
Transmissible gastroenteritis53.8887Economic losses204.4602Target cells362.5601
BMC public52.3038Porcine epidemic204.4602Viral particles360.4204
Monoclonal antibodies52.3038World health204.4602Dendritic cells357.5675
Creative commons cc-by 451.0824Global health202.8752Total number356.4580
Influenza pandemic50.7188Type 1202.8752Cancer cells356.0883
Type 150.7188Vaccine development201.2902Disease control355.2957
This figure highlights which terms are important in each part of the Papers. Note that the extracted terms are different for each measure and sub-corpus, but some of them are similar for both. For example, terms like public health, immune responses are extracted using both measures from the Abstract corpus. In order to quantitatively display the number of representative intersecting terms from different corpora, we show common terms between Title vs Abstract, and Title vs Content corpora for the Papers2 corpus in Fig. 4. For both measures, Title terms are more representative in the Abstract than in the Content of Papers, i.e. 57% and 27% compared to 28% and 5%, respectively, for Title vs Abstract and Title vs Content. However, we noted that terms extracted with C_Value generated more common terms than those extracted with F-TFIDF-C_M. The common terms represent terms extracted at once in the Title, Abstract and Content corpus for each measure.
Fig. 4

Common terms in Papers2

Representative terms from Papers1 Common terms in Papers2 As indicated, extracted terms were compared with the UMLS Metathesaurus. Table 4 shows the TOP@20 terms extracted for the Papers1-content corpus using C_Value and F-TFIDF-C_M measures. Bold terms are not in the UMLS Metathesaurus.
Table 4

TOP@20 terms extracted from Paper1-content using C_Value and F-TFIDF-C_M - SWTs vs MWTs

C_Value Measure
SWTs
TOP 20CellsVirusInfectionProteinStudy
DatafigurealPatientsExpression
rnaAnalysisResultDiseasep
MicecsamplesInfluenzaNumber
MWTs
TOP 20t cellsPublic healthAmino acidImmune responseGene expression
Viral replicationinfected cellsCell linesViral infectionVirus infection
mg mlInfectious diseases Present studyRespiratory tractEpithelial cells
Previous studiesRoom temperatureCell cultureAdditional fileViral infection
F-TFIDF-C_M Measure
SWTs
TOP 20MicePatientsInfluenzaProteinsHealth
dnaVaccineTransmissionResearchModel
ChildrenOutbreakVaccinationeChina
PeptideFusionNetworkPercentmers-cov
MWTs
TOP 20Additional fileInfectious diseasenk cellsHealth careEndothelial cells
Frequency domainEbola virusInfluenza infectionReal-time rt-pcrIncubation period
Health emergencyIndex patientMembrane raftspcr products2c atpase
b cellClose contactFinal dataset3d8 scfvpol ii

In bold terms not in the UMLS thesaurus

TOP@20 terms extracted from Paper1-content using C_Value and F-TFIDF-C_M - SWTs vs MWTs In bold terms not in the UMLS thesaurus According to these TOP@20 terms, we can see that:Figures 5 and 6 illustrate the number of terms out of the TOP@100 terms (in percentage) for each measure (C_Value, F-TFIDF-C_M) and dataset (Papers1-title, Papers2-title):According to these statistics, we first note that the C_Value and F-TFIDF-C_M measures enable extraction of more conventional terms or terms in the UMLS Metathesaurus regardless of the corpus. Secondly, we note that F-TFIDF-C_M generates more new terms (Not In UMLS) than C_Value regardless of the corpus. Finally, the number of new terms is more substantial with MWTs (Fig. 6) than SWTs (Fig. 5) regardless of the measure.
Fig. 5

C_Value vs F-TFIDF-C_M SWTs

Fig. 6

C_Value vs F-TFIDF-C_M MWTs

the majority of the SWTs are in the UMLS Metathesaurus for both statistical measures (C_Value or F-TFIDF-C_M); for MWTs, several terms are not in the UMLS Metathesaurus. These terms can be categorized as: UMLS sub-terms these are terms that do not exactly match to those present in the UMLS Metathesaurus but could be part of them. For example, health emergency is part of terms like Emergency Health Services in the UMLS Metathesaurus; New terms these terms are not in the UMLS Metathesaurus, but are meaningful (or not) in the COVID-19 context. For example, terms like close contact relate to the COVID-19 contagion mode. In_UMLS: the number of terms in the UMLS Metathesaurus; Not_In_UMLS_V: the number of terms that do not exactly match the UMLS terms, but have some variants or are part of the UMLS terms; Not_In_UMLS: the number of terms that do not match the UMLS terms at all. We indicate these as new terms. Terms which are not in the UMLS Metathesaurus but which could have greater meaning in the study context or which could be added to the UMLS Metathesaurus. C_Value vs F-TFIDF-C_M SWTs C_Value vs F-TFIDF-C_M MWTs

Relevant term extraction from corpora for both measures

This involves quantitative and qualitative analysis of the terms extracted within each corpus, while taking both measures (C_Value and F-TFIDF-C_M) into account. In other words, it consists of analysing terms obtained for both measures, i.e. terms detected at the same time, and also terms specific to each of them. The quantitative analysis aims to highlight, for each dataset, the number of terms obtained by each measure, the number of terms obtained for both measures, and which are available or not in the UMLS Metathesaurus. While the qualitative measure aims to highlight, in each case, how the terms obtained are important or not regarding the study domain. For the data representation, we take advantages of Venn Diagram [16], see in Appendix Fig. 10 the distribution of the Papers2-title corpus terms. Terms are organised in different sections. For example, gene expression, human coronavirus, case report, public health, respiratory syncytial virus, etc. are available in UMLS Metathesaurus and are recognized by both measures (C_Value and F-TFIDF-C_M). According to the study domain, these terms will tend to be more representative and important in the whole corpus. Moreover, for each measure there are new terms which are not in the UMLS Metathesaurus.
Fig. 10

Distribution of concepts according to the measures and their presence in the UMLS Metathesaurus: from Papers2-title corpus

Discriminant and common term extraction from corpora

In this case, term analysis is performed per dataset or by jointly considering multiple corpora, i.e. between Title, Abstract and Content corpora. Appendix Fig. 11 corresponds to discriminant and common term extraction from Papers1-title, Papers1-abstract and Papers1-content.
Fig. 11

Distribution of representative concepts when taking multiple corpora into account using C_Value: Papers1 corpora

There are common terms in the overall corpus such as gene expression, virus replication, influenza virus, etc.. These terms tend to be relevant in the Title, Content and Abstract corpora. Moreover, [respiratory infection, acute respiratory infection, etc.], [innate immune response, endoplasmic reticulum, etc.], and [nucleotide sequences, room temperature, etc.] are discriminant terms in the Title, Abstract and Content corpora.

The driven term extraction process

We performed a driven term extraction strategy using FASTR. Our proposal addresses two main questions: (1) For a given set of terms, how can new and relevant terms variants be extracted from a corpus based on the terms? (2) Do some of the new terms exist in the UMLS Metathesaurus? In our experiment, we used the common terms extracted in section 5.1.3 based on the fact that they were more representative and relevant throughout the corpora. Figure 7 shows an example of variant terms extracted with the term infectious disease. Among these variants, we only show those which are not in the UMLS Metathesaurus since they are new and might be more informative.
Fig. 7

Example of term variants

Example of term variants Table 5 contains a list of TOP@10 variants extracted with six initial terms. Among them, we highlighted (in bold) terms matching terms in the UMLS Metathesaurus. Like free mode extraction, term variant mode may be used to extract useful terms.
Table 5

Term extraction variations using FASTR

TermsInfectious diseaseVirus replicationLaboratory testsRespiratory syndromePreventive measureSyndrome coronavirus
VariationsDiseases including infectiousReplication competent virusesLaboratory confirmation testsRespiratory distress syndromePreventive measuresSyndrome coronavirus-related coronavirus
Infectious pulmonary diseasesreplication of N1347A virusLaboratory testingRespiratory acute syndromePreventive hygienic measuresSyndrome human coronavirus
Infectious bursal diseaseVirus optimal replicationTesting presents isolation laboratoriesSyndrome coronavirus and respiratoryPrevention community-engaged measuresSyndromic Surveillance Coronavirus
Infectious lung diseasesReplicating influenza virusesLaboratory diagnostic testingRespiratory tract syndromicPreventive health measuresSyndrome virus coronavirus
Infectious acute diseaseReplication of human virusesLaboratory genomic testingRespiratory insufficiency syndromePreventive behavioral measuresCoronavirus Associated Syndromes

Terms in the UMLS Metathesaurus in bold

Term extraction variations using FASTR Terms in the UMLS Metathesaurus in bold

Combined strategies for term analysis

Combined strategies for term analysis concern two levels: (1) Intra-corpus term extraction, and (2) Inter-corpus term extraction. Combined intra-corpus term extraction strategies: these are geared towards extracting common or discriminant terms from a given corpus. To this end, extracted terms from both measures are compared. We show the process in Fig. 8, where the set of terms Set(Cp) extracted from the corpus Cp (Title, Abstract or Content) using each measure (C_Value, F-TFIDF-C_M) are jointly compared with the UMLS Metathesaurus terms. Set A represents corpus terms specifically extracted with C_Value, set B represents terms that are specific to F-TFIDF-C_M, while set C represents common terms from both measures and UMLS Metathesaurus elements. We consider that sets A and B are discriminant terms of the corpus according to the measures, and otherwise set C is considered as containing common terms or the most representative terms of the corpus. The new term extraction process with FASTR is run with one of the combined sets (discriminant or common) and the corpus.
Fig. 8

Combined intra-corpus term extraction strategies

Combined inter-corpus term extraction strategies: these are geared towards extracting common and discriminant terms, while taking several corpora into account for a given measure. As illustrated in Fig. 9, for each measure (C_Value or F-TFIDF-C_M), the sets of terms Set(Cp1), Set(Cp2), Set(Cp3) are extracted respectively from corpus Cp1 (Title), Cp2 (Abstract), and Cp3 (Content). These sets are compared in order to compute the common term set D for both corpora, and discriminant term sets A, B, C, respectively, for corpora Cp2, Cp1 and Cp3. In this context, new terms are extracted using one of the combined sets with one corpus (Cpx).
Fig. 9

Combined inter-corpus term extraction strategies

Combined intra-corpus term extraction strategies Combined inter-corpus term extraction strategies

Case study: epidemic intelligence

Epidemic intelligence (EI) aims to detect, investigate and monitor potential health threats in a timely manner [33]. In addition to conventional surveillance system monitoring, such as outbreak notifications from the World Organisation for Animal Health (OIE), the EI process increasingly mainstreams unstructured data from informal sources such as online news. Several web-based surveillance systems have been developed and used to support public health and animal health surveillance (ProMED [25], HealthMap [14], GPHIN [29], PADI-web [38], etc.). In this case study, we focused on the choice keywords with the PADI-Web system for COVID-19 surveillance (i.e. driven surveillance) and for monitoring unknown diseases (i.e. syndromic surveillance). The Platform for Automated extraction of Disease Information from the web (PADI-web5) is an automated surveillance system for monitoring the emergence of animal infectious diseases, including zoonoses [1, 38]. PADI-web monitors Google News through specific really simple syndication (RSS) feeds, targeting diseases of interest (e.g. African swine fever, avian influenza, etc.). PADI-web also uses unspecific RSS feeds, consisting of combinations of symptoms and hosts (i.e. species), thus allowing syndromic surveillance and detection of unusual disease events. RSS feeds consists of combinations of different categories of terms (i.e. keywords) including symptoms, disease names and species. PADI-Web has been used for monitoring COVID-19 disease [39]. In this context, the choice of COVID-19 surveillance terms is crucial. In the following subsections, we discuss the choice of terms given by ITEXT-BIO to use in the PADI-Web system [38] and other web-based surveillance systems [14, 25, 29] for COVID-19 and syndromic surveillance. This enables evaluation of the relevance of terms generated by our approach for a dedicated task, i.e. web-based health surveillance.

Relevant term extraction

We compared the relevance of the top 10 terms extracted from Papers2 corpora with either C_Value or F-TFIDF-C (Table 6). Table 9 gives more details on these terms. The relevance was assessed by classifying the terms in one or more of the following categories:Among the terms extracted with C_Value from Titles, Abstracts or Titles and Abstracts, six to seven were parts of disease MWE. Only one term extracted with F-TFIDF-C_M was a part of disease MWE. C_Value could thus be of particular interest for extracting disease name variants, even if they are incomplete. For domain relevant COVID-19 surveillance and syndromic surveillance terms, F-TFIDF-C_M obtained better results than C_Value, even when the frequency of relevant terms was low (from one to five out of ten terms). No common terms were extracted from (Title + Abstract) or from (Title + Content) using F-TFIDF-C_M. Using C_Value, only three common terms were extracted from Title + Content. Among the top 10 terms extracted from Title + Abstract with these metrics, seven were parts of disease MWE. Regardless of the term category, we extracted more relevant terms from Titles and Abstracts than from Contents. This is in line with the fact that Title and Abstracts are more rich in key information and relevant terms due to their length limitation.
Table 6

Relevance of terms extracted from Papers2 depending on the metrics (C_Value or F-TFIDF-C_M)

Corpus (Papers2)MeasurenDomain relevantCOVID-19 surveillanceSyndromic surveillancePart of disease MWE
TitleC_Value103026
TitleF-TFIDF-C_M94111
AbstractC_Value101006
AbstractF-TFIDF-C_M105121
ContentC_Value100001
ContentF-TFIDF-C_M102020
Title + abstractC_Value103007
Title + abstractF-TFIDF-C_M0----
Title + contentC_Value31002
Title + contentF-TFIDF-C_M0----
Table 9

Expanded terms from Table 6

sous_corpusMeasureTermDomain relevantCOVID-19 surveillanceSyndromic surveillanceIncomplet disease name
titleC-valueRespiratory syndrome coronavirusnnny
Porcine epidemic diarrheaynnn
Syndrome coronavirusnnny
Epidemic diarrhea virusnnny
Acute respiratory syndromennny
Public accessnnnn
Diarrhea virusnnny
Infectious bronchitisynyn
Acute respiratorynnny
Bronchitis virusynyn
F-TFIDF-CJournal pre-proofnnnn
Virology journalnnnn
Influenza pandemicynnn
Coronavirus spikeyynn
BMC public healthnnnn
Influenza virus infectionynnn
Emerging infectiousynyn
Prcine circovirus typennny
Codon usagennnn
Respiratory syndromennny
abstractC-valueAcute respiratory syndromennny
Respiratory syndrome coronavirusnnny
East respiratory syndromennny
Syndrome coronavirusnnny
Present studynnnn
Chain reactionnnnn
Syncytial virusnnny
Porcine epidemic diarrheaynnn
Polymerase chainnnnn
F-TFIDF-CVirus infectionsynyn
Porcine epidemicnnny
Clinical samplesnnnn
Codon usagennnn
Mers-cov infectionyynn
Pandemic influenzaynnn
Viral entryynyn
95 percent confidence intervalnnnn
Immune cellsnnnn
Influenza pandemicynnn
Sono statinnnn
contentC-valueInfected cellsnnnn
Respiratory syndromennny
Present studynnnn
Individual componentsnnnn
Essential medicinesnnnn
Previous studiesnnnn
de losnnnn
Functional tasknnnn
der Schwangerschaftnnnn
F-TFIDF-CHealth emergencyynyn
Membrane raftsnnnn
pcr productsnnnn
afa drnnnn
COD trypsinnnnn
2c atpasennnn
Naked molennnn
Intracellular deliverynnnn
Close contactynyn
Final datasetnnnn
Respiratory syndromennny
title + abstractC-valueAcute respiratory syndromennny
Respiratory syndrome coronavirusnnny
East respiratory syndromennny
Syndrome coronavirusnnny
Syncytial virusnnny
Porcine epidemic diarrheaynnn
Antiviral activityynnn
Acute respiratory syndrome coronavirusnnny
Infectious bronchitisynnn

Each term has been evaluated by an expert according 4 criteria: domain relevant, COVID-19 surveillance, syndromic surveillance, incomplet disease name (y: yes, n: no)

COVID-19 surveillance: epidemiological terms specific to COVID-19 (e.g. coronavirus spike). Syndromic surveillance: epidemiological terms not specific to a particular disease (e.g. infectious bronchitis). Domain relevant: terms related to health, i.e. either to specific diseases (e.g. porcine epidemic diarrhoea) or unspecific (e.g. virus infections). The Domain relevant category thus includes the two previous categories, plus diseases other than COVID-19. Part of disease multiword expression (MWE): part of a multiword expression corresponding to a disease name (e.g. East respiratory syndrome for Middle East syndrome coronavirus). Relevance of terms extracted from Papers2 depending on the metrics (C_Value or F-TFIDF-C_M)

Driven term extraction

We selected terms extracted in Section 6.1: respiratory tract, viral infections, SARS coronavirus, incubation period, influenza virus, respiratory infections and infectious diseases. We randomly extracted the variants with FASTR (Section The driven term extraction approach). An epidemiologist manually evaluated the relevance of 10 randomly selected variants per term. Among the 60 evaluated terms (see Table 10), 72% (43/60) were relevant and 7% (4/60) were irrelevant. For 13 variants (22%), the relevance could not be assessed because the expression was truncated and ambiguous, such as “disease has an infectious” for the term “infectious diseases”. FASTR thus seems to be an effective tool for generating term variants efficiently. However, we noted that FASTR generated up to 774 variants for a single term. Thus, to avoid random selection of terms, it would be interesting to compute a relevance index that could be used to rank the proposed variants. Besides, several extracted variants were fragments of expressions that could not be evaluated. This issue could be overcome by displaying the variant context (i.e. the sentence in which the variants appeared).
Table 10

60 terms randomly selected from FASTR variants (Section The driven term extraction approach)

Influenza virusEvaluationRespiratory infectionsEvaluationInfectious diseasesEvaluation
Influenza a/wsn/33 virusNot relevantRespiratory virus infectionsRelevantDiseases relates to infectiousRelevant
Viruses and conventional influenzaRelevantRespiratory viral infectionRelevantDisease called feline infectiousRelevant
Virus remains the influenzaLack of contextInfection by respiratoryRelevantInfectious animal diseasesRelevant
Influenza by virusNot relevantInfections of the respiratoryRelevantInfectious enteric diseasesRelevant
Influenza vaccine virusRelevantInfect respiratoryRelevantDisease without being infectiousNot relevant
Virus and canine influenzaRelevantInfections are respiratoryRelevantDisease has an infectiousLack of context
Virus influenzaRelevantInfected with respiratoryRelevantInfectious diseaseRelevant
Viruses such as influenzaRelevantInfection with other respiratoryRelevantDisease named it infectiousLack of context
Influenza b virusesRelevantRespiratory virus infectionRelevantInfectious swine diseasesRelevant
Viruses and emerging influenzaRelevantInfection transmitted via respiratoryRelevantDisease models for infectiousLack of context
Viral infectionsEvaluationSars coronavirusEvaluationIncubation periodEvaluation
viral bronchopulmonary infectionRelevantCoronavirus is urbani sarsNot relevantIncubating periodRelevant
Virally infectedRelevantCoronavirus of 18 sarsNot relevantPeriods of incubationRelevant
Viral respiratory infectionsRelevantCoronavirus that causes sarsRelevantIncubation periodsRelevant
Infection and encounter virallyLack of contextCoronavirus named sarsRelevantPeriod of incubationRelevant
Infection or viralLack of contextCoronavirus related to sarsRelevantPeriod than incubationLack of context
Viral skin infectionRelevantCoronavirus isolated from sarsRelevantPeriod and incubationLack of context
Virals infectionRelevantCoronavirus responsable du sarsRelevantIncubation for periodLack of context
Infection with one viralRelevantSars -associated coronavirusRelevantPeriod and incubatingLack of context
viral opportunistic infectionsRelevantSars human coronavirusRelevantPeriod covering an incubationRelevant
infection at high viralLack of contextSars and coronavirusRelevantPeriod of extrinsic incubationRelevant

Conclusion

In this paper we describe ITEXT-BIO, a generic methodology for biomedical term extraction. We show how it allows users to extract terms (or concepts) from different types of textual data using several combined strategies. The free term extraction approach extracts terms from corpora, while the driven term extraction approach extracts, from a corpus and a set of terms, a set of variations of these terms. We illustrate that the proposed combined strategies based on statistical measures and textual segments help efficiently extract and categorize terms (representative, discriminant and new terms) from a corpus or corpora. We also quantitatively and qualitatively analysed the extracted terms to determine those related to the study domain and those that could be considered as emerging terminology for disease monitoring. Our future studies will focus on term extraction and analysis by: (i) taking different sections of papers into account and applying the methodology to different types of corpora derived from newspapers or social media such as Twitter, (ii) considering combinations of tools other than BioTex, and (iii) introducing word embedding strategies like BERT [10] to capture semantic aspects of the extracted terms in order to reduce context ambiguity.
  8 in total

1.  The Unified Medical Language System (UMLS): integrating biomedical terminology.

Authors:  Olivier Bodenreider
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

2.  ProMED-mail: an early warning system for emerging diseases.

Authors:  Lawrence C Madoff
Journal:  Clin Infect Dis       Date:  2004-06-28       Impact factor: 9.079

Review 3.  Epidemic intelligence: a new framework for strengthening disease surveillance in Europe.

Authors:  C Paquet; D Coulombier; R Kaiser; M Ciotti
Journal:  Euro Surveill       Date:  2006

Review 4.  Extracting information from textual documents in the electronic health record: a review of recent research.

Authors:  S M Meystre; G K Savova; K C Kipper-Schuler; J F Hurdle
Journal:  Yearb Med Inform       Date:  2008

5.  The Global Public Health Intelligence Network and early warning outbreak detection: a Canadian contribution to global public health.

Authors:  Eric Mykhalovskiy; Lorna Weir
Journal:  Can J Public Health       Date:  2006 Jan-Feb

6.  Web monitoring of emerging animal infectious diseases integrated in the French Animal Health Epidemic Intelligence System.

Authors:  Elena Arsevska; Sarah Valentin; Julien Rabatel; Jocelyn de Goër de Hervé; Sylvain Falala; Renaud Lancelot; Mathieu Roche
Journal:  PLoS One       Date:  2018-08-03       Impact factor: 3.240

7.  HealthMap: global infectious disease monitoring through automated classification and visualization of Internet media reports.

Authors:  Clark C Freifeld; Kenneth D Mandl; Ben Y Reis; John S Brownstein
Journal:  J Am Med Inform Assoc       Date:  2007-12-20       Impact factor: 4.497

8.  Monitoring online media reports for early detection of unknown diseases: Insight from a retrospective study of COVID-19 emergence.

Authors:  Sarah Valentin; Alizé Mercier; Renaud Lancelot; Mathieu Roche; Elena Arsevska
Journal:  Transbound Emerg Dis       Date:  2020-08-02       Impact factor: 4.521

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.