Literature DB >> 34988440

miRetrieve-an R package and web application for miRNA text mining.

Julian Friedrich¹, Hans-Peter Hammes², Guido Krenning¹.

Abstract

microRNAs (miRNAs) regulate gene expression and thereby influence biological processes in health and disease. As a consequence, miRNAs are intensely studied and literature on miRNAs has been constantly growing. While this growing body of literature reflects the interest in miRNAs, it generates a challenge to maintain an overview, and the comparison of miRNAs that may function across diverse disease fields is complex due to this large number of relevant publications. To address these challenges, we designed miRetrieve, an R package and web application that provides an overview on miRNAs. By text mining, miRetrieve can characterize and compare miRNAs within specific disease fields and across disease areas. This overview provides focus and facilitates the generation of new hypotheses. Here, we explain how miRetrieve works and how it is used. Furthermore, we demonstrate its applicability in an exemplary case study and discuss its advantages and disadvantages.

Entities: Chemical

Year: 2021 PMID： 34988440 PMCID： PMC8696973 DOI： 10.1093/nargab/lqab117

Source DB: PubMed Journal: NAR Genom Bioinform ISSN： 2631-9268

INTRODUCTION

microRNAs (miRNAs) are small, non-coding RNAs that regulate gene expression (1). By targeting complementary mRNAs, miRNAs repress the translation of genes into proteins and consequently control cellular functions (1,2). As a result, miRNAs add another layer of complexity to the regulation of cellular behavior, which is why an increased understanding of miRNAs may provide deeper insights into the pathophysiology of many diseases (3). The field of miRNA research has expanded substantially over the past decade, as evidenced by the rapidly growing body of miRNA literature. Over the past decade, >110 000 publications containing the MeSH term ‘miRNA’ have appeared on PubMed (https://ncbi.nlm.nih.goc/pubmed/)! This large number of publications reflects the global interest in miRNAs across research fields, yet it also generates the challenge to maintain an overview of which miRNAs have been investigated and which functions they exert in a specific disease. Whereas this is already challenging in one field, it is even more challenging to compare miRNAs across disease fields function wise. At the same time, comparing miRNAs across fields may extend the understanding of miRNA function per se—for example solving the key question on how miRNAs select their target genes, which is not exactly to date (4)—and facilitate the classification of miRNAs into functional groups based on the cellular behavior they influence. This wealth of information has been summarized in manually curated databases such as miRTarBase (5), while tools such as fireflybio (https://www.fireflybio.com/portal/search), miRSel (6), miRTex (7) or emiRIT (8) help to automatically extract associations between miRNAs and genes and diseases from text. To further address the challenges above, we have designed miRetrieve. miRetrieve is an R package (9) and complements the previous tools by combining text mining algorithms to retrieve key information on miRNAs directly generated from text, such as their current frequency of investigation or likely function in a disease. Furthermore, miRetrieve also visualizes and compares miRNA:mRNA interactions by integrating information from external databases such as miRTarBase (5). Additionally, miRetrieve’s most important functions are available as a web application (10). Here we discuss how miRetrieve works and how it is used. Furthermore, we demonstrate its applicability in an exemplary case study and discuss its advantages and disadvantages.

MATERIALS AND METHODS

A detailed explanation of miRetrieve and step-by-step guide with code is given in the vignette (Supplementary File S1). A complete overview of all functions is in miRetrieve’s documentation. This section explains the key tools in a miRetrieve workflow that are also provided in the web application.

Input

miRetrieve uses text as input and works with all databases that provide a MEDLINE or JATS format, but is optimized to work with PubMed abstracts (https://www.ncbi.nlm.nih.gov/pubmed/). We chose to design miRetrieve to text mine abstracts and not perform deep text mining of full papers to ensure access without restriction (compared to full text non-open access publications). Additionally, as abstracts summarize the major research findings, these may serve as a filter which facilitates the incorporation of miRNAs investigated in-depth and limits the incorporation of miRNAs only identified in array or sequencing experiments. One key idea of miRetrieve is to text mine abstracts related to one specific disease field, e.g. miRNAs in atherosclerosis. That way, miRetrieve can identify and characterize miRNA functions distinct for this field, and later compare field-related miRNA functions in subsequent analysis steps. To use abstracts with miRetrieve, abstracts are downloaded from PubMed in a MEDLINE or JATS format and read into R.

Datamining

Extracting miRNAs

After the abstracts are loaded in miRetrieve, analysis is prepared by extracting miRNA names from text using several regular expressions (regex). First, abstracts are prepared for miRNA extraction, e.g. by converting any non-ASCII character to its ASCII equivalent or inserting a ‘-’ between any combination of RNA and a digit, combining the regex (?< = [Rr][Nn][Aa])([0-9]) and -\\1. The abstracts are then screened for miRNAs, using the regex [Mm][Ii][Cc]?[Rr][Oo]?[Rr]?[Nn]?[Aa]?[\\-\\s]?\\d+[a-zA-Z]?[\\-]?[12345]?[\\-]?[35]?[Pp]?. This regex detects single miRNAs, such as miR-146a-5p, MiRNA-146a, microRNA-146 or MIR146. Furthermore, it detects the first miRNA of enumerations, such as miR-23b from miR-23b, -146a and -155 or miR-23b/-146a/-155, but misses subsequent miRNAs, like -146a and -155 or /-146a/-155 in this case. Therefore, any miRNA detected by this initial regex is used to create an individual regex with a positive lookbehind to catch enumerated miRNAs, using the pattern (?< = (e.g. (?< = miR-23b)([/,;and\\s?]+[-]?\\d+[a-z]?[-]?[1-5]?[-]?[53]?[pP]?\\s?)+ to catch -146a, and -155 or /-146a/-155). Additionally, the regex ([Mm][Ii][Cc]?[Rr][Oo]?[Rr]?[Nn]?[Aa]?[\\-\\s]?\\d+)([a-z][/]?)+ identifies miRNAs with enumerated lettered suffixes such as miR-33a/b, which are subsequently divided into their stem (miR-33) and lettered suffixes (a/b) by matching the regex ([Mm][Ii][Cc]?[Rr][Oo]?[Rr]?[Nn]?[Aa]?[\\-\\s]?\\d+)(.*). Afterwards, the stem is vectorized over the lettered suffixes to obtain the separated miRNAs, e.g. vectorizing miR-33 over a/b to obtain miR-33a and miR-33b. The identified miRNAs are cleaned, amongst others by removing trailing -3p and -5p or trailing digits such as -1, using the regex [\\-\\s]?3[Pp]|[\\-\\s]?5[Pp] and (?< = \\d[a-zA-Z]?)(-[12345]), respectively. The remaining numbers are miRNA numbers +/- lettered suffixes that are extracted with \\d+[a-zA-Z]? and concatenated with miR- to bring miRNA names into a cohesive form. miRNA extraction is complemented by using the regex [Ll][Ee][Tt][\\-\\s]?\\d+[a-zA-Z]?[\\-]?[35]?[Pp]? to extract let-7 variants, where -3p and -5p are again removed and let- is converted to lower case. Finally, any lettered suffix is converted to lower case with (?< = \\d)([A-Z]) and \\L\\1, and outdated miRNA names are replaced by their most recent names according to miRBase (11) version 22 (e.g. miR-97 and miR-102 are replaced by miR-30a and miR-29b, respectively). The user can specify whether to extract miRNA names annotated with or without their lettered suffixes (e.g. miR-146a or miR-146). Extracting miRNA names with their lettered suffixes differentiates between miRNAs of similar sequence yet different genomic location and improves accuracy of the downstream analysis, but this accuracy critically depends on the correct and consequent usage of miRNA nomenclature (e.g. hsa-miR-146a-3p). As miRNA nomenclature may be inconsistently used throughout literature, extracting miRNA identifiers without these suffixes increases the consistency of extraction. For this reason, we designed miRetrieve to ignore extensions such a species prefix, -3p/-5p or genomic suffixes, but to focus on the main miRNA name instead. By default, miRetrieve is programmed to extract all unique miRNA names in an abstract once to reflect the true count of abstracts mentioning one miRNA. This setting avoids the creation of bias that may be introduced when a miRNA is mentioned multiple times in the same abstract. However, as repeated mentioning of a miRNA name may reflect the depth a certain miRNA was investigated on, miRetrieve offers the possibility to extract only miRNAs mentioned with a minimal frequency in one abstract. Enabling this option ignores miRNAs mentioned seldomly or once and increases the visibility of well-investigated miRNAs. The extracted miRNA names and their associated count files are the corner stone of subsequent analysis as detailed in the subsections below.

Frequency-distribution of miRNAs

Based on the extracted miRNA names, miRetrieve provides a frequency distribution of all identified miRNAs. Determining which miRNAs are frequently or seldomly mentioned across abstracts ascertains which miRNAs may be well investigated or under-investigated in a certain research field. Thus, the frequency distribution chart of miRNAs may pose as a surrogate measure of how large the influence of a certain miRNA is in the determination of the cellular or disease phenotype under investigation.

Term-association of miRNAs

miRetrieve summarizes which terms are associated with a specific miRNA over several abstracts by tokenization. Tokenization refers to a process where text is split into smaller pieces, also called ‘tokens’. By default, miRetrieve splits text on white space and strips punctuation. A token then may either represent a single word, such as inflammation or tnf, or adjacent words/numbers, such as mir 146 or low density lipoprotein (12), also referred to as n-grams depending on the number n of words/numbers included in a token. Per miRNA, all abstracts of the provided input containing this miRNA are split into tokens. Afterwards, frequent, but irrelevant tokens such as a, is, western or expression level, are removed, using a manually curated dictionary for common PubMed expressions and a dictionary of the tidytext package (12) for common English words. The remaining tokens are then displayed in decreasing frequencies, thus showing the frequency order of words used in the same abstracts as the respective miRNA. Consequently, miRetrieve supports miRNA annotation to putative functional classes (e.g. inflammation and proliferation): A miRNA that is often mentioned in abstracts containing inflammation, tnf and macrophages, for example, may also play a role in inflammation. Alternatively, miRetrieve can also do the reverse and identify which miRNAs are commonly associated with a user-defined term. Determining which miRNAs are frequently mentioned in combination with drugs or pathways, for example, may indicate possible miRNA-drug or miRNA-pathway interactions.

Class-association of miRNAs

Next to establishing which miRNAs are commonly associated with a specific term, miRetrieve can also dissect which miRNAs are commonly associated with multiple terms at once. Here, all abstracts of the text input are split into tokens, and only abstracts containing defined terms remain. If these terms summary a larger class (e.g. a combination of circulating, biomarker, extracellular vesicles, urine, body fluid for biomarker, or a combination of patients, biopsies, children, women, men, humans for clinical studies), miRetrieve assists in identifying miRNAs that may serve as biomarker, or that have already been investigated in patients. This feature can support the estimation of the potential clinical translatability of miRNAs and direct further research.

Visualizing validated miRNA–target interactions

Besides the functional association of a miRNA based on term associations, miRetrieve is able to visualize miRNA:mRNA interactions by relying on external databases such as miRTarBase (5). miRTarBase is a manually curated database that contains validated miRNA:mRNA interactions and the PubMed IDs of the publications these were reported in (5). Here, the strength of miRetrieve lies in matching the PubMed IDs of miRTarBase to the PubMed IDs of given abstracts. miRetrieve includes the latest miRTarBase version 8 (5), but any database providing PubMed IDs is suitable. When these abstracts summarize miRNAs in a specific disease, e.g. miRNAs in atherosclerosis, miRetrieve consequently visualizes only validated miRNA:mRNA interactions in this field while ignoring unrelated interactions. This feature thus supports the identification of key miRNA:mRNA interactions in any specific disease field.

Comparing miRNA frequency, function and gene targets across research fields

As detailed above, miRetrieve provides an overview of miRNA frequency, function, and gene targets for any pre-specified research field. These functions can be easily extended to analyze and compare miRNAs across multiple fields. This cross-field comparison may be especially interesting in the identification of common miRNA signatures in multiple cancer subtypes, between multiple inflammatory disease and so on. By combining the same functions detailed above to compare miRNA frequency distributions, miRNA-term associations, and/or miRNA-target interactions across fields, miRetrieve may uncover common and distinct functions of miRNAs depending on disease context.

Usage and reliability of miRetrieve

Demonstration cases

The following case study demonstrates a typical miRetrieve workflow by illustrating how miRetrieve extracts information from thousands of abstracts. Here, miRetrieve identifies miRNAs frequently investigated in atherosclerosis and their putative functions. Furthermore, the interactions between miRNAs and their validated gene targets are visualized by the integration of miRTarBase (5) in miRetrieve. Throughout this demonstration, results generated by miRetrieve are manually validated to determine the accuracy and reliability of the program. In a second part, a similar workflow is applied to compare miRNAs involved in atherosclerosis to miRNAs involved in lung cancer. Additionally, miRetrieve is used to identify biomarker miRNAs in both diseases. As atherosclerosis and lung cancer are largely independent from each other, these analyses provide an example of disease context-dependent and -independent features of the miRNAs under investigation. The code for these demonstrations is available in Supplementary File S2. The graphs are created with ggplot2 (13) and patchwork (14).

Data preparation

To investigate the role of miRNAs in atherosclerosis, 1750 PubMed abstracts (https://www.ncbi.nlm.nih.gov/pubmed/) in total matching the keywords atherosclerosis miRNA as of 11 February 2021, were loaded as Atherosclerosis into R with read_pubmed().

Accuracy of miRNA extraction and frequency distribution

To determine the precision of the miRNA name extraction algorithm, miRNA names without and with lettered suffixes were extracted with extract_mir_df(). Afterwards, 200 abstracts were randomly selected and automatic miRNA name extraction was compared against manual extraction by computing the precision ([Correctly extracted miRNAs] / ([Correctly extracted miRNAs] + [Falsely extracted miRNAs])) and recall ([Correctly extracted miRNAs] / ([Correctly extracted miRNAs] + [Missed miRNAs])) of automatic miRNA name extraction. Additionally, extracted miRNA names were counted with count_mir(). Frequency distribution of miRNAs was validated by comparing the count of the top thirty miRNAs that carry no lettered suffix by definition against a PubMed search, following the pattern of ‘(miR-(number) atherosclerosis) AND ((‘1990/01/01’[Date - Entry] : ‘2021/02/11’[Date - Entry]))’ as of 11 February 2021. Per miRNA, miRetrieve’s and PubMed’s frequency count was compared by calculating a score with [miRetrieve count] / [PubMed count]. Afterward, all frequency scores were summarized by the mean, where a mean > 1 indicated a higher miRNA count with miRetrieve, on average, while a mean <1 indicated a higher miRNA count with PubMed. Finally, the same steps were repeated for extracted miRNA names with lettered suffixes.

Text mining of miRNAs in atherosclerosis

From the original Atherosclerosis data frame, abstracts were filtered for research articles with subset_research() to focus on original research studies and not to bias the results with articles of other types such as Reviews or Letters. Next, miRNA names mentioned at least twice in an abstract were extracted with extract_mir_df(), resulting in 1003 abstracts and 1432 extracted miRNA names. The miRNA extraction threshold was set to >1 to increase the chance of focusing on relevant abstracts.

Frequency-distribution of miRNAs in atherosclerosis

To identify frequently investigated miRNAs in atherosclerosis, the ten most frequently mentioned miRNAs were plotted with plot_mir_count(). To estimate the translatability between miRNA frequency distribution and their degree of investigation, abstracts mentioning miR-155, miR-126, miR-146, miR-21 and miR-33—the top 5 miRNAs in atherosclerosis (see also Figure 2)—at least twice were compared to the number of abstracts indicating a relevant role of the respective miRNA in atherosclerosis, where ‘relevant’ was defined as the respective miRNAs either targeting key proteins or being differently expressed.

Figure 2.

Most frequently mentioned miRNAs in atherosclerosis. Atherosclerosis abstracts were confined to research articles and miRNA names mentioned at least twice in an abstract were extracted. Ten most frequently mentioned miRNAs in atherosclerosis were plotted. Bargraphs indicate number of abstracts a miRNA was mentioned at least twice in.

miRNA extraction thresholding

The threshold for extracting miRNAs may be altered by the user. Setting the miRNA extraction threshold to >1 may increase the scientific relevance of an extracted miRNA, but ignores abstracts that mention the respective miRNA only once, and thereby limits the diversity of the results. To examine if a higher miRNA extraction threshold improves the ‘precision-to-recall ratio’ for relevant abstracts, all abstracts mentioning miR-155, miR-126, miR-146, miR-21 and miR-33 were compared to the number of abstracts indicating a relevant role of the respective miRNA in atherosclerosis. For miRNA and extraction thresholds (‘none’ or >1), precision ([Relevant abstracts] / ([Relevant abstracts] + [Irrelevant abstracts])) and recall ([Relevant abstracts] / ([Relevant abstracts] + [Relevant abstracts missed])) were calculated.

Term-association of miRNAs in atherosclerosis

To determine the putative roles of miR-155 and miR-21, two frequently mentioned miRNAs in atherosclerosis, their most frequently associated single and two adjacent words were plotted with plot_mir_terms(). miR-155 was investigated as it is the most frequently mentioned miRNA in atherosclerosis, whereas miR-21 is frequently mentioned in atherosclerosis and the most frequently mentioned miRNA in lung cancer, the second main topic of this case study. Afterwards, these suggested miRNA annotations were validated by manually cross-referencing abstracts and corresponding articles.

Visualizing validated miRNA–target interactions in atherosclerosis

miRNA characterization was concluded by visualizing validated miRNA:mRNA interactions for miR-155 and miR-21 in atherosclerosis with join_targets() and plot_target_mir_scatter(), using the latest version of miRTarBase (5).

Comparing miRNA frequency, function and gene targets between atherosclerosis and lung cancer

To contrast the roles and functions of miRNAs in atherosclerosis and lung cancer, 6228 PubMed abstracts matching the keywords lung cancer miRNA as of 11 February 2021, were loaded as Lung cancer into R, filtered for original research articles, and miRNA names mentioned at least twice in an abstract were extracted, resulting in 4405 abstracts and 5865 miRNAs. To compare the most frequently investigated miRNAs in lung cancer to miRNAs in atherosclerosis, the ten most frequently mentioned miRNAs in lung cancer were plotted with plot_mir_count(). To exemplary compare the role of miR-21 in atherosclerosis (also see above) and lung cancer, the Atherosclerosis and Lung cancer data frames were combined with combine_df(). Afterward, the frequency of the shared terms of miR-21 were juxtaposed with compare_mir_terms(). The comparison of miRNAs in atherosclerosis and lung cancer was concluded by identifying potential biomarker miRNAs with calculate_score_biomarker(). calculate_score_biomarker() pinpointed abstracts containing at least five marker words for biomarker (bio-marker, biological marker, biomarker, body fluid, bodyfluid, circulating, diagnostic, exosomal, exosomes, extracellular vesicles, plasma, serum, urinary, urine). The choice of marker words for biomarker is empirical and reflects that biomarker miRNAs are usually circulating as exosomes or extracellular vesicles in easily accessible bodyfluids such as the plasma, serum or urine (15). The seven most frequently reported miRNAs in these abstracts were visualized and compared with plot_mir_count(). The resulting frequency distribution was manually validated for selected miRNAs.

RESULTS

Accuracy of miRNA extraction

To validate miRetrieve’s miRNA extraction, automatic miRNA name extraction without and with lettered suffixes was compared to manual extraction from 200 randomly selected abstracts. Without lettered suffixes, manual extraction yielded 351 miRNAs. Of these 351 miRNAs, miRetrieve correctly extracted 350 miRNAs, while missing 1 miRNA and falsely extracting 1 miRNA instead, which amounts to a recall and precision of 99.7%. With lettered suffixes, manual extraction totaled 370 miRNAs. Of these 370 miRNAs, miRetrieve correctly extracted 369 miRNAs, while missing 1 miRNA and falsely extracting 1 miRNA instead, corresponding to a recall and precision of 99.7% (Supplementary File S3). The discrepancy is due to a typing error that is easily compensated for by a human, as miR-221/22 was extracted as miR-221 and miR-22, when miR-222 was meant and therefore missed (16). Throughout, miRetrieve handled different miRNA spelling types and coerced them into one form, even within the same abstract (e.g. MIRN125B, miR-125b and miR125b to miR-125b; miR-19a,-21,-126[…] to miR-19a, miR-21, miR-126 […], Supplementary File S3). Furthermore, combined miRNA names such as miR-33a/b were split into miR-33a and miR-33b and counted accordingly (Supplementary File S3). Thus, miRetrieve reliably detects and extracts miRNA names in abstracts, which provides a trustworthy foundation for subsequent analysis steps.

Accuracy of miRNA frequency distribution

Comparing miRetrieve’s count of miRNA names to a distinct PubMed search per miRNA, miRetrieve arrives, on average, at a slightly higher miRNA count than PubMed. When miRNA names carrying no lettered suffixes are extracted, miRetrieve identifies a higher count per miRNA than PubMed (Figure 1A), as indicated by a mean frequency score of 1.58.

Figure 1.

Comparison of frequency distribution of miRNAs as obtained by miRetrieve and PubMed. miRNA names without (A) and with lettered suffixes (B) were extracted from atherosclerosis abstracts. miRNA names were counted with miRetrieve and PubMed’s search algorithm following the pattern ‘miR-(number) atherosclerosis’, and top 30 miRNAs according to miRetrieve were visualized. Bargraphs indicate number of abstracts containing miRNAs as obtained by miRetrieve (orange) and PubMed’s search algorithm (green). Also, when miRNA names that carry lettered suffixes are extracted, miRetrieve again identifies a higher count per miRNA than PubMed (Figure 1B), as indicated by a mean frequency score of 1.82. This indicates that miRetrieve counts miRNAs accurately, while it reflects their occurrence in abstracts, on average, more precisely than PubMed. Opposed to PubMed, miRetrieve recognizes miRNAs regardless of possible lettered suffixes during extraction. This offers the advantage that miRetrieve summarizes miRNAs with the same enumeration when lettered suffixes are ignored (e.g. let-7a, let-7b, let-7c, … are summarized as let-7), which adds flexibility when the specific miRNA name is irrelevant/not known, or a general overview of a certain miRNA number is preferred. Additionally, miRetrieve extracts individual miRNAs from chained formats (such as miR-221 and miR-222 from miR-221/222, or miR-33a and miR-33b from miR-33a/b), which reflects miRNA occurrence more precisely.

Frequency-distribution of miRNAs in atherosclerosis

After validating miRetrieve’s miRNA extraction and frequency distribution, miRNAs mentioned at least twice in an abstract were extracted to characterize the miRNA landscape in atherosclerosis. Here, miR-155 is the most frequently mentioned miRNA, referenced at least twice in 80 abstracts (Figure 2). Next to miR-155, miR-126, miR-146, miR-21 and miR-33 are other frequently mentioned miRNAs in atherosclerosis, described in at least 50 abstracts each (Figure 2). Most frequently mentioned miRNAs in atherosclerosis. Atherosclerosis abstracts were confined to research articles and miRNA names mentioned at least twice in an abstract were extracted. Ten most frequently mentioned miRNAs in atherosclerosis were plotted. Bargraphs indicate number of abstracts a miRNA was mentioned at least twice in. Of these abstracts, about 94% of abstracts, on average, report a relevant role of the respective miRNA in atherosclerosis, as evidenced by manual validation (Supplementary File S4). This demonstrates that the frequency distribution of miRNAs reflects their level of investigation in a research field, showing that frequently mentioned miRNAs are well investigated, whereas rarely mentioned miRNAs lack research and characterization vice versa.

miRNA extraction thresholding

In the approach above, the frequency distribution of miRNAs was based on a miRNA extraction threshold >1, meaning that per miRNA, only abstracts mentioning that miRNA at least twice were considered. Although this yielded relevant abstracts, it risks losing relevant abstracts that mention the respective miRNA only once. Indeed, setting a miRNA extraction threshold >1 decreases the recall from 100% to, on average, 88% relevant abstracts, indicating a loss of relevant abstracts (Supplementary File S4). This decrease in recall, however, is contrasted by an increase in precision from 90% to, on average, >94% relevant abstracts for an extraction threshold >1 (Supplementary File S4). This implies that a higher miRNA extraction threshold increases precision for relevant abstracts, and therefore increases the chance to focus on relevant abstracts.

Term-associations of miR-155 in atherosclerosis

In atherosclerosis, miR-155 is the most frequently mentioned miRNA and thus the miRNA that has received most attention. Hence, its role in atherosclerosis is relevant (e.g. (17–25)). In atherosclerosis, miR-155 is often associated with terms related to inflammation (e.g. inflammation, inflammatory response, tnf α, il 6, nf κb) (Figure 3A,B). Furthermore, almost 30% of abstracts link miR-155 to macrophages (macrophage, macrophages) (Figure 3A), while about 20% and 10% of abstracts describe miR-155 alongside miR-21 and miR-126, respectively (Figure 3B). Taken together, these terms indicate that miR-155 is a key player of inflammation in atherosclerosis, likely by affecting macrophage behavior. Additionally, miR-155 might be connected to miR-21 or miR-126.

Figure 3.

Term-associations of miR-155 and miR-21 in atherosclerosis. Most frequently single and two adjacent words associated with miR-155 (A and B) and miR-21 (C and D) in atherosclerosis were plotted, while common words and terms such as a, is, significant, or western were ignored. Additionally, field specific stop words such as atherosclerosis or atherosclerotic were added to aid readability. Bargraphs indicate number of abstracts mentioning single and two adjacent words relative to all abstracts mentioning the respective miRNA. Validating these propositions with a manual search reveals that miR-155 indeed drives inflammation by targeting SOCS1 (26,27) and by activating the NLRP3 inflammasome via the ERK1/2 pathway (28). Moreover, miR-155 polarizes macrophages to their pro-inflammatory M1 state (29), thus aggravating inflammation (18,30). In addition, miR-155 correlates positively with IL-6 and TNF-α (31), two other terms referenced in the associations. miR-155 increases both proteins (32,33) by targeting SOCS1 (34). Vice versa, TNF-α stimulates miR-155 expression (25) and releases miR-155 rich particles (35). A similar reciprocal relationship exists between miR-155 and NF-κB (25,27). Lastly, the association between miR-155 and miR-21/miR-126 is explained by that fact that miR-155 and miR-21 are inversely expressed in atherosclerosis (36), whereas levels of miR-155 and miR-126 are jointly altered in atherosclerotic plaques (37). Furthermore, a mutual altered expression of all three miRNAs in atherosclerosis is described in at least two studies (38). This demonstrates that miRetrieve’s term-associations summarize the putative role of a miRNA in a focused and accurate manner. Additionally, miRetrieve reveals which miRNAs are conjointly differently expressed in a disease.

Term-associations of miR-21 in atherosclerosis

In atherosclerosis, miR-155 is related to terms of inflammation. However, as inflammation is one of the main drivers in atherosclerosis (39), these inflammatory signaling words might reflect a common mechanism, and thus be unspecific for miR-155. To investigate this possibility, terms frequently associated with miR-21, the fourth most frequently mentioned miRNA in atherosclerosis (Figure 2), are compared to terms associated with miR-155. Opposed to miR-155, miR-21 is not as frequently associated with inflammation (35% of abstracts with miR-155, 18% of abstracts with miR-21), macrophage (20%, 9%) or nf κb (12%, 4%), while the frequency of il 6 is similar (13%, 7%) (Figure 3C,D). This suggests that miR-155 plays a more active role in inflammation than miR-21, which is confirmed by a manual search: although an absence of miR-21 in macrophages results in inflammation (40) and serum miR-21 correlates positively with atherosclerotic inflammation (41), most studies only associate miR-21 with inflammation, but do not demonstrate a clear mechanistic link (e.g. (42)). In contrast, miR-21 frequently associates with apoptosis (18%), pten (16%) and mir 221 (11%). This hints that miR-21 regulates apoptosis in atherosclerosis and is connected to PTEN, while its expression might be associated with miR-221 (Figure 3C,D). Indeed, miR-21 targets PTEN (43) and correlates negatively with PTEN in patients (44). Amongst others, this prevents apoptosis (45), either by stimulating the PI3K pathway (46) or by increasing the expression of Bcl-2 (47). Additionally, miR-21 averts apoptosis by targeting PDCD4 (48) and MKK3, subsequently decreasing p38-CHOP and JNK signaling (40). This illustrates that miRetrieve's term-associations are not only meaningful, but also characterize the role of a miRNA independent from disease state.

Targets of miR-155 and miR-21 in atherosclerosis

miRNAs exert their main function by translational repression of complimentary mRNA targets (1). Thus, the characterization of miR-155 and miR-21 in atherosclerosis is completed by viewing their validated targets using miRTarBase (5). As only abstracts reporting miRNAs in atherosclerosis are present, these validated targets are confined to miRNA–mRNA interactions in atherosclerosis. According to miRTarBase, miR-155 targets >20 genes in atherosclerosis, whereas miR-21 targets 7 genes (Figure 4). Among these, miR-155 targets ICAM1, VCAM1 (49), NFKB1 and NSD3 (50), which confirm miR-155’s inflammatory role in atherosclerosis. In addition, PTEN is listed as a target of miR-21 and also encountered in miR-21’s term associations (51).

Figure 4.

Validated targets of miR-155 and miR-21 in atherosclerosis according to miRTarBase (5). PubMed IDs of atherosclerosis abstracts containing miR-155 and miR-21 were matched to PubMed IDs of miRTarBase, and validated targets of miR-155 and miR-21 were displayed. Each dot indicates one study validating the miRNA–mRNA interaction. However, miRNA listings in miRTarBase are incomplete: While miRTarBase lists only one publication each of miR-155 to target SOCS1 (34) and miR-21 to target PTEN (51), notably more studies confirm miR-155 to target SOCS1 (e.g. (26,27)) or miR-21 to target PTEN (e.g. (43,52)) in atherosclerosis. Moreover, miR-21 has been confirmed to target other proteins that are not listed, e.g. SMAD7 (53) or TIMP3 (54). This discrepancy highlights the dependency of external databases on manual curation: Regular updates are necessary to cover a wide range of known, field specific miRNA–mRNA interactions without bias. As this is barely feasible in practice, these results must be interpreted with care. In conclusion, miRetrieve visualizes field specific miRNA–mRNA interactions by using external databases such as miRTarBase. Although the full scope of known miRNA–mRNA interactions per field is not covered, this approach demonstrates the potential to provide insights into disease specific miRNA behavior.

Cross-field comparison of miRNAs in atherosclerosis to miRNAs in lung cancer

Most frequently mentioned miRNAs in lung cancer

As many miRNAs exists and miRNAs are believed to influence about any disease, it is hoped that a combination of different miRNAs might provide a signature distinctive for any disease. A unique signature could not only help in diagnosing a disease but also aid in understanding its individual pathophysiology. As miRetrieve’s frequency distribution reflects the degree of investigation of miRNAs and the number of papers reporting their substantial role, it is interesting to compare which miRNAs are well investigated between diseases, such as atherosclerosis and lung cancer. In lung cancer, the most frequently mentioned miRNAs are miR-21, miR-200 and miR-34 (Figure 5).

Figure 5.

Most frequently mentioned miRNAs in lung cancer. Lung cancer abstracts were confined to research articles and miRNA names mentioned at least twice in an abstract were extracted. Ten most frequently mentioned miRNAs in lung cancer were plotted. Bargraphs indicate number of abstracts a miRNA was mentioned at least twice in. However, it is notable that 7 of 10 miRNAs are shared among the most frequently mentioned miRNAs in atherosclerosis and lung cancer, namely miR-21, miR-30, miR-29, miR-146, miR-125, miR-155 and miR-145 (Figures 2, 5). As the frequency distribution of miRNAs reflects the knowledge and their substantial role in a disease, these seven miRNAs are unlikely to explain the unique pathophysiology or to serve as a specific miRNA signature in either disease. This is an example of how miRetrieve can identify similarities and dissimilarities between disease driving miRNAs. This might aid in distinguishing miRNAs that are involved in general processes from miRNAs that contribute to a disease’s unique pathophysiology.

Shared terms of miR-21 in atherosclerosis and lung cancer

It is hypothesized that one miRNA exerts different functions depending on environment (4). This raises the question if one miRNA plays different roles in atherosclerosis and lung cancer. As miR-21 is the most frequently mentioned miRNA in lung cancer and already characterized in atherosclerosis above, its role in both diseases shall be contrasted. In atherosclerosis and lung cancer, miR-21 is associated with apoptosis, pi3k akt and nf κb (Figure 6A,B). Additionally, miR-21 is related to miR-155 and miR-126 in both fields (Figure 6B).

Figure 6.

Shared terms of miR-21 in atherosclerosis and lung cancer. Most frequently single and two adjacent words associated with miR-21, shared in atherosclerosis and lung cancer, were juxtaposed, while common words and terms such as a, is, significant or western were ignored. Bargraphs indicate relative number of abstracts mentioning single and two adjacent words as compared to all abstracts mentioning miR-21 in atherosclerosis (orange) or lung cancer (blue), respectively. In fact, miR-21 is anti-apoptotic in lung cancer (55) by targeting PTEN (56), just as in atherosclerosis (see above). Additionally, miR-21 prevents apoptosis in lung cancer by targeting KIBRA (57) and the PI3K/Akt/NF-κB signaling pathway (58). However, miR-21 also induces apoptosis in lung cancer by targeting hMSH2 (59). Regarding nf κb, miR-21 either suppresses (60) or increases (61) NF-κB signaling in both entities (see above for atherosclerosis). Concerning the connection between miR-21, miR-126 and miR-155, miR-21 and miR-155 (62) as well as miR-21 and miR-126 (63) are commonly concomitantly aberrantly expressed in lung cancer, just as in atherosclerosis (see above). This is another example that the term associations revealed by miRetrieve are accurate. Furthermore, miRetrieve uncovers general functions of miRNAs, which might help to sort miRNAs into larger functional classes. Additionally, miRetrieve reveals relevant miRNA–miRNA co-expressions, which might hint at regulatory commonalities independent from environment.

miRNAs as biomarker in atherosclerosis and lung cancer

miRNAs are highly stable in body fluids and thus hold potential to serve as biomarkers (64). The more specific a biomarker, the better it differentiates between diseases. Thus, potential biomarker miRNAs in atherosclerosis and lung cancer shall be contrasted and their specificity estimated. Considering the 8 most frequent biomarker miRNAs in atherosclerosis and lung cancer, miR-155 is only listed in atherosclerosis, whereas miR-200 is only mentioned in lung cancer. Therefore, miR-155 and miR-200 might be more indicative biomarker of atherosclerosis and lung cancer, respectively. Furthermore, at least four miRNAs (miR-126, miR-21, miR-30, miR-125) hold potential to serve as biomarkers in both entities (Figure 7).

Figure 7.

Potential biomarker miRNAs in atherosclerosis and lung cancer. Abstracts containing at least five marker words for biomarker (bio-marker, biological marker, biomarker, body fluid, bodyfluid, circulating, diagnostic, exosomal, exosomes, extracellular vesicles, plasma, serum, urinary, urine) in atherosclerosis (A) and lung cancer (B) were identified, and the eight most frequently reported miRNAs in these abstracts were visualized. Bargraphs indicate number of abstracts in atherosclerosis (A) or lung cancer (B) mentioning a miRNA as a potential biomarker. Manual validation confirms that miR-155 is a promising biomarker for atherosclerosis (65), whereas miR-200 is a well-confirmed biomarker for lung cancer (66). Of note, miRetrieve lists 16 and 27 abstracts of miR-155 and miR-200 as biomarker in atherosclerosis and lung cancer, respectively, whereas only 9 and 19 abstracts confirm these miRNAs as biomarker (Supplementary File S4), which corresponds to a recall between 50% and 60%. Regarding miR-126, miR-21, miR-30 and miR-125, all four miRNAs are indeed established biomarkers for either atherosclerosis (e.g. miR-126: (67), miR-21: (41), miR-30: (68), miR-125: (69)) or lung cancer (e.g. miR-126: (70), miR-21: (71), miR-30: (72), miR-125: (73)). This demonstrates that miRetrieve identifies established biomarker miRNAs in a disease. Even though the abstracts listed by miRetrieve correspond to confirmed miRNAs as biomarker in only 50–60% of cases, this cross-disease comparison allows a first estimate on a miRNA’s specificity or commonality as a biomarker, which can guide further studies.

DISCUSSION

Here, we introduced miRetrieve, a tool to summarize information on miRNAs from thousands of abstracts within a short amount of time. By demonstrating a typical miRetrieve workflow, we showed how miRetrieve effectively condenses the wealth of information on miRNAs, such as their degree of investigation, function in a disease, their larger functional class, or their suitability as a biomarker. Additionally, the accuracy of these results was confirmed by manual validation. This indicates that miRetrieve can not only help to understand the unique pathophysiology of a disease, but also help to improve our understanding of miRNAs in general. One core function of miRetrieve is to extract differently spelled miRNA names and unify them into one format. Even though a wide variety of miRNA spellings is recognized, not every format can be accounted for, which might cause miRetrieve to miss miRNAs. Problematic is especially the extraction of miRNA names from chained patterns, such as miR-146–155, as these might conflict with the standard nomenclature (e.g. miR-24–1). This is aggravated by the fact that the possibilities of enumerating miRNAs with slashes, dashes and commas are almost endless. However, manual validation showed that miRetrieve extracted >99% of all miRNA names correctly. Another potential drawback is the extraction of miRNA names from abstracts without proven biological function in the study at hand. Here, one approach is to extract miRNAs that are referenced more than once in an abstract. While this approach bears the risk of ignoring relevant miRNAs that are seldomly mentioned, it improves precision, as demonstrated above. Once extracted, miRetrieve characterizes miRNAs by displaying the terms they are frequently associated with. Whereas these miRNA-term associations depend on a sufficient amount of provided abstracts, they are also suggestive and potentially incomplete. Yet, we demonstrated that miRetrieve’s term-associations are not only meaningful but also accurate and reflect a miRNA’s role independent from disease. Additionally, this can be extended to compare a miRNA’s role across different diseases, which may aid in sorting miRNAs into larger functional classes. Lastly, we showed that these miRNA-term associations point at concurrently differently expressed miRNAs, which might help to uncover co-regulated miRNA-miRNA networks and stable miRNA signatures. Hence, miRetrieve’s term-associations provide a reliable foundation to generate new hypotheses and guide further studies. Next to associating miRNAs with terms, miRetrieve also dissects which miRNAs are frequently related to defined terms, which can be used, amongst others, to identify potential biomarker miRNAs. As most biomarker studies investigate multiple miRNAs at once, they also mention miRNAs that fail to be suitable biomarkers, which explains miRetrieve’s diminished accuracy when identifying biomarker miRNAs from abstracts. Furthermore, a miRNA’s suitability as a biomarker depends on various factors, such as individual expression levels, the direction of expression alteration, or combination of several miRNAs to a signature. While these factors are not covered by miRetrieve, miRetrieve allows to quickly skim for candidate miRNAs that may serve as biomarker in a disease. This helps in estimating a miRNA’s specificity as a biomarker, and can subsequently direct further studies. All miRNA extractions, term- and class-associations are generated directly from text without lexico-syntactic rules or stochastic models. This facilitates miRetrieve’s interpretability and allows miRetrieve to extract new miRNAs or recognize a differentiated scope of miRNA-associations independent from defined named entities, while the option to combine all analysis steps with rule-based tools such as miRSel (6), miRTex (7) and emiRIT (8) still remains. Apart from characterizing miRNAs using generated associations, miRetrieve visualizes validated miRNA–mRNA interactions confined to the disease at hand by incorporating databases such as miRTarBase (5). However, up until now, these databases depend on manual curation and require extensive updates, which is why they do not cover the full scope of known miRNA–mRNA interactions. Nonetheless, visualizing miRNA–mRNA interactions for specific diseases may direct the investigation of other miRNA–mRNA interactions in similar fields more effectively than predicted miRNA–mRNA interactions that do not take environmental factors into consideration. Additionally, disease-specific miRNA–mRNA interactions may facilitate our understanding of what drives miRNA-target selectivity, and consequently improve miRNA-target predictions. Lastly, hope remains that this approach is reinforced further once validated miRNA–mRNA interactions are automatically extracted from text in the future.

CONCLUSION

In conclusion, miRetrieve provides accurate insights on miRNAs from huge amounts of text within a short amount of time. These insights aid in characterizing miRNAs in one or many fields, which may facilitate and accelerate the generation of new hypotheses on how miRNAs modify diseases, ultimately improving our understanding of miRNAs and diseases alike.

DATA AVAILABILITY

miRetrieve is an R package (9) released under a GPL-3 license and freely available from GitHub (https://github.com/JulFriedrich/miRetrieve) and CRAN. The most important miRetrieve functions are available as a browser based Shiny application (10) under (https://miretrieve.shinyapps.io/miRetrieve/). The raw abstracts for the use case are released as a GitHub tag at https://github.com/JulFriedrich/miRetrieve/releases/tag/v1.3.2, while Supplementary Data are available at https://github.com/JulFriedrich/miRetrieve-paper. Click here for additional data file.

68 in total

Review 1. MicroRNAs: genomics, biogenesis, mechanism, and function.

Authors: David P Bartel
Journal: Cell Date: 2004-01-23 Impact factor: 41.582

2. MiR-126 on mice with coronary artery disease by targeting S1PR2.

Authors: J-L Fan; L Zhang; X-H Bo
Journal: Eur Rev Med Pharmacol Sci Date: 2020-01 Impact factor: 3.507

3. A diagnostic assay based on microRNA expression accurately identifies malignant pleural mesothelioma.

Authors: Hila Benjamin; Danit Lebanony; Shai Rosenwald; Lahav Cohen; Hadas Gibori; Naama Barabash; Karin Ashkenazi; Eran Goren; Eti Meiri; Sara Morgenstern; Marina Perelman; Iris Barshack; Yaron Goren; Tina Bocker Edmonston; Ayelet Chajut; Ranit Aharonov; Zvi Bentwich; Nitzan Rosenfeld; Dalia Cohen
Journal: J Mol Diagn Date: 2010-09-23 Impact factor: 5.568

4. Genistein Protects Against Ox-LDL-Induced Inflammation Through MicroRNA-155/SOCS1-Mediated Repression of NF-ĸB Signaling Pathway in HUVECs.

Authors: Huaping Zhang; Zhenxiang Zhao; Xuefen Pang; Jian Yang; Haixia Yu; Yinhong Zhang; Hui Zhou; Jiahui Zhao
Journal: Inflammation Date: 2017-08 Impact factor: 4.092

5. The relationships among monocyte subsets, miRNAs and inflammatory cytokines in patients with acute myocardial infarction.

Authors: Ewelina Kazimierczyk; Andrzej Eljaszewicz; Paula Zembko; Ewa Tarasiuk; Malgorzata Rusak; Agnieszka Kulczynska-Przybik; Marta Lukaszewicz-Zajac; Karol Kaminski; Barbara Mroczko; Maciej Szmitkowski; Milena Dabrowska; Bozena Sobkowicz; Marcin Moniuszko; Agnieszka Tycinska
Journal: Pharmacol Rep Date: 2018-09-12 Impact factor: 3.024

6. miRSel: automated extraction of associations between microRNAs and genes from the biomedical literature.

Authors: Haroon Naeem; Robert Küffner; Gergely Csaba; Ralf Zimmer
Journal: BMC Bioinformatics Date: 2010-03-16 Impact factor: 3.169

7. Upregulated microRNA miR-21 promotes the progression of lung adenocarcinoma through inhibition of KIBRA and the Hippo signaling pathway.

Authors: Yunxia An; Qianqian Zhang; Xiaoliang Li; Zheng Wang; Ying Li; Xueyi Tang
Journal: Biomed Pharmacother Date: 2018-10-20 Impact factor: 6.529

8. Inhibition of MicroRNA-21-5p Promotes the Radiation Sensitivity of Non-Small Cell Lung Cancer Through HMSH2.

Authors: Yu Song; Yun Zuo; Xiao-Lan Qian; Zhi-Peng Chen; Shao-Kai Wang; Lei Song; Li-Ping Peng
Journal: Cell Physiol Biochem Date: 2017-10-09

Review 9. Insights into the Diagnostic Potential of Extracellular Vesicles and Their miRNA Signature from Liquid Biopsy as Early Biomarkers of Diabetic Micro/Macrovascular Complications.

Authors: Valeria La Marca; Alessandra Fierabracci
Journal: Int J Mol Sci Date: 2017-09-14 Impact factor: 5.923

10. Clinical Significance of Circulatory miRNA-21 as an Efficient Non-Invasive Biomarker for the Screening of Lung Cancer Patients

Authors: F M Abu-Duhier; Jamsheed Javid; M A Sughayer; Rashid Mir; Tariq Albalawi; M Shahid Alauddin
Journal: Asian Pac J Cancer Prev Date: 2018-09-26

1 in total

Review 1. Literature Mining of Disease Associated Noncoding RNA in the Omics Era.

Authors: Jian Fan
Journal: Molecules Date: 2022-07-23 Impact factor: 4.927

1 in total