Literature DB >> 34986599

HIT 2.0: an enhanced platform for Herbal Ingredients' Targets.

Deyu Yan¹, Genhui Zheng¹, Caicui Wang¹, Zikun Chen¹, Tiantian Mao¹, Jian Gao^2,3, Yu Yan¹, Xiangyi Chen¹, Xuejie Ji¹, Jinyu Yu¹, Saifeng Mo¹, Haonan Wen¹, Wenhao Han¹, Mengdi Zhou¹, Yuan Wang¹, Jun Wang¹, Kailin Tang¹, Zhiwei Cao⁴.

Abstract

Literature-described targets of herbal ingredients have been explored to facilitate the mechanistic study of herbs, as well as the new drug discovery. Though several databases provided similar information, the majority of them are limited to literatures before 2010 and need to be updated urgently. HIT 2.0 was here constructed as the latest curated dataset focusing on Herbal Ingredients' Targets covering PubMed literatures 2000-2020. Currently, HIT 2.0 hosts 10 031 compound-target activity pairs with quality indicators between 2208 targets and 1237 ingredients from more than 1250 reputable herbs. The molecular targets cover those genes/proteins being directly/indirectly activated/inhibited, protein binders, and enzymes substrates or products. Also included are those genes regulated under the treatment of individual ingredient. Crosslinks were made to databases of TTD, DrugBank, KEGG, PDB, UniProt, Pfam, NCBI, TCM-ID and others. More importantly, HIT enables automatic Target-mining and My-target curation from daily released PubMed literatures. Thus, users can retrieve and download the latest abstracts containing potential targets for interested compounds, even for those not yet covered in HIT. Further, users can log into 'My-target' system, to curate personal target-profiling on line based on retrieved abstracts. HIT can be accessible at http://hit2.badd-cao.net.

Entities: Chemical

Mesh：

Substances：

Year: 2022 PMID： 34986599 PMCID： PMC8728248 DOI： 10.1093/nar/gkab1011

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Being a rich source of drug candidates, herbal active ingredients play a critical role in the development of new drugs. From 1981 to 2019, 33.6% of the drugs approved by the FDA were reported to be derived from natural products or their derivatives (1). To better understand the interaction between herbal compounds and molecular targets, the first herbal-ingredient-target database, HIT, was established in 2010 via manual curation of 1301 literature-described targets for herbal compound from 3250 literatures, with convenient links to therapeutic targets database (TTD) and Drugbank etc (2,3). The target information of HIT has been extensively exploited to study the mechanism of natural compounds, as well as to make discoveries from herbal medicine. For instance, based on HIT, Luo Y et al. revealed the therapeutic mechanism of cryptotanshinone in the treatment of liver cancer (4). And Wang et al. identified potential targets for asthma according to the clinical efficacy of TCM formulations (5). During the past ten years (2011–2021), there was an explosive increase in the studies on the natural ingredients and their targets. A number of databases were then constructed covering herbal compound and target interactions. Notably, a nice exemplary database for natural products, NPASS (6), provides experimentally-determined quantitative activity records for natural products, including nearly 2,000 herbal ingredient-target pairs for about 700 herbal ingredients. Other herbal databases also included important information of herbal ingredient-target pairs (7–11), but most of them downloaded and incorporated early HIT data, appended with predicted targets, such as HERB, TCMID, TCMSP and SymMap (7–10). These databases have greatly enriched the target diversity for herbal ingredients. However, the literature-described targets of herbal ingredients being carefully curated, though valuable, remain limited and need to be updated urgently and regularly. Yet those fresh evidences of ingredient-target interactions may be published in everyday-released literatures, while the traditional databases have difficulties in keeping track of the latest results. Users often need to take great efforts manually searching the up-to-date literatures not yet covered by current databases and dig targets out to tie up with database items, which is highly time-consuming. Thus, it is necessary to propose a new platform to provide not only the regularly updated targets, but also the real-time checking for potential target from daily-released papers. Here, we introduced such a platform of HIT 2.0 for the above purpose. In this version, advanced text mining algorithms and rigorous curation were comprehensively employed. In addition to calibrating HIT 1.0 data from literatures between 2000 and 2010, our curation team made a complete refreshment by adding literatures between 2010 and 2020, resulting in almost twice the data abundancy as before, plus new features of target confidence indicators. More importantly, the text-mining system of compound-target suggestion is now open to users, where researchers can retrieve the most-related literatures via ‘Target-mining’ for any natural compounds. At last, HIT 2.0 provides an on-line function of ‘My-target’ for personal curation and downloading.

MATERIALS AND METHODS

Data source

Similar to HIT 1.0, the ingredient information was sourced from the widely used TCM-ID database of the 2020 updated version (12), which covers 2751 herbs and 7375 herb ingredients. Compound aliases were derived from the Chemical Abstracts Service (CAS). The same set of 59 keywords was used as that in HIT 1.0 to describe interactions between compounds and molecular targets (Supplementary Table S1). Some keywords are nouns describing the interaction (Type A), while the others (Type B) are phrases describing the specific effect, such as ‘inhibit the activity of’ proteins.

Text mining

Text mining of NLP was used to identify the targets of herbal ingredients in the literature, with workflow being illustrated as follows (Figure 1):

Figure 1.

Workflow of HIT 2.0. (1, 2) Retrieve PubMed using different names of herbal ingredients. (3) Mine PubMed abstracts to identify gene/protein entities. (4) Detect whether ‘compound’, ‘gene’ and ‘keyword’ are in a directional dependency tree path. 5&6. Manual check and complete the information. Step1: Retrieve abstracts in PubMed containing keywords of the name/alias of herb ingredients. Step2: Annotate gene/protein entities in the abstract using PubTator Central (13), an automated annotation platform for biomedical entities. These genes/proteins may be targets of ingredients, and only abstracts containing gene/protein entities will be retained. Step3: Parse the full abstract into sentences. Screen out those sentences based on either of below two rules. Rule1: ‘Compound name’ AND ‘any word in type A’ AND ‘Gene’. For instance, the sentence ‘EGCG is a novel Hsp90 inhibitor’. Rule2: ‘Compound name’ AND ‘any word in type B interaction’ AND ‘any word in type B effect’ AND ‘Gene’. For instance, the sentence ‘Procyanidin B2 directly inhibited MT1-MMP activity.’ Step4: Use the Stanford Parser engine (14) to extract sentence syntactic structure and grammatical relations (15). Only those sentences are retained where ‘compound’, ‘gene’ and ‘keyword’ are detected in a directional dependency tree path.

Manual curation

Finally, a curation team of 11 Ph.D. candidates reviewed 17 000 sentences, and each item was double checked by at least two candidates. The entries with consistence from two curators were remained. While those of disagreement were reviewed by a third curator, with the final result being decided by a majority vote.

RESULTS

The construction of HIT 2.0 was based on literature mining and manual curation. In the part of literature mining, PubMed abstracts were firstly retrieved containing keywords of the name/alias of herbal ingredients. Then the abstract was parsed into sentences and only those sentences containing herbal ingredients, genes, and keywords were kept for further curation. For convenience, an online platform was set up where each curator has their own account. Any time after log-in, curators can review the PubMed tasks completed, and to-be-completed. Now the whole platform has been opened as ‘Target-mining’ and ‘My-target’ system, enabling users to identify those most related literatures and keep close tracking of the latest targets for their interested compounds.

Data updating

In this version, more advanced text-mining algorithms of Natural language processing (NLP) and rigorous manual curation were applied using the same set of keywords as that of HIT 1.0 (16). NLP has been widely applied to biomedical text mining. Dependent syntactic analysis enables sentence structure parsing to highlight the relationships between medical entities (15). The NLP algorithm is used in HIT 2.0 to determine whether compounds may interact with genes/proteins. Initially, 7100 abstracts were obtained from PubMed after text-mining system. After the on-line curation, a total of 10 031 compound-target pairs were produced, involving 2208 molecular targets and 1,237 compounds from 1250 herbs. Interestingly, 56 miRNA genes have been included into our targeting list, mainly with modes of up/down regulated genes. The types of compound-target interactions cover 10 categories: indirectly inhibit/activate, up/down-regulated gene, directly inhibit/activate, binders, enzyme substrates, enzyme products and other. Quality indicators of compound-target interactions have been developed covering the nature of the interaction, the number of literatures supporting the same pair, and the citing reports of each literature. Based on above, users can choose those preferred for further analysis. Each compound-target pair can be viewed with a key description parsed from the sourcing literature. Compounds are linked to PubChem (17) and ChEMBL (18) databases. As for targets, Crosslinks have been made to databases of TTD (2), DrugBank (3), KEGG (19), PDB (20), UniProt (21), Pfam (22), NCBI, TCM-ID (12) and others for more detailed information. Compared to similar herb databases containing literature-described targets, HIT 2.0 has doubled the data abundancy of HIT 1.0, forming a nice complement to the previous databases. Currently, it is also the largest database in terms of curated ingredient-target pairs for herbs, as Table 1 shows.

Table 1.

Overview of the literature-described targets from peering databases

	Published year	Literature-described targets	Herbal ingredients	Herbal ingredient-target activity pairs	Sourcing literatures
HIT 2.0		2208	1237	10 031	7100 PubMed abstracts
HIT	2011	1301	586	5208	3250 PubMed abstracts
HERB	2020	1241	370	4815	1966 PubMed abstracts
TCMID	2013	680	/	/	4500 Chinese Literatures
NPASS	2017	464	719	1936	1288 PubMed abstracts
TCMSP	2014	3311 predicted by SysDT	2 9 384	8 4 260 predicted by SysDT	/

Overview of the literature-described targets from peering databases HIT 2.0 allows keyword search and compound similarity search. The searching interface and results pages are illustrated in Figure 2. Keyword search is available for herb information [Chinese pinyin, Chinese characters, Latin name and English name], ingredients information [different names, CAS number and CID number] and target information [gene/protein name, gene symbol, UniProt ID]. Auto-completion and fuzzy search are supported for keyword search. Besides that, similarity search can also be made via compound structures, with a Tanimoto coefficient above 70% as a cut-off. Both the SMILES formula and the artificially drawn structure by build-in software Ketcher (https://lifescience.opensource.epam.com/ketcher/) can be used as input.

Figure 2.

Searching and resulting pages in HIT 2.0. (A) Database structure and data statistics. (B) Herbs can be searched via keywords such as Chinese Pinyin, Chinese characters and Latin names. (C) Herbal ingredients can be searched via structure similarity or keywords of name, CID and CAS number. (D) Targets can be searched via keywords of gene/protein name, gene symbol and Uniprot ID. (E) Detailed information of the targets. (F) Additional targets of the compound. (G) ‘ Literature evidence ’ provides the key descriptions parsed from sourcing literatures.

Target-mining and My-target curation

As PubMed literatures are released every day, there is a constant need for researchers to check the latest evidences after 2020 for their interested compound. So far, no software becomes available to mine the literatures containing potential target-ingredient interactions. For the convenience of curation, two options have been provided for users. Target-mining function was built-in enabling users to retrieve related PubMed abstracts published from 2010 till the daily updates for any compounds (Figure 3A and B). It was realized that the abstracts identified by text-mining may contain false positives. Thus HIT 2.0 provides on-line My-target curation function to further check the detailed sentences in sourcing abstracts (Figure 3C). The sentences containing interesting entities have been highlighted in different colors, so that key items can be easily spotted together with the interaction types. A link to PubMed with full abstract is also provided.

Figure 3.

Target-mining and My-target curation system. (A) The interface of Target-mining function. Compound name, MeSH ID and Pubchem ID can be submitted to retrieve potential targets. (B) PubMed abstract retrieved by Target-mining. (C) The interface of My-target curation system. In brief, ‘Target-mining’ was designed to efficiently and precisely retrieve those most-related abstracts based on our empirically derived rules combined with advanced text-mining algorithm. Via this, users can retrieve and download the latest literatures for local checking. Alternatively, users can choose ‘My-target curation’ to make curation on-line. Each user has their own account. Anytime after log-in, users can review the tasks completed and continue to finish the full task.

DISCUSSION

Complement to other drug and drug-target databases, herbal databases containing literature-described targets have served as the primary source for mechanistic study of natural products by providing rich sets of information between medicinal herbs, active compounds and molecular targets under different experimental conditions. With a steady literature accumulation describing new evidences in the past decade, however, the previous versions need to make timely updating to meet up with the pressing needs. In this paper, HIT 2.0 was thus constructed to maintain a regular updating by adding another ten years, marking as the latest and the largest one regarding curated ingredient-target pairs for herbs. Meanwhile, we set up Target-mining and My-target curation system based on technology of natural processing language, allowing researchers to keep tracking the latest evidences and curate personal targets of interested compounds at convenient time. The launch of HIT 2.0 will be an important addition to bridge herbs ingredients and FDA approved drugs via molecular targets and may facilitate the discovery of new druggable molecules, as well as to identify potential therapeutic targets. One key technology applied in HIT 2.0 was the auto text-mining tools. Currently, there are several annotating tools to recognize the biomedical entities, such as PubTator Central (13), HunFlair (23), and ScispaCy (24). Among them, PubTator Central was developed by NCBI to annotate PubMed abstracts automatically and allows users to download annotations in bulk via PubMed ID lists. In addition, the annotated genes have direct links to the Gene database, which allows HIT 2.0 to access more details of the targets without being installed. Considering the overall convenience, simplicity and complexity, PubTator central was chosen for HIT 2.0. Meanwhile, the last decade has witnessed an extensive application of molecular targets into functional explanation for herbal compounds. Though the reliability of literature-evidenced targets may be more accurate than computationally-predicted targets, it is aware that the results collected into HIT are mostly from in vitro rather than in vivo experiments. In fact, these natural compounds are likely metabolized into different forms which may change the targeting profiles (25). Furthermore, the biological activity of an ingredient is often related to not only the group of molecular targets, but also the network interactions among targets. In this sense, HIT 2.0 may serve as a valuable start in deriving collective functions of herbal components, particularly to herbs and herbal formulas. Towards this direction, HIT will continue to enrich the data abundancy and subsequent analysis to maintain high-quality resources for domain research. Click here for additional data file.

23 in total

1. HERB: a high-throughput experiment- and reference-guided database of traditional Chinese medicine.

Authors: ShuangSang Fang; Lei Dong; Liu Liu; JinCheng Guo; LianHe Zhao; JiaYuan Zhang; DeChao Bu; XinKui Liu; PeiPei Huo; WanChen Cao; QiongYe Dong; JiaRui Wu; Xiaoxi Zeng; Yang Wu; Yi Zhao
Journal: Nucleic Acids Res Date: 2020-12-02 Impact factor: 16.971

2. Database of traditional Chinese medicine and its application to studies of mechanism and to prescription validation.

Authors: X Chen; H Zhou; Y B Liu; J F Wang; H Li; C Y Ung; L Y Han; Z W Cao; Y Z Chen
Journal: Br J Pharmacol Date: 2006-11-06 Impact factor: 8.739

3. HIT: linking herbal active ingredients to targets.

Authors: Hao Ye; Li Ye; Hong Kang; Duanfeng Zhang; Lin Tao; Kailin Tang; Xueping Liu; Ruixin Zhu; Qi Liu; Y Z Chen; Yixue Li; Zhiwei Cao
Journal: Nucleic Acids Res Date: 2010-11-21 Impact factor: 16.971

4. Immune-centric network of cytokines and cells in disease context identified by computational mining of PubMed.

Authors: Ksenya Kveler; Elina Starosvetsky; Amit Ziv-Kenet; Yuval Kalugny; Yuri Gorelik; Gali Shalev-Malul; Netta Aizenbud-Reshef; Tania Dubovik; Mayan Briller; John Campbell; Jan C Rieckmann; Nuaman Asbeh; Doron Rimar; Felix Meissner; Jeff Wiser; Shai S Shen-Orr
Journal: Nat Biotechnol Date: 2018-06-18 Impact factor: 54.908

5. ChEMBL: towards direct deposition of bioassay data.

Authors: David Mendez; Anna Gaulton; A Patrícia Bento; Jon Chambers; Marleen De Veij; Eloy Félix; María Paula Magariños; Juan F Mosquera; Prudence Mutowo; Michal Nowotka; María Gordillo-Marañón; Fiona Hunter; Laura Junco; Grace Mugumbate; Milagros Rodriguez-Lopez; Francis Atkinson; Nicolas Bosc; Chris J Radoux; Aldo Segura-Cabrera; Anne Hersey; Andrew R Leach
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

6. Therapeutic target database 2020: enriched resource for facilitating research and early development of targeted therapeutics.

Authors: Yunxia Wang; Song Zhang; Fengcheng Li; Ying Zhou; Ying Zhang; Zhengwen Wang; Runyuan Zhang; Jiang Zhu; Yuxiang Ren; Ying Tan; Chu Qin; Yinghong Li; Xiaoxu Li; Yuzong Chen; Feng Zhu
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

7. NPASS: natural product activity and species source database for natural product research, discovery and tool development.

Authors: Xian Zeng; Peng Zhang; Weidong He; Chu Qin; Shangying Chen; Lin Tao; Yali Wang; Ying Tan; Dan Gao; Bohua Wang; Zhe Chen; Weiping Chen; Yu Yang Jiang; Yu Zong Chen
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

8. KEGG: integrating viruses and cellular organisms.

Authors: Minoru Kanehisa; Miho Furumichi; Yoko Sato; Mari Ishiguro-Watanabe; Mao Tanabe
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

9. RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences.

Authors: Stephen K Burley; Charmi Bhikadiya; Chunxiao Bi; Sebastian Bittrich; Li Chen; Gregg V Crichlow; Cole H Christie; Kenneth Dalenberg; Luigi Di Costanzo; Jose M Duarte; Shuchismita Dutta; Zukang Feng; Sai Ganesan; David S Goodsell; Sutapa Ghosh; Rachel Kramer Green; Vladimir Guranović; Dmytro Guzenko; Brian P Hudson; Catherine L Lawson; Yuhe Liang; Robert Lowe; Harry Namkoong; Ezra Peisach; Irina Persikova; Chris Randle; Alexander Rose; Yana Rose; Andrej Sali; Joan Segura; Monica Sekharan; Chenghua Shao; Yi-Ping Tao; Maria Voigt; John D Westbrook; Jasmine Y Young; Christine Zardecki; Marina Zhuravleva
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

3 in total

1. CPMCP: a database of Chinese patent medicine and compound prescription.

Authors: Chang Sun; Jipeng Huang; Rong Tang; Minglei Li; Haili Yuan; Yuxiang Wang; Jin-Mao Wei; Jian Liu
Journal: Database (Oxford) Date: 2022-08-25 Impact factor: 4.462

2. Network Pharmacology and Molecular Docking Analysis Explores the Mechanisms of Cordyceps sinensis in the Treatment of Oral Lichen Planus.

Authors: Hexin Ma; Guofang Wang; Xiaomeng Guo; Yao Yao; Chunshen Li; Xibo Li; Mingzhe Xin; Xiaohui Xu; Shilong Zhang; Zhi Sun; Hongyu Zhao
Journal: J Oncol Date: 2022-08-29 Impact factor: 4.501

Review 3. Methodology of network pharmacology for research on Chinese herbal medicine against COVID-19: A review.

Authors: Yi-Xuan Wang; Zhen Yang; Wen-Xiao Wang; Yu-Xi Huang; Qiao Zhang; Jia-Jia Li; Yu-Ping Tang; Shi-Jun Yue
Journal: J Integr Med Date: 2022-09-22

3 in total