Literature DB >> 23237381

Consistency of systematic chemical identifiers within and between small-molecule databases.

Saber A Akhondi1, Jan A Kors, Sorel Muresan.   

Abstract

BACKGROUND: Correctness of structures and associated metadata within public and commercial chemical databases greatly impacts drug discovery research activities such as quantitative structure-property relationships modelling and compound novelty checking. MOL files, SMILES notations, IUPAC names, and InChI strings are ubiquitous file formats and systematic identifiers for chemical structures. While interchangeable for many cheminformatics purposes there have been no studies on the inconsistency of these structure identifiers due to various approaches for data integration, including the use of different software and different rules for structure standardisation. We have investigated the consistency of systematic identifiers of small molecules within and between some of the commonly used chemical resources, with and without structure standardisation.
RESULTS: The consistency between systematic chemical identifiers and their corresponding MOL representation varies greatly between data sources (37.2%-98.5%). We observed the lowest overall consistency for MOL-IUPAC names. Disregarding stereochemistry increases the consistency (84.8% to 99.9%). A wide variation in consistency also exists between MOL representations of compounds linked via cross-references (25.8% to 93.7%). Removing stereochemistry improved the consistency (47.6% to 95.6%).
CONCLUSIONS: We have shown that considerable inconsistency exists in structural representation and systematic chemical identifiers within and between databases. This can have a great influence especially when merging data and if systematic identifiers are used as a key index for structure integration or cross-querying several databases. Regenerating systematic identifiers starting from their MOL representation and applying well-defined and documented chemistry standardisation rules to all compounds prior to creating them can dramatically increase internal consistency.

Entities:  

Year:  2012        PMID: 23237381      PMCID: PMC3539895          DOI: 10.1186/1758-2946-4-35

Source DB:  PubMed          Journal:  J Cheminform        ISSN: 1758-2946            Impact factor:   5.514


Background

The past decade has seen a major increase in the availability of public and commercial chemical databases [1]. Resources such as PubChem (released in 2004) [2] and ChEMBL (released in 2009) [3], with their corresponding web services, have gained the trust of many researchers in the fields of cheminformatics, bioinformatics, systems biology, and translational medicine. Because large numbers of compounds and associated structure-activity relationships (SAR) data are published in journals and patents every year, many new data sources have become available, each covering different aspects of the connectivity between the SAR-related entities [4]. With the increasing usage of these resources by scientists from both academia and the pharmaceutical industry, quality control of chemical structures and associated metadata is becoming a necessity [5]. Correctness of a structure extracted from databases has a great impact on predictive ability of computational models for quantitative structure-activity relationships (QSAR) [6]. A recent study by Williams and Ekins [7] on a subset of a chemistry database showed more than 70% errors in the absolute structural integrity, a striking difference to the 5-10% level the authors had anticipated. In another study of database quality, Oprea et al. [8] have illustrated how errors within a database are transferred to other databases following data integration (also mentioned by Williams et al. [9]). Quality issues have also been observed in the relationship between chemical structures and the corresponding identifiers, such as chemical names referring to structures with different stereochemistry or CAS numbers incorrectly associated with a particular salt or mixture [9]. Although these problems are known to exist, there have been no studies that quantify the consistency between structures and their identifiers. Chemical identifiers can be distinguished in two major classes based on how they are generated. The first consists of systematic identifiers, which are generated algorithmically and should have a one-to-one correspondence with the structure (however, different software could generate different flavours, as is the case for SMILES notations [10,11]). The second class comprises non-systematic chemical identifiers. These are source dependent and usually generated at the point of registration within a particular source (e.g. CAS numbers, PubChem compound identifiers (CIDs) and substance identifiers (SIDs), generic or drug brand names). Structure depictions are the natural language for chemists. In order to convert the images to a form usable by computers, several file formats and chemical identifiers have been introduced. The MOL file format [12], SMILES notations [10], InChI strings [13], and IUPAC names [14] are arguably the most widely used. In the context of this work we will refer to IUPAC names, SMILES notations, and InChI strings as systematic identifiers. Most chemical databases are built starting from the MOL file representations of chemical structures, which are linked to systematic and non-systematic identifiers. It is thus crucial that different chemical identifier types represent the same compound. Inconsistencies between systematic identifiers and registered chemical structures can occur for several reasons. For example, systematic identifiers can be generated with different structure-to-identifier conversion tools, with different levels of structure standardisation, or structures and systematic identifiers can be integrated without harmonisation from different sources. In this study we investigate the consistency of systematic identifiers of well-defined structures within and between some of the commonly used chemical resources. We also examine the effect of standardisation on this consistency.

Methods

Databases

For this study we selected a set of well-known publicly available small-molecule databases to cover a wide range of bioactive compounds: DrugBank [15], Chemical Entities of Biological Interest (ChEBI) [16], the Human Metabolome Database (HMDB) [17], PubChem [2], and the NCGC Pharmaceutical Collection (NPC) [18]. Table 1 shows the number of structures and corresponding systematic identifiers in each database. All data were downloaded on March 14, 2012. In this study, only compounds that had MOL files were used. Whenever available, we collected SMILES notations, InChIs strings and IUPAC names. If several SMILES notations were available for a single compound, we selected the isomeric SMILES.
Table 1

Number of structures (MOLs) and systematic identifier counts for databases in this study

DatabaseMOLInChISMILESIUPAC
DrugBank
6506
6391
6504
6489
ChEBI
21367
19076
19725
18798
HMDB
8534
8534
8534
7727
PubChem
5069294
5069293
5069294
4769031
NPC8024080180
Number of structures (MOLs) and systematic identifier counts for databases in this study In addition to systematic identifiers, cross-references linking records between databases were also downloaded. The following data were extracted from the resources: DrugBank[15]. The set of compounds consisted of approved drugs, experimental drugs, nutraceutical drugs, illicit drugs, and withdrawn drugs. Cross-references to other databases were extracted from the DrugCards in DrugBank. ChEBI[16]. All manually checked and annotated (3 stars) structures with their corresponding systematic identifiers were downloaded. For some of these, ChEBI provides several IUPAC names. In these cases we only used the first IUPAC name in the ChEBI record for our analyses. we only used the first IUPAC name in the ChEBI record. Cross-references were obtained from the ChEBI ontology file. HMDB[17]. All small-molecule metabolites with their corresponding structures were downloaded. Cross-references were extracted from the HMDB MetaboCard files. PubChem[2]. Based on criteria described previously [4], a set of compounds likely to have SAR and/or other bio-annotations were downloaded from PubChem Compound. PubChem cross-references are only provided on the substance level, not on the compound level, and therefore no PubChem cross-references were used in this study. NPC[18]. NPC contains the clinical approved drugs from the USA, Europe, Canada and Japan. Compounds and cross-references were downloaded through the NPC Browser 1.1.0 [18]. The export option of the NPC Browser was used to extract data in MOL and SMILES formats. NPC does not provide InChIs strings and IUPAC names.

Consistency of systematic identifiers within a database

To analyse the structural representation consistency of systematic identifiers within a database, we took the MOL representation of a compound as the reference point. Ideally all associated systematic identifiers should represent the same MOL file. In this work we have used InChI strings for comparisons. InChI (International Chemical Identifier) is a structure-derived tag for a chemical compound. It is an algorithmically produced string of characters, which acts as the unique digital signature of the compound [19]. InChI software, developed by IUPAC and InChI Trust, is open-source software and the de facto standard for generating InChI strings [20]. This is not the case for SMILES or IUPAC names (Figure 1). Various flavours of SMILES or IUPAC names are generated by different software to represent the same molecular structure [11,21,22]. Therefore, MOL files and all systematic identifiers were converted into Standard InChIs, using InChI version 1.03, which were then used to perform all comparisons (Figure 2).
Figure 1

Chemical representations of anastrozole.

Figure 2

Comparison of MOL representation with systematic identifiers.

Chemical representations of anastrozole. Comparison of MOL representation with systematic identifiers. Several public and commercial cheminformatics tool-kits are currently available for structure manipulation and molecular editing [23]. We used ChemAxon’s MolConverter 5.9.1 [24], which has the necessary functionality and is freely available for academic research. For clarity, we refer to Standard InChI strings generated by ChemAxon’s MolConverter as InChI(ca).

Consistency of systematic identifiers between databases

To analyse the consistency of systematic identifiers between databases, the cross-reference linkage of compounds was examined. Within the constraints of different chemistry business rules, the chemical entities linked together via the cross-references should represent the same structure based on their MOL representation. We compared the structures using the InChI(ca) generated from the MOLs. We did not consider cross-references where conversion to InChI(ca) failed for one or both of the MOL files. If a compound had multiple cross-references to a single database, each cross-reference was investigated independently. For cross-references to PubChem, we only considered compounds within our subset of the PubChem database.

Standardisation

Inconsistency between systematic identifiers and their MOL representation may partly relate to the different levels of sensitivity in identifier calculation. Currently, different structure normalisation rules can be used to define compound uniqueness [25]. Unfortunately, a unified and agreed set of rules is still lacking [9]. To assess the effect of structure standardisation on the consistency of systematic identifiers within and between databases, we applied a set of rules developed by the Computer-Aided Drug Design group of the National Cancer Institute (NCI/CADD) known as FICTS rules [26,27]. These were applied to each structure and its corresponding systematic identifier. The FICTS rules include removing small organic fragment (F), ignoring isotopic labels (I), neutralizing charges (C), generating canonical tautomers (T), or ignoring stereochemistry information (S) for a compound. If any of these rules are applied the corresponding upper-case letter is replaced with a “u” (standing for “un-sensitive” [26]). We implemented the FICTS rules using ChemAxon’s Standardizer [28]. To make the results comparable with our other analyses the rules are applied to the InChI(ca) strings.

Results

Conversion of systematic identifiers

Table 2 shows the percentage of successful conversion of the systematic identifiers into InChI(ca) strings by Chem-Axon’s MolConverter. This is high for MOLs, SMILES notations and InChI strings in all databases. The lower (90%) MOL conversion for ChEBI was due to the presence of query atom features such as “R” (R-groups) or “*” (= any atom). The main reason for failure in conversion of IUPAC names to Standard InChI strings was challenges for the conversion tool to handle certain structural classes such as steroids, porphyrins, and carbohydrates. The lowest value of IUPAC to InChI(ca) conversion was for HMDB.
Table 2

Successful conversion (in %) of MOL files and systematic identifiers to InChI(ca)

DatabaseMOLInChISMILESIUPAC
DrugBank
98.9
100
99.1
93.6
ChEBI
90.6
100
96.8
69.8
HMDB
100
99.9
100
38.1
PubChem
100
100
100
92.6
NPC99.7-100-
Successful conversion (in %) of MOL files and systematic identifiers to InChI(ca) To investigate whether this could be improved, the same procedure was applied with another structure-to-identifier tool, the NCI Chemical Identifier Resolver [29]. This increased successful conversions slightly by 8% but still left the majority of IUPAC names in HMDB unconverted.

Consistency of systematic identifiers within databases

For each compound in a database, we compared the InChI(ca) derived from the MOL file with the InChI(ca) strings from the corresponding systematic identifiers (Figure 2). Table 3, shows for each database, the consistency between the MOL representation and the corresponding systematic identifiers, expressed as percentage agreement of matching InChI(ca) strings. If the InChI(ca) could not be generated for a MOL file or a systematic identifier, no comparison was done.
Table 3

Consistency of MOLs and systematic identifiers (in % agreement) within databases

DatabaseMOL–InChIMOL–SMILESMOL–IUPAC
DrugBank
98.2
98.5
90.0
ChEBI
96.5
96.5
75.3
HMDB
89.3
37.2
55.7
PubChem
97.7
97.8
87.2
NPC-93.4-
Consistency of MOLs and systematic identifiers (in % agreement) within databases In DrugBank there is more than 98% agreement between MOLs and their corresponding InChI strings and SMILES, while the consistency drops to around 90% for IUPAC names. PubChem and ChEBI have slightly lower agreement than DrugBank for InChI strings and SMILES notations, but the IUPAC names in ChEBI show a substantially lower agreement of 75%. The figures are lowest in HMDB with agreements of 37% for MOL-SMILES and 56% for MOL-IUPAC names. NPC only stores SMILES, which have a 93% agreement with their MOL representations. FICTS rules were applied to the InChI(ca) strings derived from the MOL files and systematic identifiers, and all comparisons were redone. Table 4 show the results. Stereochemistry has the most significant impact. For example, the consistency for MOL-SMILES notations and MOL-IUPAC names in HMDB increased with 61 and 29 percentage points. ChEBI and PubChem also show a considerable increase in agreement between IUPAC names and MOL files. In addition to stereochemistry, the changes made by standardising tautomers also improved the consistency, with the largest effect on HMDB. Charges, fragments and isotopic labels had a small or no effect on the consistency.
Table 4

Effect of different standardisation rules on the consistency between MOL files and systematic identifiers (in % agreement)

DatabaseComparisonFICTSuICTSFuCTSFIuTSFICuSFICTu
DrugBank
MOL–InChI
98.2
99.0
99.0
99.0
99.4
99.8
MOL–SMILES
98.5
98.6
98.6
98.6
99.5
99.7
MOL–IUPAC
90.0
90.1
90.0
90.1
93.5
96.2
ChEBI
MOL–InChI
96.5
98.9
98.5
98.4
99.2
99.6
MOL–SMILES
96.5
96.6
96.6
96.6
99.6
99.8
MOL–IUPAC
75.3
75.6
75.4
77.1
79.7
91.9
HMDB
MOL–InChI
89.3
89.8
89.7
90.3
89.9
98.5
MOL–SMILES
37.2
37.3
37.2
38.0
43.1
98.3
MOL–IUPAC
55.7
55.8
55.8
57.5
58.8
84.8
PubChem
MOL–InChI
97.7
97.9
97.9
97.9
99.3
99.9
MOL–SMILES
97.8
97.9
97.9
97.8
99.2
99.9
MOL–IUPAC
87.2
87.7
87.5
87.2
93.7
97.2
NPCMOL–SMILES93.493.593.493.498.099.8
Effect of different standardisation rules on the consistency between MOL files and systematic identifiers (in % agreement) Table 5 shows the agreement between the MOL files for compounds with inter-database cross-references. This varies from 25.8% to 93.7%, but for most cases is around 60-75%. The low value for cross-references from NPC to PubChem can be attributed to 1527 compounds in NPC that have more than one (average 5.7, median 3) cross-reference to PubChem CIDs. The agreement for the 2475 compounds in NPC that have just one cross-reference to PubChem is 79.3%. Note that the agreement for the cross-references in DrugBank or HMDB to ChEBI is about 20% higher than the other way around.
Table 5

Agreement between MOL files of compounds that have a cross-reference in one database (row) to another database (column)

 DrugBankChEBIHMDBPubChemNPC
DrugBank
-
72.1% (1666)
-
93.7% (4723)
-
ChEBI
54.3% (1288)
-
45.6% (114)
-
-
HMDB
-
64.0% (1433)
-
76.0% (2217)
-
PubChem
-
-
-
-
-
NPC76.7% (1320)--25.8% (9557)-

The number of cross-references is given in parentheses.

Agreement between MOL files of compounds that have a cross-reference in one database (row) to another database (column) The number of cross-references is given in parentheses. Since our results indicate that stereochemistry standardisation may substantially improve the consistency of systematic identifiers within databases (Table 4), we also assessed the consistency between databases after applying the FICTu rule (Table 6).
Table 6

Agreement between MOL files of compounds that have a cross-references in one database (row) to another database (column) after stereochemistry standardisation

 DrugBankChEBIHMDBPubChemNPC
DrugBank
-
91.4%
-
95.6%
-
ChEBI
68.6%
-
93.0%
-
-
HMDB
-
82.0%
-
89.8%
-
PubChem
-
-
-
-
-
NPC93.4%--47.6%-
Agreement between MOL files of compounds that have a cross-references in one database (row) to another database (column) after stereochemistry standardisation Stereochemistry annotation increases the agreement for most databases by around 15-20%. The largest increase (47.4%) is seen for cross-references linking ChEBI to HMDB. The agreement between NPC and PubChem also increases but more than half of the cross-references still link MOL files that do not match. For compounds that have just one cross-reference the agreement increased from 79.3% to 91.0%.

Discussion

While the importance of data quality control in chemical resources has been discussed previously [5-7,9], to our knowledge this is the first study to assess the consistency of structural representations of systematic identifiers within and between small-molecule databases. The assumption was that systematic identifiers should correspond with the registered MOL file. Standard InChI strings were used as a basis for this comparison because of the unique algorithm available, unlike for SMILES notations and IUPAC names where multiple strings can represent the same compound. To provide comparable results and remove the influence of different structure-to-identifier software, only ChemAxon’s MolConverter [24] was used for all name conversions. Compounds where MOL files or systematic identifiers did not convert to InChI strings were disregarded. To quantify the potential influence of different structure-to-identifier software we compared the Standard InChI strings generated from the MOL files using ChemAxon’s MolConverter [24] with those of Xemistry’s CACTVS chemoinformatics toolkit [30,31]. The comparison showed 98.9% agreement for HMDB, 98.3% for PubChem, 97.6% for DrugBank, 96.4% for ChEBI, and 94.2% for NPC in cases were both tools managed to convert MOL files to InChI strings. The differences are small and likely to be caused by the way the tools handle the MOL files. We consider it unlikely that our results would essentially have changed by using another conversion tool. The consistency of systematic identifiers with their corresponding MOL representations varies widely (Table 3). The highest agreement was obtained for DrugBank and PubChem, the lowest for HMDB. The higher consistency values for PubChem may be explained by their procedure for generating systematic identifiers [32]: starting from the MOL files, InChI strings are calculated based on the IUPAC Standard InChI software and SMILES notations and IUPAC names are generated by OpenEye software [33]. Unfortunately, because other databases do not clearly describe their procedures it remains unclear how possible differences may have affected consistency. Application of the FICTS sensitivity rules [26] gave us further insight. We found that disregarding stereochemistry and, to a lesser extent, tautomers boosted the consistency, in particular of MOL-IUPAC names (Table 4). The other sensitivity levels had a much lower or no effect. Thus, differences in stereochemistry between MOL files and systematic identifiers appear the single most important cause of inconsistencies. For ChEBI and HMDB, the agreement between MOLs and IUPAC names remained low even with stereochemistry insensitive matching. The consistency of systematic identifiers between databases, as measured by the agreement of MOL files in different databases linked by cross-references, ranged from 26% to 94% (Table 5). The value of cross-references lies in the consistency of the structural representation of the data and our study shows these have many errors. Disregarding stereochemistry on the registered MOL files increased the agreement, but a considerable percentage of the cross-references remained inconsistent. Integration of different chemical databases should consider these problems. Merging databases using different structure identifiers as indexes for integration can reduce quality. Instead, a unique representation such as MOL files can be used as the basis of integration. Other systematic identifiers can be generated later on the validated structure within the database. Inconsistencies within databases may steer curation efforts, and by combining the information on inconsistencies for a specific compound may even suggest which of the names or representations are wrong. In a recent article by Williams et al. [9] several solutions have been proposed to reduce errors in databases. In addition to improved curation, the use of structure validation filters for incorrect valance, atom labels, aromatic bonds, charges, stereochemistry and duplication was suggested. In another recent study, O’Boyle [11] proposed a standard method to generate canonical SMILES based on InChI strings, in order to create the same canonical SMILES using different toolkits. Our results quantify the issues raised in these studies. We have shown that a set of well-defined standardisation rules is essential while constructing systematic identifiers (can gain up to 50% increase in consistency), and that stereochemistry has an important contribution to this inconsistency. Our approach of testing the consistency of systematic identifiers is general and can be applied to other databases and may prove valuable in data curation and integration efforts. Using a similar approach, we also plan to investigate the consistency of non-systematic identifiers in chemical resources.

Conclusions

The degree of consistency within systematic chemical identifiers varies between data sources. When building a new database, de novo recalculation is superior to recycling and creating systematic identifiers starting from the same primary structural representation (e.g. MOL) will improve the quality of the final product. Extra consideration should be taken into account if systematic identifiers are going to be used as a key index for merging databases. Well-defined and documented chemistry standardisation rules applied to all compounds can greatly decrease the number of errors and expedite integration. Finally, we have shown that inconsistency exists between the structural representations of compounds that are linked via cross-references within databases. Inconsistency here can have deleterious effects when merging data from or cross-querying multiple databases.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

SAA extracted, processed and analysed the data, and drafted the manuscript. JAK and SM supervised and coordinated the project and revised the manuscript. All authors read and approved the final manuscript.
  17 in total

1.  Automated ligand placement and refinement with a combined force field and shape potential.

Authors:  S Wlodek; A G Skillman; A Nicholls
Journal:  Acta Crystallogr D Biol Crystallogr       Date:  2006-06-20

2.  Internet resources integrating many small-molecule databases.

Authors:  M Sitzmann; I V Filippov; M C Nicklaus
Journal:  SAR QSAR Environ Res       Date:  2008 Jan-Mar       Impact factor: 3.000

Review 3.  Public chemical compound databases.

Authors:  Anthony J Williams
Journal:  Curr Opin Drug Discov Devel       Date:  2008-05

4.  Chemical name to structure: OPSIN, an open source solution.

Authors:  Daniel M Lowe; Peter T Corbett; Peter Murray-Rust; Robert C Glen
Journal:  J Chem Inf Model       Date:  2011-03-09       Impact factor: 4.956

5.  A quality alert and call for improved curation of public chemistry databases.

Authors:  Antony J Williams; Sean Ekins
Journal:  Drug Discov Today       Date:  2011-07-30       Impact factor: 7.851

6.  Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research.

Authors:  Denis Fourches; Eugene Muratov; Alexander Tropsha
Journal:  J Chem Inf Model       Date:  2010-07-26       Impact factor: 4.956

7.  The NCGC pharmaceutical collection: a comprehensive resource of clinically approved drugs enabling repurposing and chemical genomics.

Authors:  Ruili Huang; Noel Southall; Yuhong Wang; Adam Yasgar; Paul Shinn; Ajit Jadhav; Dac-Trung Nguyen; Christopher P Austin
Journal:  Sci Transl Med       Date:  2011-04-27       Impact factor: 17.956

8.  DrugBank 3.0: a comprehensive resource for 'omics' research on drugs.

Authors:  Craig Knox; Vivian Law; Timothy Jewison; Philip Liu; Son Ly; Alex Frolkis; Allison Pon; Kelly Banco; Christine Mak; Vanessa Neveu; Yannick Djoumbou; Roman Eisner; An Chi Guo; David S Wishart
Journal:  Nucleic Acids Res       Date:  2010-11-08       Impact factor: 16.971

9.  HMDB: a knowledgebase for the human metabolome.

Authors:  David S Wishart; Craig Knox; An Chi Guo; Roman Eisner; Nelson Young; Bijaya Gautam; David D Hau; Nick Psychogios; Edison Dong; Souhaila Bouatra; Rupasri Mandal; Igor Sinelnikov; Jianguo Xia; Leslie Jia; Joseph A Cruz; Emilia Lim; Constance A Sobsey; Savita Shrivastava; Paul Huang; Philip Liu; Lydia Fang; Jun Peng; Ryan Fradette; Dean Cheng; Dan Tzur; Melisa Clements; Avalyn Lewis; Andrea De Souza; Azaret Zuniga; Margot Dawe; Yeping Xiong; Derrick Clive; Russ Greiner; Alsu Nazyrova; Rustem Shaykhutdinov; Liang Li; Hans J Vogel; Ian Forsythe
Journal:  Nucleic Acids Res       Date:  2008-10-25       Impact factor: 16.971

10.  Chemical Entities of Biological Interest: an update.

Authors:  Paula de Matos; Rafael Alcántara; Adriano Dekker; Marcus Ennis; Janna Hastings; Kenneth Haug; Inmaculada Spiteri; Steve Turner; Christoph Steinbeck
Journal:  Nucleic Acids Res       Date:  2009-10-23       Impact factor: 16.971

View more
  17 in total

1.  Recognition of chemical entities: combining dictionary-based and grammar-based approaches.

Authors:  Saber A Akhondi; Kristina M Hettne; Eelke van der Horst; Erik M van Mulligen; Jan A Kors
Journal:  J Cheminform       Date:  2015-01-19       Impact factor: 5.514

2.  Many InChIs and quite some feat.

Authors:  Wendy A Warr
Journal:  J Comput Aided Mol Des       Date:  2015-06-17       Impact factor: 3.686

3.  Comparative evaluation of open source software for mapping between metabolite identifiers in metabolic network reconstructions: application to Recon 2.

Authors:  Hulda S Haraldsdóttir; Ines Thiele; Ronan Mt Fleming
Journal:  J Cheminform       Date:  2014-01-27       Impact factor: 5.514

4.  Schema Matching and Data Integration with Consistent Naming on Protein Crystallization Screens.

Authors:  Midusha Shrestha; Truong X Tran; Bidhan Bhattarai; Marc L Pusey; Ramazan S Aygun
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2020-12-08       Impact factor: 3.710

5.  Ambiguity of non-systematic chemical identifiers within and between small-molecule databases.

Authors:  Saber A Akhondi; Sorel Muresan; Antony J Williams; Jan A Kors
Journal:  J Cheminform       Date:  2015-11-16       Impact factor: 5.514

6.  Trends in the production of scientific data analysis resources.

Authors:  Jason Hennessey; Constantin Georgescu; Jonathan D Wren
Journal:  BMC Bioinformatics       Date:  2014-10-21       Impact factor: 3.169

7.  Annotated chemical patent corpus: a gold standard for text mining.

Authors:  Saber A Akhondi; Alexander G Klenner; Christian Tyrchan; Anil K Manchala; Kiran Boppana; Daniel Lowe; Marc Zimmermann; Sarma A R P Jagarlapudi; Roger Sayle; Jan A Kors; Sorel Muresan
Journal:  PLoS One       Date:  2014-09-30       Impact factor: 3.240

8.  Virus-host interactomics: new insights and opportunities for antiviral drug discovery.

Authors:  Benoît de Chassey; Laurène Meyniel-Schicklin; Jacky Vonderscher; Patrice André; Vincent Lotteau
Journal:  Genome Med       Date:  2014-11-29       Impact factor: 11.117

9.  On InChI and evaluating the quality of cross-reference links.

Authors:  Jakub Galgonek; Jiří Vondrášek
Journal:  J Cheminform       Date:  2014-04-17       Impact factor: 5.514

10.  Chemical entity recognition in patents by combining dictionary-based and statistical approaches.

Authors:  Saber A Akhondi; Ewoud Pons; Zubair Afzal; Herman van Haagen; Benedikt F H Becker; Kristina M Hettne; Erik M van Mulligen; Jan A Kors
Journal:  Database (Oxford)       Date:  2016-05-02       Impact factor: 3.451

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.