| Literature DB >> 20512400 |
Markus Sitzmann1, Wolf-Dietrich Ihlenfeldt, Marc C Nicklaus.
Abstract
We have used the Chemical Structure DataBase (CSDB) of the NCI CADD Group, an aggregated collection of over 150 small-molecule databases totaling 103.5 million structure records, to conduct tautomerism analyses on one of the largest currently existing sets of real (i.e. not computer-generated) compounds. This analysis was carried out using calculable chemical structure identifiers developed by the NCI CADD Group, based on hash codes available in the chemoinformatics toolkit CACTVS and a newly developed scoring scheme to define a canonical tautomer for any encountered structure. CACTVS's tautomerism definition, a set of 21 transform rules expressed in SMIRKS line notation, was used, which takes a comprehensive stance as to the possible types of tautomeric interconversion included. Tautomerism was found to be possible for more than 2/3 of the unique structures in the CSDB. A total of 680 million tautomers were calculated from, and including, the original structure records. Tautomerism overlap within the same individual database (i.e. at least one other entry was present that was really only a different tautomeric representation of the same compound) was found at an average rate of 0.3% of the original structure records, with values as high as nearly 2% for some of the databases in CSDB. Projected onto the set of unique structures (by FICuS identifier), this still occurred in about 1.5% of the cases. Tautomeric overlap across all constituent databases in CSDB was found for nearly 10% of the records in the collection.Entities:
Mesh:
Year: 2010 PMID: 20512400 PMCID: PMC2886898 DOI: 10.1007/s10822-010-9346-4
Source DB: PubMed Journal: J Comput Aided Mol Des ISSN: 0920-654X Impact factor: 3.686
Fig. 1General isomerization scheme for tautomers
Fig. 2Tautomerism can change stereochemistry. Top: change of E/Z geometry. Bottom: change of chirality
Fig. 3Calculation of the NCI/CADD Chemical Structure Identifiers (FICTS, FICuS, uuuuu)
Fig. 4Relationship between the different NCI/CADD parent structures and identifiers after structure normalization
SMIRKS transforms for the enumeration of tautomers. CACTVS provides an extended set of attributes for the definition of SMIRKS that have no counterpart in the original SMIRKS syntax, e.g. the attribute zn indicates the number n of heteroatoms substituted to the corresponding carbon atom. Another additional attribute in CACTVS is the en attribute used in rule 5 (e6 on atom 2) which indicates that the corresponding carbon atom has to be member of a ring with at least n pi atoms
|
|
| [O,S,Se,Te;X1:1]=[C;z{1-2}:2][CX4R{0-2}:3] |
|
|
|
|
| [O,S,Se,Te;X1:1]=[Cz1H0:2][C:5]=[C:6][CX4z0,NX3:3] |
|
|
|
|
| [#1,a:5][NX2:1]=[Cz1:2][CX4R{0-2}:3] |
|
|
|
|
| [Cz0R0X3:1]([C:5])=[C:2][Nz0:3] |
|
|
|
|
|
|
|
|
|
|
| [N,n,S,s,O,o,Se,Te:1]=[NX2,nX2,C,c,P,p:2][N,n,S,O,Se,Te:3] |
|
|
|
|
| [nX2,NX2,S,O,Se,Te:1]=[C,c,nX2,NX2:6][C,c:5]=[C,c,nX2:2][N,n,S,s,O,o,Se,Te:3] |
|
|
|
|
| [n,s,o:1]=[c,n:6][c:5]=[c,n:2][n,s,o:3] |
|
|
|
|
| [nX2,NX2,S,O,Se,Te,Cz0X3:1]=[c,C,NX2,nX2:6][C,c:5]=[C,c,NX2,nX2:2][C,c,NX2,nX2:7]=[C,c,NX2,nX2:8][N,n,S,s,O,o,Se,Te:3] |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| [O,S,Se,Te;X1:1]=[C:2]=[C:3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Scoring of structure fragments used for the definition of a canonical tautomer
| Structure fragment | Scoring points |
|---|---|
| Each carbocyclic aromatic ring | +150 |
| Each aromatic ring | +100 |
| Each benzoquinones (including imine and thio analogs, [C]1([C]=[C][C]([C]=[C]1)=,:[N,S,O])=,:[N,S,O], penalize cyclohexanetetrone-like structures) | +25 |
| Each oxim group (C=N[OH]) | +4 |
| Each double bond between a carbon atom (C) and an oxygen atom (O) | +2 |
| Each double bond between a nitrogen atom (N) and an oxygen atom (O) | +2 |
| Each double bond between a phosphorus atom (P) and an oxygen atom (O) | +2 |
| Each non-aromatic double bond between a carbon atom (C) and a heteroatom (X) | +1 |
| Each methyl group (penalize structures with terminal double bonds) | +1 |
| Each guanidine group with a double bond on the terminal nitrogen atom (NC(=N)[N][!H]) | +1 |
| Each guanidine group with an endocyclic double bond ([N;R][C;R]([N])=[N;R]) | +2 |
| Each P-H, S-H, Se-H and Te-H bond | −1 |
| Each aci-nitro group (C=N(=O)[OH]) | −4 |
The scoring points were obtained by an analysis of different sets of tautomers including the known preferred tautomer
Fig. 5Normalization of stereochemistry for the canonical tautomer, involving double bonds whose stereochemistry is disregarded in the final step producing the canonical tautomer when the original stereo bond does not have a fixed location during tautomer generation
Fig. 6Enumeration of all tautomers of 2-hydroxy-3,4-dimethoxy-6-methylbenzaldehyde (1)
List of all databases present in CSDB. With: the original publisher and the source from where the database was obtained; the number of original structure records and the unique structure counts after de-duplication by the FICTS and FICuS identifiers, and the percentage of duplicate structures by both identifiers; the difference between unique structure counts between the FICTS and FICuS parent structure sets
| Database name | Original publisher | Source/downloaded from | Downloaded/released at | Structure record count | Unique structure count (FICTS) | Unique structure count (FICuS) | % Duplicates by FICTS | % Duplicates by FICuS | Discrepancy FICTS/FICuS structure count | % Duplicates FICTS-FICuS |
|---|---|---|---|---|---|---|---|---|---|---|
| ACD 3D | MDL/Symyx | MDL/Symyx | 1999-01-01 | 221,661 | 215,154 | 214,614 | 2.93 | 3.17 | 540 | 0.24 |
| ACX | CambridgeSoft | CambridgeSoft | 1999-12-31 | 137,001 | 101,009 | 100,729 | 26.27 | 26.47 | 280 | 0.20 |
| Ambinter | Ambinter | PubChem | 2008-06-10 | 2,692,132 | 2,688,795 | 2678,948 | 0.12 | 0.48 | 9,847 | 0.36 |
| Aronis | Aronis | PubChem | 2008-06-10 | 23,389 | 23,389 | 23,385 | 0.0 | 0.01 | 4 | 0.01 |
| Asinex | Asinex | PubChem | 2007-07-26 | 362,469 | 362,464 | 362,464 | <0.01 | 0.0 | 0 | 0.0 |
| Asinex Building Blocks | Asinex | Asinex | 2005-04-01 | 5,248 | 5,248 | 5,248 | 0.0 | 0.0 | 0 | 0.0 |
| Asinex Gold Collection | Asinex | Asinex | 2006-06-01 | 227,479 | 227,475 | 227,475 | <0.01 | 0.0 | 0 | 0.0 |
| Asinex Platinum Collection | Asinex | Asinex | 2006-06-01 | 130,646 | 130,646 | 130,646 | 0.0 | 0.0 | 0 | 0.0 |
| BIND | BIND | PubChem | 2007-07-26 | 1,207 | 1,205 | 1,203 | 0.16 | 0.33 | 2 | 0.17 |
| BindingDB | BindingDB | PubChem | 2007-07-26 | 8,492 | 8,470 | 8,458 | 0.25 | 0.4 | 12 | 0.15 |
| 2008-06-10 | 12,747 | 12,705 | 12,699 | 0.32 | 0.37 | 6 | 0.05 | |||
| BioByte QSAR | BioByte | BioByte | 2006-05-01 | 155,296 | 154,679 | 153,801 | 0.39 | 0.96 | 878 | 0.57 |
| BioCyc | BioCyc | PubChem | 2007-07-26 | 1,660 | 1,307 | 1,285 | 21.26 | 22.59 | 22 | 1.33 |
| Biosynth | Biosynth | PubChem | 2008-06-10 | 2,079 | 1,934 | 1,931 | 6.97 | 7.11 | 3 | 0.14 |
| Calbiochem | Calbiochem | PubChem | 2008-06-10 | 1,665 | 1,591 | 1,591 | 4.44 | 4.44 | 0 | 0.0 |
| CambridgeSoft | CambridgeSoft | PubChem | 2007-07-26 | 10,458 | 10,143 | 10,120 | 3.01 | 3.23 | 23 | 0.22 |
| CC PMLSC | CC PMLSC | PubChem | 2007-07-26 | 222 | 217 | 217 | 2.25 | 2.25 | 0 | 0.0 |
| 2008-06-10 | 173 | 173 | 172 | 0.0 | 0.57 | 1 | 0.57 | |||
| ChEBI | ChEBI | PubChem | 2007-07-26 | 8,767 | 8,373 | 8,238 | 4.49 | 6.03 | 135 | 1.54 |
| 2008-06-10 | 2,604 | 2,563 | 2,515 | 1.57 | 3.41 | 48 | 1.84 | |||
| ChemBank | ChemBank | PubChem | 2007-07-26 | 413,586 | 339,066 | 338,520 | 18.01 | 18.15 | 546 | 0.14 |
| 2008-06-10 | 1,194,169 | 1014,374 | 1011,147 | 15.05 | 15.32 | 3,227 | 0.27 | |||
| ChemBlock | ChemBlock | PubChem | 2007-07-26 | 107,570 | 107,290 | 107,192 | 0.26 | 0.35 | 98 | 0.09 |
| ChemBridge | ChemBridge | PubChem | 2007-07-26 | 433,971 | 433,970 | 433,970 | <0.01 | 0.0 | 0 | 0.0 |
| ChemBridge 100 k Lib | ChemBridge | ChemBridge | 2002-02-01 | 100,000 | 99,997 | 99,920 | <0.01 | 0.08 | 77 | 0.07 |
| ChemDB | ChemDB | PubChem | 2007-07-26 | 3,564,882 | 3,549,580 | 3501,958 | 0.42 | 1.76 | 47,622 | 1.34 |
| 2008-06-10 | 2 | 2 | 2 | 0.0 | 0.0 | 0 | 0.0 | |||
| ChemDiv Diversity Collection | ChemDiv | ChemDiv | 2004-09-01 | 495466 | 495,455 | 495,395 | <0.01 | 0.01 | 60 | 0.01 |
| ChemExper Chemical Directory | ChemExper Chemical Directory | PubChem | 2007-07-26 | 156,258 | 156,113 | 155,698 | 0.09 | 0.35 | 415 | 0.26 |
| ChemSpider | ChemSpider | PubChem | 2008-06-10 | 17,064,543 | 16,871,574 | 16,537,474 | 1.13 | 3.08 | 334,100 | 1.95 |
| CHMIS-C | UMich | UMich | 2004-11-01 | 8,572 | 8,022 | 7,992 | 6.41 | 6.76 | 30 | 0.35 |
| CMC | MDL/Symyx | MDL/Symyx | 2006-01-01 | 8,757 | 8,742 | 8,732 | 0.17 | 0.28 | 10 | 0.11 |
| CMLD-BU | CMLD-BU | PubChem | 2007-07-26 | 1,629 | 1,619 | 1,619 | 0.61 | 0.61 | 0 | 0.0 |
| Columbia University Molecular Screening Center | Columbia University Molecular Screening Center | PubChem | 2008-06-10 | 399 | 391 | 391 | 2.0 | 2.0 | 0 | 0.0 |
| ComGenex | ComGenex | ComGenex | 2006-03-01 | 184,266 | 184,266 | 184,266 | 0.0 | 0.0 | 0 | 0.0 |
| ComGenex Unique Reagents | ComGenex | ComGenex | 2006-03-01 | 330 | 330 | 330 | 0.0 | 0.0 | 0 | 0.0 |
| Diabetic Complications Screening | Diabetic Complications Screening | PubChem | 2007-07-26 | 1,040 | 1 | 1 | 99.9 | 99.9 | 0 | 0.0 |
| DiscoveryGate | Symyx | PubChem | 2007-07-26 | 4,608,993 | 4,602,080 | 4,581,587 | 0.14 | 0.59 | 20,493 | 0.45 |
| 2008-06-10 | 1,261,853 | 1,260,435 | 1,260,101 | 0.11 | 0.13 | 334 | 0.02 | |||
| DrugBank | DrugBank | PubChem | 2008-06-10 | 4,763 | 4,419 | 4,409 | 7.22 | 7.43 | 10 | 0.21 |
| Dupont Library | Dupont | MDDP/NCI | 2004-04-01 | 179,008 | 174,977 | 174,745 | 2.25 | 2.38 | 232 | 0.13 |
| Emory University Molecular Libraries Screening Center | Emory University Molecular Libraries Screening Center | PubChem | 2007-07-26 | 101,567 | 101,540 | 101,523 | 0.02 | 0.04 | 17 | 0.02 |
| 2008-06-10 | 4,367 | 4,335 | 4,333 | 0.73 | 0.77 | 2 | 0.04 | |||
| EPA DSSTox | EPA DSSTox | PubChem | 2007-07-26 | 4,258 | 4,103 | 4,101 | 3.64 | 3.68 | 2 | 0.04 |
| 2008-06-10 | 12,950 | 6,635 | 6,630 | 48.76 | 48.8 | 5 | 0.04 | |||
| Exchemistry | Exchemistry | PubChem | 2008-06-10 | 2,057 | 2,057 | 2,057 | 0.0 | 0.0 | 0 | 0.0 |
| FDA CDER Chronic/Subchronic | FDA/CDER | FDA.CDER | 2006-05-01 | 84 | 84 | 84 | 0.0 | 0.0 | 0 | 0.0 |
| FDA CDER Genetox | FDA/CDER | FDA/CDER | 2006-05-01 | 231 | 181 | 181 | 21.64 | 21.64 | 0 | 0.0 |
| FDA CFSAN Genetox | FDA/CFSAN | FDA/CFSAN | 2006-05-01 | 487 | 400 | 400 | 17.86 | 17.86 | 0 | 0.0 |
| FDA Genet/Reprod/Carcino | FDA/CDER | FDA/CDER | 2006-01-01 | 6,912 | 6,820 | 6,810 | 1.33 | 1.47 | 10 | 0.14 |
| InFarmatik | InFarmatik | PubChem | 2008-06-10 | 1,077 | 1,077 | 1,077 | 0.0 | 0.0 | 0 | 0.0 |
| iResearch Library | ChemNavigator | ChemNavigator | 2004-07-01 | 13,352,734 | 13,350,546 | 13,323,974 | 0.01 | 0.21 | 26,572 | 0.20 |
| 2004-10-01 | 263,325 | 263,311 | 261,858 | <0.01 | 0.55 | 1,453 | 0.55 | |||
| 2005-01-01 | 5,036,400 | 5,036,247 | 5,035,543 | <0.01 | 0.01 | 704 | 0.01 | |||
| 2005-04-01 | 480,405 | 480,402 | 480,350 | <0.01 | 0.01 | 52 | 0.01 | |||
| 2005-07-01 | 479,324 | 479,305 | 479,289 | <0.01 | <0.01 | 16 | <0.01 | |||
| 2005-10-01 | 377,246 | 377,238 | 376,265 | <0.01 | 0.26 | 973 | 0.25 | |||
| 2006-01-01 | 210,500 | 210,497 | 210,490 | <0.01 | <0.01 | 7 | <0.01 | |||
| 2006-04-01 | 580,332 | 580,328 | 579,878 | <0.01 | 0.07 | 450 | 0.07 | |||
| 2006-07-01 | 4,218,127 | 4,218,019 | 4,217,935 | <0.01 | <0.01 | 84 | <0.01 | |||
| 2006-10-01 | 220,602 | 220,561 | 220,322 | 0.01 | 0.12 | 239 | 0.11 | |||
| 2007-01-01 | 4,045,672 | 4,045,656 | 4,045,492 | <0.01 | <0.01 | 164 | <0.01 | |||
| 2007-07-01 | 333,578 | 333,568 | 333,561 | <0.01 | <0.01 | 7 | <0.01 | |||
| 2007-10-01 | 3,148,102 | 3,148,066 | 3,148,008 | <0.01 | <0.01 | 58 | <0.01 | |||
| 2008-01-01 | 528,308 | 528,294 | 528,280 | <0.01 | <0.01 | 14 | <0.01 | |||
| 2008-04-01 | 536,482 | 536,477 | 536,302 | <0.01 | 0.03 | 175 | 0.03 | |||
| 2008-07-01 | 548,386 | 548,373 | 548,335 | <0.01 | <0.01 | 38 | <0.01 | |||
| 2008-10-01 | 380,396 | 380,394 | 380,299 | <0.01 | 0.02 | 95 | 0.02 | |||
| 2009-01-01 | 564,140 | 564,133 | 564,082 | <0.01 | 0.01 | 51 | <0.01 | |||
| 2009-04-01 | 784,758 | 784,731 | 784,343 | <0.01 | 0.05 | 388 | 0.04 | |||
| 2009-07-01 | 22,225,547 | 22,219,140 | 22,211,624 | 0.02 | 0.06 | 7,516 | 0.04 | |||
| Jubilant Kinase Inhibitors | Jubilant | Jubilant | 2004-12-01 | 170,000 | 163,799 | 163,518 | 3.64 | 3.81 | 281 | 0.17 |
| KEGG | KEGG | PubChem | 2007-07-26 | 16,938 | 14,313 | 14,233 | 15.49 | 15.97 | 80 | 0.48 |
| 2008-06-10 | 3,551 | 2,477 | 2,475 | 30.24 | 30.30 | 2 | 0.06 | |||
| KUMGM | KUMGM | PubChem | 2007-07-26 | 3,349 | 3,109 | 3,107 | 7.16 | 7.22 | 2 | 0.06 |
| Leadscope FDA | Leadscope/FDA | PubChem | 2007-07-26 | 724 | 588 | 588 | 18.78 | 18.78 | 0 | 0.0 |
| LifeChem Building Blocks | LifeChem | LifeChem | 2006-05-01 | 4,027 | 4,027 | 4,020 | 0.0 | 0.17 | 7 | 0.17 |
| LifeChem Stock Compounds | LifeChem | LifeChem | 2006-05-01 | 204,955 | 204,954 | 204,765 | <0.01 | 0.09 | 189 | 0.09 |
| LifeChem Virtual Compounds | LifeChem | LifeChem | 2006-05-01 | 179,649 | 179,648 | 179,648 | <0.01 | 0.0 | 0 | 0.0 |
| LipidMAPS | LipidMAPS | PubChem | 2007-07-26 | 10,128 | 9,628 | 9,590 | 4.93 | 5.31 | 38 | 0.38 |
| 2008-06-10 | 308 | 284 | 284 | 7.79 | 7.79 | 0 | 0.0 | |||
| MDDR | MDL/Symyx | MDL/Symyx | 2006-03-01 | 165,595 | 164,666 | 164,561 | 0.56 | 0.62 | 105 | 0.06 |
| MDL Patent Database | MDL/Symyx | MDL/Symyx | 2005-11-01 | 38,363 | 30,980 | 30,840 | 19.24 | 19.61 | 140 | 0.37 |
| MDL Toxicity Database | MDL/Symyx | MDL/Symyx | 2005-11-01 | 147,308 | 147,144 | 147,006 | 0.11 | 0.2 | 138 | 0.09 |
| MDPI | MDPI | MDPI | 2004-11-01 | 10,655 | 10,513 | 10,478 | 1.33 | 1.66 | 35 | 0.33 |
| MICAD | MICAD | PubChem | 2007-07-26 | 188 | 187 | 187 | 0.53 | 0.53 | 0 | 0.0 |
| 2008-06-10 | 76 | 76 | 76 | 0.0 | 0.0 | 0 | 0.0 | |||
| MLSMR | MLSMR | PubChem | 2007-07-26 | 207,811 | 204,326 | 204,198 | 1.67 | 1.73 | 128 | 0.06 |
| 2008-06-10 | 75,904 | 75,460 | 75,453 | 0.58 | 0.59 | 7 | <0.01 | |||
| MMDB | MMDB | PubChem | 2007-07-26 | 65,388 | 9,450 | 9,373 | 85.54 | 85.66 | 77 | 0.12 |
| 2008-06-10 | 26,470 | 4,456 | 4,428 | 83.16 | 83.27 | 28 | 0.11 | |||
| MOLI | MOLI | PubChem | 2007-07-26 | 1,951 | 1,774 | 1,774 | 9.07 | 9.07 | 0 | 0.0 |
| MTDP/NCI | MTDP/NCI | PubChem | 2007-07-26 | 106,186 | 106,016 | 105,931 | 0.16 | 0.24 | 85 | 0.08 |
| 2008-06-10 | 301 | 301 | 301 | 0.0 | 0.0 | 0 | 0.0 | |||
| NatChemBio | NatChemBio | PubChem | 2007-07-26 | 1,565 | 1,447 | 1,446 | 7.53 | 7.6 | 1 | 0.07 |
| 2008-06-10 | 871 | 844 | 841 | 3.09 | 3.44 | 3 | 0.35 | |||
| NCGC | NCGC | PubChem | 2007-07-26 | 54,728 | 53,714 | 53,703 | 1.85 | 1.87 | 11 | 0.02 |
| 2008-06-10 | 13,797 | 10,992 | 10,971 | 20.33 | 20.48 | 21 | 0.15 | |||
| NCI open database | NCI/DTP | NCI/CADD | 2006-07-01 | 263,465 | 253,957 | 253,550 | 3.6 | 3.76 | 407 | 0.16 |
| PubChem | 2007-07-26 | 268,696 | 251,396 | 250,854 | 6.43 | 6.64 | 542 | 0.21 | ||
| 2008-06-10 | 5,383 | 5,372 | 5,365 | 0.2 | 0.33 | 7 | 0.13 | |||
| NCI-NP | NCI/DTP | NCI/DTP | 2002-02-01 | 124,700 | 120,736 | 119,587 | 3.17 | 4.10 | 1,149 | 0.93 |
| NIAID HIV/OI | NIAID | NIAID | 2006-02-01 | 138,693 | 133,064 | 132,806 | 4.05 | 4.24 | 258 | 0.19 |
| PubChem | 2007-07-26 | 155,624 | 149,609 | 149,341 | 3.86 | 4.03 | 268 | 0.17 | ||
| 2008-06-10 | 4,260 | 4,068 | 4,052 | 4.5 | 4.88 | 16 | 0.38 | |||
| NIH Clinical Collection | NIH Clinical Collection | PubChem | 2008-06-10 | 489 | 473 | 472 | 3.27 | 3.47 | 1 | 0.20 |
| NINDS-ADSP | NINDS | PubChem | 2007-07-26 | 1,040 | 1,033 | 1,031 | 0.67 | 0.86 | 2 | 0.19 |
| NINDS-PANACHE | NINDS | PubChem | 2007-07-26 | 10 | 10 | 10 | 0.0 | 0.0 | 0 | 0.0 |
| NIST MS-Lib | NIST | NIST | 2006-01-01 | 177,495 | 171,241 | 170,917 | 3.52 | 3.7 | 324 | 0.18 |
| PubChem | 2007-07-26 | 177,495 | 171,239 | 170,920 | 3.52 | 3.7 | 319 | 0.18 | ||
| NIST WebBook | NIST | NIST | 2006-01-01 | 54,146 | 51,621 | 51,451 | 4.66 | 4.97 | 170 | 0.31 |
| PubChem | 2007-07-26 | 54,125 | 51,573 | 51,403 | 4.71 | 5.02 | 170 | 0.31 | ||
| NLM ChemIDplus | NLM | NLM | 2006-03-01 | 269,276 | 255,638 | 255,138 | 5.06 | 5.25 | 500 | 0.19 |
| PubChem | 2007-07-26 | 383,789 | 274,146 | 273,597 | 28.56 | 28.71 | 549 | 0.15 | ||
| NMMLSC | NMMLSC | PubChem | 2007-07-26 | 5,776 | 5,770 | 5,770 | 0.1 | 0.1 | 0 | 0.0 |
| 2008-06-10 | 438 | 438 | 438 | 0.0 | 0.0 | 0 | 0.0 | |||
| NMRShiftDB | NMRShiftDB | PubChem | 2007-07-26 | 19,414 | 19,016 | 18,956 | 2.05 | 2.35 | 60 | 0.30 |
| NTP-CHSD | NTP | NTP | 1991-08-01 | 1,588 | 1,383 | 1,383 | 12.9 | 12.9 | 0 | 0.0 |
| NTP-PTC | NTP | NTP | 2002-09-01 | 417 | 401 | 401 | 3.83 | 3.83 | 0 | 0.0 |
| ORST Small Molecule Screening Center | ORST Small Molecule Screening Center | PubChem | 2008-06-10 | 1,998 | 1,996 | 1,993 | 0.1 | 0.25 | 3 | 0.15 |
| PASS Training Set | LSFBDD/IBMC/RAMS | LSFBDD/IBMC/RAMS | 2006-02-01 | 60,620 | 60,487 | 60,172 | 0.21 | 0.73 | 315 | 0.52 |
| PCMD | PCMD | PubChem | 2007-07-26 | 27 | 27 | 27 | 0.0 | 0.0 | 0 | 0.0 |
| 2008-06-10 | 65 | 65 | 65 | 0.0 | 0.0 | 0 | 0.0 | |||
| PDSP | PDSP | PubChem | 2007-07-26 | 2,871 | 2,868 | 2,867 | 0.1 | 0.13 | 1 | 0.03 |
| ProbeDB | ProbeDB | PubChem | 2007-07-26 | 10 | 1 | 1 | 90.0 | 90.0 | 0 | 0.0 |
| 2008-06-10 | 273 | 1 | 1 | 99.63 | 99.63 | 0 | 0.0 | |||
| Prous Science Drugs of the Future | Prous Science Drugs of the Future | PubChem | 2007-07-26 | 4,426 | 4,421 | 4,417 | 0.11 | 0.2 | 4 | 0.09 |
| 2008-06-10 | 202 | 202 | 202 | 0.0 | 0.0 | 0 | 0.0 | |||
| R&D Chemicals | R&D Chemicals | PubChem | 2008-06-10 | 8,352 | 8,352 | 8,352 | 0.0 | 0.0 | 0 | 0.0 |
| RTECS | NIOSH/CDC | NIOSH/CDC | 2004-06-01 | 144,729 | 137,216 | 137,094 | 5.19 | 5.27 | 122 | 0.08 |
| SDCCG | SDCCG | PubChem | 2007-07-26 | 54,838 | 54,597 | 54,565 | 0.43 | 0.49 | 32 | 0.06 |
| SDCCG | SDCCG | PubChem | 2008-06-10 | 1,215 | 1,175 | 1,172 | 3.29 | 3.53 | 3 | 0.24 |
| SGC-Ox | SGC-Ox | PubChem | 2007-07-26 | 319 | 311 | 308 | 2.5 | 3.44 | 3 | 0.94 |
| SGC-Sto | SGC-Sto | PubChem | 2007-07-26 | 17 | 17 | 17 | 0.0 | 0.0 | 0 | 0.0 |
| Shanghai Institute of Organic Chemistry | Shanghai Institute of Organic Chemistry | PubChem | 2008-06-10 | 3,080 | 2,443 | 2,428 | 20.68 | 21.16 | 15 | 0.48 |
| Sigma–Aldrich | Sigma–Aldrich | PubChem | 2007-07-26 | 56,080 | 37,534 | 37,519 | 33.07 | 33.09 | 15 | 0.02 |
| SMID | SMID | PubChem | 2007-07-26 | 7,161 | 6,516 | 6,500 | 9.0 | 9.23 | 16 | 0.23 |
| Southern Research Institute—HTS | Southern Research Institute—HTS | PubChem | 2008-06-10 | 1,114 | 1,114 | 1,113 | 0.0 | 0.08 | 1 | 0.08 |
| Specs | Specs | PubChem | 2007-07-26 | 205,958 | 205,954 | 205,954 | 0.0 | 0.0 | 0 | 0.0 |
| SRMLSC | SRMLSC | PubChem | 2008-06-10 | 304 | 304 | 304 | 0.0 | 0.0 | 0 | 0.0 |
| Structural Genomics Consortium | Structural Genomics Consortium | PubChem | 2007-07-26 | 87 | 87 | 87 | 0.0 | 0.0 | 0 | 0.0 |
| 2008-06-10 | 90 | 90 | 90 | 0.0 | 0.0 | 0 | 0.0 | |||
| The Scripps Research Institute Molecular Screening Center | The Scripps Research Institute Molecular Screening Center | PubChem | 2007-07-26 | 2 | 2 | 2 | 0.0 | 0.0 | 0 | 0.0 |
| 2008-06-10 | 16,231 | 16,185 | 16,180 | 0.28 | 0.31 | 5 | 0.03 | |||
| Thomson Pharma | Thomson Pharma | PubChem | 2007-07-26 | 2,303,463 | 2,285,548 | 2,277,301 | 0.77 | 1.13 | 8,247 | 0.36 |
| 2008-06-10 | 224,196 | 223,873 | 223,630 | 0.14 | 0.25 | 243 | 0.11 | |||
| Total TOSLab Building Blocks | Total TOSLab Building Blocks | PubChem | 2007-07-26 | 910 | 910 | 909 | 0.0 | 0.1 | 1 | 0.1 |
| UM-BBD | UM-BBD | PubChem | 2007-07-26 | 1,081 | 1,067 | 1,062 | 1.29 | 1.75 | 5 | 0.46 |
| 2008-06-10 | 38 | 38 | 38 | 0.0 | 0.0 | 0 | 0.0 | |||
| University of Pittsburgh Molecular Library Screening Center | University of Pittsburgh Molecular Library Screening Center | PubChem | 2008-06-10 | 303 | 273 | 273 | 9.9 | 9.9 | 0 | 0.0 |
| UPCMLD | UPCMLD | PubChem | 2007-07-26 | 2,111 | 1,879 | 1,879 | 10.99 | 10.99 | 0 | 0.0 |
| 2008-06-10 | 495 | 493 | 493 | 0.4 | 0.4 | 0 | 0.0 | |||
| USAMRIID In Silico-Screened Structures | USAMRIID | USAMRIID | 2004-06-01 | 376,062 | 359,673 | 359,554 | 4.35 | 4.38 | 119 | 0.03 |
| WDI | Derwent/Thomson Reuters | Derwent/Thomson Reuters | 2006-02-01 | 79,618 | 69,439 | 69,283 | 12.78 | 12.98 | 156 | 0.20 |
| Web of Science | Web of Science | PubChem | 2007-07-26 | 20 | 20 | 18 | 0.0 | 10.0 | 2 | 10.0 |
| Wombat 2005.02 | Sunset Molecular Discovery | Sunset molecular discovery | 2005-02-01 | 135,673 | 120,475 | 120,287 | 11.2 | 11.34 | 188 | 0.14 |
| xPharm | xPharm | PubChem | 2007-07-26 | 2,462 | 2,137 | 2,135 | 13.2 | 13.28 | 2 | 0.08 |
| ZINC | ZINC | PubChem | 2007-07-26 | 3,813,885 | 3,748,592 | 3,707,913 | 1.71 | 2.77 | 40679 | 1.06 |
Number of tautomer conflicts. As conflict is regarded a case when a tautomer-invariant parent structure (FICuS) is assigned to more than one tautomer-sensitive parent structures (FICTS)
| FICuS parent structure | FICTS parent structure | Original structure record | ||||
|---|---|---|---|---|---|---|
| Count | % | Count | % | Count | % | |
| (a) | 119,632 | 0.17 | 285,502 | 0.40 | 315,277 | 0.30 |
| (b) | 398,079 | 0.56 | 877,182 | 1.22 | 1,013,198 | 0.98 |
| (c) | 561,142 | 0.79 | 1,334,800 | 1.85 | 3,590,508 | 3.47 |
| sum | 1,078,853 | 1.52 | 2,497,484 | 3.47 | 4,918,983 | 4.75 |
There are three types of such cases: (a) tautomer conflicts occur for a specific chemical compound in one or more databases with conflicts even among structure records of a single database, (b) the same chemical compound is represented as different tautomers in different database but there are no conflicts within each single database, and (c), there were several groups of databases which each consistently shared one tautomer representation for a specific chemical compound, but the different database groups used different tautomer representations
Fig. 7Example of a tautomer conflict found for 1-phenyl-3-methyl-4-benzoyl-pyrazolone-5 (HPMBP)
Analysis of “global” tautomerism in CSDB
| Database name | Original publisher | Source/downloaded from | Downloaded/released at | Unique structure count (FICuS) | FICuS parent structures with formal tautomerism | Occurrences of FICuS parent structures with multiple FICTS parent structure assignment | FICuS parent structures exclusive to the database release | |||
|---|---|---|---|---|---|---|---|---|---|---|
| Count | % | Count | % | Count | % | |||||
| ACD 3D | MDL/Symyx | MDL/Symyx | 1999-01-01 | 214,614 | 124,829 | 58.16 | 30,373 | 14.15 | 21,623 | 10.08 |
| ACX | CambridgeSoft | CambridgeSoft | 1999-12-31 | 100,729 | 54,666 | 54.27 | 6,810 | 6.76 | 8,217 | 8.16 |
| Ambinter | Ambinter | PubChem | 2008-06-10 | 2,678,948 | 2,064,072 | 77.05 | 25,186 | 0.94 | 277,080 | 10.34 |
| Aronis | Aronis | PubChem | 2008-06-10 | 23,385 | 20,347 | 87.01 | 0 | 0.0 | 2,731 | 11.68 |
| Asinex | Asinex | PubChem | 2007-07-26 | 362,464 | 291,172 | 80.33 | 7 | <0.01 | 73,463 | 20.27 |
| Asinex Building Blocks | Asinex | Asinex | 2005-04-01 | 5,248 | 3,262 | 62.16 | 0 | 0.0 | 608 | 11.59 |
| Asinex Gold Collection | Asinex | Asinex | 2006-06-01 | 227,475 | 174,579 | 76.75 | 7 | <0.01 | 39,093 | 17.19 |
| Asinex Platinum Collection | Asinex | Asinex | 2006-06-01 | 130,646 | 115,075 | 88.08 | 10 | 0.01 | 34,174 | 26.16 |
| BIND | BIND | PubChem | 2007-07-26 | 1,203 | 921 | 76.56 | 31 | 2.58 | 226 | 18.79 |
| BindingDB | BindingDB | PubChem | 2007-07-26 | 8,458 | 7,068 | 83.57 | 30 | 0.35 | 1,163 | 13.75 |
| 2008-06-10 | 12,699 | 10,528 | 82.9 | 3,956 | 31.15 | 825 | 6.5 | |||
| BioByte QSAR | BioByte | BioByte | 2006-05-01 | 153,801 | 91,447 | 59.46 | 52,936 | 34.42 | 10,235 | 6.65 |
| BioCyc | BioCyc | PubChem | 2007-07-26 | 1,285 | 931 | 72.45 | 108 | 8.4 | 275 | 21.4 |
| Biosynth | Biosynth | PubChem | 2008-06-10 | 1,931 | 1,160 | 60.07 | 356 | 18.44 | 218 | 11.29 |
| Calbiochem | Calbiochem | PubChem | 2008-06-10 | 1,591 | 1,184 | 74.42 | 275 | 17.28 | 225 | 14.14 |
| CambridgeSoft | CambridgeSoft | PubChem | 2007-07-26 | 10,120 | 6,213 | 61.39 | 1,408 | 13.91 | 619 | 6.12 |
| CC PMLSC | CC PMLSC | PubChem | 2007-07-26 | 217 | 168 | 77.42 | 0 | 0.0 | 24 | 11.06 |
| 2008-06-10 | 172 | 157 | 91.28 | 0 | 0.0 | 54 | 31.4 | |||
| ChEBI | ChEBI | PubChem | 2007-07-26 | 8,238 | 4,266 | 51.78 | 684 | 8.3 | 921 | 11.18 |
| 2008-06-10 | 2,515 | 1,307 | 51.97 | 310 | 12.33 | 245 | 9.74 | |||
| ChemBank | ChemBank | PubChem | 2007-07-26 | 338,520 | 254,500 | 75.18 | 3,632 | 1.07 | 37,414 | 11.05 |
| 2008-06-10 | 1,011,147 | 816,927 | 80.79 | 405,912 | 40.14 | 69,909 | 6.91 | |||
| ChemBlock | ChemBlock | PubChem | 2007-07-26 | 107,192 | 78,845 | 73.55 | 0 | 0.0 | 17,959 | 16.75 |
| ChemBridge | ChemBridge | PubChem | 2007-07-26 | 433,970 | 322,423 | 74.3 | 201 | 0.05 | 53,926 | 12.43 |
| ChemBridge 100 k Lib | ChemBridge | ChemBridge | 2002-02-01 | 99,920 | 73,884 | 73.94 | 2 | <0.01 | 17,760 | 17.77 |
| ChemDB | ChemDB | PubChem | 2007-07-26 | 3,501,958 | 2,592,957 | 74.04 | 59,112 | 1.69 | 468,965 | 13.39 |
| 2008-06-10 | 2 | 0 | 0.0 | 2 | 100.0 | 0 | 0.0 | |||
| ChemDiv Diversity Collection | ChemDiv | ChemDiv | 2004-09-01 | 495,395 | 385,876 | 77.89 | 71,699 | 14.47 | 7,345 | 1.48 |
| ChemExper Chemical Directory | ChemExper Chemical Directory | PubChem | 2007-07-26 | 155,698 | 93,621 | 60.13 | 20,170 | 12.95 | 458 | 0.29 |
| ChemSpider | ChemSpider | PubChem | 2008-06-10 | 16,537,474 | 11,418,636 | 69.05 | 930,014 | 5.62 | 5,577,994 | 33.73 |
| CHMIS-C | UMich | UMich | 2004-11-01 | 7,992 | 4,455 | 55.74 | 493 | 6.17 | 3,526 | 44.12 |
| CMC | MDL/Symyx | MDL/Symyx | 2006-01-01 | 8,732 | 5,818 | 66.63 | 950 | 10.88 | 235 | 2.69 |
| CMLD-BU | CMLD-BU | PubChem | 2007-07-26 | 1,619 | 1,240 | 76.59 | 19 | 1.17 | 13 | 0.8 |
| Columbia University Molecular Screening Center | Columbia University Molecular Screening Center | PubChem | 2008-06-10 | 391 | 198 | 50.64 | 14 | 3.58 | 153 | 39.13 |
| ComGenex | ComGenex | ComGenex | 2006-03-01 | 184,266 | 152,569 | 82.8 | 6,579 | 3.57 | 1,482 | 0.8 |
| ComGenex unique reagents | ComGenex | ComGenex | 2006-03-01 | 330 | 257 | 77.88 | 70 | 21.21 | 1 | 0.3 |
| Diabetic Complications Screening | Diabetic Complications Screening | PubChem | 2007-07-26 | 1 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
| DiscoveryGate | Symyx | PubChem | 2007-07-26 | 4,581,587 | 3,029,660 | 66.13 | 246,048 | 5.37 | 10,503 | 0.23 |
| 2008-06-10 | 1,260,101 | 993,999 | 78.88 | 41,697 | 3.31 | 57,180 | 4.54 | |||
| DrugBank | DrugBank | PubChem | 2008-06-10 | 4,409 | 3,062 | 69.45 | 605 | 13.72 | 450 | 10.21 |
| Dupont Library | Dupont | MDDP/NCI | 2004-04-01 | 174,745 | 107,268 | 61.39 | 10,644 | 6.09 | 28,654 | 16.4 |
| Emory University Molecular Libraries Screening Center | Emory University Molecular Llibraries Screening Center | PubChem | 2007-07-26 | 101,523 | 79,264 | 78.07 | 13,471 | 13.27 | 42 | 0.04 |
| 2008-06-10 | 4,333 | 3,029 | 69.91 | 700 | 16.16 | 152 | 3.51 | |||
| EPA DSSTox | EPA DSSTox | PubChem | 2007-07-26 | 4,101 | 1,998 | 48.72 | 600 | 14.63 | 1 | 0.02 |
| 2008-06-10 | 6,630 | 2,870 | 43.29 | 770 | 11.61 | 23 | 0.35 | |||
| Exchemistry | Exchemistry | PubChem | 2008-06-10 | 2,057 | 1,465 | 71.22 | 96 | 4.67 | 0 | 0.0 |
| FDA CDER Chronic/Subchronic | FDA/CDER | FDA/CDER | 2006-05-01 | 84 | 56 | 66.67 | 15 | 17.86 | 0 | 0.0 |
| FDA CDER Genetox | FDA/CDER | FDA/CDER | 2006-05-01 | 181 | 125 | 69.09 | 30 | 16.57 | 0 | 0.0 |
| FDA CFSAN Genetox | FDA/CFSAN | FDA/CFSAN | 2006-05-01 | 400 | 216 | 54.0 | 66 | 16.5 | 2 | 0.5 |
| FDA Genet/Reprod/Carcino | FDA/CDER | FDA/CDER | 2006-01-01 | 6,810 | 3,368 | 49.46 | 788 | 11.57 | 1,165 | 17.11 |
| InFarmatik | InFarmatik | PubChem | 2008-06-10 | 1,077 | 519 | 48.19 | 195 | 18.11 | 1 | 0.09 |
| iResearch Library | ChemNavigator | ChemNavigator | 2004-07-01 | 13,323,974 | 10,687,768 | 80.21 | 512,426 | 3.85 | 9,623,698 | 72.23 |
| 2004-10-01 | 261,858 | 204,119 | 77.95 | 45,095 | 17.22 | 104,306 | 39.83 | |||
| 2005-01-01 | 5,035,543 | 4,255,548 | 84.51 | 105,694 | 2.1 | 4,328,401 | 85.96 | |||
| 2005-04-01 | 480,350 | 412,308 | 85.83 | 12,008 | 2.5 | 324,134 | 67.48 | |||
| 2005-07-01 | 479,289 | 402,593 | 84.0 | 26,610 | 5.55 | 259,516 | 54.15 | |||
| 2005-10-01 | 376,265 | 307,360 | 81.69 | 14,113 | 3.75 | 190,820 | 50.71 | |||
| 2006-01-01 | 210,490 | 173,675 | 82.51 | 16,300 | 7.74 | 34,843 | 16.55 | |||
| 2006-04-01 | 579,878 | 476,119 | 82.11 | 21,960 | 3.79 | 429,856 | 74.13 | |||
| 2006-07-01 | 4,217,935 | 3,534,703 | 83.8 | 60,074 | 1.42 | 3,456,568 | 81.95 | |||
| 2006-10-01 | 220,322 | 161,823 | 73.45 | 7,892 | 3.58 | 115,867 | 52.59 | |||
| 2007-01-01 | 4,045,492 | 3,565,754 | 88.14 | 19,237 | 0.48 | 3,878,268 | 95.87 | |||
| 2007-07-01 | 333,561 | 271,602 | 81.42 | 5,797 | 1.74 | 245,861 | 73.71 | |||
| 2007-10-01 | 3,148,008 | 2,805,813 | 89.13 | 17,595 | 0.56 | 2,865,405 | 91.02 | |||
| 2008-01-01 | 528,280 | 415,137 | 78.58 | 7,759 | 1.47 | 412,591 | 78.1 | |||
| 2008-04-01 | 536,302 | 366,459 | 68.33 | 5,087 | 0.95 | 438,680 | 81.8 | |||
| 2008-07-01 | 548,335 | 404,730 | 73.81 | 3,900 | 0.71 | 519,290 | 94.7 | |||
| 2008-10-01 | 380,299 | 295,198 | 77.62 | 1,690 | 0.44 | 366,496 | 96.37 | |||
| 2009-01-01 | 564,082 | 373,572 | 66.23 | 8,674 | 1.54 | 436,089 | 77.31 | |||
| 2009-04-01 | 784,343 | 440,348 | 56.14 | 6,879 | 0.88 | 735,868 | 93.82 | |||
| 2009-07-01 | 22,211,624 | 21,995,486 | 99.03 | 91,672 | 0.41 | 21,859,016 | 98.41 | |||
| Jubilant Kinase Inhibitors | Jubilant | Jubilant | 2004-12-01 | 163,518 | 153,169 | 93.67 | 7,916 | 4.84 | 103,571 | 63.34 |
| KEGG | KEGG | PubChem | 2007-07-26 | 14,233 | 9,519 | 66.88 | 1,765 | 12.4 | 884 | 6.21 |
| 2008-06-10 | 2,475 | 1,662 | 67.15 | 216 | 8.73 | 438 | 17.7 | |||
| KUMGM | KUMGM | PubChem | 2007-07-26 | 3,107 | 1,883 | 60.61 | 130 | 4.18 | 796 | 25.62 |
| Leadscope FDA | Leadscope/FDA | PubChem | 2007-07-26 | 588 | 345 | 58.67 | 98 | 16.67 | 0 | 0.0 |
| LifeChem Building Blocks | LifeChem | LifeChem | 2006-05-01 | 4,020 | 2,614 | 65.02 | 713 | 17.74 | 2 | 0.05 |
| LifeChem Stock Compounds | LifeChem | LifeChem | 2006-05-01 | 204,765 | 158,064 | 77.19 | 26,429 | 12.91 | 581 | 0.28 |
| LifeChem Virtual Compounds | LifeChem | LifeChem | 2006-05-01 | 179,648 | 138,146 | 76.9 | 6,405 | 3.57 | 176 | 0.1 |
| LipidMAPS | LipidMAPS | PubChem | 2007-07-26 | 9,590 | 7,926 | 82.65 | 207 | 2.16 | 1,062 | 11.07 |
| 2008-06-10 | 284 | 174 | 61.27 | 1 | 0.35 | 180 | 63.38 | |||
| MDDR | MDL/Symyx | MDL/Symyx | 2006-03-01 | 164,561 | 124,584 | 75.71 | 7,020 | 4.27 | 69,148 | 42.02 |
| MDL Patent Database | MDL/Symyx | MDL/Symyx | 2005-11-01 | 30,840 | 20,239 | 65.63 | 1,959 | 6.35 | 12,647 | 41.01 |
| MDL Toxicity Database | MDL/Symyx | MDL/Symyx | 2005-11-01 | 147,006 | 81,460 | 55.41 | 8,847 | 6.02 | 31,877 | 21.68 |
| MDPI | MDPI | MDPI | 2004-11-01 | 10,478 | 6,026 | 57.51 | 2,006 | 19.14 | 14 | 0.13 |
| MICAD | MICAD | PubChem | 2007-07-26 | 187 | 106 | 56.68 | 5 | 2.67 | 24 | 12.83 |
| 2008-06-10 | 76 | 52 | 68.42 | 1 | 1.32 | 52 | 68.42 | |||
| MLSMR | MLSMR | PubChem | 2007-07-26 | 204,198 | 158,268 | 77.51 | 36,550 | 17.9 | 1,182 | 0.58 |
| 2008-06-10 | 75,453 | 59,699 | 79.12 | 11,106 | 14.72 | 2,996 | 3.97 | |||
| MMDB | MMDB | PubChem | 2007-07-26 | 9,373 | 6,805 | 72.6 | 987 | 10.53 | 3,356 | 35.8 |
| 2008-06-10 | 4,428 | 3116 | 70.37 | 488 | 11.02 | 1,651 | 37.29 | |||
| MOLI | MOLI | PubChem | 2007-07-26 | 1,774 | 1030 | 58.06 | 27 | 1.52 | 1,085 | 61.16 |
| MTDP/NCI | MTDP/NCI | PubChem | 2007-07-26 | 105,931 | 77,945 | 73.58 | 18,545 | 17.51 | 242 | 0.23 |
| 2008-06-10 | 301 | 203 | 67.44 | 0 | 0.0 | 1 | 0.33 | |||
| NatChemBio | NatChemBio | PubChem | 2007-07-26 | 1,446 | 1,084 | 74.97 | 160 | 11.07 | 58 | 4.01 |
| 2008-06-10 | 841 | 595 | 70.75 | 103 | 12.25 | 246 | 29.25 | |||
| NCGC | NCGC | PubChem | 2007-07-26 | 53,703 | 41,392 | 77.08 | 4,311 | 8.03 | 184 | 0.34 |
| 2008-06-10 | 10,971 | 7,534 | 68.67 | 994 | 9.06 | 2,083 | 18.99 | |||
| NCI Open Database | NCI/DTP | NCI/CADD | 2006-07-01 | 253,550 | 150,496 | 59.36 | 28,804 | 11.36 | 29,119 | 11.48 |
| PubChem | 2007-07-26 | 250,854 | 148,818 | 59.32 | 30,026 | 11.97 | 14,321 | 5.71 | ||
| 2008-06-10 | 5,365 | 3,813 | 71.07 | 249 | 4.64 | 3,354 | 62.52 | |||
| NCI-NP | NCI/DTP | NCI/DTP | 2002-02-01 | 119,587 | 76,664 | 64.11 | 3,644 | 3.05 | 72,939 | 60.99 |
| NIAID HIV/OI | NIAID | NIAID | 2006-02-01 | 132,806 | 74,347 | 55.98 | 16,002 | 12.05 | 10,077 | 7.59 |
| PubChem | 2007-07-26 | 149,341 | 88,058 | 58.96 | 18,212 | 12.19 | 10,495 | 7.03 | ||
| 2008-06-10 | 4,052 | 3,064 | 75.62 | 591 | 14.59 | 110 | 2.71 | |||
| NIH Clinical Collection | NIH Clinical Collection | PubChem | 2008-06-10 | 472 | 333 | 70.55 | 85 | 18.01 | 3 | 0.64 |
| NINDS-ADSP | NINDS | PubChem | 2007-07-26 | 1,031 | 716 | 69.45 | 137 | 13.29 | 10 | 0.97 |
| NINDS-PANACHE | NINDS | PubChem | 2007-07-26 | 10 | 10 | 100.0 | 3 | 30.0 | 0 | 0.0 |
| NIST MS-Lib | NIST | NIST | 2006-01-01 | 170,917 | 75,311 | 44.06 | 13,615 | 7.97 | 3,295 | 1.93 |
| PubChem | 2007-07-26 | 170,920 | 75,306 | 44.06 | 13,618 | 7.97 | 3,042 | 1.78 | ||
| NIST WebBook | NIST | NIST | 2006-01-01 | 51,451 | 16,897 | 32.84 | 2,482 | 4.82 | 547 | 1.06 |
| PubChem | 2007-07-26 | 51,403 | 16,888 | 32.85 | 2,481 | 4.83 | 540 | 1.05 | ||
| NLM ChemIDplus | NLM | NLM | 2006-03-01 | 255,138 | 137,090 | 53.73 | 19,911 | 7.8 | 7,747 | 3.04 |
| PubChem | 2007-07-26 | 273,597 | 147,403 | 53.88 | 21,223 | 7.76 | 8,915 | 3.26 | ||
| NMMLSC | NMMLSC | PubChem | 2007-07-26 | 5,770 | 4,637 | 80.36 | 916 | 15.88 | 0 | 0.0 |
| 2008-06-10 | 438 | 354 | 80.82 | 17 | 3.88 | 25 | 5.71 | |||
| NMRShiftDB | NMRShiftDB | PubChem | 2007-07-26 | 18,956 | 7,729 | 40.77 | 1,584 | 8.36 | 1,810 | 9.55 |
| NTP-CHSD | NTP | NTP | 1991-08-01 | 1,383 | 504 | 36.44 | 202 | 14.61 | 87 | 6.29 |
| NTP-PTC | NTP | NTP | 2002-09-01 | 401 | 157 | 39.15 | 60 | 14.96 | 21 | 5.24 |
| ORST Small Molecule Screening Center | ORST Small Molecule Screening Center | PubChem | 2008-06-10 | 1,993 | 1,396 | 70.05 | 268 | 13.45 | 1 | 0.05 |
| PASS Training Set | LSFBDD/IBMC/RAMS | LSFBDD/IBMC/RAMS | 2006-02-01 | 60,172 | 44,157 | 73.38 | 5,402 | 8.98 | 6,585 | 10.94 |
| PCMD | PCMD | PubChem | 2007-07-26 | 27 | 24 | 88.89 | 2 | 7.41 | 0 | 0.0 |
| 2008-06-10 | 65 | 51 | 78.46 | 1 | 1.54 | 2 | 3.08 | |||
| PDSP | PDSP | PubChem | 2007-07-26 | 2,867 | 1,895 | 66.1 | 483 | 16.85 | 31 | 1.08 |
| ProbeDB | ProbeDB | PubChem | 2007-07-26 | 1 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
| 2008-06-10 | 1 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | |||
| Prous Science Drugs of the Future | Prous Science Drugs of the Future | PubChem | 2007-07-26 | 4,417 | 3,182 | 72.04 | 534 | 12.09 | 109 | 2.47 |
| 2008-06-10 | 202 | 146 | 72.28 | 25 | 12.38 | 9 | 4.46 | |||
| R&D chemicals | R&D chemicals | PubChem | 2008-06-10 | 8,352 | 3,260 | 39.03 | 585 | 7.0 | 51 | 0.61 |
| RTECS | NIOSH/CDC | NIOSH/CDC | 2004-06-01 | 137,094 | 75,400 | 55.0 | 9,065 | 6.61 | 20,510 | 14.96 |
| SDCCG | SDCCG | PubChem | 2007-07-26 | 54,565 | 40,767 | 74.71 | 9,247 | 16.95 | 11 | 0.02 |
| SDCCG | SDCCG | PubChem | 2008-06-10 | 1,172 | 652 | 55.63 | 193 | 16.47 | 98 | 8.36 |
| SGC-Ox | SGC-Ox | PubChem | 2007-07-26 | 308 | 275 | 89.29 | 92 | 29.87 | 10 | 3.25 |
| SGC-Sto | SGC-Sto | PubChem | 2007-07-26 | 17 | 17 | 100.0 | 8 | 47.06 | 0 | 0.0 |
| Shanghai Institute of Organic Chemistry | Shanghai Institute of Organic Chemistry | PubChem | 2008-06-10 | 2,428 | 1,975 | 81.34 | 388 | 15.98 | 394 | 16.23 |
| Sigma–Aldrich | Sigma–Aldrich | PubChem | 2007-07-26 | 37,519 | 15,908 | 42.4 | 3,122 | 8.32 | 539 | 1.44 |
| SMID | SMID | PubChem | 2007-07-26 | 6,500 | 4,566 | 70.25 | 848 | 13.05 | 1,081 | 16.63 |
| Southern Research Institute—HTS | Southern Research Institute—HTS | PubChem | 2008-06-10 | 1,113 | 716 | 64.33 | 161 | 14.47 | 18 | 1.62 |
| Specs | Specs | PubChem | 2007-07-26 | 205,956 | 150,127 | 72.89 | 28,320 | 13.75 | 33 | 0.02 |
| SRMLSC | SRMLSC | PubChem | 2008-06-10 | 304 | 188 | 61.84 | 29 | 9.54 | 14 | 4.61 |
| Structural Genomics Consortium | Structural Genomics Consortium | PubChem | 2007-07-26 | 87 | 74 | 85.06 | 27 | 31.01 | 0. | 0.0 |
| 2008-06-10 | 90 | 77 | 85.56 | 25 | 27.78 | 0 | 0.0 | |||
| The Scripps Research Institute Molecular Screening Center | The Scripps Research Institute Molecular Screening Center | PubChem | 2007-07-26 | 2 | 2 | 100.0 | 0 | 0.0 | 0 | 0.0 |
| 2008-06-10 | 16,180 | 10,609 | 65.57 | 2,663 | 16.46 | 50 | 0.31 | |||
| Thomson Pharma | Thomson Pharma | PubChem | 2007-07-26 | 2,277,301 | 1,389,970 | 61.04 | 132,517 | 5.82 | 129,218 | 5.67 |
| 2008-06-10 | 223,630 | 145,692 | 65.15 | 9,605 | 4.3 | 97,628 | 43.66 | |||
| Total TOSLab Building Blocks | Total TOSLab Building Blocks | PubChem | 2007-07-26 | 909 | 666 | 73.27 | 244 | 26.84 | 0 | 0.0 |
| UM-BBD | UM-BBD | PubChem | 2007-07-26 | 1,062 | 579 | 54.52 | 198 | 18.64 | 57 | 5.37 |
| 2008-06-10 | 38 | 21 | 55.26 | 6 | 15.79 | 12 | 31.58 | |||
| University of Pittsburgh Molecular Library Screening Center | University of Pittsburgh Molecular Library Screening Center | PubChem | 2008-06-10 | 273 | 248 | 90.48 | 67 | 24.54 | 0 | 0.0 |
| UPCMLD | UPCMLD | PubChem | 2007-07-26 | 1,879 | 1,415 | 75.31 | 38 | 2.02 | 25 | 1.33 |
| 2008-06-10 | 493 | 418 | 84.79 | 10 | 2.03 | 91 | 18.46 | |||
| USAMRIID In Silico-Screened Structures | USAMRIID | USAMRIID | 2004-06-01 | 359,554 | 326,765 | 90.88 | 146 | 0.04 | 359,429 | 99.97 |
| WDI | Derwent/Thomson Reuters | Derwent/Thomson Reuters | 2006-02-01 | 69,283 | 49,238 | 71.07 | 6,459 | 9.32 | 8,996 | 12.98 |
| Web of Science | Web of Science | PubChem | 2007-07-26 | 18 | 9 | 50.0 | 2 | 11.11 | 0 | 0.0 |
| Wombat 2005.02 | Sunset Molecular Discovery | Sunset molecular discovery | 2005-02-01 | 120,287 | 90,072 | 74.88 | 9,272 | 7.71 | 27,974 | 23.26 |
| xPharm | xPharm | PubChem | 2007-07-26 | 2,135 | 1,535 | 71.9 | 374 | 17.52 | 19 | 0.89 |
| ZINC | ZINC | PubChem | 2007-07-26 | 3,707,913 | 2,944,497 | 79.41 | 300,108 | 8.09 | 153,822 | 4.15 |
Frequency of application of CACTVS transforms in the systematic generation of all tautomers for the FICuS parent structure (canonical tautomer) set
| Transform rule | Generated tautomers | |
|---|---|---|
| Count | % | |
| Rule 1: 1.3 (thio)keto/(thio)enol | 173,002,712 | 25.4 |
| Rule 2: 1.5 (thio)keto/(thio)enol | 11,541,452 | 1.7 |
| Rule 3: simple (aliphatic) imine | 3,5917,415 | 5.3 |
| Rule 4: special imine | 4,306,155 | 0.6 |
| Rule 5: 1.3 aromatic heteroatom H shift | 25,678,446 | 3.8 |
| Rule 6: 1.3 heteroatom H shift | 250,453,882 | 36.8 |
| Rule 7: 1.5 (aromatic) heteroatom H shift (1) | 27,542,770 | 4.0 |
| Rule 8: 1.5 aromatic heteroatom H shift (2) | 26,819 | <0.1 |
| Rule 9: 1.7 (aromatic) heteroatom H shift | 57,242,472 | 8.4 |
| Rule 10: 1.9 (aromatic) heteroatom H shift | 5,061,731 | 0.7 |
| Rule 11: 1.11 (aromatic) heteroatom H shift | 1,374,235 | 0.2 |
| Rule 12: furanones | 17,860,604 | 2.6 |
| Rule 13: keten/ynol exchange | 57,989 | <0.1 |
| Rule 14: ionic nitro/aci-nitro | 428,266 | 0.1 |
| Rule 15: pentavalent nitro/aci-nitro | 129 | <0.1 |
| Rule 16: oxim/nitroso | 505,695 | 0.1 |
| Rule 17: oxim/nitroso via phenol | 131,502 | 0.2 |
| Rule 18: cyanic/iso-cyanic acids | 181 | <0.1 |
| Rule 19: formamidinesulfinic acids | 1,392 | <0.1 |
| Rule 20: isocyanides | 229 | <0.1 |
| Rule 21: phosphonic acids | 54,926 | <0.1 |
: Distribution of the number of tautomers generated per FICuS parent structure
| Canonical tautomers (FICuS parent structures) with | Count | % |
|---|---|---|
| no tautomers | 9,756,186 | 13.8 |
| one tautomer | 10,721,845 | 15.2 |
| 2–10 tautomers | 33,532,284 | 47.5 |
| 11–25 tautomers | 10,870,312 | 15.4 |
| 25–50 tautomers | 2,622,587 | 3.7 |
| 51–100 tautomers | 1,136,066 | 1.6 |
| 101–200 tautomers | 565,199 | 0.8 |
| 201–300 tautomers | 104,875 | 0.1 |
| 301–400 tautomers | 35,144 | <0.1 |
| 401–500 tautomers | 17,241 | <0.1 |
| 501–600 tautomers | 4,323 | <0.1 |
| 601–700 tautomers | 1,400 | <0.1 |
| 701–800 tautomers | 362 | <0.1 |
| 801–832 tautomers | 3 | <0.1 |
Distribution of Tanimoto similarities in the entire set of tautomers (680 million structures) between the FICuS parent structure (canonical tautomer) and all derived tautomers]
| Tanimoto index range | Count | % |
|---|---|---|
| >0.0–0.2 | 0 | 0.0 |
| >0.2–0.3 | 6 | <0.1 |
| >0.3–0.4 | 6,580 | <0.1 |
| >0.4–0.5 | 369,331 | <0.1 |
| >0.5–0.6 | 6,304,436 | 0.9 |
| >0.6–0.7 | 36,448,651 | 5.3 |
| >0.7–0.8 | 111,954,384 | 16.4 |
| >0.8–0.9 | 214,747,976 | 31.5 |
| >0.9–1.0 | 310,725,465 | 45.6 |
The Tanimoto similarities were calculated using the PubChem fingerprints
Fig. 8Low Tanimoto similarity between different tautomers. Structure 21 is regarded as the canonical tautomer by CACTVS, structure 22–24 are formal tautomers generated from 21. The italic numbers are the Tanimoto similarity indices between the canonical tautomer and respective tautomer structure calculated by a PubChem/CACTVS E_SCREEN fingerprints (881 bit fragment set), b extended connectivity fingerprints (FCFPs) with length of 2, 4, and 6 as implemented in Pipeline Pilot (1024 bit hashed), c functional class fingerprints (ECFPs) with lengths of 2, 4, and 6 as implemented in Pipeline Pilot (1024 bit hashed), and d MDL Public Keys as implemented in Pipeline Pilot