| Literature DB >> 23256896 |
Igor Pletnev1, Andrey Erin, Alan McNaught, Kirill Blinov, Dmitrii Tchekhovskoi, Steve Heller.
Abstract
InChIKey is a 27-character compacted (hashed) version of InChI which is intended for Internet and database searching/indexing and is based on an SHA-256 hash of the InChI character string. The first block of InChIKey encodes molecular skeleton while the second block represents various kinds of isomerism (stereo, tautomeric, etc.). InChIKey is designed to be a nearly unique substitute for the parent InChI. However, a single InChIKey may occasionally map to two or more InChI strings (collision). The appearance of collision itself does not compromise the signature as collision-free hashing is impossible; the only viable approach is to set and keep a reasonable level of collision resistance which is sufficient for typical applications.We tested, in computational experiments, how well the real-life InChIKey collision resistance corresponds to the theoretical estimates expected by design. For this purpose, we analyzed the statistical characteristics of InChIKey for datasets of variable size in comparison to the theoretical statistical frequencies. For the relatively short second block, an exhaustive direct testing was performed. We computed and compared to theory the numbers of collisions for the stereoisomers of Spongistatin I (using the whole set of 67,108,864 isomers and its subsets). For the longer first block, we generated, using custom-made software, InChIKeys for more than 3 × 1010 chemical structures. The statistical behavior of this block was tested by comparison of experimental and theoretical frequencies for the various four-letter sequences which may appear in the first block body.From the results of our computational experiments we conclude that the observed characteristics of InChIKey collision resistance are in good agreement with theoretical expectations.Entities:
Year: 2012 PMID: 23256896 PMCID: PMC3558395 DOI: 10.1186/1758-2946-4-39
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Figure 1Molecular skeleton of Spongistatin I.
Comparison of observed and theoretically expected average numbers of InChIKey second block collisions for stereoisomers of Spongistatin I
| 10000 | 0.0004 | 0.0004 | 10000 |
| 50000 | 0.0091 | 0.0125 | 10000 |
| 100000 | 0.0364 | 0.0364 | 10000 |
| 250000 | 0.2274 | 0.2264 | 10000 |
| 500000 | 0.9095 | 0.9091 | 10000 |
| 1000000 | 3.6380 | 3.6390 | 10000 |
| 2000000 | 14.5519 | 14.6810 | 1000 |
| 4000000 | 58.2076 | 58.6810 | 1000 |
| 8000000 | 232.8306 | 234.9040 | 1000 |
| 16000000 | 931.3225 | 941.2460 | 1000 |
| 32000000 | 3725.2902 | 3766.7530 | 1000 |
Figure 2The observed (circles) and theoretically expected (curve) average number of InChIKey second block collisions vs. the number of considered stereoisomers of Spongistatin I.a) The whole data range; abscissa values: log(number of isomers); b) low-collision region; abscissa values: number of isomers.
Figure 3The dependence of observed average number of InChIKey second block collisions for 370 000-entry datasets vs. the number of samplings m.
Figure 4Normalized frequencies of various letters within the first block of InChIKey. Measured using InChIKeys for 1 097 996 constitutional isomers of C8H8Cl3F5; the values are normalized to the frequency of ‘A’.
Selected normalized letter frequencies for various positions (1 to 14) in the InChIKey first block
| 1.0000 | 0.9221 | 0.9130 | 0.9917 | 0.9213 | 0.9227 | 1.0039 | | 0.9250 | 1.2273 | 0.9390 | ||
| … | … | … | … | … | … | … | … | … | … | … | ||
| 0.9898 | 0.9223 | 0.9182 | 1.0038 | 0.9138 | 0.9260 | 0.9938 | … | 0.9291 | 1.2321 | 0.9380 | ||
| 0.9164 | 0.9331 | 0.9208 | 0.9256 | … | 0.9216 | 1.2249 | 0.9436 | |||||
| 0.9986 | 0.9145 | 0.9200 | 0.9964 | 0.9183 | 0.9283 | 0.9957 | … | 0.9171 | 1.2313 | 0.9430 | ||
| … | … | … | … | … | … | … | … | … | … | … | ||
| 0.9965 | 0.9109 | 0.9279 | 1.0015 | 0.9207 | 0.9269 | 0.9916 | … | 0.9213 | 1.2152 | 0.9420 | ||
| 0.9940 | 0.9186 | 0.9294 | 0.9956 | 0.9184 | 0.9259 | 0.9957 | … | 0.9226 | 1.2236 | 0.8924 | ||
| 0.9237 | 0.9303 | 0.9227 | 0.9291 | … | 0.9268 | 0.9037 | ||||||
| 0.9992 | 0.9573 | 0.9341 | 0.9934 | 0.9543 | 0.9303 | 0.9906 | … | 0.9251 | 0.8902 | |||
| … | … | … | … | … | … | | … | … | … | … | ||
| 0.9986 | 0.9627 | 0.9293 | 0.9969 | 0.9569 | 0.9288 | 0.9873 | … | 0.9316 | 0.8857 |
Measured using InChIKeys for 998753 isomers of C9H8O1; the values are normalized to the frequency of ‘A’.
The occurrence of various 4-letter sequences in the set of first blocks for 1.2002 × 109 InChIKeys
| ABCD | 33243 | 33041 | 1.006 | EDNA | 20135 | 20254 | 0.994 | |
| FMGL | 33389 | 33041 | 1.011 | EGPS | 20466 | 20254 | 1.010 | |
| LGRC | 32793 | 33041 | 0.992 | EKPH | 20344 | 20254 | 1.004 | |
| RBCQ | 32937 | 33041 | 0.997 | EJDO | 20190 | 20254 | 0.997 | |
| Probability | 2.7530 × 10-5 | Probability | 1.6876 × 10-5 | |||||
| TBAC | 20209 | 20254 | 0.998 | ZAMR | 33551 | 33547 | 1.000 | |
| TKIL | 20303 | 20254 | 1.002 | ZDKL | 33650 | 33547 | 1.003 | |
| TRPC | 20273 | 20254 | 1.001 | ZSBC | 33581 | 33547 | 1.001 | |
| TSBF | 20111 | 20254 | 0.993 | ZIII | 33577 | 33547 | 1.001 | |
| Probability | 1.6876 × 10-5 | Probability | 2.7951 × 10-5 | |||||