| Literature DB >> 33430977 |
Abstract
The chemfp project has had four main goals: (1) promote the FPS format as a text-based exchange format for dense binary cheminformatics fingerprints, (2) develop a high-performance implementation of the BitBound algorithm that could be used as an effective baseline to benchmark new similarity search implementations, (3) experiment with funding a pure open source software project through commercial sales, and (4) publish the results and lessons learned as a guide for future implementors. The FPS format has had only minor success, though it did influence development of the FPB binary format, which is faster to load but more complex. Both are summarized. The chemfp benchmark and the no-cost/open source version of chemfp are proposed as a reference baseline to evaluate the effectiveness of other similarity search tools. They are used to evaluate the faster commercial version of chemfp, which can test 130 million 1024-bit fingerprint Tanimotos per second on a single core of a standard x86-64 server machine. When combined with the BitBound algorithm, a k = 1000 nearest-neighbor search of the 1.8 million 2048-bit Morgan fingerprints of ChEMBL 24 averages 27 ms/query. The same search of 970 million PubChem fingerprints averages 220 ms/query, making chemfp one of the fastest CPU-based similarity search implementations. Modern CPUs are fast enough that memory bandwidth and latency are now important factors. Single-threaded search uses most of the available memory bandwidth. Sorting the fingerprints by popcount improves memory coherency, which when combined with 4 OpenMP threads makes it possible to construct an N × N similarity matrix for 1 million fingerprints in about 30 min. These observations may affect the interpretation of previous publications which assumed that search was strongly CPU bound. The chemfp project funding came from selling a purely open-source software product. Several product business models were tried, but none proved sustainable. Some of the experiences are discussed, in order to contribute to the ongoing conversation on the role of open source software in cheminformatics.Entities:
Keywords: FOSS; Format; High-performance; Molecular fingerprints; Open source; Performance benchmark; Similarity searching; Tanimoto
Year: 2019 PMID: 33430977 PMCID: PMC6896769 DOI: 10.1186/s13321-019-0398-8
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Relative performance of different popcount implementations
| Popcount method | Performance relative to 8-bit lookup table | |||
|---|---|---|---|---|
| 166 bits | 881 bits | 1024 bits | 2048 bits | |
| 8-bit lookup table | 1× | 1× | 1× | 1× |
| 16-bit lookup table | 2.0 | 2.8 | 2.9 | 2.4 |
| Gillies-Miller [ | 1.6 | 2.9 | 3.1 | 3.4 |
| Lauradoux [ | 3.1 | 3.3 | 3.7 | |
| SSSE3 [ | 5.4 | 6.1 | ||
| POPCNT (8 bytes/loop) | ||||
| Dispatch | 3.6 | 6.0 | 6.3 | 6.4 |
| Inline | 4.9 | 6.6 | 6.9 | 6.6 |
| POPCNT (fully unrolled) | ||||
| Dispatch | 5.3 | 7.9 | 8.2 | 7.8 |
| Inline | 6.7 | 8.2 | 8.4 | 8.0 |
| AVX2 [ | ||||
| Dispatch | 8.6 | 9.2 | ||
| Dispatch, prefetch | 8.7 | 9.3 | ||
| Inline | 9.8 | 9.9 | ||
| Inline, prefetch | 11.0 | 10.6 | ||
Times are scaled relative to an 8-bit lookup table, as measured by the threshold searches from the chemfp benchmark suite. In most cases the search algorithm uses a function pointer to dispatch to the appropriate popcount function, without memory prefetching. The “fully unrolled” variants implement the fingerprint popcount without using a loop. The “inline” and “prefetch” variants inline the calculation and use memory prefetching, respectively. Timings were made with chemfp 3.3. Chemfp 1.5 does not support inlining, AVX2, or prefetching
Fingerprint target data set sizes in FPS format
| Data set | #Bits | Fingerprint type | #Fingerprints (in millions) | Unique | FPS size (in MiB) | FPS.gz size (in MiB) |
|---|---|---|---|---|---|---|
chemfp benchmark ChEMBL 23 subset | 166 | OpenEye MACCS | 1.00 | 83.6% | 54 | 17.7 |
chemfp benchmark PubChem subset | 881 | PubChem/CACTVS | 1.00 | 98.2 | 222 | 53.1 |
chemfp benchmark ChEMBL 23 subset | 1021 | Open Babel FP2 | 1.00 | 96.0 | 258 | 80.5 |
chemfp benchmark ChEMBL 23 subset | 2048 | RDKit Morgan | 1.00 | 90.6 | 502 | 59.9 |
| ChEMBL 24 | 2048 | RDKit Morgan | 1.82 | 94.1 | 914 | 99.7 |
| PubChem | 881 | PubChem/CACTVS | 96.9 | 65.3 | 21,500 | 2910 |
“Unique” is the number of distinct fingerprints as a percentage of the total number of fingerprints
Fig. 1Example FPS file for 166-bit MACCS keys generated by OpenEye’s GraphSim toolkit. Header lines start with a ‘#’. The three record lines start with a hex-encoded fingerprint, followed by a tab and the record id
Fig. 2Seven fingerprint type strings from different toolkits. Each type string contains space separated terms. The first term contains the fingerprint family name and version. Remaining terms encode fingerprint parameters as key = value pairs. The OpenEye-Path and RDKit-Morgan types are wrapped over two lines for presentation
Fingerprint data set sizes in FPB format and largest chunk sizes
| Data set | #Bits | #Fingerprints (in millions) | FPB size (in MiB) | AREN size (in MiB) | FPID size (in MiB) | HASH size (in MiB) |
|---|---|---|---|---|---|---|
| chemfp benchmark | 166 | 1.00 | 54.0 | 22.9 | 15.9 | 15.3 |
| chemfp benchmark | 881 | 1.00 | 134 | 107 | 11.6 | 15.3 |
| chemfp benchmark | 1021 | 1.00 | 153 | 122 | 15.9 | 15.3 |
| chemfp benchmark | 2048 | 1.00 | 275 | 244 | 15.9 | 15.3 |
| ChEMBL 24 | 2048 | 1.82 | 501 | 444 | 29.9 | 27.8 |
| PubChem | 881 | 96.9 | 13,000 | 10,300 | 1130 | 1480 |
The AREN chunk contains the fingerprints, the FPID chunk contains record identifiers indexed by position, and the HASH chunk contains a hash table mapping identifiers to index
Fig. 3Single query search times for chemfp 3.3. Boxen plots for k = 2, 10, 100, and 1000 nearest-neighbor and threshold = 0.95, 0.80, 0.70, and 0.40 searches of ChEMBL 24 and PubChem (downloaded 2018-12-07). Each search samples 1000 fingerprints to use as queries so each query is always found in the result. Python’s garbage collector was disabled for each timing as it adds a roughly 25 ms delay about every 1000 timings. The T = 0.40 PubChem search could not be run due to insufficient memory
Average performance of 1000 queries against 1 million targets
| #bits | Method | #Tanimotos (M) | chemfp 1.5 | chemfp 3.3 | ||||
|---|---|---|---|---|---|---|---|---|
| Avg. time (ms) | TTanimoto (ns) | Bandwidth (GiB/s) | Avg. time (ms) | TTanimoto (ns) | Bandwidth (GiB/s) | |||
| 166 | k = 1 | 91.8 | 0.25 | 2.68 | 8.34 | 0.19 | 2.08 | 10.7 |
| 166 | k = 1000 | 588 | 2.20 | 3.74 | 5.97 | 1.85 | 3.15 | 7.10 |
| 166 | T = 0.70 | 688 | 1.72 | 2.50 | 8.93 | 1.42 | 2.07 | 10.8 |
| 881 | k = 1 | 146 | 1.50 | 10.3 | 10.2 | 1.22 | 8.35 | 12.5 |
| 881 | k = 1000 | 485 | 5.64 | 11.6 | 8.97 | 4.73 | 9.75 | 10.7 |
| 881 | T = 0.70 | 554 | 5.70 | 10.3 | 10.2 | 4.70 | 8.47 | 12.3 |
| 1021 | k = 1 | 113 | 1.30 | 11.5 | 10.4 | 0.86 | 7.56 | 15.8 |
| 1021 | k = 1000 | 743 | 9.25 | 12.5 | 9.58 | 6.25 | 8.41 | 14.2 |
| 1021 | T = 0.70 | 489 | 5.51 | 11.3 | 10.6 | 3.64 | 7.45 | 16.0 |
| 2048 | k = 1 | 356 | 7.76 | 21.8 | 11.0 | 5.29 | 14.8 | 16.1 |
| 2048 | k = 1000 | 939 | 21.2 | 22.6 | 10.6 | 14.6 | 15.5 | 15.4 |
| 2048 | T = 0.40 | 920 | 19.9 | 21.6 | 11.1 | 13.6 | 14.8 | 16.1 |
The timings use three different search methods to search the four different fingerprint types from the chemfp benchmark data set. The total number of Tanimoto evaluations is less than 1 billion because of BitBound pruning. TTanimoto is the average time per Tanimoto evaluation, including storing the hits. The effective read bandwidth is calculated as #Tanimotos * storage_size (24, 112, 128, and 256 bytes respectively)/TTanimoto. Note that while shorter fingerprints are faster and more compact, longer fingerprints tend to have better scientific usefulness
Multiquery search performance
| Method | Query size | 1 thread | 2 threads | 4 threads | ||
|---|---|---|---|---|---|---|
| Time (s) | Time (s) | Scaling | Time (s) | Scaling | ||
| k = 1 | 1000 | 5.31 | 3.93 | 1.35 | 3.69 | 1.44 |
| k = 1 | Sorted | 5.24 | 3.84 | 1.36 | 3.50 | 1.50 |
| k = 1 | N × N | 7130 (= 1 h 58 m) | 5200 (= 1 h 26 m) | 1.37 | 4640 (= 1 h 17 m) | 1.54 |
| k = 1000 | 1000 | 14.6 | 10.5 | 1.39 | 9.54 | 1.53 |
| k = 1000 | Sorted | 14.5 | 8.42 | 1.72 | 6.30 | 2.30 |
| k = 1000 | N × N | 15,300 (= 4 h 14 m) | 8040 (= 2 h 13 m) | 1.90 | 4690 (= 1 h 18 m) | 3.26 |
| T = 0.90 | 1000 | 2.95 | 2.19 | 1.35 | 2.03 | 1.45 |
| T = 0.90 | Sorted | 2.92 | 1.65 | 1.77 | 1.04 | 2.81 |
| T = 0.90 | N × N | 1890 (= 31 m 34 s) | 999 (= 16 m 39 s) | 1.90 | 550 (= 9 m 9 s) | 3.45 |
| T = 0.80 | 1000 | 5.52 | 4.09 | 1.35 | 3.77 | 1.46 |
| T = 0.80 | Sorted | 5.47 | 2.96 | 1.85 | 2.03 | 2.69 |
| T = 0.80 | N × N | 3490 (= 58 m 9 s) | 1830 (= 30 m 25 s) | 1.91 | 1010 (= 16 m 47 s) | 3.46 |
| T = 0.70 | 1000 | 8.09 | 5.95 | 1.36 | 5.43 | 1.49 |
| T = 0.70 | Sorted | 8.07 | 4.37 | 1.85 | 2.80 | 2.88 |
| T = 0.70 | N × N | 4930 (= 1 h 22 m) | 2580 (= 42 m 57 s) | 1.91 | 1430 (= 23 m 49 s) | 3.45 |
| T = 0.40 | 1000 | 13.6 | 9.99 | 1.36 | 8.28 | 1.64 |
| T = 0.40 | Sorted | 13.6 | 7.39 | 1.83 | 4.54 | 2.99 |
| T = 0.40 | N × N | 7120 (= 1 h 58 m) | 3710 (= 1 h 1 m) | 1.92 | 2100 (= 34 m 55 s) | 3.40 |
Time to search the 1 million 2048-bit Morgan fingerprints from the chemfp benchmark data set, for different numbers of threads. A query size of “1000” indicates that the first 1000 benchmark queries were used, “sorted” indicates the same 1000 queries sorted by popcount, and “N × N” generates the full sparse similarity matrix for the 1 million target fingerprints
Fig. 4Example of how the non-distributive nature of IEEE 754 doubles results in different Tversky similarity scores