Examples - execution in details
Sphere exclusion clustering algorithm
Evaluating performance using verbose mode
Output only cluster representant molecules in default (SMILES) format with cluster ID and size info.
In this example a sphere exclusion clustering is launched on a 1k random subset of the ZINC drug like subset (see http://zinc.docking.org/subset1/ ). Execution is followed with verbose printouts and cluster size statistics is printed to the console as an output.
$ cat zinc-druglike.RND0001k.smi.gz | gzip -d | time ./jklustor -c sphex:0.6 - -o wrstat:full -v
Import structures from -
Leaves: 0 Level 1: 360 / 1000 2.360 s / 2.360 s
Leaves: 0 Level 1: 361 / 1006 2.375 s / 0.015 s
Imported 1006 structures
Imported 1006 structures from the input. (Total 1006 structures; 1006 in tree)
All import finished.
Cluste size statistics
[INFO]
Structure count: 1006
Level count: 1
Preferred level: lower
[COVERAGE]
Level: *** 1 ***
---------
Total cluster count: 361
Singleton cluster/struct count: 182
Real cluster count: 179
Clustering coverage (%): 81.90%
[STATISTICS]
Cluster size: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 ...
-----+------+------+------+------+------+------+------+------+------+---- ...
Frequency at level * 1* 182 | 68 | 35 | 20 | 16 | 8 | 4 | 7 | 6 | 2 | 4 ...
[Cluster size-frequency]
excluding singletons
========================
Level: ***** 1 *****
Freq.:Single:Cumul.:Clust.
:cover.:cover.:count
======+======+======+======
Cluster size: 30 1 :2.982%:2.982%:0.5586%
Cluster size: 27 1 :2.683%:5.666%:1.117%
Cluster size: 26 1 :2.584%:8.250%:1.675%
Cluster size: 17 1 :1.689%:9.940%:2.234%
Cluster size: 16 2 :3.180%:13.12%:3.351%
Cluster size: 15 1 :1.491%:14.61%:3.910%
Cluster size: 13 2 :2.584%:17.19%:5.027%
Cluster size: 11 4 :4.373%:21.57%:7.262%
Cluster size: 10 2 :1.988%:23.55%:8.379%
Cluster size: 9 6 :5.367%:28.92%:11.73%
Cluster size: 8 7 :5.566%:34.49%:15.64%
Cluster size: 7 4 :2.783%:37.27%:17.87%
Cluster size: 6 8 :4.771%:42.04%:22.34%
Cluster size: 5 16 :7.952%: 50 %:31.28%
Cluster size: 4 20 :7.952%:57.95%:42.45%
Cluster size: 3 35 :10.43%:68.38%:62.01%
Cluster size: 2 68 :13.51%:81.90%: 100 %
0.04user 0.04system 0:03.10elapsed 2%CPU (0avgtext+0avgdata 800512maxresident)k
0inputs+0outputs (3149major+0minor)pagefaults 0swaps
Running on 16k with high radius;; cluster centroids are written to a file as smiles.
$ cat zinc-druglike.RND0016k.smi.gz | gzip -d | time ./jklustor -c sphex:0.8 - -o wrclus:smiles:clusters.smi -v
Import structures from -
Leaves: 0 Level 1: 35 / 1000 2.812 s / 2.812 s
Leaves: 0 Level 1: 51 / 2000 4.547 s / 1.735 s
Leaves: 0 Level 1: 55 / 3000 6.203 s / 1.656 s
Leaves: 0 Level 1: 57 / 4000 7.891 s / 1.688 s
Leaves: 0 Level 1: 62 / 5000 9.625 s / 1.734 s
Leaves: 0 Level 1: 65 / 6000 11.344 s / 1.719 s
Leaves: 0 Level 1: 65 / 7000 13.031 s / 1.687 s
Leaves: 0 Level 1: 65 / 8000 14.750 s / 1.719 s
Leaves: 0 Level 1: 65 / 9000 16.422 s / 1.672 s
Leaves: 0 Level 1: 67 / 10000 18.250 s / 1.828 s
Leaves: 0 Level 1: 68 / 11000 20.437 s / 2.187 s
Leaves: 0 Level 1: 68 / 12000 22.453 s / 2.016 s
Leaves: 0 Level 1: 68 / 13000 24.391 s / 1.938 s
Leaves: 0 Level 1: 68 / 14000 26.375 s / 1.984 s
Leaves: 0 Level 1: 70 / 15000 28.156 s / 1.781 s
Leaves: 0 Level 1: 71 / 16000 30.109 s / 1.953 s
Leaves: 0 Level 1: 71 / 16402 30.812 s / 0.703 s
Imported 16402 structures
Imported 16402 structures from the input. (Total 16402 structures; 16402 in tree)
All import finished.
Write clusters to clusters.smi in format smiles in order natural level: lower
0.04user 0.03system 0:34.12elapsed 0%CPU (0avgtext+0avgdata 793088maxresident)k
0inputs+0outputs (3120major+0minor)pagefaults 0swaps
The clustering algorithm currently implemented:
Note that any two centroids have a higher dissimilarity than the given radius. The proper dissimilarity radius depends on the input set and the fingerprint method (CFP/ECFP) used; determining it requires an itetative refinement.
Verbose mode can be used by passing option “-v” to jklustor command line executable. Messages will be printed during structure import are in the following format (please note that verbose message format might be changed in the future thus verbose messages are not suitable for machine processing):
Leaves: 0 Level 1: 65 / 8000 14.750 s / 1.719 s
Interpretation:
Processing time depends on found cluster count: when cluster count increases then significant time required to identify the nearest centroid for each read structure. When using small dissimilarity radius cluster count is near to imported structure count and processing time can be prohibitively high.
When individual structures are used then all structures are reassigned after import (see algorithm description). The time required for this reassign can be estimated from the time spent in the clustering process. This additional time can reach twice the time spent during structure clustering when cluster count is very high.
Tracker topic:
https://www.chemaxon.com/forum/ftopic8015.html (JKlustor related documents)
References to this documents are in the following topics:
https://www.chemaxon.com/forum/ftopic7691.html ('sphere exclusion' clustering - the speed of performance)