Sphere exclusion clustering

Examples - Online demo

Examples - Command line

Examples - execution in details

Sphere exclusion clustering algorithm

Evaluating performance using verbose mode

Forum links

Examples - Online demo

go to http://discoverygroup.chemaxon.com/MGSandbox/jkdemo.jsp

enter http://www.chemaxon.com/shared/libMCS/default.sdf into the left input field (this can be done by clicking on the last “example” link)
click “Add!”; when “Added 100 structures ...” message appears click “Ok”
in the “launch clustering” box select “Sphere exclusion” and click “Launch”
click “View Clustering results” when appears
click on the floppy icon in the line ‘Total cluster count (including singletons)” to save cluster centroids in SMILES format

Examples - Command line

jklustor -c sphex:0.8 http://www.chemaxon.com/shared/libMCS/default.sdf -v

Select diverse structures as sphere exclusion clustering (with dissimilarity radius = 0.8) centroids. Follow execution with verbose messages.
jklustor -l 88 -c sphex:0.8 http://www.chemaxon.com/shared/libMCS/default.sdf -v

when message “Launch listening server on port 88” appears connect with a browser to http://localhost:88

Select 10 diverse structures using MMDS algorithm (described below) and start listening on port 88 with a web user interface similar to the online example above. Follow execution with verbose messages.

jklustor -c sphex:0.8 http://www.chemaxon.com/shared/libMCS/default.sdf -v -o wrclus

Output only cluster representant molecules in default (SMILES) format with cluster ID and size info.

Examples - execution in details

In this example a sphere exclusion clustering is launched on a 1k random subset of the ZINC drug like subset (see http://zinc.docking.org/subset1/ ). Execution is followed with verbose printouts and cluster size statistics is printed to the console as an output.

$ cat zinc-druglike.RND0001k.smi.gz | gzip -d | time ./jklustor -c sphex:0.6 - -o wrstat:full -v

Import structures from -

Leaves: 0 Level 1: 360 / 1000 2.360 s / 2.360 s

Leaves: 0 Level 1: 361 / 1006 2.375 s / 0.015 s

Imported 1006 structures

Imported 1006 structures from the input. (Total 1006 structures; 1006 in tree)

All import finished.

Cluste size statistics

[INFO]

Structure count: 1006

Level count: 1

Preferred level: lower

[COVERAGE]

Level: *** 1 ***

---------

Total cluster count: 361

Singleton cluster/struct count: 182

Real cluster count: 179

Clustering coverage (%): 81.90%

[STATISTICS]

Cluster size: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 ...

-----+------+------+------+------+------+------+------+------+------+---- ...

Frequency at level * 1* 182 | 68 | 35 | 20 | 16 | 8 | 4 | 7 | 6 | 2 | 4 ...

[Cluster size-frequency]

excluding singletons

========================

Level: ***** 1 *****

Freq.:Single:Cumul.:Clust.

:cover.:cover.:count

======+======+======+======

Cluster size: 30 1 :2.982%:2.982%:0.5586%

Cluster size: 27 1 :2.683%:5.666%:1.117%

Cluster size: 26 1 :2.584%:8.250%:1.675%

Cluster size: 17 1 :1.689%:9.940%:2.234%

Cluster size: 16 2 :3.180%:13.12%:3.351%

Cluster size: 15 1 :1.491%:14.61%:3.910%

Cluster size: 13 2 :2.584%:17.19%:5.027%

Cluster size: 11 4 :4.373%:21.57%:7.262%

Cluster size: 10 2 :1.988%:23.55%:8.379%

Cluster size: 9 6 :5.367%:28.92%:11.73%

Cluster size: 8 7 :5.566%:34.49%:15.64%

Cluster size: 7 4 :2.783%:37.27%:17.87%

Cluster size: 6 8 :4.771%:42.04%:22.34%

Cluster size: 5 16 :7.952%: 50 %:31.28%

Cluster size: 4 20 :7.952%:57.95%:42.45%

Cluster size: 3 35 :10.43%:68.38%:62.01%

Cluster size: 2 68 :13.51%:81.90%: 100 %

0.04user 0.04system 0:03.10elapsed 2%CPU (0avgtext+0avgdata 800512maxresident)k

0inputs+0outputs (3149major+0minor)pagefaults 0swaps

Running on 16k with high radius;; cluster centroids are written to a file as smiles.

$ cat zinc-druglike.RND0016k.smi.gz | gzip -d | time ./jklustor -c sphex:0.8 - -o wrclus:smiles:clusters.smi -v

Import structures from -

Leaves: 0 Level 1: 35 / 1000 2.812 s / 2.812 s

Leaves: 0 Level 1: 51 / 2000 4.547 s / 1.735 s

Leaves: 0 Level 1: 55 / 3000 6.203 s / 1.656 s

Leaves: 0 Level 1: 57 / 4000 7.891 s / 1.688 s

Leaves: 0 Level 1: 62 / 5000 9.625 s / 1.734 s

Leaves: 0 Level 1: 65 / 6000 11.344 s / 1.719 s

Leaves: 0 Level 1: 65 / 7000 13.031 s / 1.687 s

Leaves: 0 Level 1: 65 / 8000 14.750 s / 1.719 s

Leaves: 0 Level 1: 65 / 9000 16.422 s / 1.672 s

Leaves: 0 Level 1: 67 / 10000 18.250 s / 1.828 s

Leaves: 0 Level 1: 68 / 11000 20.437 s / 2.187 s

Leaves: 0 Level 1: 68 / 12000 22.453 s / 2.016 s

Leaves: 0 Level 1: 68 / 13000 24.391 s / 1.938 s

Leaves: 0 Level 1: 68 / 14000 26.375 s / 1.984 s

Leaves: 0 Level 1: 70 / 15000 28.156 s / 1.781 s

Leaves: 0 Level 1: 71 / 16000 30.109 s / 1.953 s

Leaves: 0 Level 1: 71 / 16402 30.812 s / 0.703 s

Imported 16402 structures

Imported 16402 structures from the input. (Total 16402 structures; 16402 in tree)

All import finished.

Write clusters to clusters.smi in format smiles in order natural level: lower

0.04user 0.03system 0:34.12elapsed 0%CPU (0avgtext+0avgdata 793088maxresident)k

0inputs+0outputs (3120major+0minor)pagefaults 0swaps

Sphere exclusion clustering algorithm

The clustering algorithm currently implemented:

First structure read is selected as a cluster centroid
For every input structure the least dissimilar (“nearest”) previously selected centroid is identified

If the dissimilarity of the nearest centroid is above a given dissimilarity radius then the structure is selected as a new centroid

When all structure read and the individual input structures are used (either by a “wrmols” output action, either by giving option “-l”) every input structure (including the selected centroids) will be assigned to the least dissimilar (nearest) centroid

Note that any two centroids have a higher dissimilarity than the given radius. The proper dissimilarity radius depends on the input set and the fingerprint method (CFP/ECFP) used; determining it requires an itetative refinement.

Evaluating performance using verbose mode

Verbose mode can be used by passing option “-v” to jklustor command line executable. Messages will be printed during structure import are in the following format (please note that verbose message format might be changed in the future thus verbose messages are not suitable for machine processing):

Leaves: 0 Level 1: 65 / 8000 14.750 s / 1.719 s

Interpretation:

“Leaves: 0” 0 individual structure stored because no output action or the sphere exclusion clustering process will access individual structures.
“Level 1:” Some clustering algorithms generate hierarchic clustering. Current implementation of sphere exclusion uses single level.
“65 / 8000”: 65 clusters are assigned and 8000 individual structures processed (average cluster size is 8000/65; about 120 structures per cluster)
“14.750 s / 1.719 s” 14.75 s elapsed since the begining of clustering; the last 1000 structures processed in 1.719 s

Processing time depends on found cluster count: when cluster count increases then significant time required to identify the nearest centroid for each read structure. When using small dissimilarity radius cluster count is near to imported structure count and processing time can be prohibitively high.

When individual structures are used then all structures are reassigned after import (see algorithm description). The time required for this reassign can be estimated from the time spent in the clustering process. This additional time can reach twice the time spent during structure clustering when cluster count is very high.

Forum links

Tracker topic:

https://www.chemaxon.com/forum/ftopic8015.html (JKlustor related documents)

References to this documents are in the following topics:

https://www.chemaxon.com/forum/ftopic7691.html ('sphere exclusion' clustering - the speed of performance)