Published using Google Docs
Sphere exclusion clustering
Updated automatically every 5 minutes

Sphere exclusion clustering

Examples - Online demo

Examples - Command line

Examples - execution in details

Sphere exclusion clustering algorithm

Evaluating performance using verbose mode

Forum links

Examples - Online demo

Examples - Command line

Output only cluster representant molecules in default (SMILES) format with cluster ID and size info.

Examples - execution in details

In this example a sphere exclusion clustering is launched on a 1k random subset of the ZINC drug like subset (see http://zinc.docking.org/subset1/ ). Execution is followed with verbose printouts and cluster size statistics is printed to the console as an output.

$  cat zinc-druglike.RND0001k.smi.gz | gzip -d | time  ./jklustor -c sphex:0.6 - -o wrstat:full -v

Import structures from -

Leaves: 0 Level 1: 360 / 1000 2.360 s / 2.360 s

Leaves: 0 Level 1: 361 / 1006 2.375 s / 0.015 s

Imported 1006 structures

Imported 1006 structures from the input. (Total 1006 structures; 1006 in tree)

All import finished.

Cluste size statistics

[INFO]

Structure count: 1006

Level count:         1

Preferred level: lower

[COVERAGE]

Level:                              *** 1 ***

                                    ---------

Total cluster count:                     361

Singleton cluster/struct count:          182

Real cluster count:                      179

Clustering coverage (%):               81.90%

[STATISTICS]

Cluster size:             1   |  2   |  3   |  4   |  5   |  6   |  7   |  8   |  9   |  10  |  11 ...

                         -----+------+------+------+------+------+------+------+------+------+---- ...

Frequency at level * 1*  182  |  68  |  35  |  20  |  16  |   8  |   4  |   7  |   6  |   2  |   4 ...

[Cluster size-frequency]

  excluding singletons

========================

Level:                         *****   1   *****

                           Freq.:Single:Cumul.:Clust.

                                :cover.:cover.:count

                      ======+======+======+======

Cluster size:       30   1  :2.982%:2.982%:0.5586%

Cluster size:       27   1  :2.683%:5.666%:1.117%

Cluster size:       26   1  :2.584%:8.250%:1.675%

Cluster size:       17   1  :1.689%:9.940%:2.234%

Cluster size:       16   2  :3.180%:13.12%:3.351%

Cluster size:       15   1  :1.491%:14.61%:3.910%

Cluster size:       13   2  :2.584%:17.19%:5.027%

Cluster size:       11   4  :4.373%:21.57%:7.262%

Cluster size:       10   2  :1.988%:23.55%:8.379%

Cluster size:        9   6  :5.367%:28.92%:11.73%

Cluster size:        8   7  :5.566%:34.49%:15.64%

Cluster size:        7   4  :2.783%:37.27%:17.87%

Cluster size:        6   8  :4.771%:42.04%:22.34%

Cluster size:        5  16  :7.952%: 50  %:31.28%

Cluster size:        4  20  :7.952%:57.95%:42.45%

Cluster size:        3  35  :10.43%:68.38%:62.01%

Cluster size:        2  68  :13.51%:81.90%: 100 %

0.04user 0.04system 0:03.10elapsed 2%CPU (0avgtext+0avgdata 800512maxresident)k

0inputs+0outputs (3149major+0minor)pagefaults 0swaps

Running on 16k with high radius;; cluster centroids are written to a file as smiles.

$ cat zinc-druglike.RND0016k.smi.gz | gzip -d | time ./jklustor -c sphex:0.8 - -o wrclus:smiles:clusters.smi -v

Import structures from -

Leaves: 0 Level 1: 35 / 1000 2.812 s / 2.812 s

Leaves: 0 Level 1: 51 / 2000 4.547 s / 1.735 s

Leaves: 0 Level 1: 55 / 3000 6.203 s / 1.656 s

Leaves: 0 Level 1: 57 / 4000 7.891 s / 1.688 s

Leaves: 0 Level 1: 62 / 5000 9.625 s / 1.734 s

Leaves: 0 Level 1: 65 / 6000 11.344 s / 1.719 s

Leaves: 0 Level 1: 65 / 7000 13.031 s / 1.687 s

Leaves: 0 Level 1: 65 / 8000 14.750 s / 1.719 s

Leaves: 0 Level 1: 65 / 9000 16.422 s / 1.672 s

Leaves: 0 Level 1: 67 / 10000 18.250 s / 1.828 s

Leaves: 0 Level 1: 68 / 11000 20.437 s / 2.187 s

Leaves: 0 Level 1: 68 / 12000 22.453 s / 2.016 s

Leaves: 0 Level 1: 68 / 13000 24.391 s / 1.938 s

Leaves: 0 Level 1: 68 / 14000 26.375 s / 1.984 s

Leaves: 0 Level 1: 70 / 15000 28.156 s / 1.781 s

Leaves: 0 Level 1: 71 / 16000 30.109 s / 1.953 s

Leaves: 0 Level 1: 71 / 16402 30.812 s / 0.703 s

Imported 16402 structures

Imported 16402 structures from the input. (Total 16402 structures; 16402 in tree)

All import finished.

Write clusters to clusters.smi in format smiles in order natural level: lower

0.04user 0.03system 0:34.12elapsed 0%CPU (0avgtext+0avgdata 793088maxresident)k

0inputs+0outputs (3120major+0minor)pagefaults 0swaps

Sphere exclusion clustering algorithm

The clustering algorithm currently implemented:

Note that any two centroids have a higher dissimilarity than the given radius. The proper dissimilarity radius depends on the input set and the fingerprint method (CFP/ECFP) used; determining it requires an itetative refinement.

Evaluating performance using verbose mode

Verbose mode can be used by passing option “-v” to jklustor command line executable. Messages will be printed during structure import are in the following format (please note that verbose message format might be changed in the future thus verbose messages are not suitable for machine processing):

Leaves: 0 Level 1: 65 / 8000 14.750 s / 1.719 s

Interpretation:

Processing time depends on found cluster count: when cluster count increases then significant time required to identify the nearest centroid for each read structure. When using small dissimilarity radius cluster count is near to imported structure count and processing time can be prohibitively high.

When individual structures are used then all structures are reassigned after import (see algorithm description). The time required for this reassign can be estimated from the time spent in the clustering process. This additional time can reach twice the time spent during structure clustering when cluster count is very high.

 

Forum links

Tracker topic:

https://www.chemaxon.com/forum/ftopic8015.html (JKlustor related documents)

References to this documents are in the following topics:

https://www.chemaxon.com/forum/ftopic7691.html ('sphere exclusion' clustering - the speed of performance)