Selecting diverse compounds
Examples - Online demo
Examples - Command line
Maximum of minimal dissimilarity selection (MMDS)
Clustering using MMDS
Using sphere exclusion clustering
JKlustor can be used to select diverse compounds for a set by using cluster centroid/representant structures.
Examples - Online demo
- enter http://www.chemaxon.com/shared/libMCS/default.sdf into the left input field (this can be done by clicking on the last “example” link)
- click “Add!”
- in the “launch clustering” box select “Diverse subset” and click “Launch”
- click “View Clustering results”
- click on the floppy icon in the line ‘Total cluster count (including singletons)” to save diverse subset in SMILES format
Examples - Command line
- jklustor -c mmds:10 http://www.chemaxon.com/shared/libMCS/default.sdf
Select 10 diverse structures using MMDS algorithm (described below) and write them to the output
- jklustor -c sphex:0.8 http://www.chemaxon.com/shared/libMCS/default.sdf
Select diverse structures as sphere exclusion clustering (with dissimilarity radius = 0.8) centroids
- jklustor -l 88 -c mmds:10 http://www.chemaxon.com/shared/libMCS/default.sdf
when message “Launch listening server on port 88” appears connect with a browser to http://localhost:88
Select 10 diverse structures using MMDS algorithm (described below) and start listening on port 88 with a web user interface similar to the online example above
Maximum of minimal dissimilarity selection (MMDS)
This selection algorithm yields a diverse subset which size (k) is specified. The selection algorithm:
- A central node is identified as the first element of the selection
- Select the node which has the smallest rmsd dissimilarity from the other nodes (the sum of the squares of dissimilarity scores from the other nodes is the smallest)
- Select n-1 diverse nodes in n-1 selection steps.
- For each node find the most similar previously selected node (nearest selected) which has the smallest dissimilarity score
- Select the node which nearest selected node has the highest dissimilarity score
Note that this algorithm typically tends to select the outliers (apart from the first centrum) from the input set.
Clustering using MMDS
A clustering algorithm (accessible with “-c mmds:<k>” in jklustor command line) is defined which used the MMDS algorithm described above:
- Select k diverse nodes using the MMDS algorithm
- Consider these nodes cluster representants
- Assign every input node (including those are selected) into the cluster which cluster representant has the smallest dissimilarity value (assign to the nearest selected)
Using sphere exclusion clustering
Cluster centroids identified by sphere exclusion clustering algorithm can be considered as a diverse subset.. The clustering algorithm currently implemented:
- First structure read is selected as a cluster centroid
- For every input structure the least dissimilar (“nearest”) previously selected centroid is identified
- If the dissimilarity of the nearest centroid is above a given dissimilarity radius then the structure is selected as a new centroid
- When all structure read and the individual input structures are used (either by a “wrmols” output action, either by giving option “-l”) every input structure (including the selected centroids) will be assigned to the least dissimilar (nearest) centroid
Note that any two centroids have a higher dissimilarity than the given radius. The proper dissimilarity radius depends on the input set and the fingerprint method (CFP/ECFP) used; determining it requires an iterative refinement.
Forum links
Tracker topic:
https://www.chemaxon.com/forum/ftopic8015.html (JKlustor related documents)
References to this documents are in the following topics:
https://www.chemaxon.com/forum/ftopic7912.html (Diverse compound selector)