| Task | Completion (%) | Comments and To do |
| 1. Study other clustering algorithms | 100% | Read the chapter on clustering from the book "Introduction to Data Mining" and other online sources |
| 2. Find a way to combine all the features used for finding similar pages into one similarity function | 85% | Combined the 3 features (page content, link structure, url similarity) into one distance function. There is still room for improvement by trying out different combinations and maybe adding new features. |
| 3. Evaluate different algorithms for clustering and similarity functions against the ground truth using the rand index similarity measure | 90% | Evaluated k-means, agglomerative and divisive clustering and stayed with agglomerative because it also helps in finding the appropriate number of clusters. Implemented the L-method for finding the knee of a graph (used in finding the best number of clusters). Got good rand index results, but they could further be improved by trying different cluster distance measures and stopping criteria (such as a statistic test). |
| Erdal Tuleu | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |