Milestone 4 Progress Evaluation

Personalized Web Search

Erdal Tuleu
email: etuleu@fit.edu

Faculty sponsor:
Dr. Philip Chan
email: pkc@cs.fit.edu



Milestone 4 (Feb 23, 2009):



Progress of current milestone
Task
Completion (%)
Comments and To do
1. Study other clustering algorithms
100%
Read the chapter on clustering from the book "Introduction to Data Mining" and other online sources
2. Find a way to combine all the features used for finding similar pages into one similarity function
85%
Combined the 3 features (page content, link structure, url similarity) into one distance function. There is still room for improvement by trying out different combinations and maybe adding new features.
3. Evaluate different algorithms for clustering and similarity functions against the ground truth using the rand index similarity measure90%
Evaluated k-means, agglomerative and divisive clustering and stayed with agglomerative because it also helps in finding the appropriate number of clusters. Implemented the L-method for finding the knee of a graph (used in finding the best number of clusters). Got good rand index results, but they could further be improved by trying different cluster distance measures and stopping criteria (such as a statistic test).



Plan for milestone 5 (March 23, 2009):



Sponsor feedback on each task:

1.


2.


3.



Signature: _______________________________ Date: ________



Sponsor Evaluation
Sponsor: detach and return this page to Dr. Chan (EC 242)


Scores: circle a score (or circle two adjacent scores for .5 or write down a real/float number between 0 and 10) for each member:


  1.  
    • Erdal Tuleu
      012345678910



Signature: _______________________________ Date: __________