Document dissimlilarity based on Kullback-Leibler (KL) divergence
KL-divergence, a distributional similarity measure, is one way to measure the similarity of one document (one unigram distribution) given another.
Clustering Algorithm
Soft, Non-Hierarchical Clustering (Partition)
Single Pass Clustering with carefully selected seed documents each time
Close to K-Means
No need to define K before hand
22 of 34
Similarity-based Clustering - II
Dup Candidates
Dup Set 1
Dup Set 2
23 of 34
Adaptive Thresholding
Cut-off threshold for different clusters should be different
Documents in a cluster are sorted by their document-centroid similarity scores.
Sample sorted scores at 10 document intervals
If there is greater than 5% of the initial cut-off threshold within an interval, a new cut-off threshold is set at the beginning of the interval.
24 of 34
Does Feature-based Document Retrieval Helps?
It works fairly efficient for large clusters
cuts the number of documents needed to be clustered from the size of entire dataset to a reasonable number.
536,975 ->10,995 documents
Bad for small clusters (especially for those only containing a single unique document)
Disable feature-based retrieval after most of the big clusters have been found.
assume that most of the remaining unclustered documents are unique.
Only similarity-based clustering is used on them
25 of 34
Presentation Outline
Introduction
Problem Definition
System Architecture
Feature-based Document Retrieval
Similarity-based Clustering
Evaluation and Experimental Results
Related Work
Conclusion and Demo
26 of 34
Evaluation Methodology - I
Difficult to evaluate clustering performance
lack of man power to produce ground truth for large dataset
Two subsets of 1,000 email messages each were selected randomly from the Mercury dataset.
Assessors: two graduate research assistants
Manually organized the documents into clusters of documents that they felt were near-duplicates
Manually went through one of the experimental clustering results pair by pair ( compare document-centroid pair)
27 of 34
Evaluation Methodology - II
Class j vs. Cluster i
F-measure
pij = nij/ ni , rij = nij/ nj
F = , Fj = maxi {Fij}
Purity
ρ = , ρi = maxj{pij},
Pairwise-measure
Folkes and Mallows index
Kappa
κ = p(A) = (a+d)/m ,
p(E) =
28 of 34
Experimental Results - I
29 of 34
Experimental Results - II
30 of 34
Conclusion
Large Volume Working Set
Duplicate Definition and Automatic evaluation
Feature-based Duplicate Candidate Retrieval
Similarity-based Clustering
Improved Efficiency
31 of 34
Related Work - I
Duplicate detection in other domains:
databases [Bilenko and Mooney 2003]
to find records referring to the same entity but possibly in different representations
electronic publishing [Brin et al. 1995]
to detect plagiarism or to identify different versions of the same document.
web search [Chowdhury et al. 2002] [Pugh 2004]
more efficient web-crawling
effective search results ranking
easy web documents archiving
32 of 34
Related Work - II
Fingerprinting
a compact description of a document, and then do pair-wise comparison of document fingerprints
Shingling [Broder et al.]
represents a document as a series of simple numeric encodings for an n-term window
retain every mth shingle to produce a document sketch
super shingles
Selective fingerprinting [Heintze]
selected a subset of the substrings to generate fingerprints
Statistical approach [Chowdhury et al.]
n high idf terms
Improved accuracy over shingling
Efficient: one-fifth of the time over shingling
Fingerprints Reliability in dynamic environment [Conrad et al.]
Consider time factor on the Web
33 of 34
References
M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD-2003), Washington D.C., August 2003.
S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), pages 398–409. ACM Press, May 1995.
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of WWW6 ’97, pages 391–404. Elsevier Science, April 1997.
A. Chowdhury. O. Frieder, D. Grossman, and M. McCabe. Collection statistics for fast Duplicate document detection. In ACM Transactions on Information Systems (TOIS), Volume 20, Issue 2, 2002.
J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46, 1960.
J. Conrad, X. S. Guo, and C. P. Schriber. Online duplicate document detection: Signature reliability in a dynamic retrieval environment. In Proceedings of CIKM’03, pages 443–452. ACM Press, Nov. 2003.
N. Heintze. Scalable document fingerprinting. In Proceedings of the Second USENIX electronic Commerce Workshop, pages 191–200, Nov. 1996.