1 of 34

Near-Duplicate Detection for eRulemaking

Hui Yang, Jamie Callan

Language Technologies Institute

School of Computer Science

Carnegie Mellon University

{huiyang, callan}@cs.cmu.edu

2 of 34

Presentation Outline

  • Introduction
  • Problem Definition
  • System Architecture
  • Feature-based Document Retrieval
  • Similarity-based Clustering
  • Evaluation and Experimental Results
  • Related Work
  • Conclusion and Demo

3 of 34

Introduction - I

  • U.S. regulatory agencies are required to solicit, consider, and respond to public comments before issuing the final regulations.
  • Some popular regulations attract hundreds of thousands of comments from the general public.
    • In late 1990s USDA’s national organic standard, manually sort over 250,000 public comments.
    • In 2004 the EPA’s proposed “Mercury rule” (USEPA-OAR-2002-0056) attracted over 530,000 email messages.
  • Very labor-intensive

4 of 34

Introduction - II

  • Things Become Worse Now
    • Many Online form letters available
    • Written by special interest groups
    • Modifying an electronic form letter is extremely easy
  • Special Interest Groups
    • Build electronic advocacy groups when there is a disconnect between broad public opinion and legislative action.
    • Provide information and tools to help each individual have the greatest possible impact once a group is assembled.

5 of 34

Introduction - III

  • Public comments will be near-duplicates if created from the same form letter.
  • Near-Duplicates increase the likelihood of overlooking substantive information that an individual adds to a form letter.
  • Goal:
    • Recognizing the Near-duplicates and organize them
    • Finding the added information by an individual
    • Finding the Unique comments
  • Our research focused on recognizing and organizing near-duplicates by text mining and clustering, as well as handling large amount of data

6 of 34

Problem Definition - I

  • What is a near-duplicate?
    • Pugh declared that “two documents are considered near duplicates if they have more than r features in common”.
    • Conrad, et al., stated that two documents are near duplicates if they share >80% terminology defined by human experts
  • Our definition based on the ways to create near-duplicates

7 of 34

Problem Definition - II

  • Sources of Public Comments
    • from scratch (unique comments)
    • based on a form letter (exact- or near duplicates)
  • A Category-based Definition
    • Block edit
    • Key block
    • Minor change
    • Minor change + block edit
    • Exact

8 of 34

Block Edit

9 of 34

Key Block

10 of 34

Minor Change

11 of 34

Minor Change + Block Edit

12 of 34

Block Reordering

13 of 34

Exact

14 of 34

Presentation Outline

  • Introduction
  • Problem Definition
  • System Architecture
  • Feature-based Document Retrieval
  • Similarity-based Clustering
  • Evaluation and Experimental Results
  • Related Work
  • Conclusion and Demo

15 of 34

System Architecture

16 of 34

Feature-based Document Retrieval - I

  • To get duplicate candidate set for each seed
    • Avoid working on the entire dataset
  • Steps:
    • Each seed document is broken into chunks
    • Select the most informative words from each chunk
      • a text span around term t* which is the term with minimal document frequency in the chunk

17 of 34

Feature-based Document Retrieval - II

  • Metadata extraction by Information Extraction
      • email senders,
      • receivers,
      • signatures,
      • docket IDs,
      • delivered dates,
      • email relayers

18 of 34

Feature-based Document Retrieval - III

  • Query Formulation

#AND ( docketoar.20020056 router.moveon #OR(“standards proposed by” “will harm thousands” “unborn children for” “coal plants should” “other cleaner alternative” “by 90 by” “with national standards” “available pollution control”) )

19 of 34

20 of 34

Presentation Outline

  • Introduction
  • Problem Definition
  • System Architecture
  • Feature-based Document Retrieval
  • Similarity-based Clustering
  • Evaluation and Experimental Results
  • Related Work
  • Conclusion and Demo

21 of 34

Similarity-based Clustering - I

  • Document dissimlilarity based on Kullback-Leibler (KL) divergence
    • KL-divergence, a distributional similarity measure, is one way to measure the similarity of one document (one unigram distribution) given another.

  • Clustering Algorithm
    • Soft, Non-Hierarchical Clustering (Partition)
    • Single Pass Clustering with carefully selected seed documents each time
      • Close to K-Means
      • No need to define K before hand

22 of 34

Similarity-based Clustering - II

Dup Candidates

Dup Set 1

Dup Set 2

23 of 34

Adaptive Thresholding

  • Cut-off threshold for different clusters should be different
    • Documents in a cluster are sorted by their document-centroid similarity scores.
    • Sample sorted scores at 10 document intervals
    • If there is greater than 5% of the initial cut-off threshold within an interval, a new cut-off threshold is set at the beginning of the interval.

24 of 34

Does Feature-based Document Retrieval Helps?

  • It works fairly efficient for large clusters
    • cuts the number of documents needed to be clustered from the size of entire dataset to a reasonable number.
    • 536,975 ->10,995 documents
  • Bad for small clusters (especially for those only containing a single unique document)
  • Disable feature-based retrieval after most of the big clusters have been found.
    • assume that most of the remaining unclustered documents are unique.
    • Only similarity-based clustering is used on them

25 of 34

Presentation Outline

  • Introduction
  • Problem Definition
  • System Architecture
  • Feature-based Document Retrieval
  • Similarity-based Clustering
  • Evaluation and Experimental Results
  • Related Work
  • Conclusion and Demo

26 of 34

Evaluation Methodology - I

  • Difficult to evaluate clustering performance
    • lack of man power to produce ground truth for large dataset
  • Two subsets of 1,000 email messages each were selected randomly from the Mercury dataset.
  • Assessors: two graduate research assistants
    • Manually organized the documents into clusters of documents that they felt were near-duplicates
    • Manually went through one of the experimental clustering results pair by pair ( compare document-centroid pair)

27 of 34

Evaluation Methodology - II

  • Class j vs. Cluster i
  • F-measure

pij = nij/ ni , rij = nij/ nj

F = , Fj = maxi {Fij}

  • Purity

ρ = , ρi = maxj{pij},

  • Pairwise-measure

Folkes and Mallows index

  • Kappa

κ = p(A) = (a+d)/m ,

p(E) =

28 of 34

Experimental Results - I

29 of 34

Experimental Results - II

30 of 34

Conclusion

  • Large Volume Working Set
  • Duplicate Definition and Automatic evaluation
  • Feature-based Duplicate Candidate Retrieval
  • Similarity-based Clustering
  • Improved Efficiency

31 of 34

Related Work - I

  • Duplicate detection in other domains:
    • databases [Bilenko and Mooney 2003]
      • to find records referring to the same entity but possibly in different representations
    • electronic publishing [Brin et al. 1995]
      • to detect plagiarism or to identify different versions of the same document.
    • web search [Chowdhury et al. 2002] [Pugh 2004]
      • more efficient web-crawling
      • effective search results ranking
      • easy web documents archiving

32 of 34

Related Work - II

  • Fingerprinting
    • a compact description of a document, and then do pair-wise comparison of document fingerprints
  • Shingling [Broder et al.]
    • represents a document as a series of simple numeric encodings for an n-term window
    • retain every mth shingle to produce a document sketch
    • super shingles
  • Selective fingerprinting [Heintze]
    • selected a subset of the substrings to generate fingerprints
  • Statistical approach [Chowdhury et al.]
    • n high idf terms
    • Improved accuracy over shingling
    • Efficient: one-fifth of the time over shingling
  • Fingerprints Reliability in dynamic environment [Conrad et al.]
    • Consider time factor on the Web

33 of 34

References

  • M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD-2003), Washington D.C., August 2003.
  • S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), pages 398–409. ACM Press, May 1995.
  • A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of WWW6 ’97, pages 391–404. Elsevier Science, April 1997.
  • J. Callan, E-Rulemaking testbed. http://hartford.lti.cs.cmu.edu/eRulemaking/Data/. 2004
  • A. Chowdhury. O. Frieder, D. Grossman, and M. McCabe. Collection statistics for fast Duplicate document detection. In ACM Transactions on Information Systems (TOIS), Volume 20, Issue 2, 2002.
  • J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46, 1960.
  • J. Conrad, X. S. Guo, and C. P. Schriber. Online duplicate document detection: Signature reliability in a dynamic retrieval environment. In Proceedings of CIKM’03, pages 443–452. ACM Press, Nov. 2003.
  • N. Heintze. Scalable document fingerprinting. In Proceedings of the Second USENIX electronic Commerce Workshop, pages 191–200, Nov. 1996.
  • W. Pugh. US Patent 6,658,423W. Pugh. US Patent 6,658,423 http://www.cs.umd.edu/~pugh/google/Duplicates.pdf. 2003

34 of 34

Demo

  • http://hartford.lti.cs.cmu.edu/eRulemaking/Data/USEPA-OAR-2002-0056/