1 of 34

Near-Duplicate Detection for eRulemaking

Hui Yang, Jamie Callan

Language Technologies Institute

School of Computer Science

Carnegie Mellon University

{huiyang, callan}@cs.cmu.edu

2 of 34

Presentation Outline

Introduction
Problem Definition
System Architecture
Feature-based Document Retrieval
Similarity-based Clustering
Evaluation and Experimental Results
Related Work
Conclusion and Demo

3 of 34

Introduction - I

U.S. regulatory agencies are required to solicit, consider, and respond to public comments before issuing the final regulations.
Some popular regulations attract hundreds of thousands of comments from the general public.

In late 1990s USDA’s national organic standard, manually sort over 250,000 public comments.
In 2004 the EPA’s proposed “Mercury rule” (USEPA-OAR-2002-0056) attracted over 530,000 email messages.

Very labor-intensive

4 of 34

Introduction - II

Things Become Worse Now

Many Online form letters available
Written by special interest groups
Modifying an electronic form letter is extremely easy

Special Interest Groups

Build electronic advocacy groups when there is a disconnect between broad public opinion and legislative action.
Provide information and tools to help each individual have the greatest possible impact once a group is assembled.

Moveon.org, http://www.moveon.org
GetActive, http://www.getactive.org

Modifying an electronic form letter is extremely easy.

Moreover, it can better represent their opinions or stylistic preferences

When there is a disconnect between broad public opinion and legislative action, MoveOn builds electronic advocacy groups.

Examples of such issues are campaign finance, environmental and energy issues, media consolidation..

Once a group is assembled, MoveOn provides information and tools to help each individual have the greatest possible impact.

Many of the form letter-based comments are near-duplicates that are similar, but not exactly identical, to a (usually unknown) initial form letter, and to one another.

Recognizing, organizing, and considering near-duplicates are significant challenges for the agencies because near-duplicates increase the likelihood of overlooking substantive information (information that by law the agency must consider) that an individual adds to a form letter.

5 of 34

Introduction - III

Public comments will be near-duplicates if created from the same form letter.
Near-Duplicates increase the likelihood of overlooking substantive information that an individual adds to a form letter.
Goal:

Recognizing the Near-duplicates and organize them
Finding the added information by an individual
Finding the Unique comments

Our research focused on recognizing and organizing near-duplicates by text mining and clustering, as well as handling large amount of data

6 of 34

Problem Definition - I

What is a near-duplicate?

Pugh declared that “two documents are considered near duplicates if they have more than r features in common”.
Conrad, et al., stated that two documents are near duplicates if they share >80% terminology defined by human experts

Our definition based on the ways to create near-duplicates

7 of 34

Problem Definition - II

Sources of Public Comments

from scratch (unique comments)
based on a form letter (exact- or near duplicates)

A Category-based Definition

Block edit
Key block
Minor change
Minor change + block edit
Exact

8 of 34

Block Edit

9 of 34

Key Block

10 of 34

Minor Change

11 of 34

Minor Change + Block Edit

12 of 34

Block Reordering

13 of 34

Exact

14 of 34

Presentation Outline

Introduction
Problem Definition
System Architecture
Feature-based Document Retrieval
Similarity-based Clustering
Evaluation and Experimental Results
Related Work
Conclusion and Demo

15 of 34

System Architecture

Our duplicate and near-duplicate detection system includes three modules: Text preprocessing, feature-based document retrieval and similarity-based document clustering (Figure 1).

The text preprocessing module consists of information extraction (IE), binning by length, and seed document selection. The IE module identifies the email sender, the receiver, address, salutation and signature lines; docket IDs mentioned in the text; delivery dates and the email-relaying organizations identified in email headers. These features are used later in duplicate detection. Documents are then binned based on their content lengths (i.e., after removal of headers, address, salutation, and signature lines are removed). Binning insures that exact-duplicates are grouped together. Moreover, even before duplicate clustering starts, the system has a rough guess about what are the large duplicate sets. These sets are handled first, to improve efficiency. Finally, to identify the dominating document within a bin, the document that repeats most (i.e., has the most exact duplicates) is selected to be the initial seed in any bin whose size is greater than 100.

In the feature-based document retrieval module, the system begins near-duplicate detection by going through the size-based bins in a round robin manner. On each pass a document is selected from a bin, and a feature-based query is generated based on fingerprint and metadata. The system then performs probabilistic document retrieval, using a Boolean query, to get candidate duplicates for the document.

The similarity-based clustering module measures the similarity of each document in the candidate set to the current seed; similar documents are grouped together. The next section describes the algorithms for feature-based document retrieval and similarity-based clustering.

16 of 34

Feature-based Document Retrieval - I

To get duplicate candidate set for each seed

Avoid working on the entire dataset

Steps:

Each seed document is broken into chunks
Select the most informative words from each chunk

a text span around term t* which is the term with minimal document frequency in the chunk

during preprocessing; the retrieval system indexes them, and supports their use in queries. Since many of the near duplicates are based on form letters created by special interest groups (“mass mailers”) or are commercial advertising (“spam”), duplicate documents are likely to share the same features as the seed document.

One of our test collections contains 536,975 public comments; the average size of candidate duplicate sets is 10,995. This reduction is extremely significant for the big clusters. However, for small clusters, especially for those only containing a single unique document, it is inefficient to retrieve and look through the candidate set. Therefore, after most of the big clusters have been found, the feature-based retrieval module is disabled based on the assumption that most of the remaining unclustered documents are unique. Only similarity-based clustering is used on them.

17 of 34

Feature-based Document Retrieval - II

Metadata extraction by Information Extraction

email senders,
receivers,
signatures,
docket IDs,
delivered dates,
email relayers

18 of 34

Feature-based Document Retrieval - III

Query Formulation

#AND ( docketoar.20020056 router.moveon #OR(“standards proposed by” “will harm thousands” “unborn children for” “coal plants should” “other cleaner alternative” “by 90 by” “with national standards” “available pollution control”) )

19 of 34

20 of 34

Presentation Outline

Introduction
Problem Definition
System Architecture
Feature-based Document Retrieval
Similarity-based Clustering
Evaluation and Experimental Results
Related Work
Conclusion and Demo

21 of 34

Similarity-based Clustering - I

Document dissimlilarity based on Kullback-Leibler (KL) divergence

KL-divergence, a distributional similarity measure, is one way to measure the similarity of one document (one unigram distribution) given another.

Clustering Algorithm

Soft, Non-Hierarchical Clustering (Partition)
Single Pass Clustering with carefully selected seed documents each time

Close to K-Means
No need to define K before hand

22 of 34

Similarity-based Clustering - II

Dup Candidates

Dup Set 1

Dup Set 2

23 of 34

Adaptive Thresholding

Cut-off threshold for different clusters should be different

Documents in a cluster are sorted by their document-centroid similarity scores.
Sample sorted scores at 10 document intervals
If there is greater than 5% of the initial cut-off threshold within an interval, a new cut-off threshold is set at the beginning of the interval.

24 of 34

Does Feature-based Document Retrieval Helps?

It works fairly efficient for large clusters

cuts the number of documents needed to be clustered from the size of entire dataset to a reasonable number.
536,975 ->10,995 documents

Bad for small clusters (especially for those only containing a single unique document)
Disable feature-based retrieval after most of the big clusters have been found.

assume that most of the remaining unclustered documents are unique.
Only similarity-based clustering is used on them

25 of 34

Presentation Outline

Introduction
Problem Definition
System Architecture
Feature-based Document Retrieval
Similarity-based Clustering
Evaluation and Experimental Results
Related Work
Conclusion and Demo

26 of 34

Evaluation Methodology - I

Difficult to evaluate clustering performance

lack of man power to produce ground truth for large dataset

Two subsets of 1,000 email messages each were selected randomly from the Mercury dataset.
Assessors: two graduate research assistants

Manually organized the documents into clusters of documents that they felt were near-duplicates
Manually went through one of the experimental clustering results pair by pair ( compare document-centroid pair)

27 of 34

Evaluation Methodology - II

Class j vs. Cluster i
F-measure

pij = nij/ ni , rij = nij/ nj

F = , Fj = maxi {Fij}

Purity

ρ = , ρi = maxj{pij},

Pairwise-measure

Folkes and Mallows index

Kappa

κ = p(A) = (a+d)/m ,

p(E) =

F

In the context of clustering, the quality of a clustering with respect to the gold standard is assessed in terms of precision (p) and recall (r). p and r compare each cluster i with each class j in the ground truth:

pij = nij/ ni (5)

rij = nij/ nj (6)

where nij is the number of documents from class j in cluster i, ni is the number of documents in cluster i and nj is the number of documents in class j.

The F-measure widely used in IR community tries to capture how well the groups of the investigated clustering at the best match the groups of the ground truth. The F-measure for a cluster i and class j is:

Fij = 2・rij / (p ij +r ij) (7)

The F-measure of the whole clustering set is:

F = (8)

where Fj = maxi {Fij} , n is the total number of document.

Purity

If the purpose of clustering is to classify clusters of documents rather than single documents, purity of each cluster tries to capture how well the groups of the first on average matches those of the ground truth. The weighted average purity ρ over all clusters is a measure of quality of the whole clustering. It is defined as:

ρ = (9)

where ρi = maxj{pij}, n is the total number of document and pij is precision of cluster i with reference to class j.

Pair-wise Measure

One may also define measures based on the distribution of pairs of texts. Pair measures reflect degrees of overlap and are thus not dependent on the direction of the comparison. Folkes and Mallows index [9] is one of such pair measures that have been used for cluster validation

(10)

where a is the number of pairs in the same group in the ground truth and in the clustering (agreement), b is the number of pairs in the same group in the ground truth but in different in the clustering (false negative), c is the number of pairs in different groups in the ground truth but in the same in the clustering (false positive), d is the number of pairs in different groups in the ground truth and in the clustering (agreement).

Kappa

Finally, we suggest using the Cohen’s kappa statistics [7] to assess agreement between clustering results relative to two ground truth given by different human assessors. We consider the investigated clustering results as another “assessor”. The kappa coefficient is used to compare inter-assessor concordances and is defined is:

κ = (11)

where p(A) is the observed agreement between the two assessments. It can be calculated as (a+d)/m by using a, b, c and d given in Equation 10 and m=a+b+c+d. p(E) is the agreement expected by chance. It can be calculated as .

Our idea is that if the concordance of the clustering results given by our system and one of the assessors are as good as or even better than that of two human assessors, our algorithm could be considered as effective. In computational linguistics κ > 0.67 has often been required to draw any conclusions of agreement at all [5]. However, there has been a number of objections to this standard and less strict evaluations consider 0.20 < κ < 0.40 indicating fair agreement and 0.40 < κ < 0.60 indicating moderate agreement.

28 of 34

Experimental Results - I

29 of 34

Experimental Results - II

30 of 34

Conclusion

Large Volume Working Set
Duplicate Definition and Automatic evaluation
Feature-based Duplicate Candidate Retrieval
Similarity-based Clustering
Improved Efficiency

31 of 34

Related Work - I

Duplicate detection in other domains:

databases [Bilenko and Mooney 2003]

to find records referring to the same entity but possibly in different representations

electronic publishing [Brin et al. 1995]

to detect plagiarism or to identify different versions of the same document.

web search [Chowdhury et al. 2002] [Pugh 2004]

more efficient web-crawling
effective search results ranking
easy web documents archiving

32 of 34

Related Work - II

Fingerprinting

a compact description of a document, and then do pair-wise comparison of document fingerprints

Shingling [Broder et al.]

represents a document as a series of simple numeric encodings for an n-term window
retain every mth shingle to produce a document sketch
super shingles

Selective fingerprinting [Heintze]

selected a subset of the substrings to generate fingerprints

Statistical approach [Chowdhury et al.]

n high idf terms
Improved accuracy over shingling
Efficient: one-fifth of the time over shingling

Fingerprints Reliability in dynamic environment [Conrad et al.]

Consider time factor on the Web

A widespread duplicate detection technique is to generate a document fingerprint, which is a compact description of the document, and then to do pair-wise comparison of document fingerprints. The assumption is that fingerprints can be compared much more quickly than complete documents. A common method of generating fingerprints is to select a set of character sequences from a document, and to generate a fingerprint based on the hash values of these sequences. Similarity between two documents is measured by counting the number of common sequences in the complete fingerprints. Different algorithms are characterized, and their computational costs are determined, by the hash function and how substrings are selected.

The dominant approach in duplicate detection is “shingling”, which was proposed by Broder, et al. [1]. It represents a document as a series of simple numeric encodings for an n-term window. Filtering techniques were used to retain every mth shingle to produce a document sketch. The authors also presented a super-shingle technique that creates meta-sketches to further reduce the computational complexity. Pairs of documents that have a high shingle match coefficient are asserted to be near-duplicates.

Heintze studied selective document fingerprinting that scales to large environments and that can identify similarities between documents which have as little as 5% or less in common [10].To reduce the size of the fingerprints, he selected a subset of the substrings to generate them. He demonstrated that selective fingerprints normally perform within a factor of two of complete fingerprints.

Pugh, from Google, claimed that duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by “(i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists”. Two documents may be considered to be near-duplicates if any one of their fingerprints matches [13].

Recent duplicate detection research in the Web environment has focused on issues of computational efficiency and detection effectiveness while relying on “collection statistics” to consistently recognize document replicas in full-text collections [6][14]. Since a term’s inverse document frequency (idf) is a measurement of a word’s rareness across a given collection, its value measures how well a term discriminates one document from others. Chowdhury, et al., selected terms best characterizing a document, i.e., n high idf words. Moreover, a term is not considered unless it appears across the collections at least five times to filter out possible misspelling tokens and typos. They claimed that in addition to improving accuracy over shingling, it executes in one-fifth of the time. They showed that a statistically-based approach (i) is more efficient than a fingerprint-based approach, (ii) scales in the number of documents and (iii) works well for documents of diverse sizes.

33 of 34

References

M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD-2003), Washington D.C., August 2003.
S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), pages 398–409. ACM Press, May 1995.
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of WWW6 ’97, pages 391–404. Elsevier Science, April 1997.
J. Callan, E-Rulemaking testbed. http://hartford.lti.cs.cmu.edu/eRulemaking/Data/. 2004
A. Chowdhury. O. Frieder, D. Grossman, and M. McCabe. Collection statistics for fast Duplicate document detection. In ACM Transactions on Information Systems (TOIS), Volume 20, Issue 2, 2002.
J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46, 1960.
J. Conrad, X. S. Guo, and C. P. Schriber. Online duplicate document detection: Signature reliability in a dynamic retrieval environment. In Proceedings of CIKM’03, pages 443–452. ACM Press, Nov. 2003.
N. Heintze. Scalable document fingerprinting. In Proceedings of the Second USENIX electronic Commerce Workshop, pages 191–200, Nov. 1996.
W. Pugh. US Patent 6,658,423W. Pugh. US Patent 6,658,423 http://www.cs.umd.edu/~pugh/google/Duplicates.pdf. 2003

34 of 34

Demo

http://hartford.lti.cs.cmu.edu/eRulemaking/Data/USEPA-OAR-2002-0056/