第 1 页,共 13 页

Measuring Approximate Functional Dependencies

M Parciak, S Weytjens, F Neven, L Peeters, N Hens, S Vansummeren

09/06/2023 - Knowledge Graphs for Data Integration Workshop, UHasselt

第 2 页,共 13 页

Introduction: Functional Dependencies (FDs)

第 3 页,共 13 页

Introduction: Approximate FDs (AFDs)

第 4 页,共 13 页

Introduction: Approximate FDs (AFDs)

Our Aim

Compare AFD measures proposed in the literature.

第 5 页,共 13 页

AFD measures: literature review

Literature review

Since 1954, 12 AFD measures were described.

We identify three groups

  • Violation: quantify a number of FD violations
  • Shannon: measures based on shannon entropy
  • Logical: measures based on logical entropy

第 6 页,共 13 页

AFD measures: formal comparison

Identifying two new measures

第 7 页,共 13 页

Evaluation on Real-World Data

Approach

Test AFD measures on a real-world benchmark, which we manually annotate, and compare the measures on precision, recall and rankings.

第 8 页,共 13 页

Evaluation on Real-World Data

  • three measures outperform the rest

Evaluating the ranking power

第 9 页,共 13 页

Evaluation on Real-World Data

  • three measures outperform the rest
  • largely attributed to R3 & R6

Evaluating the ranking power

第 10 页,共 13 页

Evaluation on Real-World Data

  • three measures outperform the rest
  • largely attributed to R3 & R6
  • two structural properties confuse the measures:
    • LHS-uniqueness
    • RHS-skewness

Evaluating the ranking power

第 11 页,共 13 页

Sensitivity Analysis

Structural properties and AFD measures

Based on synthetic datasets, we find that

RFI’+ & 𝜇+ (both orange) separate FDs from non- FDs best.

第 12 页,共 13 页

Contributions

  • uniform AFD measures definitions�(including an open-source implementation)
  • formal comparison of AFD measures
  • manually annotated real-world benchmark
  • measure evaluation on the new benchmark
  • sensitivity analysis of the measures
  • AFD measure recommendations

第 13 页,共 13 页

Conclusions

1

2

3

𝜇+

  • accurate (2nd on RWD-)
  • insensitive
  • fast

RFI’+

  • accurate (1st on RWD-)
  • insensitive
  • slow

g’3

  • accurate (3rd on RWD-)
  • sensitive to skew
  • fast