1 of 38

AutoFeat: Transitive Feature Discovery over Join Paths

Asterios Katsifodimos

Rihan

Hai

Florena

Buse

Kiril

Vasilev

Andra

Ionescu

2 of 38

Agenda

2

Problem space

Feature discovery

AutoFeat

Pipeline

Evaluation setup

Results

3 of 38

INTRODUCTION

3

4 of 38

PROBLEM

4

Input data of an ML model is a single table

5 of 38

5

Input dataset is the result of data augmentation and feature selection

6 of 38

Dataset Augmentation

6

Collection of datasets

Training dataset

7 of 38

Dataset Augmentation

7

Training dataset

When PK-FK are known:

  1. Search for datasets
  2. Join datasets
  3. Apply feature selection

Collection of datasets

8 of 38

Dataset Augmentation

8

When PK-FK are missing:

  1. Dataset discovery
  2. Join data
  3. Apply feature selection

9 of 38

Dataset Augmentation

9

  • Spurious relations

When PK-FK are missing:

  1. Dataset discovery
  2. Join data
  3. Apply feature selection

10 of 38

Dataset Augmentation

10

  • Multiple join columns

When PK-FK are missing:

  1. Dataset discovery
  2. Join data
  3. Apply feature selection

11 of 38

FEATURE DISCOVERY

11

12 of 38

Feature Discovery

12

Dataset

collection

}

Base table

{

Target variable

13 of 38

Feature Discovery

13

Dataset

collection

}

Base table

{

Target variable

14 of 38

Feature Discovery

14

Dataset

collection

}

ML

model

}

Base table

{

Target variable

15 of 38

AUTOFEAT

15

16 of 38

AutoFeat

16

  • Join-path length:

single-hop

multi-hop

  • Path / Feature selection:

ranking-based

model-execution based

  • Joinability graph

simple graph

multi-graph

17 of 38

AutoFeat

17

  • Join-path length:

single-hop

multi-hop

  • Path / Feature selection:

ranking-based

model-execution based

  • Joinability graph

simple graph

multi-graph

18 of 38

AutoFeat

18

0

1.2

1

2

2.3

  • Join-path length:

single-hop

multi-hop

  • Path / Feature selection:

ranking-based

model-execution based

  • Joinability graph

simple graph

multi-graph

19 of 38

PIPELINE

19

20 of 38

AutoFeat Pipeline

20

1

21 of 38

Dataset Relation Graph

21

    • Valentine – schema matching tool suite [1]

Dataset Discovery

    • Nodes → Tables
    • Edges → Relationships
      • Weight = 1 (PK-FK)
      • Weight = similarity score

DRG - weighted graph

[1] Christos Koutras, et al. "Valentine: Evaluating matching techniques for dataset discovery." 2021 ICDE

2

22 of 38

STREAMING FEATURE SELECTION

22

FEATURES ARRIVE IN A STREAMING FASHION WITH EVERY JOIN

3

23 of 38

Join Trees

23

    • Breadth First Search (BFS)
    • Evaluate data quality after each level
    • Easier error management

Graph traversal

    • Left join
    • Preserve number of rows
    • Avoid introducing class imbalance

Join type

3a

24 of 38

Join Trees

24

    • Sequence of edges
    • Chain of joins

Join paths

    • Similarity score
    • Data quality – null values ratio

Prune paths

3a

25 of 38

Feature Selection

25

    • Spearman correlation – rank correlation

Relevance

    • MRMR – with more selected features, the effect of redundancy is reduced

Redundancy

    • Linear function of relevance and redundancy scores

Ranking

3b

26 of 38

26

Relevance

Information Gain

Pearson correlation

Spearman correlation

Relief

Redundancy

Mutual Information Feature Selection

Minimum Redundancy Maximum Relevance

Conditional Infomax Feature Extraction

Join Mutual Information

Conditional Mutual Information Maximisation

Feature Selection

27 of 38

27

    • Based on the ranking

Top-k join trees

    • Train ML model

Augment Base Table

Evaluate Join Trees

4

28 of 38

EVALUATION

28

29 of 38

Setup

29

Datasets

7 OpenML

1 SOTA

ML models

Decision trees from AutoGluon

Metrics

Efficiency

Effectiveness

30 of 38

Baselines

30

    • Non-augmented base table

Base

    • Join all tables

Join All

    • Join all, then apply feature selection

Join All + FS

    • Random Injection of noise

ARDA [2]

    • Exploration - Exploitation strategy

Multi-Armed Bandit [3]

[2] Nadiia Chepurko, et al. "ARDA: Automatic Relational Data Augmentation for Machine Learning." 2020 VLDB

[3] Jiabin Liu, et al. "Feature augmentation with reinforcement learning." 2022 ICDE

31 of 38

Scenarios

31

Known Relations

Known PK-FK connections

Star/Snowflake schema

Reproduce the results from baselines

Discovered Relations

Unknown PK-FK connections

Dense multi-graph

Show the predictive power of AutoFeat

32 of 38

32

RESULTS

33 of 38

ON AVERAGE

33

16% AVERAGE INCREASE IN ACCURACY ACROSS ALL DATASETS AND MODELS

34 of 38

KNOWN RELATIONS

34

# joins

Accuracy

Runtime

Less joins

Higher

Lower

More joins

Higher

Lower

AUTOFEAT HAS SAME ACCURACY AS JOIN ALL(+FS)

AT A FRACTION OF TIME

35 of 38

35

Path analysis

AutoFeat explores the join space in depth

Prunes out irrelevant tables

Effectiveness

AutoFeat shows increased accuracy from the base table

Efficiency

10x faster than MAB

3x faster than ARDA

DISCOVERED RELATIONS

36 of 38

Conclusion

36

AutoFeat is a more efficient and effective method for automatic feature discovery over long join paths.

AutoFeat works with both star and snowflake schema.

AutoFeat decouples the model training step from feature discovery process

AutoFeat relies on heuristics to prune out irrelevant tables and features.

37 of 38

Thank you!

37

AutoFeat is a more efficient and effective method for automatic feature discovery over long join paths.

AutoFeat works with both star and snowflake schema.

AutoFeat decouples the model training step from feature discovery process

AutoFeat relies on heuristics to prune out irrelevant tables and features.

Open for work

a.ionescu-3@tudelft.nl

38 of 38

Thank you!

38

AutoFeat: Transitive Feature Discovery Over Join Paths

andradenisio