1 of 38

AutoFeat: Transitive Feature Discovery over Join Paths

Asterios Katsifodimos

Rihan

Hai

Florena

Buse

Kiril

Vasilev

Andra

Ionescu

2 of 38

Agenda

2

Problem space

Feature discovery

AutoFeat

Pipeline

Evaluation setup

Results

3 of 38

INTRODUCTION

3

4 of 38

PROBLEM

4

Input data of an ML model is a single table

5 of 38

5

Input dataset is the result of data augmentation and feature selection

6 of 38

Dataset Augmentation

6

Collection of datasets

Training dataset

7 of 38

Dataset Augmentation

7

Training dataset

When PK-FK are known:

Search for datasets
Join datasets
Apply feature selection

Collection of datasets

8 of 38

Dataset Augmentation

8

When PK-FK are missing:

Dataset discovery
Join data
Apply feature selection

9 of 38

Dataset Augmentation

9

Spurious relations

When PK-FK are missing:

Dataset discovery
Join data
Apply feature selection

10 of 38

Dataset Augmentation

10

Multiple join columns

When PK-FK are missing:

Dataset discovery
Join data
Apply feature selection

11 of 38

FEATURE DISCOVERY

11

12 of 38

Feature Discovery

12

Dataset

collection

}

Base table

{

Target variable

13 of 38

Feature Discovery

13

Dataset

collection

}

Base table

{

Target variable

14 of 38

Feature Discovery

14

Dataset

collection

}

ML

model

}

Base table

{

Target variable

15 of 38

AUTOFEAT

15

16 of 38

AutoFeat

16

Join-path length:

single-hop

multi-hop

Path / Feature selection:

ranking-based

model-execution based

Joinability graph

simple graph

multi-graph

17 of 38

AutoFeat

17

Join-path length:

single-hop

multi-hop

Path / Feature selection:

ranking-based

model-execution based

Joinability graph

simple graph

multi-graph

18 of 38

AutoFeat

18

0

1.2

1

2

2.3

Join-path length:

single-hop

multi-hop

Path / Feature selection:

ranking-based

model-execution based

Joinability graph

simple graph

multi-graph

19 of 38

PIPELINE

19

20 of 38

AutoFeat Pipeline

20

1

21 of 38

Dataset Relation Graph

21

Valentine – schema matching tool suite [1]

Dataset Discovery

Nodes → Tables
Edges → Relationships

Weight = 1 (PK-FK)
Weight = similarity score

DRG - weighted graph

[1] Christos Koutras, et al. "Valentine: Evaluating matching techniques for dataset discovery." 2021 ICDE

2

22 of 38

STREAMING FEATURE SELECTION

22

FEATURES ARRIVE IN A STREAMING FASHION WITH EVERY JOIN

3

23 of 38

Join Trees

23

Breadth First Search (BFS)
Evaluate data quality after each level
Easier error management

Graph traversal

Left join
Preserve number of rows
Avoid introducing class imbalance

Join type

3a

24 of 38

Join Trees

24

Sequence of edges
Chain of joins

Join paths

Similarity score
Data quality – null values ratio

Prune paths

3a

25 of 38

Feature Selection

25

Spearman correlation – rank correlation

Relevance

MRMR – with more selected features, the effect of redundancy is reduced

Redundancy

Linear function of relevance and redundancy scores

Ranking

3b

26 of 38

26

Relevance

Information Gain

Pearson correlation

Spearman correlation

Relief

Redundancy

Mutual Information Feature Selection

Minimum Redundancy Maximum Relevance

Conditional Infomax Feature Extraction

Join Mutual Information

Conditional Mutual Information Maximisation

Feature Selection

27 of 38

27

Based on the ranking

Top-k join trees

Train ML model

Augment Base Table

Evaluate Join Trees

4

28 of 38

EVALUATION

28

29 of 38

Setup

29

Datasets

7 OpenML

1 SOTA

ML models

Decision trees from AutoGluon

Metrics

Efficiency

Effectiveness

30 of 38

Baselines

30

Non-augmented base table

Base

Join all tables

Join All

Join all, then apply feature selection

Join All + FS

Random Injection of noise

ARDA [2]

Exploration - Exploitation strategy

Multi-Armed Bandit [3]

[2] Nadiia Chepurko, et al. "ARDA: Automatic Relational Data Augmentation for Machine Learning." 2020 VLDB

[3] Jiabin Liu, et al. "Feature augmentation with reinforcement learning." 2022 ICDE

31 of 38

Scenarios

31

Known Relations

Known PK-FK connections

Star/Snowflake schema

Reproduce the results from baselines

Discovered Relations

Unknown PK-FK connections

Dense multi-graph

Show the predictive power of AutoFeat

32 of 38

32

RESULTS

33 of 38

ON AVERAGE

33

16% AVERAGE INCREASE IN ACCURACY ACROSS ALL DATASETS AND MODELS

On the benchmark scenario, we see an overall increase of 16% percent in accuracy across all datasets and models

Now let’s have a look at the number of joins, the accuracy and the runtime on each type of schemata

On a start schemata, AutoFeat prunes out all the irrelevant tables, while ARDA and MAB join all tables

Thus, the number of joined table is lower with AutoFeat, while the accuracy is higher and the runtime lower

On a snowflake schemata, ARDA and MAB are unable to explore the schemata in depth, thus they join only the directly connected tables

Because AutoFeat explores the schemata in depth, the number of joined tables is higher, however, many of these tables are just passage way towards the relevant tables

Compared with join all and join all with featue selection, Autofeat has the same accuracy, thus proving that it discovered the most relevant features, all at a fraction of the time

-- add some wisdom = why it does what it does

34 of 38

KNOWN RELATIONS

34

# joins	Accuracy	Runtime
Less joins	Higher	Lower
More joins	Higher	Lower

AUTOFEAT HAS SAME ACCURACY AS JOIN ALL(+FS)

AT A FRACTION OF TIME

On the benchmark scenario, we see an overall increase of 16% percent in accuracy across all datasets and models

Now let’s have a look at the number of joins, the accuracy and the runtime on each type of schemata

On a start schemata, AutoFeat prunes out all the irrelevant tables, while ARDA and MAB join all tables

Thus, the number of joined table is lower with AutoFeat, while the accuracy is higher and the runtime lower

On a snowflake schemata, ARDA and MAB are unable to explore the schemata in depth, thus they join only the directly connected tables

Because AutoFeat explores the schemata in depth, the number of joined tables is higher, however, many of these tables are just passage way towards the relevant tables

Compared with join all and join all with featue selection, Autofeat has the same accuracy, thus proving that it discovered the most relevant features, all at a fraction of the time

--- less joins vs all the text or plots

--- remove the table

--- add numbers

-- add some wisdom = why it does what it does

-- show the first figure of the paper

35 of 38

35

Path analysis

AutoFeat explores the join space in depth

Prunes out irrelevant tables

Effectiveness

AutoFeat shows increased accuracy from the base table

Efficiency

10x faster than MAB

3x faster than ARDA

DISCOVERED RELATIONS

36 of 38

Conclusion

36

AutoFeat is a more efficient and effective method for automatic feature discovery over long join paths.

AutoFeat works with both star and snowflake schema.

AutoFeat decouples the model training step from feature discovery process

AutoFeat relies on heuristics to prune out irrelevant tables and features.

37 of 38

Thank you!

37

AutoFeat is a more efficient and effective method for automatic feature discovery over long join paths.

AutoFeat works with both star and snowflake schema.

AutoFeat decouples the model training step from feature discovery process

AutoFeat relies on heuristics to prune out irrelevant tables and features.

Open for work

a.ionescu-3@tudelft.nl

38 of 38

Thank you!

38

AutoFeat: Transitive Feature Discovery Over Join Paths

https://github.com/delftdata/autofeat

andradenisio