AutoFeat: Transitive Feature Discovery over Join Paths
Asterios Katsifodimos
Rihan
Hai
Florena
Buse
Kiril
Vasilev
Andra
Ionescu
Agenda
2
Problem space
Feature discovery
AutoFeat
Pipeline
Evaluation setup
Results
INTRODUCTION
3
PROBLEM
4
Input data of an ML model is a single table
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
5
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
Input dataset is the result of data augmentation and feature selection
Dataset Augmentation
6
Collection of datasets
Training dataset
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
Dataset Augmentation
7
Training dataset
When PK-FK are known:
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
Collection of datasets
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
Dataset Augmentation
8
When PK-FK are missing:
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
Dataset Augmentation
9
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
When PK-FK are missing:
Dataset Augmentation
10
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
When PK-FK are missing:
FEATURE DISCOVERY
11
Feature Discovery
12
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
Dataset
collection
}
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | |
| | |
| | |
| | |
Base table
{
Target variable
Feature Discovery
13
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
Dataset
collection
}
Base table
{
Target variable
| | |
| | |
| | |
| | |
Feature Discovery
14
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
Dataset
collection
}
ML
model
}
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
Base table
{
Target variable
| | |
| | |
| | |
| | |
AUTOFEAT
15
AutoFeat
16
single-hop
multi-hop
ranking-based
model-execution based
simple graph
multi-graph
AutoFeat
17
single-hop
multi-hop
ranking-based
model-execution based
simple graph
multi-graph
AutoFeat
18
0
1.2
1
2
2.3
single-hop
multi-hop
ranking-based
model-execution based
simple graph
multi-graph
PIPELINE
19
AutoFeat Pipeline
20
1
Dataset Relation Graph
21
Dataset Discovery
DRG - weighted graph
[1] Christos Koutras, et al. "Valentine: Evaluating matching techniques for dataset discovery." 2021 ICDE
2
STREAMING FEATURE SELECTION
22
FEATURES ARRIVE IN A STREAMING FASHION WITH EVERY JOIN
3
Join Trees
23
Graph traversal
Join type
3a
Join Trees
24
Join paths
Prune paths
3a
Feature Selection
25
Relevance
Redundancy
Ranking
3b
26
Relevance
Information Gain
Pearson correlation
Spearman correlation
Relief
Redundancy
Mutual Information Feature Selection
Minimum Redundancy Maximum Relevance
Conditional Infomax Feature Extraction
Join Mutual Information
Conditional Mutual Information Maximisation
Feature Selection
27
Top-k join trees
Augment Base Table
Evaluate Join Trees
4
EVALUATION
28
Setup
29
Datasets
7 OpenML
1 SOTA
ML models
Decision trees from AutoGluon
Metrics
Efficiency
Effectiveness
Baselines
30
Base
Join All
Join All + FS
ARDA [2]
Multi-Armed Bandit [3]
[2] Nadiia Chepurko, et al. "ARDA: Automatic Relational Data Augmentation for Machine Learning." 2020 VLDB
[3] Jiabin Liu, et al. "Feature augmentation with reinforcement learning." 2022 ICDE
Scenarios
31
Known Relations
Known PK-FK connections
Star/Snowflake schema
Reproduce the results from baselines
Discovered Relations
Unknown PK-FK connections
Dense multi-graph
Show the predictive power of AutoFeat
32
RESULTS
ON AVERAGE
33
16% AVERAGE INCREASE IN ACCURACY ACROSS ALL DATASETS AND MODELS
KNOWN RELATIONS
34
| # joins | Accuracy | Runtime |
| Less joins | Higher | Lower |
| More joins | Higher | Lower |
AUTOFEAT HAS SAME ACCURACY AS JOIN ALL(+FS)
AT A FRACTION OF TIME
35
Path analysis
AutoFeat explores the join space in depth
Prunes out irrelevant tables
Effectiveness
AutoFeat shows increased accuracy from the base table
Efficiency
10x faster than MAB
3x faster than ARDA
DISCOVERED RELATIONS
Conclusion
36
AutoFeat is a more efficient and effective method for automatic feature discovery over long join paths.
AutoFeat works with both star and snowflake schema.
AutoFeat decouples the model training step from feature discovery process
AutoFeat relies on heuristics to prune out irrelevant tables and features.
Thank you!
37
AutoFeat is a more efficient and effective method for automatic feature discovery over long join paths.
AutoFeat works with both star and snowflake schema.
AutoFeat decouples the model training step from feature discovery process
AutoFeat relies on heuristics to prune out irrelevant tables and features.
Open for work
a.ionescu-3@tudelft.nl
Thank you!
38
AutoFeat: Transitive Feature Discovery Over Join Paths
andradenisio