Data prepping for tabular learning
More learning and less cleaning
1
Who am I?
2
Vincent Maladière
Machine Learning Engineer @Inria Soda
Contributor @scikit-learn, @skrub and @hazardous
During ML modeling, how would you deal with this column?
During ML modeling, how would you deal with this column?
During ML modeling, how would you deal with this column?
These columns are collinear, so we only keep one of these two
What about this one?
What about this one?
What about this one?
One Hot Encoding is impractical
By extension, messy categories impact joining
“Dirty data” = #1 challenge
https://www.kaggle.com/code/ash316/novice-to-grandmaster
We need to treat categories as continuous entities
By embedding these categories
Meet skrub
Prepping tables for machine learning
skrub-data.org
⚠️ Note that we are migrating the project from “dirty_cat”
pip install dirty_cat
Current stable version:
Feature 1: Encoding
Gap Encoder
P.Cerda, G.Varoquaux. Encoding high-cardinality string categorical variables (2019)
Feature 1: Encoding
Gap Encoder
The Gamma Poisson is well suited to count statistics:
As f is a vector of count, it is natural to consider a Poisson distribution for each of its elements:
For prior elements of x, we use a Gamma prior, as it is the conjugate of the Poisson likelihood but also because it fosters soft sparsity (small values but non zero):
P.Cerda, G.Varoquaux. Encoding high-cardinality string categorical variables (2019)
Feature 1: Encoding
Gap Encoder
P.Cerda, G.Varoquaux. Encoding high-cardinality string categorical variables (2019)
Feature 1: Encoding
Min Hash Encoder
P.Cerda, G.Varoquaux. Encoding high-cardinality string categorical variables (2019)
Feature 1: Encoding
Min Hash Encoder
P.Cerda, G.Varoquaux. Encoding high-cardinality string categorical variables (2019)
Feature 1: Encoding
Min Hash Encoder
P.Cerda, G.Varoquaux. Encoding high-cardinality string categorical variables (2019)
Typical performances
Feature 1: Encoding
Table Vectorizer
Feature 1: Encoding
Table Vectorizer
Feature 2: Joining
Fuzzy Join
Feature 2: Joining
Fuzzy Join
Feature 2: Joining
Feature Augmenter
Feature 2: Joining
Feature Augmenter
Feature 3: Deduplication
Hierarchical clustering
Jupyter notebook demo
https://github.com/jovan-stojanovic/jupytercon2023
What’s next?
Leverage contextual embeddings�and graph
More learning, less cleaning
A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Analytics on non-normalized data sources: more learning, rather than more cleaning
More learning, less cleaning
A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Analytics on non-normalized data sources: more learning, rather than more cleaning
More learning, less cleaning
A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Analytics on non-normalized data sources: more learning, rather than more cleaning
Automatic feature extraction
A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational Data Embeddings for Feature Enrichment with Background Information
Automatic feature extraction
A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational Data Embeddings for Feature Enrichment with Background Information
Automatic feature extraction
A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational Data Embeddings for Feature Enrichment with Background Information
Automatic feature extraction
A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational Data Embeddings for Feature Enrichment with Background Information
Automatic feature extraction
A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational Data Embeddings for Feature Enrichment with Background Information
Thank you!
Questions?
38