1 of 38

Data prepping for tabular learning

More learning and less cleaning

2 of 38

Who am I?

Vincent Maladière

Machine Learning Engineer @Inria Soda

Contributor @scikit-learn, @skrub and @hazardous

3 of 38

During ML modeling, how would you deal with this column?

4 of 38

During ML modeling, how would you deal with this column?

5 of 38

During ML modeling, how would you deal with this column?

These columns are collinear, so we only keep one of these two

6 of 38

What about this one?

7 of 38

What about this one?

8 of 38

What about this one?

One Hot Encoding is impractical

The dimension explodes
A lot of very rare categories (long tail)
Unseen categories in the test set
OHE consider all categories as equidistant!

9 of 38

By extension, messy categories impact joining

10 of 38

“Dirty data” = #1 challenge

https://www.kaggle.com/code/ash316/novice-to-grandmaster

11 of 38

We need to treat categories as continuous entities

By embedding these categories

12 of 38

Meet skrub

Prepping tables for machine learning

skrub-data.org

13 of 38

⚠️ Note that we are migrating the project from “dirty_cat”

pip install dirty_cat

Current stable version:

14 of 38

Feature 1: Encoding

Gap Encoder

P.Cerda, G.Varoquaux. Encoding high-cardinality string categorical variables (2019)

Factorising substring count matrices
Models strings as a linear combination of substrings

15 of 38

Feature 1: Encoding

Gap Encoder

The Gamma Poisson is well suited to count statistics:

As f is a vector of count, it is natural to consider a Poisson distribution for each of its elements:

For prior elements of x, we use a Gamma prior, as it is the conjugate of the Poisson likelihood but also because it fosters soft sparsity (small values but non zero):

P.Cerda, G.Varoquaux. Encoding high-cardinality string categorical variables (2019)

16 of 38

Feature 1: Encoding

Gap Encoder

Much more scalable than Similarity Encoder
Globally, give better predictions

P.Cerda, G.Varoquaux. Encoding high-cardinality string categorical variables (2019)

17 of 38

Feature 1: Encoding

Min Hash Encoder

Similarity using the minhash technique, approximate the Jaccard index
Extremely fast and scalable
But results aren’t interpretable

P.Cerda, G.Varoquaux. Encoding high-cardinality string categorical variables (2019)

18 of 38

Feature 1: Encoding

Min Hash Encoder

P.Cerda, G.Varoquaux. Encoding high-cardinality string categorical variables (2019)

19 of 38

Feature 1: Encoding

Min Hash Encoder

P.Cerda, G.Varoquaux. Encoding high-cardinality string categorical variables (2019)

20 of 38

Typical performances

21 of 38

Feature 1: Encoding

Table Vectorizer

Automatically recognize categories that need to be encoded
One Hot Encoding on low cardinality (<40) and Encoders for high cardinality (≥40)

22 of 38

Feature 1: Encoding

Table Vectorizer

Automatically recognize categories that need to be encoded
One Hot Encoding on low cardinality (<40) and Encoders for high cardinality (≥40)

23 of 38

Feature 2: Joining

Fuzzy Join

More flexible than pd.merge
Return the matching score, we can choose a merging threshold

24 of 38

Feature 2: Joining

Fuzzy Join

More flexible than pd.merge
Return the matching score, we can choose a merging threshold

25 of 38

Feature 2: Joining

Feature Augmenter

Enrich a base table with multiple fuzzy joins!
Feature Augmenter is a scikit-learn transformer

26 of 38

Feature 2: Joining

Feature Augmenter

27 of 38

Feature 3: Deduplication

Hierarchical clustering

Enable counting and groupby operations on a noisy entity
Beware of potential losses of information.

28 of 38

Jupyter notebook demo

https://github.com/jovan-stojanovic/jupytercon2023

29 of 38

What’s next?

Leverage contextual embeddings�and graph

30 of 38

More learning, less cleaning

A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Analytics on non-normalized data sources: more learning, rather than more cleaning

31 of 38

More learning, less cleaning

A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Analytics on non-normalized data sources: more learning, rather than more cleaning

32 of 38

More learning, less cleaning

A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Analytics on non-normalized data sources: more learning, rather than more cleaning

33 of 38

Automatic feature extraction

A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational Data Embeddings for Feature Enrichment with Background Information

Base table

34 of 38

Automatic feature extraction

A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational Data Embeddings for Feature Enrichment with Background Information

35 of 38

Automatic feature extraction

A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational Data Embeddings for Feature Enrichment with Background Information

36 of 38

Automatic feature extraction

A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational Data Embeddings for Feature Enrichment with Background Information

37 of 38

Automatic feature extraction

A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational Data Embeddings for Feature Enrichment with Background Information

38 of 38

Thank you!

Questions?