1 of 38

Data prepping for tabular learning

More learning and less cleaning

1

2 of 38

Who am I?

2

Vincent Maladière

Machine Learning Engineer @Inria Soda

Contributor @scikit-learn, @skrub and @hazardous

3 of 38

During ML modeling, how would you deal with this column?

4 of 38

During ML modeling, how would you deal with this column?

5 of 38

During ML modeling, how would you deal with this column?

These columns are collinear, so we only keep one of these two

6 of 38

What about this one?

7 of 38

What about this one?

8 of 38

What about this one?

One Hot Encoding is impractical

  • The dimension explodes
  • A lot of very rare categories (long tail)
  • Unseen categories in the test set
  • OHE consider all categories as equidistant!

9 of 38

By extension, messy categories impact joining

10 of 38

“Dirty data” = #1 challenge

https://www.kaggle.com/code/ash316/novice-to-grandmaster

11 of 38

We need to treat categories as continuous entities

By embedding these categories

12 of 38

Meet skrub

Prepping tables for machine learning

skrub-data.org

13 of 38

⚠️ Note that we are migrating the project from “dirty_cat”

pip install dirty_cat

Current stable version:

14 of 38

Feature 1: Encoding

Gap Encoder

  • Factorising substring count matrices
  • Models strings as a linear combination of substrings

15 of 38

Feature 1: Encoding

Gap Encoder

The Gamma Poisson is well suited to count statistics:

As f is a vector of count, it is natural to consider a Poisson distribution for each of its elements:

For prior elements of x, we use a Gamma prior, as it is the conjugate of the Poisson likelihood but also because it fosters soft sparsity (small values but non zero):

16 of 38

Feature 1: Encoding

Gap Encoder

  • Much more scalable than Similarity Encoder
  • Globally, give better predictions

17 of 38

Feature 1: Encoding

Min Hash Encoder

  • Similarity using the minhash technique, approximate the Jaccard index
  • Extremely fast and scalable
  • But results aren’t interpretable

18 of 38

Feature 1: Encoding

Min Hash Encoder

19 of 38

Feature 1: Encoding

Min Hash Encoder

20 of 38

Typical performances

21 of 38

Feature 1: Encoding

Table Vectorizer

  • Automatically recognize categories that need to be encoded
  • One Hot Encoding on low cardinality (<40) and Encoders for high cardinality (≥40)

22 of 38

Feature 1: Encoding

Table Vectorizer

  • Automatically recognize categories that need to be encoded
  • One Hot Encoding on low cardinality (<40) and Encoders for high cardinality (≥40)

23 of 38

Feature 2: Joining

Fuzzy Join

  • More flexible than pd.merge
  • Return the matching score, we can choose a merging threshold

24 of 38

Feature 2: Joining

Fuzzy Join

  • More flexible than pd.merge
  • Return the matching score, we can choose a merging threshold

25 of 38

Feature 2: Joining

Feature Augmenter

  • Enrich a base table with multiple fuzzy joins!
  • Feature Augmenter is a scikit-learn transformer

26 of 38

Feature 2: Joining

Feature Augmenter

27 of 38

Feature 3: Deduplication

Hierarchical clustering

  • Enable counting and groupby operations on a noisy entity
  • Beware of potential losses of information.

28 of 38

Jupyter notebook demo

https://github.com/jovan-stojanovic/jupytercon2023

29 of 38

What’s next?

Leverage contextual embeddings�and graph

30 of 38

More learning, less cleaning

31 of 38

More learning, less cleaning

32 of 38

More learning, less cleaning

33 of 38

Automatic feature extraction

  1. Base table

34 of 38

Automatic feature extraction

35 of 38

Automatic feature extraction

36 of 38

Automatic feature extraction

37 of 38

Automatic feature extraction

38 of 38

Thank you!

Questions?

38