1 of 96

scikit-learn and beyond

Axa Developer Summit - Köln, 26th of May 2023

2 of 96

Speakers

François Goupil

scikit-learn growth developer @consortium scikit-learn

Vincent Maladière

machine learning engineer @Inria contributor @scikit-learn @skrub @hazardous

3 of 96

scikit-learn and beyond

  1. What have we achieved?
  2. scikit-learn take on Deep Learning, GPU and Big Data
  3. What is our vision?
  4. Prepping tables for machine learning

4 of 96

1- Achievements

5 of 96

around 2009

Created @

6 of 96

Top used machine learning library

Some key elements and key figures to understand the widespread of the scikit-learn library across the machine learning ecosystem.

7 of 96

Windows

Linux

Mac

source: internal survey

8 of 96

Industry

Academy

Other

source: internal survey

9 of 96

An industry standard

scikit-learn is frequently used by more than 70% of the data scientists

source: Kaggle survey 2022 (23,997 answers)

10 of 96

Which of the following machine learning frameworks do you use on a regular basis?

0

25

50

75

scikit-learn

TensorFlow

�Keras

PyTorch

Xgboost

LightGBM

None HuggingFace CatBoost

Pytorch

Lightning

Caret Fast.ai Other Tidymodels

JAX

source: Kaggle survey 2022 (23,997 answers)

11 of 96

A scientific standard

Cited in 73,749 research papers

source: Google Scholar

12 of 96

35,545,012 downloads last month using pip

16,591,753 for Tensorflow

8,991,285 for Keras

46,733 for Pytorch

13,673,785 total downloads on conda-forge

11,588,868 for Pytorch on pytorch-lts 3,431,449 for Tensorflow on conda-forge 2,374,110 for Keras on conda-forge

source: PyPi and Anaconda

13 of 96

8.7M users

31M sessions

68M page views in 2022

source: scikit-learn.org analytics

14 of 96

2,664 contributors since the start of the project

source: GitHub, March 23

15 of 96

around 10 dependencies

16 of 96

Whereas

17 of 96

577,169 repositories &

11,948 packages rely on scikit-learn

source: GitHub, May 23

18 of 96

#7 Python project on GitHub #127 project on GitHub

source: GitHub rankings based on stargazers, Gitstar Ranking, March 23

19 of 96

Industrial sponsors of the library

This Machine Learning is brought to you thanks to responsible actors fostering Open Source Software and Digital Commons.

20 of 96

2- scikit-learn take on Deep Learning, GPU and Big Data

21 of 96

It's not all about

Deep Learning, GPUs and Big Data

We claim these statements to be true but we are meant to be technologically agnostic and always on the move.

22 of 96

It's not all about Deep Learning

When asked "Which of the following ML algorithms do you use on a regular basis?" : The top 2 is about machine learning.

23 of 96

Which of the following ML algorithms do you use on a regular basis?

0

25

50

75

source: Kaggle survey 2022 (23,997 answers)

Linear or Logistic Regression

Decision Trees or Random Forests Convolutional Neural Networks

Gradient Boosting Machines (xgboost, lightgbm, etc)

Bayesian Approaches Dense Neural Networks (MLPs, etc) Recurrent Neural Networks Transformers Networks (BERT, gpt-3,etc)

Neural Networks

None Autoencoder Networks (DAE, VAE, etc) Generative Adversarial Networks Evolutionary Approaches

Other

24 of 96

Why? Deep learning underperforms on tables

source: Why do tree-based models still outperform deep learning on tabular data? - Léo Grinsztajn et al.

25 of 96

A scikit-learn compatible neural network library that wraps PyTorch.

26 of 96

Seamlessly integrate powerful language models like ChatGPT into scikit-learn for enhanced text analysis tasks

27 of 96

Better interoperability Standard Array API

#22554 , #25956 , #26372 , #26315 , #26243

Interface with other array libraries (CuPY, PyTorch) using the standard Array Api

28 of 96

29 of 96

It's not all about GPUs

scikit-learn FAQ: Outside of neural networks, GPUs don’t play a large role in machine learning today, and much larger gains in speed can often be achieved by a careful choice of algorithms.

30 of 96

Meanwhile…

WIP GPU plugins

Foresighting easy software compatibility across CPUs, iGPUs, GPUs of all manufacturers.

We are working on a plugin for GPU backends based on SYCL open standards.

One core developer is also working on a plugin approach for Nvidia.

31 of 96

Meanwhile...

other work paving the GPU's way

Improve pairwise-distance-based algorithm in scikit-learn using Cython (authored by Julien Jerphanion)

#22554 , #25956 , #26372 , #26315 , #26243 Interface with other array libraries (CuPY, PyTorch) using the standard Array Api

#22438 Plug third-party backends into scikit-learn estimators

32 of 96

33 of 96

It's not all about Big Data

Distorted public discourse.

34 of 96

0

5

10

15

20

25

Less than 1MB

1.1 to 10 MB

11 to 100 MB

101 MB to 1 GB

1.1 GB to 10 GB

11 to 100 GB

101 GB to 1 TB

1.1 TB to 10 TB

11 to 100 TB

  1. TB to 1 PB

    • to 10 PB

11 to 100 PB

Over 100 PB

source: KDnuggets poll 2018 (1108 answers)

35 of 96

source: Big Data is dead - Jonathan Tigani https://motherduck.com/blog/big-data-is-dead/

36 of 96

source: Big Data is dead - Jonathan Tigani https://motherduck.com/blog/big-data-is-dead/

37 of 96

source: Big Data is dead - Jonathan Tigani https://motherduck.com/blog/big-data-is-dead/

38 of 96

out-of-core approach

Learning from data that doesn’t fit into main memory is more and more efficient. Using one single machine. Ex: 500 million rows aggregated/grouped by in a few seconds

39 of 96

Some Nuance

source: Big Data is dead...Long live big data - Aditya Parameswaran https://ponder.io/big-data-is-dead-long-live-big-data/

40 of 96

source: Big Data is dead...Long live big data - Aditya Parameswaran https://ponder.io/big-data-is-dead-long-live-big-data/

41 of 96

What's next? The Future

Thanks to nice perspectives of funding and sponsorship, we also have nice perspectives of developments.

42 of 96

More work on

Data Interoperability

43 of 96

More efforts on Data Wrangling

44 of 96

More Performance

45 of 96

More Explainability

46 of 96

More MLOps

47 of 96

More UX

48 of 96

More

Documentation & Training

49 of 96

More

Community animation

50 of 96

More

Hardware support

51 of 96

3- Vision

52 of 96

No easy access to

Large compute infrastructures (cluster of GPUs) Trained ML Experts

Huge and Clean Datasets

53 of 96

Data science for the many, not only the mighty

We need to enable machine learning without large resources, infinite data, ivy-league education.

54 of 96

Keep making it easier for non-ML experts to use ML correctly.

We are on a mission

55 of 96

Generative AI, LLMs...

What started with OpenAI’s ChatGPT has bloomed into a rapidly evolving subcategory of technology. Its capabilities are blooming as well as the critics and fears.

56 of 96

Not the cutting edge AI but the rocksolid ML

We want to build the basic and fundamental machine learning society needs with a clear focus on stability and reliability. We want scikit-learn to embody the frugal, understandable, explainable and reliable machine learning.

57 of 96

In a nutshell

Fundamental and Trusted Machine Learning

58 of 96

Where to convey the vision?

A lot of space for scikit-learn to deploy its vision

source: Google Cloud - Mlops continuous delivery and automation pipelines in machine learning

59 of 96

Branch out

And extend the vision

60 of 96

4- Prepping tables for machine learning

61 of 96

During ML modeling, how would you deal with this column?

62 of 96

During ML modeling, how would you deal with this column?

63 of 96

During ML modeling, how would you deal with this column?

These columns are collinear, so we only keep one of these two

64 of 96

What about this one?

65 of 96

What about this one?

66 of 96

What about this one?

One Hot Encoding is impractical

  • The dimension explodes
  • A lot of very rare categories (long tail)
  • Unseen categories in the test set
  • OHE consider all categories as equidistant!

67 of 96

By extension, messy categories impact joining

68 of 96

“Dirty data” = #1 challenge

https://www.kaggle.com/code/ash316/novice-to-grandmaster

69 of 96

We need to treat categories as continuous entities

By embedding these categories

70 of 96

Meet skrub

Prepping tables for machine learning

skrub-data.org

71 of 96

⚠️ Note that we are migrating the project from “dirty_cat”

pip install dirty_cat

Current stable version:

72 of 96

Feature 1: Encoding

Gap Encoder

  • Factorising substring count matrices
  • Models strings as a linear combination of substrings

73 of 96

Feature 1: Encoding

Gap Encoder

The Gamma Poisson is well suited to count statistics:

As f is a vector of count, it is natural to consider a Poisson distribution for each of its elements:

For prior elements of x, we use a Gamma prior, as it is the conjugate of the Poisson likelihood but also because it fosters soft sparsity (small values but non zero):

74 of 96

Feature 1: Encoding

Gap Encoder

  • Much more scalable than Similarity Encoder
  • Globally, give better predictions

75 of 96

Feature 1: Encoding

Min Hash Encoder

  • Similarity using the minhash technique, approximate the Jaccard index
  • Extremely fast and scalable
  • But results aren’t interpretable

76 of 96

Feature 1: Encoding

Min Hash Encoder

77 of 96

Feature 1: Encoding

Min Hash Encoder

78 of 96

Typical performances

79 of 96

Feature 1: Encoding

Table Vectorizer

  • Automatically recognize categories that need to be encoded
  • One Hot Encoding on low cardinality (<40) and Encoders for high cardinality (≥40)

80 of 96

Feature 1: Encoding

Table Vectorizer

  • Automatically recognize categories that need to be encoded
  • One Hot Encoding on low cardinality (<40) and Encoders for high cardinality (≥40)

81 of 96

Feature 2: Joining

Fuzzy Join

  • More flexible than pd.merge
  • Return the matching score, we can choose a merging threshold

82 of 96

Feature 2: Joining

Fuzzy Join

  • More flexible than pd.merge
  • Return the matching score, we can choose a merging threshold

83 of 96

Feature 2: Joining

Feature Augmenter

  • Enrich a base table with multiple fuzzy joins!
  • Feature Augmenter is a scikit-learn transformer

84 of 96

Feature 2: Joining

Feature Augmenter

85 of 96

Feature 3: Deduplication

Hierarchical clustering

  • Enable counting and groupby operations on a noisy entity
  • Beware of potential losses of information.

86 of 96

Jupyter notebook demo

https://github.com/jovan-stojanovic/jupytercon2023

87 of 96

What’s next?

Leverage contextual embeddings�and graph

88 of 96

More learning, less cleaning

89 of 96

More learning, less cleaning

90 of 96

More learning, less cleaning

91 of 96

Automatic feature extraction

  1. Base table

92 of 96

Automatic feature extraction

93 of 96

Automatic feature extraction

94 of 96

Automatic feature extraction

95 of 96

Automatic feature extraction

96 of 96