scikit-learn and beyond
Axa Developer Summit - Köln, 26th of May 2023
Speakers
François Goupil
scikit-learn growth developer @consortium scikit-learn
Vincent Maladière
machine learning engineer @Inria contributor @scikit-learn @skrub @hazardous
scikit-learn and beyond
1- Achievements
around 2009
Created @
Top used machine learning library
Some key elements and key figures to understand the widespread of the scikit-learn library across the machine learning ecosystem.
Windows
Linux
Mac
source: internal survey
Industry
Academy
Other
source: internal survey
An industry standard
scikit-learn is frequently used by more than 70% of the data scientists
source: Kaggle survey 2022 (23,997 answers)
Which of the following machine learning frameworks do you use on a regular basis?
0
25
50
75
scikit-learn
TensorFlow
�Keras
PyTorch
Xgboost
LightGBM
None HuggingFace CatBoost
Pytorch
Lightning
Caret Fast.ai Other Tidymodels
JAX
source: Kaggle survey 2022 (23,997 answers)
A scientific standard
Cited in 73,749 research papers
source: Google Scholar
35,545,012 downloads last month using pip
16,591,753 for Tensorflow
8,991,285 for Keras
46,733 for Pytorch
13,673,785 total downloads on conda-forge
11,588,868 for Pytorch on pytorch-lts 3,431,449 for Tensorflow on conda-forge 2,374,110 for Keras on conda-forge
source: PyPi and Anaconda
8.7M users
31M sessions
68M page views in 2022
source: scikit-learn.org analytics
2,664 contributors since the start of the project
source: GitHub, March 23
around 10 dependencies
Whereas
577,169 repositories &
11,948 packages rely on scikit-learn
source: GitHub, May 23
#7 Python project on GitHub #127 project on GitHub
source: GitHub rankings based on stargazers, Gitstar Ranking, March 23
Industrial sponsors of the library
This Machine Learning is brought to you thanks to responsible actors fostering Open Source Software and Digital Commons.
2- scikit-learn take on Deep Learning, GPU and Big Data
It's not all about
Deep Learning, GPUs and Big Data
We claim these statements to be true but we are meant to be technologically agnostic and always on the move.
It's not all about Deep Learning
When asked "Which of the following ML algorithms do you use on a regular basis?" : The top 2 is about machine learning.
Which of the following ML algorithms do you use on a regular basis?
0
25
50
75
source: Kaggle survey 2022 (23,997 answers)
Linear or Logistic Regression
Decision Trees or Random Forests Convolutional Neural Networks
Gradient Boosting Machines (xgboost, lightgbm, etc)
Bayesian Approaches Dense Neural Networks (MLPs, etc) Recurrent Neural Networks Transformers Networks (BERT, gpt-3,etc)
Neural Networks
None Autoencoder Networks (DAE, VAE, etc) Generative Adversarial Networks Evolutionary Approaches
Other
Why? Deep learning underperforms on tables
source: Why do tree-based models still outperform deep learning on tabular data? - Léo Grinsztajn et al.
A scikit-learn compatible neural network library that wraps PyTorch.
Seamlessly integrate powerful language models like ChatGPT into scikit-learn for enhanced text analysis tasks
Better interoperability Standard Array API
#22554 , #25956 , #26372 , #26315 , #26243
Interface with other array libraries (CuPY, PyTorch) using the standard Array Api
It's not all about GPUs
scikit-learn FAQ: Outside of neural networks, GPUs don’t play a large role in machine learning today, and much larger gains in speed can often be achieved by a careful choice of algorithms.
Meanwhile…
WIP GPU plugins
Foresighting easy software compatibility across CPUs, iGPUs, GPUs of all manufacturers.
We are working on a plugin for GPU backends based on SYCL open standards.
One core developer is also working on a plugin approach for Nvidia.
Meanwhile...
other work paving the GPU's way
Improve pairwise-distance-based algorithm in scikit-learn using Cython (authored by Julien Jerphanion)
#22554 , #25956 , #26372 , #26315 , #26243 Interface with other array libraries (CuPY, PyTorch) using the standard Array Api
#22438 Plug third-party backends into scikit-learn estimators
It's not all about Big Data
Distorted public discourse.
0
5
10
15
20
25
Less than 1MB
1.1 to 10 MB
11 to 100 MB
101 MB to 1 GB
1.1 GB to 10 GB
11 to 100 GB
101 GB to 1 TB
1.1 TB to 10 TB
11 to 100 TB
11 to 100 PB
Over 100 PB
source: KDnuggets poll 2018 (1108 answers)
source: Big Data is dead - Jonathan Tigani https://motherduck.com/blog/big-data-is-dead/
source: Big Data is dead - Jonathan Tigani https://motherduck.com/blog/big-data-is-dead/
source: Big Data is dead - Jonathan Tigani https://motherduck.com/blog/big-data-is-dead/
out-of-core approach
Learning from data that doesn’t fit into main memory is more and more efficient. Using one single machine. Ex: 500 million rows aggregated/grouped by in a few seconds
Some Nuance
source: Big Data is dead...Long live big data - Aditya Parameswaran https://ponder.io/big-data-is-dead-long-live-big-data/
source: Big Data is dead...Long live big data - Aditya Parameswaran https://ponder.io/big-data-is-dead-long-live-big-data/
What's next? The Future
Thanks to nice perspectives of funding and sponsorship, we also have nice perspectives of developments.
More work on
Data Interoperability
More efforts on Data Wrangling
More Performance
More Explainability
More MLOps
More UX
More
Documentation & Training
More
Community animation
More
Hardware support
3- Vision
No easy access to
Large compute infrastructures (cluster of GPUs) Trained ML Experts
Huge and Clean Datasets
Data science for the many, not only the mighty
We need to enable machine learning without large resources, infinite data, ivy-league education.
Keep making it easier for non-ML experts to use ML correctly.
We are on a mission
Generative AI, LLMs...
What started with OpenAI’s ChatGPT has bloomed into a rapidly evolving subcategory of technology. Its capabilities are blooming as well as the critics and fears.
Not the cutting edge AI but the rocksolid ML
We want to build the basic and fundamental machine learning society needs with a clear focus on stability and reliability. We want scikit-learn to embody the frugal, understandable, explainable and reliable machine learning.
In a nutshell
Fundamental and Trusted Machine Learning
Where to convey the vision?
A lot of space for scikit-learn to deploy its vision
source: Google Cloud - Mlops continuous delivery and automation pipelines in machine learning
Branch out
And extend the vision
4- Prepping tables for machine learning
During ML modeling, how would you deal with this column?
During ML modeling, how would you deal with this column?
During ML modeling, how would you deal with this column?
These columns are collinear, so we only keep one of these two
What about this one?
What about this one?
What about this one?
One Hot Encoding is impractical
By extension, messy categories impact joining
“Dirty data” = #1 challenge
https://www.kaggle.com/code/ash316/novice-to-grandmaster
We need to treat categories as continuous entities
By embedding these categories
Meet skrub
Prepping tables for machine learning
skrub-data.org
⚠️ Note that we are migrating the project from “dirty_cat”
pip install dirty_cat
Current stable version:
Feature 1: Encoding
Gap Encoder
P.Cerda, G.Varoquaux. Encoding high-cardinality string categorical variables (2019)
Feature 1: Encoding
Gap Encoder
The Gamma Poisson is well suited to count statistics:
As f is a vector of count, it is natural to consider a Poisson distribution for each of its elements:
For prior elements of x, we use a Gamma prior, as it is the conjugate of the Poisson likelihood but also because it fosters soft sparsity (small values but non zero):
P.Cerda, G.Varoquaux. Encoding high-cardinality string categorical variables (2019)
Feature 1: Encoding
Gap Encoder
P.Cerda, G.Varoquaux. Encoding high-cardinality string categorical variables (2019)
Feature 1: Encoding
Min Hash Encoder
P.Cerda, G.Varoquaux. Encoding high-cardinality string categorical variables (2019)
Feature 1: Encoding
Min Hash Encoder
P.Cerda, G.Varoquaux. Encoding high-cardinality string categorical variables (2019)
Feature 1: Encoding
Min Hash Encoder
P.Cerda, G.Varoquaux. Encoding high-cardinality string categorical variables (2019)
Typical performances
Feature 1: Encoding
Table Vectorizer
Feature 1: Encoding
Table Vectorizer
Feature 2: Joining
Fuzzy Join
Feature 2: Joining
Fuzzy Join
Feature 2: Joining
Feature Augmenter
Feature 2: Joining
Feature Augmenter
Feature 3: Deduplication
Hierarchical clustering
Jupyter notebook demo
https://github.com/jovan-stojanovic/jupytercon2023
What’s next?
Leverage contextual embeddings�and graph
More learning, less cleaning
A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Analytics on non-normalized data sources: more learning, rather than more cleaning
More learning, less cleaning
A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Analytics on non-normalized data sources: more learning, rather than more cleaning
More learning, less cleaning
A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Analytics on non-normalized data sources: more learning, rather than more cleaning
Automatic feature extraction
A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational Data Embeddings for Feature Enrichment with Background Information
Automatic feature extraction
A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational Data Embeddings for Feature Enrichment with Background Information
Automatic feature extraction
A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational Data Embeddings for Feature Enrichment with Background Information
Automatic feature extraction
A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational Data Embeddings for Feature Enrichment with Background Information
Automatic feature extraction
A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational Data Embeddings for Feature Enrichment with Background Information