Tivadar Danka

Biological Research Centre

Hungarian Academy of Sciences

A modular active learning framework for Python

Bioimage Analysis and

Machine Learning Group

About me

Tivadar Danka

2013 - 2016

2017 -

PhD in mathematics

postdoc in computational biology at

(Bioimage Analysis and Machine Learning Group,

Hungarian Academy of Sciences)

What is active learning?

Training

dataset

What is active learning?

Classification probabilities

for a trained classifier

What is active learning?

If additional unlabeled data is available, which ones should we label?

Informative

Not informative

What is active learning?

Gather data

Build a model

Employ

Can I gather

more data?

Measure the utility of unlabeled samples

Query for labels

no

yes

  • uncertainty
  • max margin
  • entropy
  • hierarchical sampling

.

.

.

  • max utility
  • ranked batch
  • spatially balanced
  • weighted random

.

.

.

  • kNN classifier
  • random forest
  • neural networks

.

.

.

modAL: a modular active

learning framework

Full scikit-learn compatibility

Intuitive use

Simple and unified API for the pipeline

Modularity and

flexibility

Easy extensibility

modal: adjective, relating to structure as opposed to substance

(Merriam-Webster Dictionary)

modAL: modularity

from modAL.models import ActiveLearner

from modAL.uncertainty import uncertainty_sampling
from sklearn.ensemble import RandomForestClassifier

# initializing the learner
learner = ActiveLearner(
estimator=RandomForestClassifier(),

query_strategy=uncertainty_sampling,
X=X_training, y=y_training
)

any object following the scikit-learn API

any function taking an estimator object

# query for labels
query_idx, query_inst = learner.query(X_unlabeled)

# ...obtaining new labels from the Oracle...

# supply label for queried instance
learner.teach(X_unlabeled[query_idx], y_new)

stream-based and

pool-based scenarios

are supported - with almost no change in this code!

modAL: flexibility

from modAL.models import ActiveLearner

learner = ActiveLearner(
GaussianProcessRegressor(),

max_std,
X_training, y_training
)

learner = ActiveLearner(
KerasClassifier(),

uncertainty_sampling,
X_training, y_training
)

learner = ActiveLearner(
RandomForestClassifier(),

entropy_sampling,
X_training, y_training
)

# query for labels
query_idx, query_inst = learner.query(X_unlabeled)
# supply label for queried instance
learner.teach(X_unlabeled[query_idx], y_new)

modAL: extensibility

from modAL.models import ActiveLearner

from sklearn.gaussian_process import GaussianProcessRegressor

# a custom query strategy

def max_std(regressor, X):
_, std = regressor.predict(X, return_std=True)
query_idx = np.argmax(std)
return query_idx, X[query_idx]

only constraint:

first argument must be

the active learner

# initializing the learner

learner = ActiveLearner(

estimator=GaussianProcessRegressor(kernel),

query_strategy=max_std,

X=X_training, y=y_training

)

modAL: scikit-learn compatibility

from modAL.models import ActiveLearner

learner = ActiveLearner(...)

learner.fit(X, y)

learner.predict(X)

learner.predict_proba(X)

learner.score(X, y)

from sklearn.model_selection import cross_val_score

scores = cross_val_score(learner, X, y, cv=10)

the ActiveLearner

acts like a scikit-learn

estimator

modAL: additional features

TODO:

  • support
  • support (with )
  • Regression models
  • Stream-based sampling
  • Ensemble models, query by committee
  • Bayesian optimization

  • Thorough profiling and optimization
  • Query synthesis
  • Advanced acquisition function optimization for Bayesian optimization
  • More active learning algorithms :)

Contributions are welcome!

Thank you for your attention!

is available on PyPI and Github at

https://github.com/modAL-python/modAL

slides - Google Slides