1 of 10

J. Duncan

C. Singh

A. Agarwal

R. Kapoor

B. Yu

2 of 10

Veridical Flow

Yu & Kumbier, “Veridical Data Science” (2020)

A Python data science productivity package, heavily influenced by the PCS framework for the data science life-cycle (DSLC)

3 of 10

Coding PCS analysis from scratch is tedious

Data

perturbation

Modeling

Evaluation

...

helpful but not enough

Data cleaning?

Choice of eval metric?

Interpretation?

Prediction screening?

Solution: Collapse perturbations

down to a single step

Data

perturbations

Model

perturbations

Evaluation

...

Distributed computation, tracking, caching, and saving with minimal code

4 of 10

Main abstraction: Vset

A “veridical set” (Vset) is a set of arbitrary perturbations for some step in the DSLC.

Vset(‘my_vset’, [func1, func2, ...]) # provide a list of perturbations

build_vset(‘my_vset’, func, params) # provide a function and parameter variations

Can be any callable object or that has one of the unified interface methods.

Vsets provide a unified interface for data science:

resample_set(X_train, y_train) # call a Vset directly to apply functions to input

impute_set.transform(X_train) # e.g., fill missing values

model_set.fit(X_train, y_train) # fit all learners in a Vset

model_set.predict(X_test) # classification / regression

model_set.predict_proba(X_test) # predict class probabilities

eval_vset.evaluate(preds, y_test) # evaluate predictions

5 of 10

Veridical Flow constructs an implicit pipeline

model_set = Vset([lr, dt, rf], ...)

impute_set = Vset([mean_impute, median_impute], ...)

resample_set = build_Vset(resample, ...)

init

model

resample

X_train, y_train = init_args(X_train, y_train)

X_val, y_val = init_args(X_val, y_val)

X_test, y_test = init_args(X_test, y_test)

eval_set = Vset([acc, auroc], ...)

impute

eval

X_trains, y_trains = resample_set(X_train, y_train)

X_trains = impute_set.fit(X_trains).transform(X_trains)

metrics = eval_set.evaluate(y_val, preds)

preds = model_set.predict_proba(impute_set(X_val))

model_set.fit(X_trains, y_trains)

6 of 10

Inspect metrics and perturbation intervals

df_metrics = dict_to_df(metrics)

perturbation_stats(df_metrics, ‘eval’, 'impute')

7 of 10

Prediction screening

best_impute_set, best_model_set = \

filter_vset_by_metric(metrics, impute_set, model_set, ...)

init

resample

model

eval

impute

best_impute

best_model

X_trainval = best_impute_set.fit(X_trainval).transform(X_trainval)

best_model_set.fit(X_trainval, y_trainval)

eval_set.evaluate(y_test, test_preds)

metrics = eval_set.evaluate(y_val, preds)

test_preds = best_model_set.predict(best_impute_set.transform(X_test))

best_impute_set = filter_vset_by_metric(metrics, impute_set, ...)

8 of 10

Computation, tracking, & sharing

Distributed compute on heterogeneous resources

Hyperparam tracking + model saving

Caching

9 of 10

How vflow incorporates PCS

P - makes evaluation simple, provides prediction screening

C - distributed computation, caching, tracking, saving

S - Vary over large sets of perturbations, utilities to compute perturbation intervals

Yu-Group/veridical-flow/

10 of 10

Thank you!