J. Duncan
C. Singh
A. Agarwal
R. Kapoor
B. Yu
Veridical Flow
Yu & Kumbier, “Veridical Data Science” (2020)
A Python data science productivity package, heavily influenced by the PCS framework for the data science life-cycle (DSLC)
Coding PCS analysis from scratch is tedious
Data
perturbation
Modeling
Evaluation
...
...
...
...
...
...
...
...
...
helpful but not enough
Data cleaning?
Choice of eval metric?
Interpretation?
Prediction screening?
Solution: Collapse perturbations
down to a single step
Data
perturbations
Model
perturbations
Evaluation
...
Distributed computation, tracking, caching, and saving with minimal code
Main abstraction: Vset
A “veridical set” (Vset) is a set of arbitrary perturbations for some step in the DSLC.
Vset(‘my_vset’, [func1, func2, ...]) # provide a list of perturbations
build_vset(‘my_vset’, func, params) # provide a function and parameter variations
Can be any callable object or that has one of the unified interface methods.
Vsets provide a unified interface for data science:
resample_set(X_train, y_train) # call a Vset directly to apply functions to input
impute_set.transform(X_train) # e.g., fill missing values
model_set.fit(X_train, y_train) # fit all learners in a Vset
model_set.predict(X_test) # classification / regression
model_set.predict_proba(X_test) # predict class probabilities
eval_vset.evaluate(preds, y_test) # evaluate predictions
Veridical Flow constructs an implicit pipeline
model_set = Vset([lr, dt, rf], ...)
impute_set = Vset([mean_impute, median_impute], ...)
resample_set = build_Vset(resample, ...)
init
model
resample
X_train, y_train = init_args(X_train, y_train)
X_val, y_val = init_args(X_val, y_val)
X_test, y_test = init_args(X_test, y_test)
eval_set = Vset([acc, auroc], ...)
impute
eval
X_trains, y_trains = resample_set(X_train, y_train)
X_trains = impute_set.fit(X_trains).transform(X_trains)
metrics = eval_set.evaluate(y_val, preds)
preds = model_set.predict_proba(impute_set(X_val))
model_set.fit(X_trains, y_trains)
Inspect metrics and perturbation intervals
df_metrics = dict_to_df(metrics)
perturbation_stats(df_metrics, ‘eval’, 'impute')
Prediction screening
best_impute_set, best_model_set = \
filter_vset_by_metric(metrics, impute_set, model_set, ...)
init
resample
model
eval
impute
best_impute
best_model
X_trainval = best_impute_set.fit(X_trainval).transform(X_trainval)
best_model_set.fit(X_trainval, y_trainval)
eval_set.evaluate(y_test, test_preds)
metrics = eval_set.evaluate(y_val, preds)
test_preds = best_model_set.predict(best_impute_set.transform(X_test))
best_impute_set = filter_vset_by_metric(metrics, impute_set, ...)
Computation, tracking, & sharing
Distributed compute on heterogeneous resources
Hyperparam tracking + model saving
Caching
How vflow incorporates PCS
P - makes evaluation simple, provides prediction screening
C - distributed computation, caching, tracking, saving
S - Vary over large sets of perturbations, utilities to compute perturbation intervals
Thank you!