Faster data loading
More tricks
Erik Ylipää
Linköping University
AIDA Data Hub AI Support Lead
National Bioinformatics Infrastructure Sweden (NIBS)
SciLifeLab
AIDA Data Hub Support�Training a model
Support for how to split data (independence, resampling etc.)
Support for how to load data (efficiency, compatibility)
Support for designing and debugging machine learning model
Support for evaluating model
Support for secure data storage and compute
Task specific dataset
Training data
Test data
Machine Learning model
Trained Machine Learning model
Experiment results
Outline
CPU utilization
GPU utilization
btop visualization of CPU vs GPU utilization. Training is bottlenecked by data augmentation on CPU.
NVIDIA DALI
Defining a pipeline
Defining a pipeline
While still there, doing standard any randomized augmentation only works with the decorator
Basic example
What is a decorator
Decorators are functions which takes other functions as inputs and returns a function as output
It’s an “simple” way from a user point of view of changing the behaviour of a function
Used in many frameworks to “magically” inject themselves in your python code
NVIDIA Dali does more than most
The pipeline_def decorator actually transforms the source code of the function (or the Abstract Syntax Tree to be more precise) to generate a new function using Dali primitives
This means that errors in your pipeline definitions can get weird error messages
DALI AutoGraph
Readers
The “source” of your data.
Analogous to pytorch Dataset in a DataLoader pipeline
External source
Pytorch integration
By default DALI uses its own ndarray format
There are wrappers which convert the arrays to pytorch Tensors
Here will will use the DALIGenericIterator
Tutorial Refactoring to DALI
Hands-on refactoring