1 of 37

Agenda	Duration (CST)
Introduction + AutoGluon Tabular	2:00PM – 2:55PM
Break	2:55PM – 3:05PM
AutoGluon Multimodal	3:05PM – 4:00PM
Break	4:00PM – 4:10PM
AutoGluon Timeseries	4:10PM – 4:50PM
Additional Q&A + Feedback	4:50PM – 5:00PM

Workshop Website

*Note on Hands-on Notebooks

pip install autogluon

Restart the kernel

2 of 37

AutoML for Time Series with AutoGluon

AutoGluon: Empowering (Multimodal) AutoML for the Next 10 Million Users at NeurIPS 2022

December 2022 | New Orleans, LA

Caner Türkmen, Sr. Applied Scientist @ Amazon Web Services

3 of 37

Time Series Forecasting: A Short Problem Definition

time

“past”

“future”

4 of 37

Time Series Forecasting: A Short Problem Definition

time

“features”

“labels”

5 of 37

Time Series Forecasting: A Short Problem Definition

time

“features”

“labels”

Uncertainty quantification

e.g., Newsvendor problem

Finance: value at risk

(conditional) quantiles of the predictive distribution

6 of 37

Time Series Forecasting: Other Features

time

“past”

“future”

7 of 37

Time Series Forecasting: Other Features

time

“past”

“future”

“Known” time-varying covariates (related time series)

Covariates which will have “known” values, or surrogates thereof, into the future at prediction time. Example: weather (forecasts), dummy variables for holidays, promotions (plans)

Other time-varying covariates / related time series

Other related time series the future values of which will not be known at prediction time. Example: prices of related assets in finance, demand for related items in demand forecasting

Item metadata / Static features

Static features describing the item that are not time-varying. For example, the category of an item in the catalogue, the industry of a financial asset, etc.

8 of 37

Global vs local

time

Very often, a data set is comprised of multiple “items”

time

In local models, we “fit” parameters to the dynamics of each time series (ETS, ARIMA, etc.)

θ₁

θ₂

θ₃

θ₄

θ₅

time

Global models share parameters across different time series and learn common dynamics (Neural networks)

9 of 37

Machine learning methods are increasingly used in real-world forecasting use cases�* majority already powering AutoGluon-TimeSeries

Local

STL-AR

ETS

ARIMA

Theta

Prophet

Naive-1

Seasonal Naive

Naive

Global

DeepAR

TFT

MQC/RNN

FF Networks

DeepState

Informer

N-BEATS

Tree

XGBoost

LightGBM

CatBoost

Rotbaum

TCNs

10 of 37

Machine learning methods are taking over forecasting, but subject of ongoing debate

Common scenario: neural network forecaster “does not work”

Deep learning based forecasters have too many tuning knobs to get right + require some knowledge of “dark arts”

The simplest ways are very good baselines (e.g., Naive-1). Complex models require 100x compute. Also for inference.

Benefits of ML can be marginal and hard to determine: targets are heavy tailed + both training and inference are inherently non-deterministic = lots of noise in model evaluation

Validation scores computed ‘out of time’, do not generalize as well as exchangeable data.

11 of 37

We’ve created monsters

M4 winner: “ensemble of specialists” and other levels of ensembling on ES-RNN (Smyl, 2020)

M5 winners: most reported winners used combinations of ML (LightGBM, NN forecasters) (Makridakis et al., 2022)

12 of 37

AutoML

A wide selection of models / model architectures
A 2^wide selection of potential combinations of models
Each model with its array of hyperparameters (tuning knobs)
Data of varying shapes and sizes
Time and compute budget

Something like 5-10 lines of code

A highly accurate forecasting model (potentially a monster)

13 of 37

AutoML

Hyperparameter Optimization

Model Selection

Model Ensembles

Thoughtful Data Preprocessing

Battle-tested “Presets”

CASH

Finding the best configuration of hyperparameters for a given model, including regularization, training, and model architecture

Selecting the model likeliest to generalize with high performance to test data

Combining trained models and training new models for even higher performance

Augmenting / transforming data to boost model performance

Collection of default hyperparameters, hyperparameter ranges, preprocessing steps

14 of 37

Introducing AutoGluon-TimeSeries

Available as of AutoGluon v0.5: AutoML for time series forecasting, built on AutoGluon and GluonTS in Python.

Flexible hyperparameter optimization, model selection, and ensembles for GluonTS models and beyond!

With v0.6: support for static and dynamic features, powerful tree-based models, and more

15 of 37

TimeSeries

Model zoo including ETS, ARIMA, Prophet, and many GluonTS models

Hyperparameter Optimization with Ray Tune or plain random search

Built on AutoGluon’s familiar API

Tabular

(XGBoost, CatBoost, LightGBM)

Default Backend

(Random Search)

16 of 37

Available as of v0.6

Coming in v0.7

Time Series Problems

Forecasting multiple univariate time series (panel data) with related time series (dynamic features) and item metadata (static features)
Probabilistic (quantile) forecasting

Models

DeepAR and SimpleFeedForward networks based on PyTorch / GluonTS
Tree based models (LightGBM, XGBoost, Catboost) via AutoGluon-Tabular
DeepAR, MQC/RNN, TFT, Transformers, et al. from GluonTS (MXNet – optional!)
Fast and stable implementations of local models (ETS, Theta, ARIMA) based on statsmodels

Faster implementations of deep learning models on PyTorch
Faster implementations of local models
Post-hoc calibration of models for significantly improved coverage

Model Selection and HPO

Basic HPO functionality for all models via random search
Bayesian optimization based on integration with Ray Tune
Validation via multi-window backtesting

Improved Bayesian optimization capability

Ensembling

Weighted ensembles

Due Q1 2023

Support for featurization and time series transforms under the hood: differencing, log transforms, Box-Cox, etc.

Stack ensembles with model-based quantile aggregation

17 of 37

AutoGluon-TimeSeries

AG Time Series

Hyperparameter Optimization

Model Selection

Model Ensembles

Thoughtful Data Preprocessing

Battle-tested “Presets”

Improve HPO by stabilizing validation: multi-window backtesting, prioritizing models with faster inference, etc.

Model zoo encompassing many common benchmark models in time series analysis, with a mix of local, naive, and global

Building powerful ensembles of time-series models while addressing unique challenges in temporal data, quantile model aggregation. [coming v0.7] Stack ensembles

[coming in v0.7] default data transformations known to significantly increase forecast accuracy

Presets and default hyperparameters for deep learning based models tuned over a wide set of benchmark data sets

18 of 37

Time series forecasting in a few lines of code

19 of 37

Time Series Data Frame

20 of 37

Time series forecasting in a few lines of code

Inside fit()

Out-of-time train / val split

Train multiple models based on presets

Combine predictions with an ensemble

Automatically select the best model

21 of 37

Fine-grained control of component models

22 of 37

Hyperparameter optimization

23 of 37

Perform hyperparameter optimization by specifying search space of hyperparameters

24 of 37

Robust training via multi-window backtesting

More granular control also available (number of windows, etc.)

25 of 37

Greedy weighted ensemble

26 of 37

Override time series indexes

27 of 37

Get started today: auto.gluon.ai

Check the quick start and in-depth tutorials at https://auto.gluon.ai/stable/tutorials/timeseries/index.html

pip install autogluon>=0.6

AutoGluon Website

NeurIPS workshop Website

28 of 37

Q&A

atturkm@amazon.com

AutoGluon Website

NeurIPS workshop Website

29 of 37

References

Smyl, Slawek. "A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting." International Journal of Forecasting 36.1 (2020): 75-85.
Makridakis, Spyros, Evangelos Spiliotis, and Vassilios Assimakopoulos. "M5 accuracy competition: Results, findings, and conclusions." International Journal of Forecasting (2022).
Li, Lisha, et al. "Hyperband: A novel bandit-based approach to hyperparameter optimization." The Journal of Machine Learning Research 18.1 (2017): 6765-6816.
Snoek, Jasper, Hugo Larochelle, and Ryan P. Adams. "Practical bayesian optimization of machine learning algorithms." Advances in neural information processing systems 25 (2012).
Hyndman, Rob J., and George Athanasopoulos. Forecasting: principles and practice. OTexts, 2018.
Meisenbacher, Stefan, et al. "Review of automated time series forecasting pipelines." arXiv preprint arXiv:2202.01712 (2022).
Feurer, Matthias, and Frank Hutter. "Hyperparameter optimization." Automated machine learning. Springer, Cham, 2019. 3-33.
Lindauer, Marius, et al. "SMAC3: A Versatile Bayesian Optimization Package for Hyperparameter Optimization." J. Mach. Learn. Res. 23 (2022): 54-1.
Gneiting, Tilmann, et al. "Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation." Monthly Weather Review 133.5 (2005): 1098-1118.
Kim, Taesup, et al. "Deep quantile aggregation." arXiv preprint arXiv:2103.00083 (2021).

30 of 37

Hyperparameter Optimization: �Too many DeepARs, too little time 😱

num_layers

num_cells

cell_type

dropoutcell_type

embedding_dimension

….

GRU

LSTM

….

Zoneout

RNN ZoneOut

Variational

Dropout

Variational

Zoneout

31 of 37

Hyperparameter Optimization (HPO)

num_layers

num_cells

cell_type

dropoutcell_type

embedding_dimension

….

GRU

LSTM

….

Zoneout

RNN ZoneOut

Variational

Dropout

Variational

Zoneout

How to select the best hyperparameters such that out of sample performance is optimized?

32 of 37

Time Series HPO: AutoETS and AutoARIMA

AutoETS: Exhaustively enumerate a range of values for Error, Trend, Seasonality, and Damping.

AutoARIMA: 1/ Establish order of differencing by non-stationarity tests. 2/ Exhaustively enumerate values for AR and MA orders.

Select best model in sample according to selected information criterion¹

For higher-capacity models one needs multi-fold out-of-sample validation scores

1 (Hyndman and Athanasopoulos, 2018)

33 of 37

Bagging

Image Source: Wikipedia

34 of 37

Multi-Layer Stack Ensembling

Stacker model uses predictions of every base model as extra features
Layer L+1 stacker model uses layer L predictions as extra features
For simplicity: stacker models types = base model types
NOTE: Stacker must be trained with held-out predictions of lower-layer models

35 of 37

Cross-Validation

Train k different copies of model with different chunk of data held-out from each.

36 of 37

Ensembles

Many are better than one

Encompasses key ideas that underlie monsters, bagging, boosting, stacking

In AutoML frameworks: not letting models go to waste

37 of 37

Ensembles of Forecasts: Unique Challenges

Often interested in probabilistic forecasts.
How to aggregate probabilistic forecasts? Simple idea: take averages of quantiles (a.k.a. “Vincentization”)
However even searching the space of weights is prohibitive: evaluating quantile losses can be expensive.
Other ideas required, but come with costs. For example^{1, 2}
How does one draw bootstrap samples?

1 (Gneiting et al., 2005) 2 (Kim et al., 2021)