1 of 37

1

Agenda

Duration (CST)

Introduction + AutoGluon Tabular

2:00PM – 2:55PM

Break

2:55PM – 3:05PM

AutoGluon Multimodal

3:05PM – 4:00PM

Break

4:00PM – 4:10PM

AutoGluon Timeseries

4:10PM – 4:50PM

Additional Q&A + Feedback

4:50PM – 5:00PM

Workshop Website

*Note on Hands-on Notebooks

  • pip install autogluon

  • Restart the kernel

© 2022, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

2 of 37

AutoML for Time Series with AutoGluon

AutoGluon: Empowering (Multimodal) AutoML for the Next 10 Million Users at NeurIPS 2022

December 2022 | New Orleans, LA

Caner Türkmen, Sr. Applied Scientist @ Amazon Web Services

3 of 37

Time Series Forecasting: A Short Problem Definition

time

“past”

“future”

4 of 37

Time Series Forecasting: A Short Problem Definition

time

“features”

“labels”

5 of 37

Time Series Forecasting: A Short Problem Definition

time

“features”

“labels”

Uncertainty quantification

e.g., Newsvendor problem

Finance: value at risk

(conditional) quantiles of the predictive distribution

6 of 37

Time Series Forecasting: Other Features

time

“past”

“future”

1

2

3

7 of 37

Time Series Forecasting: Other Features

time

“past”

“future”

1

2

3

1

“Known” time-varying covariates (related time series)

Covariates which will have “known” values, or surrogates thereof, into the future at prediction time. Example: weather (forecasts), dummy variables for holidays, promotions (plans)

2

Other time-varying covariates / related time series

Other related time series the future values of which will not be known at prediction time. Example: prices of related assets in finance, demand for related items in demand forecasting

3

Item metadata / Static features

Static features describing the item that are not time-varying. For example, the category of an item in the catalogue, the industry of a financial asset, etc.

8 of 37

Global vs local

time

Very often, a data set is comprised of multiple “items”

time

In local models, we “fit” parameters to the dynamics of each time series (ETS, ARIMA, etc.)

θ1

θ2

θ3

θ4

θ5

time

θ

Global models share parameters across different time series and learn common dynamics (Neural networks)

9 of 37

Machine learning methods are increasingly used in real-world forecasting use cases�* majority already powering AutoGluon-TimeSeries

Local

STL-AR

ETS

ARIMA

Theta

Prophet

Naive-1

Seasonal Naive

Naive

Global

DL

DeepAR

TFT

MQC/RNN

FF Networks

DeepState

Informer

N-BEATS

Tree

XGBoost

LightGBM

CatBoost

Rotbaum

TCNs

10 of 37

Machine learning methods are taking over forecasting, but subject of ongoing debate

  • Common scenario: neural network forecaster “does not work”

  • Deep learning based forecasters have too many tuning knobs to get right + require some knowledge of “dark arts”

  • The simplest ways are very good baselines (e.g., Naive-1). Complex models require 100x compute. Also for inference.

  • Benefits of ML can be marginal and hard to determine: targets are heavy tailed + both training and inference are inherently non-deterministic = lots of noise in model evaluation

  • Validation scores computed ‘out of time’, do not generalize as well as exchangeable data.

11 of 37

We’ve created monsters

M4 winner: “ensemble of specialists” and other levels of ensembling on ES-RNN (Smyl, 2020)

M5 winners: most reported winners used combinations of ML (LightGBM, NN forecasters) (Makridakis et al., 2022)

12 of 37

AutoML

  • A wide selection of models / model architectures
  • A 2wide selection of potential combinations of models
  • Each model with its array of hyperparameters (tuning knobs)
  • Data of varying shapes and sizes
  • Time and compute budget

Something like 5-10 lines of code

A highly accurate forecasting model (potentially a monster)

13 of 37

AutoML

AutoML

Hyperparameter Optimization

Model Selection

Model Ensembles

Thoughtful Data Preprocessing

Battle-tested “Presets”

CASH

Finding the best configuration of hyperparameters for a given model, including regularization, training, and model architecture

Selecting the model likeliest to generalize with high performance to test data

Combining trained models and training new models for even higher performance

Augmenting / transforming data to boost model performance

Collection of default hyperparameters, hyperparameter ranges, preprocessing steps

14 of 37

Introducing AutoGluon-TimeSeries

  • Available as of AutoGluon v0.5: AutoML for time series forecasting, built on AutoGluon and GluonTS in Python.

  • Flexible hyperparameter optimization, model selection, and ensembles for GluonTS models and beyond!

  • With v0.6: support for static and dynamic features, powerful tree-based models, and more

15 of 37

15

TimeSeries

Model zoo including ETS, ARIMA, Prophet, and many GluonTS models

Hyperparameter Optimization with Ray Tune or plain random search

Built on AutoGluon’s familiar API

Tabular

(XGBoost, CatBoost, LightGBM)

Default Backend

(Random Search)

16 of 37

Available as of v0.6

Coming in v0.7

Time Series Problems

  • Forecasting multiple univariate time series (panel data) with related time series (dynamic features) and item metadata (static features)
  • Probabilistic (quantile) forecasting

Models

  • DeepAR and SimpleFeedForward networks based on PyTorch / GluonTS
  • Tree based models (LightGBM, XGBoost, Catboost) via AutoGluon-Tabular
  • DeepAR, MQC/RNN, TFT, Transformers, et al. from GluonTS (MXNet – optional!)
  • Fast and stable implementations of local models (ETS, Theta, ARIMA) based on statsmodels
  • Faster implementations of deep learning models on PyTorch
  • Faster implementations of local models
  • Post-hoc calibration of models for significantly improved coverage

Model Selection and HPO

  • Basic HPO functionality for all models via random search
  • Bayesian optimization based on integration with Ray Tune
  • Validation via multi-window backtesting

  • Improved Bayesian optimization capability

Ensembling

  • Weighted ensembles

Due Q1 2023

  • Support for featurization and time series transforms under the hood: differencing, log transforms, Box-Cox, etc.
  • Stack ensembles with model-based quantile aggregation

17 of 37

AutoGluon-TimeSeries

AG Time Series

Hyperparameter Optimization

Model Selection

Model Ensembles

Thoughtful Data Preprocessing

Battle-tested “Presets”

Improve HPO by stabilizing validation: multi-window backtesting, prioritizing models with faster inference, etc.

Model zoo encompassing many common benchmark models in time series analysis, with a mix of local, naive, and global

Building powerful ensembles of time-series models while addressing unique challenges in temporal data, quantile model aggregation. [coming v0.7] Stack ensembles

[coming in v0.7] default data transformations known to significantly increase forecast accuracy

Presets and default hyperparameters for deep learning based models tuned over a wide set of benchmark data sets

18 of 37

Time series forecasting in a few lines of code

19 of 37

Time Series Data Frame

20 of 37

Time series forecasting in a few lines of code

Inside fit()

  • Out-of-time train / val split

  • Train multiple models based on presets

  • Combine predictions with an ensemble

  • Automatically select the best model

21 of 37

Fine-grained control of component models

22 of 37

Hyperparameter optimization

23 of 37

  • Perform hyperparameter optimization by specifying search space of hyperparameters

24 of 37

  • Robust training via multi-window backtesting

  • More granular control also available (number of windows, etc.)

25 of 37

  • Greedy weighted ensemble

26 of 37

  • Override time series indexes

27 of 37

  • Get started today: auto.gluon.ai

pip install autogluon>=0.6

AutoGluon Website

NeurIPS workshop Website

28 of 37

Q&A

atturkm@amazon.com

AutoGluon Website

NeurIPS workshop Website

29 of 37

References

  • Smyl, Slawek. "A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting." International Journal of Forecasting 36.1 (2020): 75-85.
  • Makridakis, Spyros, Evangelos Spiliotis, and Vassilios Assimakopoulos. "M5 accuracy competition: Results, findings, and conclusions." International Journal of Forecasting (2022).
  • Li, Lisha, et al. "Hyperband: A novel bandit-based approach to hyperparameter optimization." The Journal of Machine Learning Research 18.1 (2017): 6765-6816.
  • Snoek, Jasper, Hugo Larochelle, and Ryan P. Adams. "Practical bayesian optimization of machine learning algorithms." Advances in neural information processing systems 25 (2012).
  • Hyndman, Rob J., and George Athanasopoulos. Forecasting: principles and practice. OTexts, 2018.
  • Meisenbacher, Stefan, et al. "Review of automated time series forecasting pipelines." arXiv preprint arXiv:2202.01712 (2022).
  • Feurer, Matthias, and Frank Hutter. "Hyperparameter optimization." Automated machine learning. Springer, Cham, 2019. 3-33.
  • Lindauer, Marius, et al. "SMAC3: A Versatile Bayesian Optimization Package for Hyperparameter Optimization." J. Mach. Learn. Res. 23 (2022): 54-1.
  • Gneiting, Tilmann, et al. "Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation." Monthly Weather Review 133.5 (2005): 1098-1118.
  • Kim, Taesup, et al. "Deep quantile aggregation." arXiv preprint arXiv:2103.00083 (2021).

30 of 37

Hyperparameter Optimization: �Too many DeepARs, too little time 😱

num_layers

num_cells

cell_type

dropoutcell_type

embedding_dimension

….

1

2

3

10

20

30

GRU

LSTM

10

20

30

….

….

….

Zoneout

RNN ZoneOut

Variational

Dropout

Variational

Zoneout

31 of 37

Hyperparameter Optimization (HPO)

num_layers

num_cells

cell_type

dropoutcell_type

embedding_dimension

….

1

2

3

10

20

30

GRU

LSTM

10

20

30

….

….

….

Zoneout

RNN ZoneOut

Variational

Dropout

Variational

Zoneout

How to select the best hyperparameters such that out of sample performance is optimized?

32 of 37

Time Series HPO: AutoETS and AutoARIMA

  • AutoETS: Exhaustively enumerate a range of values for Error, Trend, Seasonality, and Damping.

  • AutoARIMA: 1/ Establish order of differencing by non-stationarity tests. 2/ Exhaustively enumerate values for AR and MA orders.

  • Select best model in sample according to selected information criterion1

  • For higher-capacity models one needs multi-fold out-of-sample validation scores

1 (Hyndman and Athanasopoulos, 2018)

33 of 37

Bagging

33

Image Source: Wikipedia

34 of 37

Multi-Layer Stack Ensembling

  • Stacker model uses predictions of every base model as extra features
  • Layer L+1 stacker model uses layer L predictions as extra features
  • For simplicity: stacker models types = base model types
  • NOTE: Stacker must be trained with held-out predictions of lower-layer models

35 of 37

Cross-Validation

35

Train k different copies of model with different chunk of data held-out from each.

36 of 37

Ensembles

  • Many are better than one

  • Encompasses key ideas that underlie monsters, bagging, boosting, stacking

  • In AutoML frameworks: not letting models go to waste

37 of 37

Ensembles of Forecasts: Unique Challenges

  • Often interested in probabilistic forecasts.
  • How to aggregate probabilistic forecasts? Simple idea: take averages of quantiles (a.k.a. “Vincentization”)
  • However even searching the space of weights is prohibitive: evaluating quantile losses can be expensive.
  • Other ideas required, but come with costs. For example1, 2
  • How does one draw bootstrap samples?

1 (Gneiting et al., 2005) 2 (Kim et al., 2021)