1 of 15

G-Research Crypto Forecasting

Patrick Yam

2 of 15

3 of 15

Final leaderboard

4 of 15

Competition setup

  • 2 stages competition
  • Training phase: 2nd Nov 2021 – 1st Feb 2022
  • Evaluation phase: 2nd Feb 2022 – 3rd May 2022
  • The model is fixed after the training phase and will be run against real market data in the evaluation phase

5 of 15

Overview

  • Data: 14 crypto assets from Jan 2018 to Jan 2022, minute bar data

6 of 15

Minute bar data

7 of 15

Target

8 of 15

Evaluation metric

  • Weighted version of the Pearson correlation coefficient

9 of 15

Data preprocessing

  • Additional features: sin/cos time of the day
  • Model inputs: 9 features in total
  • ['Count',  'Open',  'High',  'Low',  'Close', 'Volume', 'VWAP', 'time_sin', 'time_cos']

  • Take latest 45/60/90/120 minutes as input

  • Log transform for 'Volume' and 'VWAP'
  • Perform standardisation locally (except for time_sin and time_cos)

10 of 15

Model Architecture

11 of 15

Axial Attention

  • Ordering matters for time series axis but not for asset axis

For each asset, get information from other timestamp

For each timestamp, get information from other asset

12 of 15

MLP pooling

  • Nonlinear version of weighted average pooling
  • Perform MLP for time series axis, with output dimension = 1

  • Layer:

  • Forward:

13 of 15

Loss function

  • Minibatch Negative Weighted Pearson correlation coefficient

  • The scale for model's prediction trained with different random seed could be very different 
  • [0.1, 0.2, 0.3] will give the same result as [10, 20, 30]
  • Could be problematic for ensemble

  • BatchNorm with 'affine=False' for model's prediction

14 of 15

Ensemble

Correlation between models trained with different sequence length

45

60

90

120

45

1.000

0.702

0.588

0.546

60

0.702

1.000

0.652

0.602

90

0.588

0.652

1.000

0.670

120

0.546

0.602

0.670

1.000

Seq len

CV score

45

0.078

60

0.077

90

0.074

120

0.073

Ensemble

0.089

  • 3-fold CV training
  • 4 different sequence length settings
  • Total 12 models, simple average

  • Single model performance:�Private LB: 0.0202 -> 0.0159�#28, silver medal

Longer sequence doesn't always come with a better result

15 of 15

Thank you!��Q&A