2 of 50

Brief Dataset Recap

Dataset Overview

234 crypto coins/altcoins with open, close, low, high, and volume
7 different types of datasets: weekly, hourly, and daily changes. We chose D1 as a group since there includes more data for the daily scale and can provide high capabilities for prediction

Goal of the Dataset

We want to predict Bitcoin prices leveraging deep learning models learned in the course

3 of 50

Data Preprocessing

Data cleaning

Data Cleaning was not needed as the dataset was populated with the correct data types and no missing values were found

Data Normalization

Data (all fields numerical) is normalized prior to training. Data normalization is important when training recurrent neural networks for model stability.

Sample data information of bitcoin dataset from D1 folder

4 of 50

Sequence Separation

Dataset of length N was split into sequences of K+1.
The output sample is saved as the “target” to be learned.
Number of sequences is

Seq K_n

K_(n+1)

K_(n+2)

Dataset Length N

…

Training Sequence

Output

5 of 50

Train/Test Split

Dataset was split into three regions: 75% Train, 15% Test, 10% Validation
The regions were non-overlapping.
Sequence lengths of 10, 20 and 40 were chosen.
Table is for sequence length of 20.

Dataset	Train #	Test #	Val #
1 Day	1,753	335	216
4 Hour	10,614	2,107	1,397
30 Min	84,954	16,975	11,309

6 of 50

Model Evaluation

The primary evaluation metric used is the Root Mean Squared Error (RMSE).

Two primary loss functions used to compute training loss: MSE and L1 (MAE)

7 of 50

Model Regularization

Two regularization methods were used:

L1 Lasso Regularization

Dropout Layer

Randomly sets input units to 0 with probability p to prevent overfitting.

8 of 50

Model Optimizer

Two optimizers were evaluated for learning:

Stochastic Gradient Descent (SGD)

Updates the weights (w) based on a preselected learning rate (eta).
A learning rate schedule is often used to improve learning.

Adaptive Moment Estimation (Adam) [4]

“ An algorithm for first-order gradient-based optimization of stochastic objective functions based on adaptive estimates of lower-order moments ”

1st Moment = Mean, 2nd Moment = Biased Variance

Computes individual adaptive learning rates for different parameters.

9 of 50

Learning Schedule Rate

Three learning rates were selected for SGD optimization [5]:

Stepped Learning Rate (StepLR)

Reduces the learning rate every step_size epochs.

Cosine Annealing (CosineAnnealingLR) [6]

Uses a cosine function to decrease/increase the learning rate

Plateau (ReduceLROnPlateau) [7]

Keeps the learning rate constant while learning, but once learning plateaus, it decreases the learning rate.
Test loss used as the learning metric.

10 of 50

Model Training

Recurrent Neural Network (RNN)

Long Short Term Memory (LSTM)

Gated Recurrent Unit

(GRU)

Single hidden linear layer

Gated hidden linear layers and cell state vectors

Gated hidden linear layer

[1]

[2]

[3]

11 of 50

Hyper Parameter Tuning (RNN)

The RNN was selected as the network to help define the hyperparameters for the other models.
Used the Daily BTC dataset
Evaluated using

Validation RMSE for accuracy
Epochs till convergence

How long do I need to train for?

Train vs. Test Loss

Does it converge at all?

Parameter	Value
# of Layers	3
Hidden Size	1028
# of Epochs	250
Initial LR	0.01
L1 Reg Lambda	0.001
Batch Size	# of Samples
Activation Function	ReLU
Sequence Length	20
Dropout Prob.	0.0

12 of 50

SGD + L1 Regularization

The SGD optimizer had trouble learning with L1 regularization on.
No change in LR, LR Scheduler, Number of Epochs or appear to help.

Parameter	Value
Optimizer	SDG
Loss Function	MSE
LR Schedule	Cosine Annealing
T Max	10
Eta Min	0

Alpha = 0.0 (Off)

RMSE: 0.03139

Alpha = 0.001 (On)

RMSE = Too Large

13 of 50

SGD + L1 Regularization

(OFF) Alpha = 0.0

(ON) Alpha = 0.001

L1 Regularization introduced instabilities that made it difficult to train with SGD.

14 of 50

SGD + Dropout

A Dropout value of 0.10 had the best performance.
However, the improvement is only slight improvement over the other methods.

Parameter	Value	RMSE
Dropout	0.00	0.0295
Dropout	0.10	0.0266
Dropout	0.25	0.0281

15 of 50

Adam Optimizer

Avg RMSE was 0.0238, an improvement over SGD.

Lessoned Learned:

Adam optimizer had trouble converging with a larger model (Hidden Size and # Layers).
LR Schedule didn’t appear to matter for Adam.

Adam adapts the LR for each parameter so it handles it own schedule.

Dropout caused convergence issues, so it was turned off.
The largest improvement was the stability. Adam consistently converged to a solution.

Parameter	Value
Loss Function	L1 Loss
Optimizer	Adam
# of Layers	3
Hidden Size	256
Dropout	0.0
LR Scheduler	Cosine Annealing (10, 0)

16 of 50

Best RNN

Best Achieved RMSE: 0.02099

Parameter	Value
# of Layers	2
Hidden Size	64
# of Epochs	100
Initial LR	0.01
L1 Reg Lambda	0.001
Batch Size	# of Samples
Activation Function	ReLU
Sequence Length	20
Dropout Prob.	0.0
LR Schedule	ReduceLROnPlateau(‘min’, 0.5, 5)
Loss	L1
Optimizer	Adam

17 of 50

Data Output

Data was trained for Open, Close, Low, High and Volume.
Open, Close, Low, and High were able to train well.
Volume never was able to be predicted.
We did not explore removing this from the data.

18 of 50

Timescale

The error decreased as smaller time increments were used. Some possible reasons are below:

More time samples are available to learn from. Time decrease was roughly 6x and 24x more data, respectively
The price’s stochastic nature may be less chaotic in smaller time increments.

Daily RMSE: 0.02099

4 Hour RMSE: 0.0104

30 Min RMSE: 0.005

19 of 50

Hyperparameter Tuning - LSTM

The LSTM was selected as the network to help define the hyperparameters.
Used the Daily BTC dataset
Evaluated using

Test RMSE for accuracy
Train vs. Predicted Plot

Parameter	Value
# of Layers	1
Hidden Size	50
# of Epochs	100
Initial LR	0.01
L1 Reg Lambda	0.0001
Batch Size	# of Samples
Sequence Length	30
Dropout Prob.	0.0

20 of 50

Hyperparameter Tuning - LSTM

Parameter	Value
# of Layers	1
Hidden Size	50
# of Epochs	100
Initial LR	0.01
L1 Reg Lambda	0.0001
Batch Size	# of Samples
Sequence Length	30
Dropout Prob.	0.0

Reason of Choice

The value for each parameter is very basic. This is because our data is not complex and data size is not big. Adding more layers/units might cause overfitting and consume more computational power.

21 of 50

Hyperparameter Tuning - LSTM

Sequence Length Comparison

Since our data is not big and complex, I tried three short sequence lengths – 10, 20, and 30. For our problem, as we are predicting coin prices based on the historical data, it might be better to have longer sequence length. However, since our data is limited, it might cause overfitting and I decided to stop at 30. Also, since we are using the daily data, using shorter sequence length should be fine. (For hourly dataset we might need longer sequence to capture for daily patterns.)

Sequence Length	Test RMSE
10	0.022302017
20	0.021161662
30	0.021013899

The training speed slowed down noticeably at 30

22 of 50

Model Generalization- LSTM

Data size smaller compared to the other coins. Data was underfitting a bit using the same parameters. Increased number of layers and epochs.

23 of 50

Model Output - LSTM

For Aptos Coin:

Parameter	Value
# of Layers	2
Hidden Size	50
# of Epochs	500
Initial LR	0.01
L1 Reg Lambda	0.0001
Batch Size	# of Samples
Sequence Length	30
Dropout Prob.	0.0

Test RMSE from 0.047582 -> 0.037924

24 of 50

Hyper Parameter Tuning (GRU)

The GRU was selected as an alternative network to help define the hyperparameters.
Used the Daily BTC dataset
Evaluated using

Test RMSE for accuracy
Actual vs Predicted Plots

How well does it converge?

Parameter	Value
# of Layers	2
Hidden Size	50
# of Epochs	100
Initial LR	0.01
L1 Reg Lambda	0.0001
Batch Size	# of Samples
Sequence Length	10-40
Dropout Prob.	0.0

25 of 50

GRU - SGD + L1 Regularization

The SGD optimizer had trouble learning with L1 regularization on and also without it on.
Epochs remain to be the same.

Parameter	Value
Optimizer	SDG
Loss Function	MSE

Alpha = 0.0 (Off)

RMSE(Seq Length 10-40): 0.1983, 0.172, 0.1942, 0.184

Alpha = 0.0001 (On)

RMSE(Seq Length 10-40) = 0.1888, 0.1924, 0.1869, 0.1712

Same performance with half the sequence than with L1 Reg. on.

26 of 50

GRU - SGD + Dropout

A Dropout value of 0.2 had the best performance.
The RMSE range is the range for the seq lengths of 10-40
Also, the RMSE is too large for dropout

Parameter	Value	RMSE
Dropout	0	0.19 - 0.2
Dropout	0.1	0.176-0.205
Dropout	0.2	0.133-0.198

27 of 50

GRU - Adam + L1 Regularization + MSE Loss

All the coins performed much better with Adam
Each coin is trained separately and the key findings were interesting; Etherum performed the best

Coin	10 Seq	20 Seq	30 Seq	40 Seq
Bitcoin	0.02731	0.0302	0.0246	0.0291
Etherum	0.0203	0.01923	0.0247	0.0186
BNB	0.0247	0.02348	0.0267	0.02949
TIA	0.0381	0.0471	0.0437	0.0375
APT	0.0415	0.0408	0.0432	0.0421

28 of 50

Best GRU

Best Achieved RMSE: 0.01793

Parameter	Value
# of Layers	2
Hidden Size	20
# of Epochs	100
Initial LR	0.01
L1 Reg Lambda	0.0001
Batch Size	# of Samples
Sequence Length	30
Dropout Prob.	0.0
Loss	L1
Optimizer	Adam

29 of 50

Data Comparison

One thing we could do for future predictions is removing Volume and only looking at Open, Close, Low, and High variables
Volume had a large number of outliers which led to difficulty in predicting the volume price with our models.

Preprocessing the dataset to transform the volume data to the same working region as the other 4 features.

30 of 50

SGD vs. Adam

Adam might work better than SGD for our dataset due to noise. Adam is more robust against noise in the gradient computations as it uses adaptive learning rates.

SGD when added L1 regularization might not work well for our dataset as the data size is not big.

The variance introduced by SGD and added L1 regularization can drive coefficients to zero and potentially lead to underfitting.

31 of 50

Model Application and Challenges in Practice

Crypto coin price prediction analysis can be used by many people for different purposes, such as risk management and investment analysis.

For our analysis, the models have very good accuracies. This is probably because we are using the daily data and the data is not as stochastic. (The coin price can only change so much within a day)

32 of 50

Model Summary

GRU achieved the best performance by 0.003 with the least parameters.
All networks trained better with the Adam optimizer.
Dropout severely degraded performance in all networks.
Learning rate schedule not as important with Adam optimizer.

Model	Best RMSE	# Param	# Epochs	Seq Len
RNN	0.02099	1280	100	20
LSTM	0.02116	750	100	30
GRU	0.01793	300	100	30

33 of 50

References

[1] https://en.wikipedia.org/wiki/Recurrent_neural_network

[2] https://en.wikipedia.org/wiki/Long_short-term_memory

[3] https://en.wikipedia.org/wiki/Gated_recurrent_unit

[4] https://arxiv.org/abs/1412.6980

[5] https://pytorch.org/docs/stable/optim.html#module-torch.optim.lr_scheduler

[6]

https://paperswithcode.com/method/cosine-annealing

[7] https://wiki.cloudfactory.com/docs/mp-wiki/scheduler/reducelronplateau

34 of 50

Initial EDA Presentation

(BACKUP)

35 of 50

Outline

Background
Exploratory Data Analysis (EDA)
Assumptions
Project Goals
Predictive Power
Summary

36 of 50

Background Info

This dataset is found on Kaggle and it contains 234 Crypto Coins/Altcoins with historical Open, High, Low, Close, and Volume (OHLCV) prices traded in the Binance Exchange. This dataset is around 7GB and is the direct market data from the past 8 years (2024 included).

The data contains total 7 different sets of data that tracks daily, hourly (hours and minutes), and weekly rate changes. For our analysis, D1 folder data we will be our main resource, which has the information regarding the price and trading volume changes on the daily basis. However, if needed, we will also be referring to data from other timestamp folders for more precise predictions.

37 of 50

Background Basic EDA

The D1 dataset (concat all datasets in D1 folder) has 263,800 entries with no missing values. The data covers from date 2017-07-14 to 2024-02-13.

When we look at the max open value for each coin, the top ten coins with the most open values were all in year 2021, with the following distribution (YFI and BTC being the largest):

And if we remove the top two values, here is the distribution:

38 of 50

Background Basic EDA

We can see that most coins had their max opening price in year 2021 and two peaks happening during early summer and winter time.

39 of 50

Background Basic EDA

We can see the same trend coins’ max closing price.

40 of 50

Background Basic EDA

Below are the scatter plots of the top ten max openings vs. their closing value of the day:

Top 10

Removed top 2

41 of 50

We can see that prices were soaring between early 2021 to early 2022 for Bitcoin. There was a steep decline between early 2022 and mid-2022. Afterwords, the prices steadily started to increase again till 2024. The highest closing price Bitcoin reached is 67525.83 per share.

42 of 50

We can see that prices were soaring between early 2021 to early 2022 for Etherum. There was a steep decline between early 2022 and mid-2022. Afterwords, the prices steadily started to increase again till 2024. The highest closing price Etherum reached is 4807.98 per share.

43 of 50

Assumptions

Crypto price values are exactly like stocks - think of closing price as a share owned of crypto
The most important coins are the most expensive ones
The future price of a coin is based on past performance of a coin

44 of 50

Goal of the project

Predict the daily closing price of bitcoin
These predictions will leverage 4 sources of information.

The trading price and volume of bitcoin.
The trading price and volume of altcoins.
The various timestep granularity (daily, weekly, hourly, etc), with a focus on the daily closing price.

Only the altcoins which show a strong correlation to bitcoin will be used.

However, there could be relationships with weakly correlation altcoins which yield better prediction accuracy.

Given this is a time sequence of data, we will use RNN, LTSMs or GRUs as the network model.

45 of 50

Altcoin Correlation (1/2)

Many altcoins are closely tied to the price of bitcoin (BTC).

The figure on the right shows the normalized closing price of the top 10 cryptocurrencies (according to average price).

Many exhibit similar trends to bitcoin.

46 of 50

Alt-Coin Correlation

Of the 234 Alt-Coins (Not Bitcoin) several coins exhibited strong correlation between closing price with bitcoins.
To keep the input vector smaller, we will try to train the algorithm with weakly correlated (>.3) and strongly correlated (>.6) altcoins.

Strong Negative Correlation
Token	Pearson Score
EDU	-0.74
LQTY	-0.68
SSV	-0.65
AGIX	-0.63

Strong Positive Correlation
Token	Pearson Score
ETH	.93
BNB	.87
TIA	.79
APT	.66

47 of 50

The importance of cryptocurrency forecasting (1/2)

Investors and Trading

The SEC has approved several bitcoin exchange traded funds (ETFs) which give institutional stability to cryptocurrencies while also enabling a large new pool of people to start trading cryptocurrencies without the concern of exchanges going bankrupt (FTX for example).

Market Analysis

With hundreds of billions of dollars flowing through cryptocurrencies, it is prudent that regulatory bodies such as the SEC have methods for understanding volatility and risk of the underlying security (BTC) in order to enact measures to ensure market stability.
If BTC were to drop to 0, hundreds of billions would be wiped out. Predicting moments of extreme volatility would allow the SEC to enact countermeasures, such as pausing trading of bitcoin ETFs, to allow the volatility to pass.

48 of 50

The importance of cryptocurrency forecasting (2/2)

Policy and Regulation

As of Sept 7th 2021, El Salvador was the first country to adopt bitcoin as legal tender, tying an entire countries currency stability to bitcoin.
Predicting the price of bitcoin would allow policies and regulations to be either stabilize BTC or implement other methods to protect the nation’s population’s assets.

These policies and regulations could be similar to a variety of regulations enacted by the U.S. Federal Government after the great depression, such as FDIC insured bank accounts.

However, given the volatility of cryptocurrencies, carefully monitoring and forecasting would be required to prevent the consequences of large movements in price.

Imagine how the population of El Salvador would feel if their legal tender value dropped by 50% in a single day.

49 of 50

Summary

The dataset contains 234 Crypto Coins/Altcoins with historical Open, High, Low, Close, and Volume (OHLCV) prices traded in the Binance Exchange.
Will utilize the daily price information for Bitcoin which contains 263,800 entries with no missing values.
The goal of this project will be to predict the price of Bitcoin.
Predictions will take place by exploring a variety of inputs given the large amount of data contained within the dataset.

50 of 50

Dataset Background Info

Where is the dataset from?
Who is this this impacting?
How is this dataset created?
When was this dataset created and timeline?
We have 236800 rows of D1(one day) data for all 235 coins/altcoins
There are no missing dataset values: in particular for Bitcoin and Ethereum
We will only be using D1 data - this is the most efficient way to analyze the data
The highest closing prices for the coins will be between early 2021 to early 2022
Most coins started gaining value in 2018