1 of 94

Applied

Big Data Analytics

Zubair Nabi

2 of 94

Lecture 9

Forecasting Industrial Machine Failures

3 of 94

Our Big Data Analytics Stack

4 of 94

Our Big Data Analytics Stack: Today’s Lecture

5 of 94

Use Case for Today

Forecasting Industrial Machine Failures

6 of 94

What is an Industrial Machine?

Industrial machines, equipment, and assets encompass a wide range of physical tools and systems used in manufacturing, production, and processing industries

This includes machinery such as conveyor belts, robotic arms, CNC machines, and assembly lines, as well as supporting equipment like pumps, compressors, and safety systems

These assets play a crucial role in enhancing productivity, efficiency, and quality in industrial operations

7 of 94

Regular Maintenance and Lifecycle Management

Effective management of industrial equipment is essential for ensuring:

Regular maintenance ensures that machines and equipment operate reliably and efficiently

It helps prevent unexpected breakdowns, thereby maintaining optimal performance and minimizing disruptions in production

Lifecycle management focuses on maximizing the useful life of equipment

By implementing scheduled maintenance and monitoring performance, organizations can extend the lifespan of their assets, ensuring a better return on investment over time

Reliability and Performance

Proactive maintenance reduces the likelihood of costly repairs and replacements

By addressing potential issues early, companies can avoid major failures that lead to expensive downtime and lost productivity

Cost Savings

Asset Longevity

8 of 94

Main Types of Maintenance

The four main types of maintenance include:

Corrective Maintenance

Reactive maintenance that addresses equipment failures after they happen. This ensures quick restoration of operations, although it can lead to increased downtime and repair costs

Condition-Based Maintenance

Maintenance performed based on the actual condition of equipment, determined through real-time monitoring. This approach helps optimize maintenance schedules and resource allocation

Preventive Maintenance

Scheduled maintenance activities aimed at preventing equipment failures. This includes inspections, cleaning, and part replacements, helping to minimize downtime and extend asset life

Predictive Maintenance

Utilizes data and analytics to anticipate equipment failures before they occur. This approach reduces unexpected breakdowns by allowing maintenance to be performed based on actual equipment condition

9 of 94

Industrial Use Cases

Failure Pattern Detection

Learn from historical failure data to recognize patterns and trends that lead to equipment breakdown. Helps in root cause analysis and prevention of similar issues

Example: Identifying consistent patterns (like pressure spikes) that precede valve failures

Anomaly Detection

Identify unusual patterns in sensor data (e.g., temperature, vibration, or pressure) that indicate potential equipment issues or deviations from normal operating conditions. Anomalies can signal early signs of failure or wear

Example: Detecting sudden increases in motor vibration that could signal bearing failure

Predictive Maintenance

Forecast equipment failure based on historical data, allowing for timely maintenance before breakdowns occur. This reduces downtime and maintenance costs

Example: Predicting when a turbine might need servicing based on temperature and vibration trends over time

Remaining Useful Life (RUL) Estimation

Estimate how much longer a piece of equipment can operate effectively before failure, using degradation patterns from time series data

Example: Predicting the remaining lifespan of a pump based on wear-related factors such as pressure, flow rates, and runtime

10 of 94

Key Industrial Sensors

Measure temperature in processes (e.g., thermocouples)

Temperature

Monitor pressure levels in tanks and pipes (e.g., piezoelectric sensors)

Pressure

Measure the flow rate of liquids or gases (e.g., electromagnetic flow meters)

Flow

Detect the level of liquids or solids in containers (e.g., capacitive)

Level

Monitor equipment vibrations to detect imbalances or mechanical issues (e.g., accelerometers)

Vibration

11 of 94

The number of sensors in a plant might vary from a few hundred to millions in a complex operation such as gas production or nuclear power generation

12 of 94

In what format should we be collecting and storing sensor data?

13 of 94

Enter Time Series Data

Time series data refers to a sequence of data points collected or recorded at successive points in time, often at uniform intervals

In the context of sensors, time series data is crucial for monitoring, analyzing, and interpreting various physical phenomena in real-time industrial applications

14 of 94

Characteristics of Time Series Data

Each data point is associated with a specific timestamp, indicating the precise moment the observation was made. This ordering is essential for understanding how values evolve over time

Time series data may exhibit long-term movements or trends, showing consistent increases or decreases over time

Temporal Order

Time series data can be continuous (measured at every moment, like temperature) or discrete (measured at specific intervals, like daily sales)

Continuous or Discrete

Trends

Regular patterns or cycles in the data that occur at predictable intervals, such as monthly sales peaks or annual temperature changes

Seasonality

15 of 94

Characteristics of Time Series Data (2)

Many time series analyses assume stationarity, meaning that statistical properties (like mean and variance) do not change over time. Non-stationary data may require transformation to achieve stationarity

Time series data typically contains random variations or noise, which can obscure underlying patterns

Stationarity

Time series data often shows correlation between observations at different time points, meaning past values can influence future values

Autocorrelation

Noise

The data may include lagged values (previous observations) to capture relationships over time

Lagged Values

16 of 94

A Quick Note on Time Stamped Data

A dataset where each individual record includes a timestamp indicating when it was created or recorded

The timestamp provides context for the data but does not imply any specific sequence, relationship, or regular interval among records

Therefore not all time stamped data can be counted as time series data

17 of 94

A Quick Note on Time Stamped Data (2)

Traditional machine learning algorithms (like decision trees, etc.) can be applied to timestamped data by treating the timestamps as features. This allows models to learn from various attributes of the data without necessarily considering the temporal aspect

However for time series data we need a different approach because each data point is influenced by previous points

18 of 94

The focus of this Lecture is time series analytics

Storage of time series is something that we have already discussed in Lecture 2

Transport/processing of time series/streaming data is going to be discussed in Lecture 11

19 of 94

Representative Dataset

Timestamp

Machine ID

Temperature (°C)

Pressure (bar)

Vibration (mm/s)

Runtime (hours)

Status (Failure/Normal)

2024-10-01 0:00:00

M001

70

3

0.5

100

Normal

2024-10-01 1:00:00

M001

70

3

0.5

101

Normal

2024-10-01 2:00:00

M001

71

3

0.5

102

Normal

2024-10-01 3:00:00

M001

72

3.2

0.6

103

Normal

2024-10-01 4:00:00

M001

73

3.3

0.6

104

Normal

2024-10-01 5:00:00

M001

74

3.5

0.7

105

Normal

2024-10-01 6:00:00

M001

75

4

0.9

106

Normal

2024-10-01 7:00:00

M001

80

3.7

1.2

107

Normal

2024-10-01 8:00:00

M001

95

6

2.5

108

Normal

2024-10-01 9:00:00

M001

100

7

3

109

Normal

20 of 94

Industrial Data Format

We have assumed a flat, simple structure but..

In industrial settings, data collected from sensors and monitoring systems is often organized in a VQT format, which stands for Value, Quality, and Timestamp

This format is essential for effectively managing time series data, particularly in scenarios where precision and reliability are crucial

21 of 94

Feature engineering techniques for time series can be divided into two categories:

  1. enable standard machine learning models (like decision trees) to handle time series data
  2. designed specifically for time series models to capture trends, seasonality, and temporal dependencies

22 of 94

Let’s go through the first category

Colab: https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture9/FeatureEngineering-Timeseries.ipynb

23 of 94

Feature Engineering: Lag Features

Description: Create features that represent previous time steps (lags) of the target variable

When: Use lag features when the current observation is influenced by past values of the target variable

Why: Lag features help the model capture temporal dependencies in the data

Model Type: For standard models (e.g., decision trees). These models don't natively handle sequential data, so introducing lag features provides them with necessary temporal context

import pandas as pd

data = {'value': [10, 20, 30, 40, 50]}

df = pd.DataFrame(data)

df['lag_1'] = df['value'].shift(1)

df['lag_2'] = df['value'].shift(2)

value

lag_1

lag_2

10

NaN

NaN

20

10

NaN

30

20

10

40

30

20

50

40

30

24 of 94

Feature Engineering: Rolling Statistics

Description: Calculate rolling statistics (mean, median, standard deviation) over a specified window of time

When: Use rolling statistics when you need to smooth short-term fluctuations or when there’s seasonality in the data

Why: Helps reduce noise and reveal trends, improving model performance by incorporating moving averages

Model Type: Standard models benefit from rolling statistics since they don’t handle temporal smoothness

import pandas as pd

data = {'value': [10, 20, 30, 40, 50]}

df = pd.DataFrame(data)

df['rolling_mean'] = df['value'].rolling(window=3).mean()

value

rolling_mean

10

NaN

20

NaN

30

20

40

30

50

40

25 of 94

Feature Engineering: Exponentially Weighted Moving Average (EWMA)

Description: Calculate an exponentially weighted moving average to give more weight to recent observations

When: Use EWMA when you want to prioritize recent values more heavily than older values

Why: EWMA smooths out the time series but allows the model to be more responsive to recent changes

Model Type: Standard models can benefit from EWMA to smooth data and emphasize recent trends

import pandas as pd

data = {'value': [10, 20, 30, 40, 50]}

df = pd.DataFrame(data)

df['ewma'] = df['value'].ewm(span=3, adjust=False).mean()

value

ewma

10

10

20

15

30

22.5

40

31.25

50

40.625

26 of 94

Feature Engineering: Exponentially Weighted Moving Average (EWMA)

Where:

  • st is the EWMA at time t
  • xt is the actual observation at time t
  • α is the smoothing factor (with 0<α≤1)

27 of 94

Feature Engineering: Date/Time Features

Description: Extract features from timestamps (e.g., day of the week, hour of the day)

When: Use date/time features when you expect periodic patterns or seasonality (e.g., weekday vs weekend, holiday effects)

Why: Capturing cyclical patterns is critical for improving model accuracy in data with time-based trends

Model Type: Date-based features are especially useful for models like decision trees and random forests that don’t natively consider time

import pandas as pd

date_rng = pd.date_range(start='2023-01-01', periods=4, freq='D')

df = pd.DataFrame(date_rng, columns=['date'])

df['value'] = [10, 20, 30, 40]

df['day_of_week'] = df['date'].dt.dayofweek

df['month'] = df['date'].dt.month

date

value

day_of_week

month

2023-01-01 0:00:00

10

6

1

2023-01-02 0:00:00

20

0

1

2023-01-03 0:00:00

30

1

1

2023-01-04 0:00:00

40

2

1

28 of 94

Feature Engineering: Time-Based Aggregation

Description: Aggregate data over time windows (e.g., sum or mean of hourly/daily data)

When: Use aggregation when you need to reduce the granularity of your data or when working with seasonality patterns

Why: Aggregation reduces noise and emphasizes long-term trends, especially useful when there are multiple readings within a time period

Model Type: Time-based aggregation can simplify time series for standard models like decision trees

import pandas as pd

data = {'date': pd.date_range(start='2023-01-01', periods=10, freq='H'), 'value': range(10)}

df = pd.DataFrame(data)

df.set_index('date', inplace=True)

daily_agg = df.resample('D').sum()

date

value

2023-01-01 0:00:00

45

29 of 94

Feature Engineering: SMOTE

Description: SMOTE (Synthetic Minority Over-sampling Technique) increases the number of instances in the minority class by generating synthetic samples rather than simply duplicating existing samples

When: You have a dataset with an imbalanced target variable (e.g., very few 'Failure' cases compared to 'Normal' cases)

Why: Improves model performance by providing more training examples for the minority class, allowing the model to learn more about this class's characteristics

Model Type: Helps standard machine learning techniques to deal with class imbalance

import pandas as pd

from imblearn.over_sampling import SMOTE

data = {

'value': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],

'Status': [1, 1, 1, 1, 0, 1, 1, 0, 1, 1]

}

df = pd.DataFrame(data)

X = df[['value']]

y = df['Status']

smote = SMOTE(random_state=42, k_neighbors=1)

X_resampled, y_resampled = smote.fit_resample(X, y)

df_resampled = pd.DataFrame(X_resampled, columns=['value'])

df_resampled['Status'] = y_resampled

30 of 94

How SMOTE Works

2. Find K-Nearest Neighbors: For each selected sample, identity its k-nearest neighbors within the minority class (commonly set to 5). These neighbors represent points that are close to the selected sample in the feature space

1. Choose a Minority Class Sample: Select a random sample from the minority class

3. Generate Synthetic Samples: Create a new synthetic sample by interpolating between the selected minority class sample and one of its randomly chosen k-nearest neighbors

4. Repeat: Repeat this process until the minority class has enough synthetic samples to balance out the majority class

31 of 94

Properties of SMOTE

Not always effective for datasets with high overlap between classes

By generating synthetic samples rather than duplicating existing ones, SMOTE reduces the risk of overfitting

Prevents Overfitting

New samples based on nearby samples in the feature space help the model generalize better for the minority class

Minority Class Variability

Overlap Between Classes

32 of 94

Now that we have mastered the art of feature engineering for time series data, let’s use standard machine learning to predict machine failure

Colab: https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture9/MachineFailure-MultivariateStandardML.ipynb

33 of 94

Detecting machine failure is a binary classification problem

Now let’s focus on a different problem that detects an anomaly at a particular time for which standard machine learning models will not work

34 of 94

Enter ARIMA

ARIMA stands for AutoRegressive Integrated Moving Average. It combines three key concepts:

AR (AutoRegression): Uses past values to predict future values

I (Integrated): Differencing the data to make it stationary, which helps remove trends

MA (Moving Average): Uses past forecast errors to improve accuracy

35 of 94

Key Parameters of ARIMA

ARIMA is defined by three parameters:

Number of lag observations in the autoregressive model

Autoregressive Order (p)

Number of times data needs differencing to make it stationary

Differencing Order (d)

Size of the moving average window, or number of lagged forecast errors

Moving Average Order (q)

36 of 94

Example of ARIMA Parameters

Let’s say we have temperature readings from a machine every hour

The objective is to monitor these readings for anomalies, such as unexpected spikes or drops in temperature that may indicate equipment failure

37 of 94

Example of ARIMA Parameters (2)

If p=2, the model will use the temperature readings from the previous two hours to predict the current hour's reading

This captures the relationship between the current temperature and its past values

In anomaly detection, if an anomaly is detected (e.g., a sudden spike), examining the past values helps in understanding if it’s part of a trend or an outlier

Autoregressive Order (p)

If the temperature data shows a clear trend (e.g., gradually increasing temperature over weeks), we might set d=1 to subtract the previous hour's temperature from the current one

This differencing helps stabilize the mean, making the data more suitable for modeling

For anomaly detection, stationarity is crucial because it allows us to detect deviations from normal behavior accurately

Differencing Order (d)

If q=1, the model uses the error from the previous hour’s prediction to adjust its current prediction

This allows the model to learn from its previous mistakes

In anomaly detection, if the model's predictions consistently deviate from actual readings (large errors), it might indicate an underlying anomaly in the temperature readings

Moving Average Order (q)

38 of 94

Key Assumptions of ARIMA

The time series should be stationary, meaning its statistical properties (mean, variance, autocorrelation) do not change over time

Stationarity

The time series consists of observations on a single variable over time

Univariate

The relationship between the observations and the predicted values is linear. ARIMA relies on linear combinations of past observations and errors

Linearity

The residuals (forecast errors) from the model should be uncorrelated with one another

Independence of Errors

The residuals are often assumed to be normally distributed, especially for statistical inference

Normality of Residuals

The variance of the errors should be constant over time

Homoscedasticity

39 of 94

You might have noticed that some of these assumptions are shared between Linear Regression and ARIMA

This is because both ARIMA and linear regression are built on the framework of statistical modeling, which involves making assumptions about the data-generating process to allow for effective estimation and inference

40 of 94

In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable

The term auto-regression indicates that it is a regression of the variable against itself

This technique is similar to a linear regression model in how it uses past values as inputs to the regression

41 of 94

Let’s go through the second category of time series feature engineering

Colab: https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture9/FeatureEngineering-Timeseries-Specific.ipynb

42 of 94

Feature Engineering: Differencing

Description: Subtract the previous observation from the current observation to stabilize the mean of the series

When: Use differencing when the time series is non-stationary (has a trend or seasonality)

Why: Differencing helps to remove trends and make the series stationary, which is required by many models

Model Type: ARIMA requires stationarity, and differencing is built into the model's design

import pandas as pd

data = {'value': [10, 20, 30, 40, 50]}

df = pd.DataFrame(data)

df['value_diff'] = df['value'].diff()

value

value_diff

10

NaN

20

10

30

10

40

10

50

10

43 of 94

Feature Engineering: Fourier Transform

Description: Use Fourier Transforms to capture periodic patterns

When: Use when the data has periodic, trend, or seasonal components

Why: Helps in isolating different components of the series to better model or analyze each

Model Type: ARIMA relies heavily on trends and seasonality

44 of 94

Fourier Transform

The Fourier Transform is a mathematical technique that transforms a time-domain signal into its constituent frequencies

It decomposes signals into sinusoidal components, allowing us to analyze their frequency content

This transformation is fundamental in various fields, including signal processing, communications, and time series analysis

45 of 94

Some Example Use Cases of Fourier Transform

Communication

Fourier transforms break a complex signal into simple sine waves. In communication systems, this helps to split a signal into many smaller parts so data can be sent over several channels at once. It makes the process of sending and receiving information faster and more reliable

Signal Processing

When you have a messy signal (like a radio signal with static), Fourier transforms help by breaking the signal into its basic parts. This makes it easier to design filters that remove unwanted noise, leaving a cleaner and clearer signal

Medical Imaging

In MRI machines, the data collected is in the form of complex signals. Fourier transforms convert this raw data into clear images of the inside of the body. This helps doctors see organs and tissues accurately

Audio Processing

Audio signals can be split into different frequencies using Fourier transforms. This allows engineers to isolate and adjust parts of the sound—such as boosting bass or reducing background noise—to improve the overall quality of music or speech

46 of 94

Time vs Frequency Domain

The representation of a signal as it varies over time. For example, an audio signal is typically represented as amplitude versus time

Focus on how things evolve over time—great for observing sequences and transitions

Time Domain

The representation of a signal in terms of its frequency components. It reveals how much of each frequency is present in the original signal

Break down the data into its fundamental parts, revealing the "recipe" of patterns or cycles present, independent of when they occur

Frequency Domain

47 of 94

Time vs Frequency Domain (2)

Imagine an orchestra playing a piece of music

You can hear changes in volume, rhythm, and intensity as time progresses, but it’s difficult to separate out individual instruments or underlying patterns precisely

Time Domain

You’re examining what pitches and patterns are present in the music—whether it has a strong bass line, a repeating melody, or high-pitched harmonies

Frequency Domain

48 of 94

The Mathematical Principle behind Fourier Transform

Any periodic function can be represented as a sum of sine and cosine functions (or complex exponentials)

These sinusoidal functions are characterized by their frequency, amplitude, and phase

49 of 94

Amplitude vs Frequency vs Phase

50 of 94

Fourier Transform Formulae

Where:

F(f) is the output frequency spectrum

f(t) is the original time-domain signal

e−2πift represents the complex exponential form of sine and cosine functions

Continuous Fourier Transform

Where:

X(k) is the output frequency spectrum

x(n) is the input sequence

N is the total number of samples

Discrete Fourier Transform

51 of 94

Continuous vs Discrete Signals

Imagine you have an analog audio signal, like the sound produced by a guitar string vibrating in the air. This signal exists continuously over time, and you can use the Continuous Fourier Transform (CFT) to break it down into its frequency components (e.g., the fundamental frequency and overtones) by integrating over the entire signal

Continuous Fourier Transform

Consider a digital recording of that guitar sound, where the audio is captured at fixed intervals (for instance, 44,100 samples per second). The signal is now a series of numbers, not a continuous waveform. You would use the Discrete Fourier Transform (DFT), typically computed via the Fast Fourier Transform (FFT), to convert these samples into a discrete set of frequency components that represent the original sound

Discrete Fourier Transform

52 of 94

Good way to understand guitar harmonics: https://www.youtube.com/shorts/1hRjMVdTVgE

53 of 94

Fast Fourier Transform (FFT) is an efficient O(N log N), divide and conquer based implementation of DFT O(N2)

FFT is one of the most important algorithms in the world

Without the FFT, many modern technologies, such as digital audio, medical imaging (MRI, CT), and telecommunications, would either be impractical or far less efficient

54 of 94

Feature Engineering: Fourier Transform Example

55 of 94

Feature Engineering: Fourier Transform Example (2)

The spike at 0.1 in the amplitude spectrum indicates that the dominant periodic behavior in the time series corresponds to a sinusoidal wave that completes one cycle every 10 days

Relationship between frequency and period

T = 1/f

In our case:

f = 0.1

T = 1/0.1 = 10 days

56 of 94

Feature Engineering: Fourier Transform Example (3)

57 of 94

Feature Engineering: Fourier Transform Example (4)

3 frequencies at

0.05

0.1

0.2

Relationship between frequency and period

T = 1/f

In our case:

f = 0.05 | T = 20 days

f = 0.1 | T = 10 days

f = 0.2 | T = 5 days

58 of 94

Now that we have a good understanding of Fourier Transform, let’s look at how we can use frequency analysis for feature engineering

59 of 94

Feature Engineering: Fourier Transform Usage

Identify and Extract Dominant Frequencies: Examine the amplitude spectrum to identify the dominant frequencies. Use the magnitudes of the dominant frequencies as features in a predictive model, as they can encapsulate recurring patterns in the data

Filter Out Noise or Unwanted Frequencies: Remove noise or less relevant frequencies (known as low-pass or high-pass filtering). Reconstruct a denoised version of the signal using only the dominant frequencies. This cleaned signal can be transformed back to the time domain

Reconstruct Key Frequency Components as Separate Time Series: Decompose the time series based on identified frequencies by creating separate signals for each significant frequency and use as separate features

60 of 94

We now have a good feel for various time series specific feature engineering techniques

Let’s look at how we can use ARIMA

61 of 94

Recall that ARIMA has two parts:

  1. AutoRegressive term (AR)
  2. Moving Average term (MA)

The ACF plot and PACF plot are oftentimes used to determine their values

62 of 94

ACF and PACF Plots

Definition: ACF measures the correlation between a time series and a lagged version of itself over different time lags

Interpretation:

ACF values range from -1 to 1

A positive ACF value indicates that as the time series value at time t increases, the value at time t+k also tends to increase (k is the lag)

A negative ACF value indicates an inverse relationship

Example: Measures if a warm day tends to be followed by another warm day after a set gap

ACF

Definition: PACF measures the correlation between a time series and a lagged version of itself, while controlling for the correlations at shorter lags

Interpretation:

PACF values also range from -1 to 1.

It helps in understanding the direct relationship between a variable and its lags after removing the influence of intermediate lags

Example: Measures the direct effect of today's temperature on a future day, ignoring the days in between

PACF

63 of 94

ACF and PACF Plots (2)

Reveals the overall time dependency in data, helping identify seasonal patterns and the persistence of effects over time

Overall persistence or "memory"

ACF

Isolates direct lag relationships, crucial for selecting the right order in autoregressive time series models

Direct impact or "shock effect"

PACF

64 of 94

ACF and PACF Plots (2)

Component Type

ACF Pattern

PACF Pattern

AR(p)

Gradual decay

Cut-off after lag p

MA(q)

Cut-off after lag q

Gradual decay

65 of 94

ACF and PACF Plots Example

Component Type

ACF Pattern

PACF Pattern

AR(p)

Gradual decay

Cut-off after lag p

MA(q)

Cut-off after lag q

Gradual decay

66 of 94

ACF and PACF Plots Example (2)

Component Type

ACF Pattern

PACF Pattern

AR(p)

Gradual decay

Cut-off after lag p

MA(q)

Cut-off after lag q

Gradual decay

AR (p=5)

67 of 94

ARIMA (p, d, q)

p

d

q

Description

1

0

0

This model might be appropriate for a time series that shows no trend or seasonality. You would predict the current value based on the previous day's value

0

1

0

This model could be used for a time series with a clear upward trend. You would first difference the series to remove the trend and then model the differenced series with no lags

0

2

1

This model could be appropriate for a time series that shows a strong trend and needs to be differenced twice to stabilize the mean. An example could be the monthly sales of a product that exhibits a clear upward trend over time

1

1

0

This model is ideal for time series data that exhibit a trend but where only the most recent observation is relevant for making predictions. This can be useful for applications where the latest data point holds significant weight over earlier points, like in daily temperature readings or stock prices

2

1

2

A complex process like monthly electricity demand, where the data has a trend that needs to be removed, and the residual series shows both autoregressive and moving average behavior. Differencing once (d = 1) stabilizes the mean. The series then requires an AR component of order 2 (p = 2) and an MA component of order 2 (q = 2) to capture both longer memory and shock effects

68 of 94

The only thing that remains now is checking for stationarity

69 of 94

Augmented Dickey-Fuller (ADF) Test

Description: A statistical test used to determine if a time series is stationary, meaning its statistical properties do not change over time

Why: It’s widely used in time series analysis, especially when working with autoregressive models, to check if differencing (transformation to stationarity) is needed before applying these models

value

value_diff

10

NaN

20

10

30

10

40

10

50

10

Calculate the p-value: the likelihood of obtaining the observed data under the null hypothesis of a statistical test

If p-value < 0.05: The series is likely stationary

If p-value >= 0.05: The series is likely non-stationary

70 of 94

Let’s detect some univariate anomalies

Colab: https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture9/AnomalyDetection.ipynb

71 of 94

Shortcomings of ARIMA

Non-stationary series need to be transformed (e.g., through differencing) to achieve stationarity, which can complicate the modeling process and lead to loss of information

Stationarity

This limits its applicability in scenarios where multiple time series influence each other, such as in economic data

Univariate

Nonlinear relationships can lead to poor forecasts and inaccuracies

Linearity

This can result in misleading forecasts and poor performance, especially if the outliers are not handled appropriately

Sensitivity to Outliers

Applying ARIMA to seasonal data without proper adjustments can lead to inaccurate forecasts

Seasonal Patterns

72 of 94

Enter LSTM

Long Short-Term Memory (LSTM) networks are a specialized type of recurrent neural network (RNN) designed to effectively capture long-term dependencies in sequential data

They utilize memory cells and gating mechanisms to regulate the flow of information, allowing them to remember relevant data while discarding irrelevant details

73 of 94

To understand LSTMs we first need to understand Neural Networks

74 of 94

What are Neural Networks?

Neural networks are computational models inspired by the structure and function of the human brain

They consist of interconnected nodes, or neurons, organized in layers to form an architecture

Each neuron processes input through a weighted sum and an activation function, passing the output to the next layer

75 of 94

What are Neural Network Architectures?

Neural network architecture refers to the structured arrangement of nodes (neurons) and the connections (edges) between them in a neural network

It defines how data flows through the network, including the

number of layers

types of layers (e.g., convolutional, recurrent, fully connected)

configuration of the neurons within those layers

76 of 94

Key Components of Neural Network Architecture

The first layer that receives the raw data (features)

Input Layer

Intermediate layers where computations and feature extraction occur. The number and size of these layers can vary significantly depending on the problem

Hidden Layers

The final layer that produces the output, such as classification or regression results

Output Layer

Functions applied to neurons that introduce non-linearity, allowing the network to learn complex patterns (e.g., sigmoid, softmax)

Activation Functions

How neurons are interconnected (e.g., fully connected, convolutional, recurrent connections)

Connections

77 of 94

Some Other Types of Layers

Each neuron is connected to every neuron in the previous layer. Used in feedforward neural networks, it’s standard for tasks where features need to interact globally, like classification

Dense Layer

Applies convolutional filters to capture spatial hierarchies in data. It preserves spatial relationships and reduces the parameter count by focusing on local features

Convolutional Layer

Specifically designed for sequential data, like time series or text. Layers like LSTM (Long Short-Term Memory) are recurrent layers used to capture long-term dependencies in sequential data

Recurrent Layer

Randomly sets a fraction of the input units to zero during training to prevent overfitting. It forces the network to rely less on any particular neuron and is commonly used after Dense layers

Dropout Layer

78 of 94

Why Does the Architecture Matter?

The architecture directly affects the model's ability to learn and generalize from data

A well-designed architecture can capture complex patterns and improve predictive accuracy, while a poorly designed one may underperform or overfit

Performance

Different architectures are suited for different types of tasks:

For instance,

Convolutional Neural Networks (CNNs) are ideal for image processing tasks

Recurrent Neural Networks (RNNs), including Long Short-Term Memory (LSTM) networks, are effective for sequential data, such as time series or natural language

Task Suitability

The architecture can influence the computational efficiency and scalability of the model

For instance, deeper networks might require more data and training time, while shallower networks may not be capable of capturing all the necessary features

Scalability and Efficiency

79 of 94

Neural Network Architecture: FNN

Description: Simple architecture where data moves in one direction from input to output

Use Cases: Used for basic regression and classification tasks

value

value_diff

10

NaN

20

10

30

10

40

10

50

10

Feedforward Neural Networks (FNN)

80 of 94

Neural Network Architecture: CNN

Description: Data moves forward direction from input to output, similar to FNNs. However, CNNs incorporate convolutional layers, pooling layers, and often batch normalization layers, which allow them to capture spatial hierarchies and local patterns

Use Cases: Primarily used for image classification, object detection, image segmentation, and other tasks involving visual data

value

value_diff

10

NaN

20

10

30

10

40

10

50

10

Convolutional Neural Networks (CNN)

81 of 94

Neural Network Architecture: RNN

Description: Architecture with recurrent connections, allowing information to persist across time steps. RNNs can process sequences of varying lengths by maintaining a hidden state that captures information from previous inputs

Use Cases: Commonly used for time series prediction

value

value_diff

10

NaN

20

10

30

10

40

10

50

10

Recurrent Neural Networks (RNN)

82 of 94

Neural Network Architecture: Autoencoders

Description: Neural networks designed to learn efficient representations of data by encoding the input into a compressed format and then decoding it back to reconstruct the original input. They consist of an encoder and a decoder

Use Cases: Employed for dimensionality reduction, denoising data, and generating new data points, as well as anomaly detection

value

value_diff

10

NaN

20

10

30

10

40

10

50

10

Autoencoders

83 of 94

Neural Network Architecture: Transformers

Description: Architecture based on self-attention mechanisms that process sequences of data simultaneously, allowing for parallelization and improved handling of long-range dependencies

Use Cases: Predominantly used in natural language processing tasks like machine translation, text summarization, and conversational agents

value

value_diff

10

NaN

20

10

30

10

40

10

50

10

Transformers

Already discussed in Lecture 5 under GPT

84 of 94

Neural Network Training

Define Loss Function: Select a loss function (e.g., mean squared error) that measures the difference between the predicted output and the actual target values

Compute Loss: Calculate the loss using the outputs from the forward propagation step

Define the Architecture: Specify the type of neural network and its architecture, including the number of layers, types of layers (e.g., convolutional, recurrent), and activation functions

Initialize Weights: Randomly initialize the weights of the network, which will be adjusted during training

1. Model Initialization

Input Data: Pass the training data through the network

Calculate Output: For each layer, compute the output using the weighted sum of inputs and activation functions

2. Forward Propagation

3. Loss Calculation

85 of 94

Neural Network Training (2)

Monitor Performance: After each epoch, evaluate the model's performance on the validation set to check for overfitting and to tune hyperparameters

Early Stopping: Optionally, stop training if the validation performance doesn't improve after a certain number of epochs

Calculate Gradients: Use backpropagation to compute the gradients of the loss with respect to the weights in the network

Update Weights: Adjust the weights using an optimization algorithm (e.g., Stochastic Gradient Descent) based on the calculated gradients. This step aims to minimize the loss function

4. Backward Propagation

Epochs: Repeat the forward and backward propagation steps for a number of epochs (complete passes through the training dataset)

Batch Processing: During training, the dataset may be divided into smaller batches, allowing for more frequent updates to the weights and faster convergence

5. Iterate

6. Validation

86 of 94

Visual analysis of Neural Networks: https://playground.tensorflow.org/

87 of 94

We can now return to LSTMs

88 of 94

Key Features of LSTMs

LSTMs have memory cells that maintain information over time, allowing them to retain data for long periods. This is essential for capturing long-term dependencies in sequences, such as in time series data

Memory Cells

LSTMs utilize three types of gates to regulate the flow of information:

Forget Gate: Decides what information to discard from the cell state

Input Gate: Determines what new information to store in the cell state

Output Gate: Controls what information from the cell state is sent to the output

Gates

Carries information across time steps in the sequence

It is essentially a "conveyor belt" that runs through the entire LSTM unit

The cell state is responsible for maintaining long-term dependencies and is modified by various gates (input gate, forget gate, and output gate) at each time step to decide what information to keep or discard

Cell State

89 of 94

The cell state carries the long-term memory across the network, and the gates (input, forget, and output gates) regulate the flow of information into, out of, and within the memory cells

90 of 94

LSTMs and Time Series Data

Time series data is inherently sequential, with temporal dependencies. LSTMs are designed to process sequences, making them suitable for tasks like predicting future values

When analyzing a time series, LSTMs can learn that specific values at one point in time influence future values

91 of 94

LSTMs and Time Series Data (2)

Unlike ARIMA, LSTMs do not require linearity or stationarity and can also handle multivariate data

92 of 94

We are now ready to detect multivariate anomalies:

Colab: hhttps://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture9/AnomalyDetection-Multivariate.ipynb

93 of 94

Summary

Time Series Data consists of sequential data points recorded over time, often used to track trends, patterns, and seasonal effects

Fourier Transform decomposes a time series into its frequency components, revealing periodicities and cycles

ARIMA (AutoRegressive Integrated Moving Average) is a statistical model for time series forecasting that combines differencing, autoregression, and moving averages to capture patterns and trends

LSTMs (Long Short-Term Memory Networks) are a type of recurrent neural network designed to learn and predict sequential data with long-term dependencies, commonly used in time series forecasting

94 of 94

Recommended Reading

Chapters 5 to 8 of “Modern Time Series Forecasting with Python: Explore industry-ready time series forecasting using modern machine learning and deep learning” by Manu Joseph

Chapter 13 of “Mining of Massive Datasets” by Jure Leskovec, Anand Rajaraman, and Jeff Ullman

Tensorflow guide: https://www.tensorflow.org/resources/learn-ml