Applied
Big Data Analytics
Zubair Nabi
Lecture 9
Forecasting Industrial Machine Failures
Our Big Data Analytics Stack
Our Big Data Analytics Stack: Today’s Lecture
Use Case for Today
Forecasting Industrial Machine Failures
What is an Industrial Machine?
Industrial machines, equipment, and assets encompass a wide range of physical tools and systems used in manufacturing, production, and processing industries
This includes machinery such as conveyor belts, robotic arms, CNC machines, and assembly lines, as well as supporting equipment like pumps, compressors, and safety systems
These assets play a crucial role in enhancing productivity, efficiency, and quality in industrial operations
Regular Maintenance and Lifecycle Management
Effective management of industrial equipment is essential for ensuring:
Regular maintenance ensures that machines and equipment operate reliably and efficiently
It helps prevent unexpected breakdowns, thereby maintaining optimal performance and minimizing disruptions in production
Lifecycle management focuses on maximizing the useful life of equipment
By implementing scheduled maintenance and monitoring performance, organizations can extend the lifespan of their assets, ensuring a better return on investment over time
Reliability and Performance
Proactive maintenance reduces the likelihood of costly repairs and replacements
By addressing potential issues early, companies can avoid major failures that lead to expensive downtime and lost productivity
Cost Savings
Asset Longevity
Main Types of Maintenance
The four main types of maintenance include:
Corrective Maintenance
Reactive maintenance that addresses equipment failures after they happen. This ensures quick restoration of operations, although it can lead to increased downtime and repair costs
Condition-Based Maintenance
Maintenance performed based on the actual condition of equipment, determined through real-time monitoring. This approach helps optimize maintenance schedules and resource allocation
Preventive Maintenance
Scheduled maintenance activities aimed at preventing equipment failures. This includes inspections, cleaning, and part replacements, helping to minimize downtime and extend asset life
Predictive Maintenance
Utilizes data and analytics to anticipate equipment failures before they occur. This approach reduces unexpected breakdowns by allowing maintenance to be performed based on actual equipment condition
Industrial Use Cases
Failure Pattern Detection
Learn from historical failure data to recognize patterns and trends that lead to equipment breakdown. Helps in root cause analysis and prevention of similar issues
Example: Identifying consistent patterns (like pressure spikes) that precede valve failures
Anomaly Detection
Identify unusual patterns in sensor data (e.g., temperature, vibration, or pressure) that indicate potential equipment issues or deviations from normal operating conditions. Anomalies can signal early signs of failure or wear
Example: Detecting sudden increases in motor vibration that could signal bearing failure
Predictive Maintenance
Forecast equipment failure based on historical data, allowing for timely maintenance before breakdowns occur. This reduces downtime and maintenance costs
Example: Predicting when a turbine might need servicing based on temperature and vibration trends over time
Remaining Useful Life (RUL) Estimation
Estimate how much longer a piece of equipment can operate effectively before failure, using degradation patterns from time series data
Example: Predicting the remaining lifespan of a pump based on wear-related factors such as pressure, flow rates, and runtime
Key Industrial Sensors
Measure temperature in processes (e.g., thermocouples)
Temperature
Monitor pressure levels in tanks and pipes (e.g., piezoelectric sensors)
Pressure
Measure the flow rate of liquids or gases (e.g., electromagnetic flow meters)
Flow
Detect the level of liquids or solids in containers (e.g., capacitive)
Level
Monitor equipment vibrations to detect imbalances or mechanical issues (e.g., accelerometers)
Vibration
The number of sensors in a plant might vary from a few hundred to millions in a complex operation such as gas production or nuclear power generation
In what format should we be collecting and storing sensor data?
Enter Time Series Data
Time series data refers to a sequence of data points collected or recorded at successive points in time, often at uniform intervals
In the context of sensors, time series data is crucial for monitoring, analyzing, and interpreting various physical phenomena in real-time industrial applications
Characteristics of Time Series Data
Each data point is associated with a specific timestamp, indicating the precise moment the observation was made. This ordering is essential for understanding how values evolve over time
Time series data may exhibit long-term movements or trends, showing consistent increases or decreases over time
Temporal Order
Time series data can be continuous (measured at every moment, like temperature) or discrete (measured at specific intervals, like daily sales)
Continuous or Discrete
Trends
Regular patterns or cycles in the data that occur at predictable intervals, such as monthly sales peaks or annual temperature changes
Seasonality
Characteristics of Time Series Data (2)
Many time series analyses assume stationarity, meaning that statistical properties (like mean and variance) do not change over time. Non-stationary data may require transformation to achieve stationarity
Time series data typically contains random variations or noise, which can obscure underlying patterns
Stationarity
Time series data often shows correlation between observations at different time points, meaning past values can influence future values
Autocorrelation
Noise
The data may include lagged values (previous observations) to capture relationships over time
Lagged Values
A Quick Note on Time Stamped Data
A dataset where each individual record includes a timestamp indicating when it was created or recorded
The timestamp provides context for the data but does not imply any specific sequence, relationship, or regular interval among records
Therefore not all time stamped data can be counted as time series data
A Quick Note on Time Stamped Data (2)
Traditional machine learning algorithms (like decision trees, etc.) can be applied to timestamped data by treating the timestamps as features. This allows models to learn from various attributes of the data without necessarily considering the temporal aspect
However for time series data we need a different approach because each data point is influenced by previous points
The focus of this Lecture is time series analytics
Storage of time series is something that we have already discussed in Lecture 2
Transport/processing of time series/streaming data is going to be discussed in Lecture 11
Representative Dataset
Timestamp | Machine ID | Temperature (°C) | Pressure (bar) | Vibration (mm/s) | Runtime (hours) | Status (Failure/Normal) |
2024-10-01 0:00:00 | M001 | 70 | 3 | 0.5 | 100 | Normal |
2024-10-01 1:00:00 | M001 | 70 | 3 | 0.5 | 101 | Normal |
2024-10-01 2:00:00 | M001 | 71 | 3 | 0.5 | 102 | Normal |
2024-10-01 3:00:00 | M001 | 72 | 3.2 | 0.6 | 103 | Normal |
2024-10-01 4:00:00 | M001 | 73 | 3.3 | 0.6 | 104 | Normal |
2024-10-01 5:00:00 | M001 | 74 | 3.5 | 0.7 | 105 | Normal |
2024-10-01 6:00:00 | M001 | 75 | 4 | 0.9 | 106 | Normal |
2024-10-01 7:00:00 | M001 | 80 | 3.7 | 1.2 | 107 | Normal |
2024-10-01 8:00:00 | M001 | 95 | 6 | 2.5 | 108 | Normal |
2024-10-01 9:00:00 | M001 | 100 | 7 | 3 | 109 | Normal |
Industrial Data Format
We have assumed a flat, simple structure but..
In industrial settings, data collected from sensors and monitoring systems is often organized in a VQT format, which stands for Value, Quality, and Timestamp
This format is essential for effectively managing time series data, particularly in scenarios where precision and reliability are crucial
Feature engineering techniques for time series can be divided into two categories:
Let’s go through the first category
Colab: https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture9/FeatureEngineering-Timeseries.ipynb
Feature Engineering: Lag Features
Description: Create features that represent previous time steps (lags) of the target variable
When: Use lag features when the current observation is influenced by past values of the target variable
Why: Lag features help the model capture temporal dependencies in the data
Model Type: For standard models (e.g., decision trees). These models don't natively handle sequential data, so introducing lag features provides them with necessary temporal context
import pandas as pd
data = {'value': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
df['lag_1'] = df['value'].shift(1)
df['lag_2'] = df['value'].shift(2)
value | lag_1 | lag_2 |
10 | NaN | NaN |
20 | 10 | NaN |
30 | 20 | 10 |
40 | 30 | 20 |
50 | 40 | 30 |
Feature Engineering: Rolling Statistics
Description: Calculate rolling statistics (mean, median, standard deviation) over a specified window of time
When: Use rolling statistics when you need to smooth short-term fluctuations or when there’s seasonality in the data
Why: Helps reduce noise and reveal trends, improving model performance by incorporating moving averages
Model Type: Standard models benefit from rolling statistics since they don’t handle temporal smoothness
import pandas as pd
data = {'value': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
df['rolling_mean'] = df['value'].rolling(window=3).mean()
value | rolling_mean |
10 | NaN |
20 | NaN |
30 | 20 |
40 | 30 |
50 | 40 |
Feature Engineering: Exponentially Weighted Moving Average (EWMA)
Description: Calculate an exponentially weighted moving average to give more weight to recent observations
When: Use EWMA when you want to prioritize recent values more heavily than older values
Why: EWMA smooths out the time series but allows the model to be more responsive to recent changes
Model Type: Standard models can benefit from EWMA to smooth data and emphasize recent trends
import pandas as pd
data = {'value': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
df['ewma'] = df['value'].ewm(span=3, adjust=False).mean()
value | ewma |
10 | 10 |
20 | 15 |
30 | 22.5 |
40 | 31.25 |
50 | 40.625 |
Feature Engineering: Exponentially Weighted Moving Average (EWMA)
Where:
Feature Engineering: Date/Time Features
Description: Extract features from timestamps (e.g., day of the week, hour of the day)
When: Use date/time features when you expect periodic patterns or seasonality (e.g., weekday vs weekend, holiday effects)
Why: Capturing cyclical patterns is critical for improving model accuracy in data with time-based trends
Model Type: Date-based features are especially useful for models like decision trees and random forests that don’t natively consider time
import pandas as pd
date_rng = pd.date_range(start='2023-01-01', periods=4, freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df['value'] = [10, 20, 30, 40]
df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
date | value | day_of_week | month |
2023-01-01 0:00:00 | 10 | 6 | 1 |
2023-01-02 0:00:00 | 20 | 0 | 1 |
2023-01-03 0:00:00 | 30 | 1 | 1 |
2023-01-04 0:00:00 | 40 | 2 | 1 |
Feature Engineering: Time-Based Aggregation
Description: Aggregate data over time windows (e.g., sum or mean of hourly/daily data)
When: Use aggregation when you need to reduce the granularity of your data or when working with seasonality patterns
Why: Aggregation reduces noise and emphasizes long-term trends, especially useful when there are multiple readings within a time period
Model Type: Time-based aggregation can simplify time series for standard models like decision trees
import pandas as pd
data = {'date': pd.date_range(start='2023-01-01', periods=10, freq='H'), 'value': range(10)}
df = pd.DataFrame(data)
df.set_index('date', inplace=True)
daily_agg = df.resample('D').sum()
date | value |
2023-01-01 0:00:00 | 45 |
Feature Engineering: SMOTE
Description: SMOTE (Synthetic Minority Over-sampling Technique) increases the number of instances in the minority class by generating synthetic samples rather than simply duplicating existing samples
When: You have a dataset with an imbalanced target variable (e.g., very few 'Failure' cases compared to 'Normal' cases)
Why: Improves model performance by providing more training examples for the minority class, allowing the model to learn more about this class's characteristics
Model Type: Helps standard machine learning techniques to deal with class imbalance
import pandas as pd
from imblearn.over_sampling import SMOTE
data = {
'value': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
'Status': [1, 1, 1, 1, 0, 1, 1, 0, 1, 1]
}
df = pd.DataFrame(data)
X = df[['value']]
y = df['Status']
smote = SMOTE(random_state=42, k_neighbors=1)
X_resampled, y_resampled = smote.fit_resample(X, y)
df_resampled = pd.DataFrame(X_resampled, columns=['value'])
df_resampled['Status'] = y_resampled
How SMOTE Works
2. Find K-Nearest Neighbors: For each selected sample, identity its k-nearest neighbors within the minority class (commonly set to 5). These neighbors represent points that are close to the selected sample in the feature space
1. Choose a Minority Class Sample: Select a random sample from the minority class
3. Generate Synthetic Samples: Create a new synthetic sample by interpolating between the selected minority class sample and one of its randomly chosen k-nearest neighbors
4. Repeat: Repeat this process until the minority class has enough synthetic samples to balance out the majority class
Properties of SMOTE
Not always effective for datasets with high overlap between classes
By generating synthetic samples rather than duplicating existing ones, SMOTE reduces the risk of overfitting
Prevents Overfitting
New samples based on nearby samples in the feature space help the model generalize better for the minority class
Minority Class Variability
Overlap Between Classes
Now that we have mastered the art of feature engineering for time series data, let’s use standard machine learning to predict machine failure
Colab: https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture9/MachineFailure-MultivariateStandardML.ipynb
Detecting machine failure is a binary classification problem
Now let’s focus on a different problem that detects an anomaly at a particular time for which standard machine learning models will not work
Enter ARIMA
ARIMA stands for AutoRegressive Integrated Moving Average. It combines three key concepts:
AR (AutoRegression): Uses past values to predict future values
I (Integrated): Differencing the data to make it stationary, which helps remove trends
MA (Moving Average): Uses past forecast errors to improve accuracy
Key Parameters of ARIMA
ARIMA is defined by three parameters:
Number of lag observations in the autoregressive model
Autoregressive Order (p)
Number of times data needs differencing to make it stationary
Differencing Order (d)
Size of the moving average window, or number of lagged forecast errors
Moving Average Order (q)
Example of ARIMA Parameters
Let’s say we have temperature readings from a machine every hour
The objective is to monitor these readings for anomalies, such as unexpected spikes or drops in temperature that may indicate equipment failure
Example of ARIMA Parameters (2)
If p=2, the model will use the temperature readings from the previous two hours to predict the current hour's reading
This captures the relationship between the current temperature and its past values
In anomaly detection, if an anomaly is detected (e.g., a sudden spike), examining the past values helps in understanding if it’s part of a trend or an outlier
Autoregressive Order (p)
If the temperature data shows a clear trend (e.g., gradually increasing temperature over weeks), we might set d=1 to subtract the previous hour's temperature from the current one
This differencing helps stabilize the mean, making the data more suitable for modeling
For anomaly detection, stationarity is crucial because it allows us to detect deviations from normal behavior accurately
Differencing Order (d)
If q=1, the model uses the error from the previous hour’s prediction to adjust its current prediction
This allows the model to learn from its previous mistakes
In anomaly detection, if the model's predictions consistently deviate from actual readings (large errors), it might indicate an underlying anomaly in the temperature readings
Moving Average Order (q)
Key Assumptions of ARIMA
The time series should be stationary, meaning its statistical properties (mean, variance, autocorrelation) do not change over time
Stationarity
The time series consists of observations on a single variable over time
Univariate
The relationship between the observations and the predicted values is linear. ARIMA relies on linear combinations of past observations and errors
Linearity
The residuals (forecast errors) from the model should be uncorrelated with one another
Independence of Errors
The residuals are often assumed to be normally distributed, especially for statistical inference
Normality of Residuals
The variance of the errors should be constant over time
Homoscedasticity
You might have noticed that some of these assumptions are shared between Linear Regression and ARIMA
This is because both ARIMA and linear regression are built on the framework of statistical modeling, which involves making assumptions about the data-generating process to allow for effective estimation and inference
In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable
The term auto-regression indicates that it is a regression of the variable against itself
This technique is similar to a linear regression model in how it uses past values as inputs to the regression
Let’s go through the second category of time series feature engineering
Colab: https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture9/FeatureEngineering-Timeseries-Specific.ipynb
Feature Engineering: Differencing
Description: Subtract the previous observation from the current observation to stabilize the mean of the series
When: Use differencing when the time series is non-stationary (has a trend or seasonality)
Why: Differencing helps to remove trends and make the series stationary, which is required by many models
Model Type: ARIMA requires stationarity, and differencing is built into the model's design
import pandas as pd
data = {'value': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
df['value_diff'] = df['value'].diff()
value | value_diff |
10 | NaN |
20 | 10 |
30 | 10 |
40 | 10 |
50 | 10 |
Feature Engineering: Fourier Transform
Description: Use Fourier Transforms to capture periodic patterns
When: Use when the data has periodic, trend, or seasonal components
Why: Helps in isolating different components of the series to better model or analyze each
Model Type: ARIMA relies heavily on trends and seasonality
Fourier Transform
The Fourier Transform is a mathematical technique that transforms a time-domain signal into its constituent frequencies
It decomposes signals into sinusoidal components, allowing us to analyze their frequency content
This transformation is fundamental in various fields, including signal processing, communications, and time series analysis
Some Example Use Cases of Fourier Transform
Communication
Fourier transforms break a complex signal into simple sine waves. In communication systems, this helps to split a signal into many smaller parts so data can be sent over several channels at once. It makes the process of sending and receiving information faster and more reliable
Signal Processing
When you have a messy signal (like a radio signal with static), Fourier transforms help by breaking the signal into its basic parts. This makes it easier to design filters that remove unwanted noise, leaving a cleaner and clearer signal
Medical Imaging
In MRI machines, the data collected is in the form of complex signals. Fourier transforms convert this raw data into clear images of the inside of the body. This helps doctors see organs and tissues accurately
Audio Processing
Audio signals can be split into different frequencies using Fourier transforms. This allows engineers to isolate and adjust parts of the sound—such as boosting bass or reducing background noise—to improve the overall quality of music or speech
Time vs Frequency Domain
The representation of a signal as it varies over time. For example, an audio signal is typically represented as amplitude versus time
Focus on how things evolve over time—great for observing sequences and transitions
Time Domain
The representation of a signal in terms of its frequency components. It reveals how much of each frequency is present in the original signal
Break down the data into its fundamental parts, revealing the "recipe" of patterns or cycles present, independent of when they occur
Frequency Domain
Time vs Frequency Domain (2)
Imagine an orchestra playing a piece of music
You can hear changes in volume, rhythm, and intensity as time progresses, but it’s difficult to separate out individual instruments or underlying patterns precisely
Time Domain
You’re examining what pitches and patterns are present in the music—whether it has a strong bass line, a repeating melody, or high-pitched harmonies
Frequency Domain
The Mathematical Principle behind Fourier Transform
Any periodic function can be represented as a sum of sine and cosine functions (or complex exponentials)
These sinusoidal functions are characterized by their frequency, amplitude, and phase
Amplitude vs Frequency vs Phase
Fourier Transform Formulae
Where:
F(f) is the output frequency spectrum
f(t) is the original time-domain signal
e−2πift represents the complex exponential form of sine and cosine functions
Continuous Fourier Transform
Where:
X(k) is the output frequency spectrum
x(n) is the input sequence
N is the total number of samples
Discrete Fourier Transform
Continuous vs Discrete Signals
Imagine you have an analog audio signal, like the sound produced by a guitar string vibrating in the air. This signal exists continuously over time, and you can use the Continuous Fourier Transform (CFT) to break it down into its frequency components (e.g., the fundamental frequency and overtones) by integrating over the entire signal
Continuous Fourier Transform
Consider a digital recording of that guitar sound, where the audio is captured at fixed intervals (for instance, 44,100 samples per second). The signal is now a series of numbers, not a continuous waveform. You would use the Discrete Fourier Transform (DFT), typically computed via the Fast Fourier Transform (FFT), to convert these samples into a discrete set of frequency components that represent the original sound
Discrete Fourier Transform
Good way to understand guitar harmonics: https://www.youtube.com/shorts/1hRjMVdTVgE
Fast Fourier Transform (FFT) is an efficient O(N log N), divide and conquer based implementation of DFT O(N2)
FFT is one of the most important algorithms in the world
Without the FFT, many modern technologies, such as digital audio, medical imaging (MRI, CT), and telecommunications, would either be impractical or far less efficient
Feature Engineering: Fourier Transform Example
Feature Engineering: Fourier Transform Example (2)
The spike at 0.1 in the amplitude spectrum indicates that the dominant periodic behavior in the time series corresponds to a sinusoidal wave that completes one cycle every 10 days
Relationship between frequency and period
T = 1/f
In our case:
f = 0.1
T = 1/0.1 = 10 days
Feature Engineering: Fourier Transform Example (3)
Feature Engineering: Fourier Transform Example (4)
3 frequencies at
0.05
0.1
0.2
Relationship between frequency and period
T = 1/f
In our case:
f = 0.05 | T = 20 days
f = 0.1 | T = 10 days
f = 0.2 | T = 5 days
Now that we have a good understanding of Fourier Transform, let’s look at how we can use frequency analysis for feature engineering
Feature Engineering: Fourier Transform Usage
Identify and Extract Dominant Frequencies: Examine the amplitude spectrum to identify the dominant frequencies. Use the magnitudes of the dominant frequencies as features in a predictive model, as they can encapsulate recurring patterns in the data
Filter Out Noise or Unwanted Frequencies: Remove noise or less relevant frequencies (known as low-pass or high-pass filtering). Reconstruct a denoised version of the signal using only the dominant frequencies. This cleaned signal can be transformed back to the time domain
Reconstruct Key Frequency Components as Separate Time Series: Decompose the time series based on identified frequencies by creating separate signals for each significant frequency and use as separate features
We now have a good feel for various time series specific feature engineering techniques
Let’s look at how we can use ARIMA
Recall that ARIMA has two parts:
The ACF plot and PACF plot are oftentimes used to determine their values
ACF and PACF Plots
Definition: ACF measures the correlation between a time series and a lagged version of itself over different time lags
Interpretation:
ACF values range from -1 to 1
A positive ACF value indicates that as the time series value at time t increases, the value at time t+k also tends to increase (k is the lag)
A negative ACF value indicates an inverse relationship
Example: Measures if a warm day tends to be followed by another warm day after a set gap
ACF
Definition: PACF measures the correlation between a time series and a lagged version of itself, while controlling for the correlations at shorter lags
Interpretation:
PACF values also range from -1 to 1.
It helps in understanding the direct relationship between a variable and its lags after removing the influence of intermediate lags
Example: Measures the direct effect of today's temperature on a future day, ignoring the days in between
PACF
ACF and PACF Plots (2)
Reveals the overall time dependency in data, helping identify seasonal patterns and the persistence of effects over time
Overall persistence or "memory"
ACF
Isolates direct lag relationships, crucial for selecting the right order in autoregressive time series models
Direct impact or "shock effect"
PACF
ACF and PACF Plots (2)
Component Type | ACF Pattern | PACF Pattern |
AR(p) | Gradual decay | Cut-off after lag p |
MA(q) | Cut-off after lag q | Gradual decay |
ACF and PACF Plots Example
Component Type | ACF Pattern | PACF Pattern |
AR(p) | Gradual decay | Cut-off after lag p |
MA(q) | Cut-off after lag q | Gradual decay |
ACF and PACF Plots Example (2)
Component Type | ACF Pattern | PACF Pattern |
AR(p) | Gradual decay | Cut-off after lag p |
MA(q) | Cut-off after lag q | Gradual decay |
AR (p=5)
ARIMA (p, d, q)
p | d | q | Description |
1 | 0 | 0 | This model might be appropriate for a time series that shows no trend or seasonality. You would predict the current value based on the previous day's value |
0 | 1 | 0 | This model could be used for a time series with a clear upward trend. You would first difference the series to remove the trend and then model the differenced series with no lags |
0 | 2 | 1 | This model could be appropriate for a time series that shows a strong trend and needs to be differenced twice to stabilize the mean. An example could be the monthly sales of a product that exhibits a clear upward trend over time |
1 | 1 | 0 | This model is ideal for time series data that exhibit a trend but where only the most recent observation is relevant for making predictions. This can be useful for applications where the latest data point holds significant weight over earlier points, like in daily temperature readings or stock prices |
2 | 1 | 2 | A complex process like monthly electricity demand, where the data has a trend that needs to be removed, and the residual series shows both autoregressive and moving average behavior. Differencing once (d = 1) stabilizes the mean. The series then requires an AR component of order 2 (p = 2) and an MA component of order 2 (q = 2) to capture both longer memory and shock effects |
The only thing that remains now is checking for stationarity
Augmented Dickey-Fuller (ADF) Test
Description: A statistical test used to determine if a time series is stationary, meaning its statistical properties do not change over time
Why: It’s widely used in time series analysis, especially when working with autoregressive models, to check if differencing (transformation to stationarity) is needed before applying these models
value | value_diff |
10 | NaN |
20 | 10 |
30 | 10 |
40 | 10 |
50 | 10 |
Calculate the p-value: the likelihood of obtaining the observed data under the null hypothesis of a statistical test
If p-value < 0.05: The series is likely stationary
If p-value >= 0.05: The series is likely non-stationary
Let’s detect some univariate anomalies
Colab: https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture9/AnomalyDetection.ipynb
Shortcomings of ARIMA
Non-stationary series need to be transformed (e.g., through differencing) to achieve stationarity, which can complicate the modeling process and lead to loss of information
Stationarity
This limits its applicability in scenarios where multiple time series influence each other, such as in economic data
Univariate
Nonlinear relationships can lead to poor forecasts and inaccuracies
Linearity
This can result in misleading forecasts and poor performance, especially if the outliers are not handled appropriately
Sensitivity to Outliers
Applying ARIMA to seasonal data without proper adjustments can lead to inaccurate forecasts
Seasonal Patterns
Enter LSTM
Long Short-Term Memory (LSTM) networks are a specialized type of recurrent neural network (RNN) designed to effectively capture long-term dependencies in sequential data
They utilize memory cells and gating mechanisms to regulate the flow of information, allowing them to remember relevant data while discarding irrelevant details
To understand LSTMs we first need to understand Neural Networks
What are Neural Networks?
Neural networks are computational models inspired by the structure and function of the human brain
They consist of interconnected nodes, or neurons, organized in layers to form an architecture
Each neuron processes input through a weighted sum and an activation function, passing the output to the next layer
What are Neural Network Architectures?
Neural network architecture refers to the structured arrangement of nodes (neurons) and the connections (edges) between them in a neural network
It defines how data flows through the network, including the
number of layers
types of layers (e.g., convolutional, recurrent, fully connected)
configuration of the neurons within those layers
Key Components of Neural Network Architecture
The first layer that receives the raw data (features)
Input Layer
Intermediate layers where computations and feature extraction occur. The number and size of these layers can vary significantly depending on the problem
Hidden Layers
The final layer that produces the output, such as classification or regression results
Output Layer
Functions applied to neurons that introduce non-linearity, allowing the network to learn complex patterns (e.g., sigmoid, softmax)
Activation Functions
How neurons are interconnected (e.g., fully connected, convolutional, recurrent connections)
Connections
Some Other Types of Layers
Each neuron is connected to every neuron in the previous layer. Used in feedforward neural networks, it’s standard for tasks where features need to interact globally, like classification
Dense Layer
Applies convolutional filters to capture spatial hierarchies in data. It preserves spatial relationships and reduces the parameter count by focusing on local features
Convolutional Layer
Specifically designed for sequential data, like time series or text. Layers like LSTM (Long Short-Term Memory) are recurrent layers used to capture long-term dependencies in sequential data
Recurrent Layer
Randomly sets a fraction of the input units to zero during training to prevent overfitting. It forces the network to rely less on any particular neuron and is commonly used after Dense layers
Dropout Layer
Why Does the Architecture Matter?
The architecture directly affects the model's ability to learn and generalize from data
A well-designed architecture can capture complex patterns and improve predictive accuracy, while a poorly designed one may underperform or overfit
Performance
Different architectures are suited for different types of tasks:
For instance,
Convolutional Neural Networks (CNNs) are ideal for image processing tasks
Recurrent Neural Networks (RNNs), including Long Short-Term Memory (LSTM) networks, are effective for sequential data, such as time series or natural language
Task Suitability
The architecture can influence the computational efficiency and scalability of the model
For instance, deeper networks might require more data and training time, while shallower networks may not be capable of capturing all the necessary features
Scalability and Efficiency
Neural Network Architecture: FNN
Description: Simple architecture where data moves in one direction from input to output
Use Cases: Used for basic regression and classification tasks
value | value_diff |
10 | NaN |
20 | 10 |
30 | 10 |
40 | 10 |
50 | 10 |
Feedforward Neural Networks (FNN)
Neural Network Architecture: CNN
Description: Data moves forward direction from input to output, similar to FNNs. However, CNNs incorporate convolutional layers, pooling layers, and often batch normalization layers, which allow them to capture spatial hierarchies and local patterns
Use Cases: Primarily used for image classification, object detection, image segmentation, and other tasks involving visual data
value | value_diff |
10 | NaN |
20 | 10 |
30 | 10 |
40 | 10 |
50 | 10 |
Convolutional Neural Networks (CNN)
Neural Network Architecture: RNN
Description: Architecture with recurrent connections, allowing information to persist across time steps. RNNs can process sequences of varying lengths by maintaining a hidden state that captures information from previous inputs
Use Cases: Commonly used for time series prediction
value | value_diff |
10 | NaN |
20 | 10 |
30 | 10 |
40 | 10 |
50 | 10 |
Recurrent Neural Networks (RNN)
Neural Network Architecture: Autoencoders
Description: Neural networks designed to learn efficient representations of data by encoding the input into a compressed format and then decoding it back to reconstruct the original input. They consist of an encoder and a decoder
Use Cases: Employed for dimensionality reduction, denoising data, and generating new data points, as well as anomaly detection
value | value_diff |
10 | NaN |
20 | 10 |
30 | 10 |
40 | 10 |
50 | 10 |
Autoencoders
Neural Network Architecture: Transformers
Description: Architecture based on self-attention mechanisms that process sequences of data simultaneously, allowing for parallelization and improved handling of long-range dependencies
Use Cases: Predominantly used in natural language processing tasks like machine translation, text summarization, and conversational agents
value | value_diff |
10 | NaN |
20 | 10 |
30 | 10 |
40 | 10 |
50 | 10 |
Transformers
Already discussed in Lecture 5 under GPT
Neural Network Training
Define Loss Function: Select a loss function (e.g., mean squared error) that measures the difference between the predicted output and the actual target values
Compute Loss: Calculate the loss using the outputs from the forward propagation step
Define the Architecture: Specify the type of neural network and its architecture, including the number of layers, types of layers (e.g., convolutional, recurrent), and activation functions
Initialize Weights: Randomly initialize the weights of the network, which will be adjusted during training
1. Model Initialization
Input Data: Pass the training data through the network
Calculate Output: For each layer, compute the output using the weighted sum of inputs and activation functions
2. Forward Propagation
3. Loss Calculation
Neural Network Training (2)
Monitor Performance: After each epoch, evaluate the model's performance on the validation set to check for overfitting and to tune hyperparameters
Early Stopping: Optionally, stop training if the validation performance doesn't improve after a certain number of epochs
Calculate Gradients: Use backpropagation to compute the gradients of the loss with respect to the weights in the network
Update Weights: Adjust the weights using an optimization algorithm (e.g., Stochastic Gradient Descent) based on the calculated gradients. This step aims to minimize the loss function
4. Backward Propagation
Epochs: Repeat the forward and backward propagation steps for a number of epochs (complete passes through the training dataset)
Batch Processing: During training, the dataset may be divided into smaller batches, allowing for more frequent updates to the weights and faster convergence
5. Iterate
6. Validation
Visual analysis of Neural Networks: https://playground.tensorflow.org/
We can now return to LSTMs
Key Features of LSTMs
LSTMs have memory cells that maintain information over time, allowing them to retain data for long periods. This is essential for capturing long-term dependencies in sequences, such as in time series data
Memory Cells
LSTMs utilize three types of gates to regulate the flow of information:
Forget Gate: Decides what information to discard from the cell state
Input Gate: Determines what new information to store in the cell state
Output Gate: Controls what information from the cell state is sent to the output
Gates
Carries information across time steps in the sequence
It is essentially a "conveyor belt" that runs through the entire LSTM unit
The cell state is responsible for maintaining long-term dependencies and is modified by various gates (input gate, forget gate, and output gate) at each time step to decide what information to keep or discard
Cell State
The cell state carries the long-term memory across the network, and the gates (input, forget, and output gates) regulate the flow of information into, out of, and within the memory cells
LSTMs and Time Series Data
Time series data is inherently sequential, with temporal dependencies. LSTMs are designed to process sequences, making them suitable for tasks like predicting future values
When analyzing a time series, LSTMs can learn that specific values at one point in time influence future values
LSTMs and Time Series Data (2)
Unlike ARIMA, LSTMs do not require linearity or stationarity and can also handle multivariate data
We are now ready to detect multivariate anomalies:
Colab: hhttps://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture9/AnomalyDetection-Multivariate.ipynb
Summary
Time Series Data consists of sequential data points recorded over time, often used to track trends, patterns, and seasonal effects
Fourier Transform decomposes a time series into its frequency components, revealing periodicities and cycles
ARIMA (AutoRegressive Integrated Moving Average) is a statistical model for time series forecasting that combines differencing, autoregression, and moving averages to capture patterns and trends
LSTMs (Long Short-Term Memory Networks) are a type of recurrent neural network designed to learn and predict sequential data with long-term dependencies, commonly used in time series forecasting
Recommended Reading
Chapters 5 to 8 of “Modern Time Series Forecasting with Python: Explore industry-ready time series forecasting using modern machine learning and deep learning” by Manu Joseph
Chapter 13 of “Mining of Massive Datasets” by Jure Leskovec, Anand Rajaraman, and Jeff Ullman
Tensorflow guide: https://www.tensorflow.org/resources/learn-ml