DAMP 2.0 How To
Dear reader, this is a brief user’s guide for DAMP.
DAMP debuted in ACM SIGKDD 2022 [b], however it extends ideas from the Matrix Profile, which debuted in 2016 [b].
We welcome suggestions for new features and bug fixes, please email the first and last authors.
[a] Matrix Profile XXIV:Scaling Time Series Anomaly Detection to Trillions of Datapoints and Ultra-fast Arriving Data Streams. Yue Lu, Renjie Wu , Abdullah Mueen , Maria A. Zuluaga and Eamonn Keogh. ACM SIGKDD 2022
[b] https://www.cs.ucr.edu/~eamonn/MatrixProfile.html
Place the demo dataset (BourkeStreetMall.txt) and the DAMP code in your MATLAB path.
Let's load the data…
>> load('BourkeStreetMall.txt’);
The data is a count of pedestrians walking past a sensor in Melbourne.
It is sampled once an hour, or 24 times a day.
Let's take a quick peek at a snippet of it, by plotting 2 weeks of data.
>> plot(BourkeStreetMall(1: [24* 7 * 2]),’r’);
0
50
100
150
200
250
300
350
0
1000
2000
3000
4000
Saturday
Sunday
Lunchtime
Dinnertime
Late night/ early morning
Note the weekend days have smooth “bumps”, as people are not constrained by classic business lunch hours
Let's run DAMP on the full dataset. We will use the first week, as the training data.
There is only one parameter that affects the anomaly discovered, the subsequence length, lets use 24 hours.
>> [discord_score,position] = DAMP_2_0(BourkeStreetMall,24,24*7,'lookahead',0);
------------------------------------------
Thank you for using DAMP.
This is version 2.0 of DAMP, please email Eamonn Keogh (eamonn@cs.ucr.edu) or Yue Lu (ylu175@ucr.edu) to make sure you have the latest version.
This time series is of length 17490, and the subsequence length you chose is 24.
The lookahead is 0.
Hints:
In most cases, the subsequence length you should use is between about 50 to 90% of a typical period.
A good initial value of lookahead is about 2^nearest_power_of_two(16 times S). The range of lookahead should be 0 to length(T)-location_to_start_processing.
If speed is important, you can tune lookahead to get greater speed-up in your domain. A simple search, doubling and halving the current value should let you quickly converge on a good value.
------------------------------------------
Results:
Pruning Rate: 0
Predicted discord score/position: 4.4569/7828
* If you want to suppress the outputs, please call DAMP using the following format:
>> [discord_score,position] = DAMP_2_0(T,SubsequenceLength,location_to_start_processing,'enable_output',false)
This run spawned the plot below. In the next slide, lets zoom in on the main discords…
The original time series
The left matrix profile
Partial road closure for street maintenance
The Top-1 discord seems plausible
(see also [c], fig 13 and fig 14)
[c] https://www.cs.ucr.edu/~eamonn/MERLIN_Long_version_for_website.pdf
The original time series
The left matrix profile
(A) This is the correct Top-1 discord in the entire time series
(A)
(C)
The value of the left-MP is zero for the training data
Sporting event finishes
The highest peak before the top-1 discord, has a correct value.
That is the top-1 discord in the range 0 to 7828
DAMP_2_0(BourkeStreetMall,24,24*7,'lookahead',0);
The original time series
The left matrix profile
(B) This is the correct Top-1 discord that happens before it is superseded (at 7828)
(B)
(C)
However, if there is a peak to the right of the top-1 discord (encounter so-far), we cannot say it is an anomaly!
It may just be an early abandon.
So, for example, we cannot say that C is an anomaly, because there is a higher value to the left of it.
The original time series
The left matrix profile
(C)
This discussion is for Top-1, but generalizes to top-K
To Summarize
For a left-MP created for Top-K
Computing DAMP in the previous slides took about a second. Suppose you want results faster!
You can use lookahead pruning (forward processing). Let's set the lookahead to 512.
>> [discord_score,position] = DAMP_2_0(BourkeStreetMall,24,24*7,'lookahead’,512);
This worked, it was an order of magnitude faster and the pruning rate was 0.9537.
What is a good value for lookahead? It depends on the data, smooth data tends to allow longer lookaheads.
You only need to try powers of two, something like 128, 256, 512, 1024 etc.
Once you find a good value for your dataset, it should generalize to other datasets in your domain.
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
Important Caveat
If you use a lookahead, then only the top K peaks have the true values, as before.
There is no information about the value or location of K+1 discords.
Now, the other values may be higher or lower than their true left MP values.
Once again do no be visually distracted by them, the top K peaks have the true values, and that is all you can claim.
The top-discord seen up to that point (it will later be superseded)
The top-discord seen up to that point (and, as it happens, it will remain top discord)
Some variants that DAMP 2.0 supports
Note that you can disable the plot and text echoed to the screen, and just grab the discord_score and position
1. Use the default lookahead
[discord_score,position] = DAMP_2_0(T,24,2000);
2. Use the lookahead entered by the user (e.g lookahead = 128)
[discord_score,position] = DAMP_2_0(T,24,2000,'lookahead’,128);
3. Disable output and use the default lookahead.
[discord_score,position] = DAMP_2_0(T,24,2000,'enable_output',false)
4. Disable output and use the lookahead entered by the user (e.g lookahead = 128)
[discord_score,position] = DAMP_2_0(T,24,2000,'lookahead’,128,'enable_output',false)
What is a good value for the subsequence length?
A good starting point is something like 70% of one period to one period.
If you have variable periods (say heartbeats), then try 70% of the shortest period to the shortest period.
0
50
100
150
200
250
300
350
0
1000
2000
3000
4000
one period
0
100
200
300
400
500
0
1000
2000
3000
It is important to note that DAMP does not process datasets containing constant regions that are longer than the subsequence length. Let’s create this situation, we make the original dataset have a constant region of length 51.
>> BourkeStreetMall(300:350)=0;
>> plot(BourkeStreetMall(1: [24* 7 * 3]),'r’);
This is a problem, because we are working in a z-normalized space, and we would get a div-by-zero. This results in imaginary numbers in the calculated left Matrix Profile, which prevents us from visualizing the results and, more importantly, from getting the correct discord score and position.
One quick and dirty fix is to adjust the subsequence length, make it to be longer than the length of the longest constant region. E.g. >> [discord_score,position] = DAMP_2_0(BourkeStreetMall,60,24*7,'lookahead',0);
If there are constant regions in your dataset, you probably need to think about what you want to happen here. Depending on context, constant regions may or may not be anomalies. (If the constant region is the anomaly you are expecting, there are a number of simple and fast linear methods to help you find the constant regions)
A more insidious problem is near constants regions. When such regions are z-normalized, they “blow up”, and may be report as an anomaly. Once again, what you should do here is context dependent, you should take care of this with preprocessing.
Constant region
Takeaway message: If your data can have constant regions or near constant regions longer than the subsequence length, you need to think about what you want to happen, and maybe edit your data.
DAMP will echo this error and terminate.
ERROR: This dataset contains constant and/or near constant regions. We define the time series with an overall variance less than 0.2 or with a constant region within its sliding window as the time series containing constant and/or near constant regions. Such regions can cause both false positives and false negatives depending on how you define anomalies. And more importantly, it can also result in imaginary numbers in the calculated Left Matrix Profile, from which we cannot get the correct score value and position of the top discord.** The program has been terminated. **