1 of 11

DAMP 2.0 How To

Dear reader, this is a brief user’s guide for DAMP.

DAMP debuted in ACM SIGKDD 2022 [b], however it extends ideas from the Matrix Profile, which debuted in 2016 [b].

We welcome suggestions for new features and bug fixes, please email the first and last authors.

[a] Matrix Profile XXIV:Scaling Time Series Anomaly Detection to Trillions of Datapoints and Ultra-fast Arriving Data Streams. Yue Lu, Renjie Wu , Abdullah Mueen , Maria A. Zuluaga and Eamonn Keogh. ACM SIGKDD 2022

[b]  https://www.cs.ucr.edu/~eamonn/MatrixProfile.html

2 of 11

Place the demo dataset (BourkeStreetMall.txt) and the DAMP code in your MATLAB path.

Let's load the data…

>> load('BourkeStreetMall.txt’);

The data is a count of pedestrians walking past a sensor in Melbourne.

It is sampled once an hour, or 24 times a day.

Let's take a quick peek at a snippet of it, by plotting 2 weeks of data.

>> plot(BourkeStreetMall(1: [24* 7 * 2]),’r’);

0

50

100

150

200

250

300

350

0

1000

2000

3000

4000

Saturday

Sunday

Lunchtime

Dinnertime

Late night/ early morning

Note the weekend days have smooth “bumps”, as people are not constrained by classic business lunch hours

3 of 11

Let's run DAMP on the full dataset. We will use the first week, as the training data.

There is only one parameter that affects the anomaly discovered, the subsequence length, lets use 24 hours.

>> [discord_score,position] = DAMP_2_0(BourkeStreetMall,24,24*7,'lookahead',0);

------------------------------------------

Thank you for using DAMP.

This is version 2.0 of DAMP, please email Eamonn Keogh (eamonn@cs.ucr.edu) or Yue Lu (ylu175@ucr.edu) to make sure you have the latest version.

This time series is of length 17490, and the subsequence length you chose is 24.

The lookahead is 0.

Hints:

In most cases, the subsequence length you should use is between about 50 to 90% of a typical period.

A good initial value of lookahead is about 2^nearest_power_of_two(16 times S). The range of lookahead should be 0 to length(T)-location_to_start_processing.

If speed is important, you can tune lookahead to get greater speed-up in your domain. A simple search, doubling and halving the current value should let you quickly converge on a good value.

------------------------------------------

Results:

Pruning Rate: 0

Predicted discord score/position: 4.4569/7828

* If you want to suppress the outputs, please call DAMP using the following format:

>> [discord_score,position] = DAMP_2_0(T,SubsequenceLength,location_to_start_processing,'enable_output',false)

This run spawned the plot below. In the next slide, lets zoom in on the main discords…

The original time series

The left matrix profile

4 of 11

Partial road closure for street maintenance

The Top-1 discord seems plausible

(see also [c], fig 13 and fig 14)

[c] https://www.cs.ucr.edu/~eamonn/MERLIN_Long_version_for_website.pdf

The original time series

The left matrix profile

(A) This is the correct Top-1 discord in the entire time series

(A)

(C)

5 of 11

The value of the left-MP is zero for the training data

Sporting event finishes

The highest peak before the top-1 discord, has a correct value.

That is the top-1 discord in the range 0 to 7828

DAMP_2_0(BourkeStreetMall,24,24*7,'lookahead',0);

The original time series

The left matrix profile

(B) This is the correct Top-1 discord that happens before it is superseded (at 7828)

(B)

(C)

6 of 11

However, if there is a peak to the right of the top-1 discord (encounter so-far), we cannot say it is an anomaly!

It may just be an early abandon.

So, for example, we cannot say that C is an anomaly, because there is a higher value to the left of it.

The original time series

The left matrix profile

(C)

This discussion is for Top-1, but generalizes to top-K

7 of 11

To Summarize

For a left-MP created for Top-K

  • The K highest peaks, really are the true K greatest discords.
  • However, no claim should be made about the K+1 highest peaks.
  • This is true whether or not you use lookahead pruning (see below).

  • The above is really all you need to know. However, consider this fact for the top-1 case (for simplicity)
  • As you read left to right, if you see a peak P, and..
    • P has no taller peak to its left
    • P eventually has a taller peak at location X to the right of P.
  • Then you can say that: P is the top-1 discord in the range 0 to X.
  • The peak labeled (B) in the above is such an example.

8 of 11

Computing DAMP in the previous slides took about a second. Suppose you want results faster!

You can use lookahead pruning (forward processing). Let's set the lookahead to 512.

>> [discord_score,position] = DAMP_2_0(BourkeStreetMall,24,24*7,'lookahead’,512);

This worked, it was an order of magnitude faster and the pruning rate was 0.9537.

What is a good value for lookahead? It depends on the data, smooth data tends to allow longer lookaheads.

You only need to try powers of two, something like 128, 256, 512, 1024 etc.

Once you find a good value for your dataset, it should generalize to other datasets in your domain.

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

Important Caveat

If you use a lookahead, then only the top K peaks have the true values, as before.

There is no information about the value or location of K+1 discords.

Now, the other values may be higher or lower than their true left MP values.

Once again do no be visually distracted by them, the top K peaks have the true values, and that is all you can claim.

The top-discord seen up to that point (it will later be superseded)

The top-discord seen up to that point (and, as it happens, it will remain top discord)

9 of 11

Some variants that DAMP 2.0 supports

Note that you can disable the plot and text echoed to the screen, and just grab the discord_score and position

1. Use the default lookahead

[discord_score,position] = DAMP_2_0(T,24,2000);

2. Use the lookahead entered by the user (e.g lookahead = 128)

[discord_score,position] = DAMP_2_0(T,24,2000,'lookahead’,128);

3. Disable output and use the default lookahead.

[discord_score,position] = DAMP_2_0(T,24,2000,'enable_output',false)

4. Disable output and use the lookahead entered by the user (e.g lookahead = 128)

[discord_score,position] = DAMP_2_0(T,24,2000,'lookahead’,128,'enable_output',false)

10 of 11

What is a good value for the subsequence length?

A good starting point is something like 70% of one period to one period.

If you have variable periods (say heartbeats), then try 70% of the shortest period to the shortest period.

0

50

100

150

200

250

300

350

0

1000

2000

3000

4000

one period

11 of 11

0

100

200

300

400

500

0

1000

2000

3000

It is important to note that DAMP does not process datasets containing constant regions that are longer than the subsequence length. Let’s create this situation, we make the original dataset have a constant region of length 51.

>> BourkeStreetMall(300:350)=0;

>> plot(BourkeStreetMall(1: [24* 7 * 3]),'r’);

This is a problem, because we are working in a z-normalized space, and we would get a div-by-zero. This results in imaginary numbers in the calculated left Matrix Profile, which prevents us from visualizing the results and, more importantly, from getting the correct discord score and position.

One quick and dirty fix is to adjust the subsequence length, make it to be longer than the length of the longest constant region. E.g. >> [discord_score,position] = DAMP_2_0(BourkeStreetMall,60,24*7,'lookahead',0);

If there are constant regions in your dataset, you probably need to think about what you want to happen here. Depending on context, constant regions may or may not be anomalies. (If the constant region is the anomaly you are expecting, there are a number of simple and fast linear methods to help you find the constant regions)

A more insidious problem is near constants regions. When such regions are z-normalized, they “blow up”, and may be report as an anomaly. Once again, what you should do here is context dependent, you should take care of this with preprocessing.

Constant region

Takeaway message: If your data can have constant regions or near constant regions longer than the subsequence length, you need to think about what you want to happen, and maybe edit your data.

DAMP will echo this error and terminate.

ERROR: This dataset contains constant and/or near constant regions. We define the time series with an overall variance less than 0.2 or with a constant region within its sliding window as the time series containing constant and/or near constant regions. Such regions can cause both false positives and false negatives depending on how you define anomalies. And more importantly, it can also result in imaginary numbers in the calculated Left Matrix Profile, from which we cannot get the correct score value and position of the top discord.** The program has been terminated. **