1 of 75

MADRID: A Hyper-Anytime Algorithm to Find Time Series Anomalies of all Lengths

Student: Yue Lu

Committee Chair: Eamonn Keogh

Committee Members: Tao Jiang� Tamar Shinar� Evangelos Papalexakis

2 of 75

I am a fifth-year Ph.D. student in the Department of Computer Science and Engineering at University of California, Riverside. My advisor is Dr. Eamonn Keogh. 

Prior to beginning my doctoral studies at UC Riverside, I received my B.S. in Software Technology and Applications with First Class Honors from the Macau University of Science and Technology.

Publications

  • Yue Lu, Thirumalai Vinjamoor Akhil Srinivas, Takaaki Nakamura, Makoto Imamura, and Eamonn Keogh. Matrix Profile XXX: MADRID: A Hyper-Anytime and Parameter-Free Algorithm to Find Time Series Anomalies of all Lengths. In Proceedings of IEEE International Conference on Data Mining (ICDM ‘23) 2023.
  • Sadaf Tafazoli, Yue Lu, Renjie Wu, Thirumalai Srinivas, Hannah Dela Cruz, Ryan Mercer, and Eamonn Keogh. Matrix Profile XXIX: C22MP: Fusing catch22 and the Matrix Profile to Produce an Efficient and Interpretable Anomaly Detector. In Proceedings of IEEE International Conference on Data Mining (ICDM ‘23) 2023.
  • Yue Lu, Renjie Wu, Mueen Abdullan, Maria A. Zuluaga, Eamonn Keogh. DAMP: accurate time series anomaly detection on trillions of datapoints and ultra-fast arriving data streams. Data Mining and Knowledge Discovery, pp.1-43 2023.
  • Yue Lu, Renjie Wu, Mueen Abdullan, Maria A. Zuluaga, Eamonn Keogh. Matrix Profile XXIV: Scaling Time Series Anomaly Detection to Trillions of Datapoints and Ultra-fast Arriving Data Streams. In Proceedings of ACM Conference on Knowledge Discovery and Data Mining (KDD ‘22) 2022.

About me

3 of 75

Outline

  • Motivation
  • The proposed method: MADRID
  • Empirical Evaluation
  • Conclusion

4 of 75

Outline

  • Motivation
  • The proposed method: MADRID
  • Empirical Evaluation
  • Conclusion

5 of 75

Why is it worth further improving time series discords?

  • Even though numerous time series anomaly detection (TSAD) methods are proposed every year, time series discords remain competitive among the state-of-the-art methods. (as we, and other groups have shown)
    • It is simple and requires only one parameter.
    • It seems to be the only TSAD technique to produce “superhuman” results.
    • It has been widely used by more than one hundred independent teams to solve different real-world problems.
  • However, time series discords have noted one weakness, the anomalies discovered depend on the algorithm’s only input parameter, the subsequence length.

6 of 75

Limitations of single-length TSAD algorithms I

  • For single-length TSAD algorithms, setting the correct window size is very important. Numerous researchers have pointed this out, for example [1] notes: “failure to provide an optimal window size may incur monetary damage.”
  • However, creating a heuristic to find the “optimal” window size (even assuming that such a task is well-defined) is a very hard task.
  • The window size setting may depend on domain-specific information. Again from [1], “Finding the optimal window size has remained to be one of the most challenging tasks in TSDM domains, where no domain-agnostic method is known for learning the window size”.

[1] A. Ermshaus, P. Schäfer, and U. Leser, “Window Size Selection in Unsupervised Time Series Analytics: A Review and Benchmark,” in AALTD, 2023, pp. 83–101.

7 of 75

Limitations of single-length TSAD algorithms II

8

24

0

4000

0

Battery change

4000

0

70

hours

May 18th 2016

Jan 25th 2017

Bourke Street Mall (North) Pedestrian Traffic

# of Pedestrians

Subsequence Length (hours)

  • Some datasets in the real world may have structures and anomalies on multiple scales. Consider this Melbourne pedestrian traffic data.
  • Had we chosen ten hours, we would have discovered the sensor battery change on June 12th.

8 of 75

Limitations of single-length TSAD algorithms II

8

24

0

4000

0

70

hours

Xmas Day

May 18th 2016

Jan 25th 2017

Bourke Street Mall (North) Pedestrian Traffic

Subsequence Length (hours)

  • Some datasets in the real world may have structures and anomalies on multiple scales. Consider this Melbourne pedestrian traffic data.
  • Had we chosen twelve hours (or longer), we would have only discovered Xmas day.

9 of 75

Limitations of single-length TSAD algorithms II

8

24

0

4000

0

70

hours

Car Attack

May 18th 2016

Jan 25th 2017

Bourke Street Mall (North) Pedestrian Traffic

Subsequence Length (hours)

  • Some datasets in the real world may have structures and anomalies on multiple scales. Consider this Melbourne pedestrian traffic data.
  • Had we chosen only eight hours, we would have only discovered the tragic car attack on January 20th.

10 of 75

Limitations of single-length TSAD algorithms II

8

24

0

4000

0

Battery change

4000

0

70

hours

0

70

hours

0

70

hours

Xmas Day

Car Attack

May 18th 2016

Jan 25th 2017

Bourke Street Mall (North) Pedestrian Traffic

# of Pedestrians

Subsequence Length (hours)

  • Some datasets in the real world may have structures and anomalies on multiple scales.
  • An algorithm that relies on a single subsequence length would miss anomalies at other lengths.

11 of 75

Limitations of single-length TSAD algorithms III

  • Even for datasets that only have anomalies of the same length, finding the optimal window size is very hard.
  • Anomaly detection can be hypersensitive to the sliding window size m. In the worst case, a subtle change in m dramatically affects the discord score.
    • In both examples, m must be set exactly to 49, no more, no less.

1

1000

0

2

4

6

8

10

Discord score for m = 49 is 8.57

Discord score for m = 50 is 0.277

1

1000

0

1

2

3

4

5

Discord score for m = 49 is 4.05

Discord score for m = 48 is 0.00

(A)

(B)

12 of 75

Limitations of single-length TSAD algorithms III

  • Even for datasets that only have anomalies of the same length, finding the optimal window size is very hard.
  • Anomaly detection can be hypersensitive to the sliding window size m. In the worst case, a subtle change in m dramatically affects the discord score and can cause a single-length TSAD algorithm to fail.
    • For any algorithm considering a window size of m < 54, this trivially simple problem is unsolvable.

Extra

“bump”

Training data

10

100

Value of m

54

All” TSAD algorithms fail if m < 54

13 of 75

Limitations of single-length TSAD algorithms III

8

24

0

4000

0

This small blip is only anomalous when seen in the context of the normal data that surrounds it

4000

0

70

hours

May 18th 2016

Jan 25th 2017

Bourke Street Mall (North) Pedestrian Traffic

# of Pedestrians

Subsequence Length (hours)

  • Even if we know the exact duration of the anomaly, it would still be challenging to predict the appropriate window size to detect it.
    • For example, the battery change anomaly is two hours long, yet it reveals itself best at a scale of ten hours.

Battery change (2 hours)

10 hours

14 of 75

Can we bypass the difficulties by testing all values?

  • All the previously described limitations have a straightforward solution: simply find anomalies for every possible subsequence length!
  • However, the computational speed of the “test-all-lengths” method seems to be too slow to be practical.

15 of 75

Outline

  • Motivation
  • The proposed method: MADRID
  • Empirical Evaluation
  • Conclusion

16 of 75

MERLIN And DAMP Refined Iteratively on Demand

  • In this work, we propose a novel TSAD method, called MADRID, which solves all the problems in the previous slides.
  • Specifically, MADRID has the following two key properties:
    • MADRID is absolutely fast to compute all-discords.

17 of 75

MERLIN And DAMP Refined Iteratively on Demand

  • In this work, we propose a novel TSAD method, called MADRID, which solves all the problems in the previous slides.
  • Specifically, MADRID has the following two key properties:
    • MADRID is absolutely fast to compute all-discords.
    • MADRID is able to produce very efficient anytime convergence, a property we have named a Hyper-Anytime Algorithm.
      • The term “Hyper-Anytime Algorithm” means an algorithm that can converge to within 10% of the optimal answer, after using less than 10% of the time needed for full convergence.

18 of 75

MERLIN And DAMP Refined Iteratively on Demand

  • In this work, we propose a novel TSAD method, called MADRID, which solves all the problems in the previous slides.
  • Specifically, MADRID has the following two key properties:
    • MADRID is absolutely fast to compute all-discords.
    • MADRID is able to produce very efficient anytime convergence, a property we have named a Hyper-Anytime Algorithm.
      • The term “Hyper-Anytime Algorithm” means an algorithm that can converge to within 10% of the optimal answer, after using less than 10% of the time needed for full convergence.
  • MADRID exploits some ideas from the recently introduced DAMP algorithm[2], which we'll review in the next four slides.

[2] Y. Lu, et al, “DAMP: accurate time series anomaly detection on trillions of datapoints and ultra-fast arriving data streams,” Data Mining and Knowledge Discovery, pp. 1–43, 2023. 

19 of 75

  • DAMP is an ultra-fast time-series anomaly detection algorithm that operates on a fixed sliding window size. 
  • On a never-ending sliding window, DAMP takes in the current subsequence, and looks backwards for its nearest neighbor in its entire history (training data). This makes discords particularly robust to concept drift, as newly ingested data instantly becomes part of the training data.

Brief Review of DAMP I

0

500

1000

1500

2000

2500

Left-aMP

(Partly computed)

Now

Training Data

Current Subsequence

Time Series

20 of 75

Brief Review of DAMP II

  • We chose DAMP as the core subroutine of MADRID, because
    • DAMP inherits all the advantages of time series discords - it is simple, requires a single parameter and performs at superhuman levels.
    • DAMP is not confused by repeated anomalies (twin-freaks) and can handle concept drift in time series.
    • DAMP achieves state-of-the-art performance on several benchmark datasets including the Hexagon ML/UCR Anomaly Detection datasets.
    • MADRID also inherits all the benefits of DAMP listed above.
  • DAMP employs two key strategies to speed up the process of anomaly detection, which are also adopted by MADRID: 
    • Iterative doubling: DAMP only looks back as far as it needs to (early abandoning).
    • Forward pruning: DAMP can “peek” ahead to prune the future subsequences that are unlikely to be a discord.

21 of 75

DAMP only looks back as far as it needs to

0

500

1000

1500

2000

2500

Left-aMP

(Partly computed)

Time Series

Current Subsequence

Now

Best-So-Far = 2.1

Double the length. Now look back this far.

If DAMP finds a subsequence whose distance from the current subsequence is less than Best-So-Far, stop.

DAMP find a nearest neighbor!

  1. Stop searching.
  2. Update the Left-aMP score.

This process is “iterative doubling”.

22 of 75

DAMP can “peek” ahead to prune unqualified future subsequences

Look ahead a bit.

If there are future subsequences whose distance from the current subsequence is less than Best-So-Far, they will be pruned and not processed in the future.

0

500

1000

1500

2000

2500

3000

3500

4500

5000

0

5

10

15

Distance Profile

Best-So-Far = 2.1

Current Subsequence

Time Series

4000

The distance between these two subsequences and the current subsequence is less than 2.1.

  • Future subsequences similar to the current subsequence are unlikely to be a discord.

This process is “forward pruning”.

23 of 75

MADRID is absolutely fast to compute all-discords

  • The efficiency of DAMP greatly depends upon the Best-So-Far (BSF) discord score. The larger it is, the more effective the pruning is.
  • If we set the initial BSF to a value just slightly below the final BSF before starting DAMP, this can reduce the absolute time to compute discords.
  • How can we determine this initial BSF?

24 of 75

Discord

Loc

m

0

NaN

24

0

NaN

23

0

NaN

22

0

NaN

9

0

NaN

8

0

NaN

16

0

NaN

10

0

NaN

21

0

NaN

11

MADRID is absolutely fast to compute all-discords

  • This is what the MADRID’s data structure, the multi-length discord table M, initially looks like.

multi-length discord table M

25 of 75

Discord

Loc

m

0

NaN

24

0

NaN

23

0

NaN

22

0

NaN

9

0

NaN

8

0

NaN

16

0

NaN

10

0

NaN

21

0

NaN

11

MADRID is absolutely fast to compute all-discords

  • Each row of the multi-length discord table M is a Left-aMP.

Left-aMP for m=8

26 of 75

Discord

Loc

m

0

NaN

24

0

NaN

23

0

NaN

22

0

NaN

9

0

NaN

8

0

NaN

16

0

NaN

10

0

NaN

21

0

NaN

11

MADRID is absolutely fast to compute all-discords

  • Each row of the multi-length discord table M is a Left-aMP.
  • MADRID incrementally increases the size of m, repeatedly invoking DAMP to compute the Left-aMP row by row.

Left-aMP for m=8

27 of 75

Discord

Loc

m

0

NaN

24

0

NaN

23

0

NaN

22

0

NaN

9

0

NaN

8

0

NaN

16

0

NaN

10

0

NaN

21

0

NaN

11

MADRID is absolutely fast to compute all-discords

  • Each row of the multi-length discord table M is a Left-aMP.
  • MADRID incrementally increases the size of m, repeatedly invoking DAMP to compute the Left-aMP row by row.

Left-aMP for m=9

28 of 75

Discord

Loc

m

0

NaN

24

0

NaN

23

0

NaN

22

0

NaN

9

0

NaN

8

0

NaN

16

0

NaN

10

0

NaN

21

0

NaN

11

MADRID is absolutely fast to compute all-discords

  • Each row of the multi-length discord table M is a Left-aMP.
  • MADRID incrementally increases the size of m, repeatedly invoking DAMP to compute the Left-aMP row by row.

Left-aMP for m=10

29 of 75

Discord

Loc

m

0

NaN

24

0

NaN

23

0

NaN

22

0

NaN

9

0

NaN

8

0

NaN

16

0

NaN

10

0

NaN

21

0

NaN

11

MADRID is absolutely fast to compute all-discords

  • The left side is the table’s meta-data.

meta-data

30 of 75

Discord

Loc

m

0

NaN

24

0

NaN

23

0

NaN

22

0

NaN

9

21.3

7,569

8

0

NaN

16

0

NaN

10

0

NaN

21

0

NaN

11

discord for m=8

MADRID is absolutely fast to compute all-discords

Training Data

discord for m=8

Time Series

0

1500

3000

4500

6000

7500

Loc = 7,569

m

discord score = 21.3

  • Meta-data stores the information of the top discord for each row.
  • For example, the red cell represents the top discord at m=8.
    • Discord: the highest discord score.
    • Loc: anomaly location.
    • m: anomaly length.

31 of 75

Discord

Loc

m

0

NaN

24

0

NaN

23

0

NaN

22

0

NaN

9

21.3

7,569

8

0

NaN

16

0

NaN

10

0

NaN

21

0

NaN

11

discord for m=8

MADRID is absolutely fast to compute all-discords

Training Data

discord for m=8

Time Series

0

1500

3000

4500

6000

7500

Loc = 7,569

m

discord score = 21.3

  • Recall that to improve the absolute speed for computing discords, we need a large initial BSF.
  • To do so, when MADRID completes the calculation for row m and is about to compute m+1…

32 of 75

MADRID is absolutely fast to compute all-discords

Discord

Loc

m

0

NaN

24

0

NaN

23

0

NaN

22

0

NaN

9

21.3

7,569

8

0

NaN

16

0

NaN

10

0

NaN

21

0

NaN

11

Same Loc, m+1=9

Training Data

The window size is expanded by one unit.

Time Series

0

1500

3000

4500

6000

7500

Loc = 7,569

m+1

exact discord score = 19.8

  • MADIRD goes to the anomaly location Loc, increases the size of m by one and compute the discord score.
  • In general, the discord score of cell M[Loc,m] will be highly correlated with cell M[Loc,m+1].

33 of 75

MADRID is absolutely fast to compute all-discords

Discord

Loc

m

0

NaN

24

0

NaN

23

0

NaN

22

19.8

NaN

9

21.3

7,569

8

0

NaN

16

0

NaN

10

0

NaN

21

0

NaN

11

Same Loc, m+1=9

Training Data

The window size is expanded by one unit.

Time Series

0

1500

3000

4500

6000

7500

Loc = 7,569

m+1

exact discord score = 19.8

(initial BSF for the next row)

  • This gives MADRID a large initial BSF discord score before computing the next row.
  • We call this optimization warm-start.

This value may be increased in the subsequent calculation.

34 of 75

MADRID is absolutely fast to compute all-discords

  • The task involves computing the multi-length discord table M for a slightly noisy sine wave of length 8,192, with minL = 128, maxL = 768, and step size S = 1. This means that there are a total of 5,251,072 cells in the table that need to be processed.
  • We start with the pure brute force algorithm and incrementally add our 3 optimizations and observe the execution time of these algorithms.

Quality of solution

0

60,000

0%

100%

~15.5 hours

Pure brute force (NN)

Pure brute force (extrapolated)

Wall clock time (seconds)

35 of 75

MADRID is absolutely fast to compute all-discords

  • Each optimization contributes to speed up, reducing what would have taken 15.5 hours for 5 million data points to just 16 minutes.
  • Compared with pure DAMP, MADRID with warm-start optimization (red curve) is 1.5 times faster.

Quality of solution

0

4000

0%

100%

Wall clock time (seconds)

Warm-start DAMP

Pure DAMP

MASS Brute Force

~One hour

~24 minutes

~16 minutes

Pure brute force (NN)

(Zoom-in)

36 of 75

MADRID is a Hyper-Anytime Algorithm

  • We theoretically defined a hierarchy to reference the performance of anytime algorithms.
  • Based on the convergence performance of anytime algorithms, we defined five different hierarchical categories.
    • For instance, Ultra-Anytime refers to a 70%-efficient algorithm, meaning it converges to within 30% of the final solution using only 30% of the time required by the batch algorithm. The definitions for other levels follow a similar logic.

100%

0%

0%

100%

Computational Resource

Quality of Solution

Hyper-Anytime

Super-Anytime

Ultra-Anytime

Anytime

Hypo-Anytime

<90%

<80%

<70%

~50%

>50%

30%

37 of 75

MADRID is a Hyper-Anytime Algorithm

  • Despite the warm-start optimization, MADRID is essentially a batch-only algorithm as it sequentially processes the cells in the table M row by row.
    • The red curve shows it converges to the final solution slightly better than linearly.
  • Can MADRID algorithm be adapted to a Hyper-Anytime Algorithm that converges to within 10% of the final solution using only 10% of the time required by the batch algorithm?

Quality of solution

0

0%

100%

Warm-start DAMP

100%

0%

0%

100%

Computational Resource

Quality of Solution

Hyper-Anytime

Super-Anytime

Ultra-Anytime

Anytime

Hypo-Anytime

<90%

<80%

<70%

~50%

>50%

30%

Wall clock time (seconds)

38 of 75

Discord

Loc

m

0

NaN

24

0

NaN

23

0

NaN

22

0

NaN

9

0

NaN

8

0

NaN

16

0

NaN

10

0

NaN

21

0

NaN

11

MADRID is a Hyper-Anytime Algorithm

  • MADRID employs an additional initialization strategy to quickly estimate the scores and locations of all-discords early in its execution.
  • We start with an empty multi-length discord table M.

39 of 75

Discord

Loc

m

23.4

11,895

24

0

NaN

23

0

NaN

22

0

NaN

9

21.3

7,569

8

18.7

12,123

16

0

NaN

10

0

NaN

21

0

NaN

11

MADRID is a Hyper-Anytime Algorithm

  • Since DAMP is fast for a single length, we can take this advantage to quickly compute three representative discords.
  • The most representative discords we choose here are at the maximum length (24), the minimum length minL (8), and the mid-length (16).

* Cells in gray represent those visited by DAMP, those in red highlight the top discords.

40 of 75

Discord

Loc

m

23.4

11,895

24

0

NaN

23

0

NaN

22

0

NaN

9

21.3

7,569

8

18.7

12,123

16

0

NaN

10

0

NaN

21

0

NaN

11

MADRID is a Hyper-Anytime Algorithm

  • Discord scores/locations are also updated in the meta-data.
  • With information about the three representative discords, we can make quick and reasonable inferences about the remaining discords.

meta-data

41 of 75

Discord

Loc

m

23.4

11,895

24

22.5

11,895

23

23.7

11,895

22

20.8

7,569

9

21.3

7,569

8

18.7

12,123

16

19.9

7,569

10

19.2

12,123

21

18.4

12,123

11

MADRID is a Hyper-Anytime Algorithm

  • We compute the scores for all cells in the same columns as the three representative discords.
  • The discord scores of these yellow cells should be very close to the highest discord score, as the discord score of cell M[Loc,m] will be highly correlated with cell M[Loc,m+1].

42 of 75

Discord

Loc

m

23.4

11,895

24

22.5

11,895

23

23.7

11,895

22

1

2

3

20.8

7,569

9

21.3

7,569

8

18.7

12,123

16

19.9

7,569

10

19.2

12,123

21

18.4

12,123

11

MADRID is a Hyper-Anytime Algorithm

  • Evaluate the three cells in each row, select the one with the highest value as the approximate solution for that row, and update the metadata.
  • After this initialization process, MADRID initiates warm-start DAMP to calculate the exact scores and positions of top discords row by row.

* The red fonts in meta-data represent exact values that match the final results and will not be modified in subsequent computations, while the black fonts are approximations, which may be corrected in subsequent calculations.

max{cell1, cell2, cell3}

43 of 75

MADRID is a Hyper-Anytime Algorithm

  • Consider two synthetic datasets with lengths of 100,000 and 1,000,000, both being slightly noisy sine waves.
    • We intentionally introduced concept drift in both datasets, their periodicity increases over time.
    • In both cases, we used the first 50,000 datapoints for training and embedded three visually obvious anomalies of different lengths between 50,000 and 60,000.

995,000

100,000

51,000

51,500

0

500

54,000

54,500

58,000

58,500

The first 500 datapoints and the last 500 datapoints of the synthetic dataset. Note the slight change in period, which is smoothly spread over the full length of the data.

The three embedded anomalies

44 of 75

Quality of solution

0

4000

0%

100%

Wall clock time (seconds)

Warm-start DAMP

Pure DAMP

MASS Brute Force

Pure brute force (NN)

(Zoom-in)

MADRID is a Hyper-Anytime Algorithm

  • Recall that in terms of absolute speed, warm-start DAMP is already much faster than the application of pure DAMP.

45 of 75

MADRID is a Hyper-Anytime Algorithm

  • However, the convergence of warm-start DAMP is not optimal, since it accesses the table M row by row in sequence. Within our established hierarchy of anytime algorithms, we expect the convergence of our algorithms to improve to the Hyper-Anytime level.
  • Let‘s see how well MADRID converges after applying the additional initialization step.

100%

0%

Hyper-Anytime

Percentage of Computational Resource

Percentage of Computational Resource

0

100%

Warm-start DAMP

Quality of solution

0%

100%

46 of 75

MADRID is a Hyper-Anytime Algorithm

  • In both cases, after MADRID completed only ~1.55% of the total computation, it successfully converged to within 10% of the final solution, confirming its robust anytime performance.
    • Overlaying MADRID's convergence plot onto the nomenclature template demonstrates that MADRID is a Hyper-Anytime Algorithm.
  • MADRID also exhibits excellent scalability, with better convergence as the dataset size increases.

Percentage of Computational Resource

0

100%

Warm-start DAMP

MADRID

MADRID

Percentage of Computational Resource

0

100%

Warm-start DAMP

Semantic Convergence

Semantic Convergence

100%

0%

Hyper-Anytime

Percentage of Computational Resource

100,000

Datapoints

1,000,000

Datapoints

Quality of solution

0%

100%

47 of 75

Outline

  • Motivation
  • The proposed method: MADRID
  • Empirical Evaluation
  • Conclusion

48 of 75

Comparison of MADRID with state-of-the-art TSAD methods

  • We compared MADRID with two most cited deep learning methods and two Matrix Profile-based methods on the 100K synthetic dataset.

Method

Window Size

Precision

Train and Test Time

OmniAnomaly

160

0.05

5.9 hours

Telemanom

160

1

2.2 hours

PAN-Matrix Profile

64 to 256

1

1.8 days

MERLIN

64 to 256

1

3.6 hours

MADRID

64 to 256

1

3.9 minutes

(34 seconds to converge)

  • MADRID outperforms the listed SOTA methods in both accuracy and execution time.
    • MADRID can test 192 different subsequence lengths much faster than the SOTA deep learning algorithms can test a single length, and MADRID successfully finds all anomalies.
    • PAN-Matrix Profile and MERLIN brute-force search every cell in table M, whereas MADRID uses forward/backward pruning and warm-start optimization to significantly reduce the search space.
  • MADRID returns a solution within 10% of the final answer in just 34 seconds.

49 of 75

MADRID addresses the challenge of finding the optimal window size

  • Finding the optimal window size that works best for the algorithm is very hard. MADRID circumvents this problem by testing a range of window sizes, increasing the chances of successful anomaly detection.
  • Consider the HEX/UCR Anomaly Dataset, on which DAMP achieves state-of-the-art performance (as demonstrated by several independent papers). However, there are still cases where DAMP fails due to inappropriate window size settings.
    • For example, for the UCR-202 dataset, the DAMP’s heuristic suggested a window size of 32. With this window size, DAMP predicts an anomaly at position 37548, while the ground truth anomaly is from 10998 to 11028.

0

16,716

33,432

DAMP top discord

UCR-202

Ground Truth

8,358

25,074

Left-aMP

50 of 75

32

DAMP’s prediction (m=32)

UCR-202

window sizes

discord score

10

96

MADRID addresses the challenge of finding the optimal window size

  • We simply set the minL for MADRID to be one-third of DAMP’s m value (10) and maxL to three times that value (96).
  • Most of MADRID’s predictions successfully identified the ground truth anomaly.
    • Of MADRID's 87 predictions, 75 are correct.
    • Reporting only the location with the highest normalized discord score also yields the correct result.
  • DAMP’s window size of 32, is only one less than the correct window sizes used for successful predictions. However, this small difference led to wrong result that is far from the ground truth.

* Red represents successful MADRID predictions, blue indicates failed MADRID predictions, and green marks the DAMP prediction.

Ground Truth

51 of 75

Visualizing MADRID in 3D

  • This is a demo to visualize MADRID’s meta-data for UCR-202 in 3D.
  • We can get different information by viewing from different sides of the 3D plot.
    • Looking from the top/front, we can identify the locations of anomalies in the time series.
    • Viewed from the side, the height of the discord score gives us clues about the natural length of the anomalies.

52 of 75

MADRID can quickly detect multi-scale anomalies in real datasets

  • This is nine years of foot traffic data from Waterfront City in Melbourne.
  • We used the first full year as training data and the remaining eight years as test data, executed MADRID and Telemanom respectively.

Waterfront City Pedestrian Traffic

0

10000

Apr 30th 2009

Jan 31st 2018

* Waterfront City is a shopping and entertainment area, just two kilometers from the Melbourne CBD.

53 of 75

MADRID can quickly detect multi-scale anomalies in real datasets

  • MADRID tested 45 different lengths from 4 to 48 hours and reported 8 distinct anomalies at various locations. Here are the top 3.
    • 15-hours: Police say widespread flash flooding is beginning to affect central areas around Melbourne. (January 12, 2011)

0

1200

0

100

hours

# of Pedestrians

Waterfront City Pedestrian Traffic

0

10000

Apr 30th 2009

Jan 31st 2018

Flash flooding

(15 hours)

54 of 75

MADRID can quickly detect multi-scale anomalies in real datasets

  • MADRID tested 45 different lengths from 4 to 48 hours and reported 8 distinct anomalies at various locations. Here are the top 3.
    • 19-hours: 2011 AFL Parade - Thousands of AFL fans have created a carnival-like atmosphere in Melbourne‘s CBD as they watch this year’s grand final parade. (September 30, 2011)

0

100

hours

Waterfront City Pedestrian Traffic

0

10000

Apr 30th 2009

Jan 31st 2018

AFL Parade

(19 hours)

55 of 75

MADRID can quickly detect multi-scale anomalies in real datasets

  • MADRID tested 45 different lengths from 4 to 48 hours and reported 8 distinct anomalies at various locations. Here are the top 3.
    • 39-hours: There was a public holiday in Australia for Remembrance Day, which commemorates the end of World War I and honors the fallen soldiers. (November 11, 2010)

0

100

hours

Waterfront City Pedestrian Traffic

0

10000

Apr 30th 2009

Jan 31st 2018

Remembrance Day

(39 hours)

56 of 75

MADRID can quickly detect multi-scale anomalies in real datasets

  • MADRID reported 8 valid anomalies. In contrast, all of Telemanom's results point to the same anomaly: the 2014 national holiday.
    • Results are independent of the window length chosen by the user, which suggests that the algorithm may be biased toward specific types of anomalies, making some anomalies hard to find.
  • MADRID took 40.5 minutes to test all 45 lengths, converging to within 10% of the optimal answer with just 6.5% of the computation. Telemanom is an order of magnitude slower, requiring an estimated 8.6 hours for all 45 lengths.
    • With MADRID‘s anytime feature, users can run TSAD for a few minutes, peek at the results, and then make decisions based on what they see, such as whether to continue running or modify the data. However, this type of interaction is simply not tenable if each run takes hours.

Waterfront City Pedestrian Traffic

0

10000

Apr 30th 2009

Jan 31st 2018

57 of 75

Three years of relative humidity data for northern California

0

100

Jan 1st 2018

Dec 31st 2020

MADRID searches efficiently on millions of data

  • This is relative humidity data for northern California from 2018 to 2020.
  • We used 2018 for training to detect anomalies in 2019 and 2020. With minute-level sampling, our training and test data are 500,000 and 1,000,000 in length, respectively.

58 of 75

Three years of relative humidity data for northern California

0

100

Jan 1st 2018

Dec 31st 2020

0

15000

0

100

0

15000

0

100

Kinkaid Fire (one week)

Winter storm Kai (half day)

Winter storm Nadia (one day)

Thanksgiving-week storm (two days) 

0

15000

0

100

0

15000

0

100

MADRID searches efficiently on millions of data

  • MADRID tested 14 different lengths from 0.5 to 7 days and reported 4 distinct anomalies, each corresponding to extreme weather events of varying durations.
    • Shorter anomalies are tied to three different storms, while longer ones exceeding two days are caused by the Kincade Fire, which ravaged Northern California for a protracted period of 15 days.

59 of 75

Three years of relative humidity data for northern California

0

100

Jan 1st 2018

Dec 31st 2020

MADRID searches efficiently on millions of data

  • Using deep learning methods on this million-level dataset is not practical.
    • We used the same training and testing data and MADRID‘s median search length of 4320 (3 days) for Telemanom. It took 49.6 hours for just the first of 35 required epochs, projecting a total of 72.6 days for full training. The testing duration couldn‘t be estimated. Given the large 1-million-point test dataset, there’s no assurance it would finish in a reasonable time or without memory errors.
  • In stark contrast, MADRID only required 10.6 hours to conduct 14 searches of varying lengths on 1 million data points, reporting four meaningful anomalies. This highlights MADRID's efficiency and accuracy.

60 of 75

Conclusion

  • The results of most TSAD algorithms depend on the considered subsequence length. However, finding the optimal length is challenging (it may not even be well defined). By testing all lengths, we can bypass this issue.
  • MADRID can test hundreds of subsequence lengths at least an order of magnitude faster than deep learning models testing a single length and can report more valid anomalies at different scales.

Paper, code and demo are available at: https://sites.google.com/view/madrid-icdm-23/documentation

61 of 75

Dedicated to our friend

Frank Madrid

July 18, 1986 - May 18, 2020

62 of 75

Thank you for your attention!

Any questions?

63 of 75

Matrix Profile XXX: MADRID: A Hyper-Anytime Algorithm to Find Time Series Anomalies of all Lengths

Yue Lu, Thirumalai Vinjamoor Akhil Srinivas, Takaaki Nakamura, Makoto Imamura, and Eamonn Keogh.

64 of 75

Appendix

  • We need to compare the discord scores of time series that are of different lengths. The correct normalization is shown next slide.

65 of 75

1,000

10,000

1

2

3

0

0.05

Time Series Motif Length

Raw Euclidean Distance

Divide by reciprocal of square root of length normalized

1,000

10,000

1

2

3

Time Series Motif Length

Raw Euclidean Distance

Divide by length normalized

0

0.001

1

10,000

Time Series Length

We may need to compare the Euclidian distances of pair of time series, that are of different lengths.

How should we normal to compensate for the length differences?

We could:

1) Not normalize at all, but this strongly biases us to short patterns.

2) Normalize by dividing by the length of the time series. This seems to make sense, but it strongly biases us towards long patterns.

3) Normalize by the reciprocal of square root of length of the time series. Without explanation (here) we claim that this is correct.

In the figures on the bottom right, we compare the three above ideas on slightly nosily sine waves of different lengths (shown top right).

As you can see, the “Normalize by the reciprocal of square root of length of the time series” approach is invariant to the length of the patterns.

Which of these three pairs is closest?

Under mild assumptions, we might claim that they are all equally similar.

66 of 75

Why does this matter?

There are hundreds of papers, containing hundreds of tables, comparing different algorithms on SWaT.

In most cases, these papers report 3, 4 or even 5 significant digits!

It is all meaningless!

67 of 75

It is not Time Series Anomaly Detection

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

10

5

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

SwaT-FIT401

…,0,0,0,0,0,…

Most domains have a special character for invalid data. For example:

    • …,NaN, NaN, NaN,...
    • …,-9999,-9999,-9999,… (AspenTech)
    • …,0,0,0,0,… (where zero is a physically impossible value)
    • …,inf,inf,inf,…

But you do not need to do any anomaly detection/machine learning to detect these!

You know about these before you see any data. About 70% of the anomalies in SWaT are of this type!

This is not time series anomaly detection.

68 of 75

It is not Time Series Anomaly Detection

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

10

5

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

SwaT-FIT401

…,0,0,0,0,0,…

Dhaval Patel

IBM TSAD expert

.. Varying signal becomes flat line..

..we can set a simple rule to detect to detect the “anomaly”

*Toolkit for Time Series Anomaly Detection. Patel et al. SIGKDD 2022

No one working in the real world would every consider this an anomaly. It is a trivial “pipeline” issue that can deal with using a “simple rule”* that you can define before you see the data.

69 of 75

A seventy-year-old algorithm is often superior

SwaT-FIT401

(zoom-in)

Let's bend over backwards, and pretend it is a TSDA problem.

A classic Statistical Process Control (SPC) algorithm says: “If the next value is more than three Standard Deviations away from data in a short history of points, flag as anomaly”.

This is tens of Standard Deviations, so we can solve this with a simple 70-year-old algorithm.

227900

263700

70 of 75

The scoring functions used greatly inflate performance

SwaT-FIT401

(zoom-in)

Finding this trivial anomaly should be scored as 1-out-of-1, failing to find it scored as 0-out-of- 1.

However, the anomaly is of length 35,800, so most papers report a score like 35,701 out of 35,800, or “Our accuracy is 0.99723”.

But these are not 35,800 independent events! This is a single event.

227900

263700

71 of 75

Any high precision claims are necessarily wrong

SwaT-FIT401

(zoom-in)

The original labels are almost certainly “sloppy” and “rough”

The fact that they are rounded to the nearest 100 would suggest that, however the error is greater than 100.

I have argued elsewhere, that it is essentially never possible or meaningful to have high temporal precision claims about exactly when an anomaly occurs.

227900

263700

Off by 111

Off by 3859

72 of 75

Others in the community are beginning to get it…

SwaT-FIT401

(zoom-in)

…Thus we conclude that evaluations on SWaT are highly unreliable and that these datasets are not suited for multivariate time-series AD evaluation.

Maja Rudolph, Bosch AI

https://ml.informatik.uni-kl.de/publications/2023/TimeSeAD.pdf

Dennis Wagner, Tobias Michels, Florian C. F. Schulz, Arjun Nair, Maja Rudolph, Marius Kloft: TimeSeAD: Benchmarking Deep Multivariate Time-Series Anomaly Detection. Trans. Mach. Learn. Res. 2023 (2023)

73 of 75

Comparison of DAMP and 5 state-of-the-art TSAD methods

  • We compare DAMP with 5 state-of-the-art TSAD methods on KDD Cup 2021 dataset.

Method

Accuracy

Train and Test Time

USAD

0.276

8.05 hours

LSTM-VAE

0.198

23.6 hours

AE

0.236

6.11 hours

Telemanom

Out of memory error on longer examples

SCRIMP (Full-MP)

0.416

24.5 minutes

DAMP (Left-MP) out-of-the-box

0.512

4.26 hours

DAMP (Left-MP) sharpened data

0.632

4.26 hours

  • DAMP outperforms deep learning-based methods in both accuracy and execution time.
    • Note that DAMP requires only one parameter compared to deep learning methods, and the parameter can be obtained automatically without any human intervention or tuning.
  • DAMP has better accuracy compared to SCRIMP.
    • SCRIMP calculates Full-MP, which is vulnerable to the "twin-freak" problem.

74 of 75

The reception to DAMP by the community

  • The accuracy of DAMP is also supported by the works published by other teams.
  • A recent paper* by a group at Alibaba confirms DAMP is the most accurate approach out of the 13 approaches they tried, including deep learning methods. (they resort to saying they are faster. However, if they use golden DAMP or X-lag DAMP, they would not be)

*OneShotSTL: One-Shot Seasonal-Trend Decomposition For Online Time Series Anomaly Detection And Forecasting

75 of 75

Comparison of DAMP and rival methods

DAMP

Telemanom

NORMA

SCRIMP

Number of parameters

1

16

2

1

Model training time

0

> 0

> 0

0

Throughput

358,000 Hz

748 Hz

200 Hz

163 Hz

Can handle batch data?

Yes

Yes

Yes

Yes

Can handle streaming data?

Yes

Yes (however it takes too long to model)

No

No

Can handle concept drift?

Yes

No

No

AB join SCRIMP - No

Self-join SCRIMP - Yes

Can handle multidimensional data?

Yes

Yes in principle, but it was not shown in the paper

No

No

Is stochastic?

No

Yes

Unclear

No