1 of 12

X International conference�“Information Technology and Implementation” (IT&I-2023)�Kyiv, Ukraine

1

Parallel and Distributed Machine Learning for Anomaly Detection Systems

Bohdan Koval, Iulia Khlevna

2 of 12

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

The purpose of this research

  • explore how parallel and distributed machine learning techniques can be harnessed to make anomaly detection systems operate with greater speed and efficiency;
  • benchmarking and performance evaluation to measure the efficiency gains and the reduction in processing time:
    • differentiate parallel and distributed machine learning
    • formulate parallel machine learning technique and analyze its performance
    • elaborate on distributed machine learning technique and how we can apply it

3 of 12

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

Parallel and distributed computing

Parallel Computing

Distributed Computing

Concurrency is achieved through simultaneous execution of numerous operations

System components are geographically distributed across distinct locations

A solitary computing unit is sufficient to execute the tasks

Utilizes a network of multiple distinct computing units

Concurrent operations are performed by multiple processors within a single system

Concurrent operations are distributed across multiple discrete computing systems

May encompass shared or distributed memory resources

Solely employs distributed memory resources

Inter-processor communication typically occurs via a shared memory bus

Communication between computing units relies on message passing protocols

Enhances the overall performance of a system

Enhances system scalability, fault tolerance, and resource sharing capabilities

4 of 12

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

Parallel and distributed computing

  • Parallelism: Utilize parallel processing or multi-threading capabilities available in your programming environment. Libraries like scikit-learn have options for parallel processing. You can make use of this feature to distribute the computation across multiple CPU cores, which can drastically reduce processing time.
  • Distributed Computing: If your dataset is extremely large and cannot fit in memory, consider using distributed computing frameworks like Apache Spark. These frameworks can distribute the data and computation across a cluster of machines.

5 of 12

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

Two primary types of parallelism:

  • model parallelism, which focuses on dividing a large model into smaller, manageable components that can be trained concurrently;
  • data parallelism, a technique where subsets of the dataset are distributed to multiple processing units for simultaneous training.

6 of 12

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

Two primary types of parallelism:

7 of 12

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

Amdahl's law

where Slatency represents the potential reduction in the time it takes to complete the entire task, s the reduction in time specifically for the part of the task that can be done in parallel, p is the portion of the total task time that is spent on the part that can be parallelized before parallelization.

Since Slatency < 1/(1 - p), it indicates that a small portion of the program that cannot be parallelized will restrict the maximum speedup achievable through parallelization.

8 of 12

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

Data processing performance comparison

Processing approach

Result

Regular (sequential) processing

19.341 seconds

Pool parallel processing

9.120 seconds

Pool parallel processing threads

19.125 seconds

Joblib parallel processing

16.912 seconds

Joblib parallel processing threads

19.213 seconds

9 of 12

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

Parallel feature engineering comparison

Feature engineerings processing approach

Result

Grouping features (sequential)

1.721 seconds

Grouping features shifted (sequential)

1.981 seconds

Pool grouping features parallel

2.820 seconds

Pool grouping features parallel threads

1.101 seconds

Pool grouping features shifted parallel

2.521 seconds

Pool grouping features shifted parallel threads

1.226 seconds

Joblib grouping features parallel threads

1.009 seconds

Joblib grouping features shifted parallel threads

1.182 seconds

10 of 12

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

Distributed machine learning

Distributed training becomes a necessity under the following circumstances:

  1. Time-Intensive Training
  2. Storage Constraints
  3. Data Localization
  4. RAM Constraints

11 of 12

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

Distributed machine learning

Distributed machine learning data processing involves:

  1. Model Initialization
  2. Model Distribution
  3. Gradient Calculation
  4. Gradient Communication
  5. Model Update
  6. Iterative Process

12 of 12

Information Technology and Implementation, November 20, 2023, Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

Conclusion

  • Parallel processing can benefit the training speed in both cases, but the bump in the speed depends on the input size (the bigger size - the better speed improvement). Generally speaking, we can see 2x speed boost with allocating 4x resources.
  • The benefits of distributed training are particularly evident in scenarios involving extensive time-consuming training processes, storage constraints, data localization requirements, and RAM limitations.