1 of 42

NDS: N-Dimensional Storage

Team - sleep(10);

Mahendra Patel

203050078

1

2 of 42

Outline

  • Introduction
    • One-dimensional, serial, covenstional , GPU
  • Problems
  • Challenges
  • NDS
    • NDS Overview
    • STL
    • NDS Prototype
  • Evaluation
  • Conclusion

2

3 of 42

3

INTRODUCTION

4 of 42

Multi-dimensional storage

  • Applications using high- dimensional datasets like:
    • Neural nets, Visions, AI, Graph Analytics�
  • Powerful General Purpose processor super efficient domain specific compute engine
  • 1-order vector (Nvidia’s Tensor Core Units ), 2-order tensor (Google’s Tensor Processing Units)

4

5 of 42

Overview

5

I perform best at [mxn]

I have N parallel channels so I can service N parallel request

But I don't how to optimally marshal and unmarshal high-dimensional object and store it in SSD

But I don't know the exact architecture of accelerator or storage device

I am sorry application but you need to it yourself

6 of 42

6

NDS

Application-defined Multi-dimensional Memory/Storage Abstraction

7 of 42

7

  • Modern accelerator-based architectures
  • Problem and challenges
  • NDS

8 of 42

Accessing a submatrix from SSD

8

  • Raw input matrix is 16K×16K.�
  • SSD has 8KB pages and 8 parallel channels; each SSD page stores 2K elements in floating-point format.

9 of 42

9

  • Modern accelerator-based architectures
  • Problem and challenges
  • NDS

10 of 42

Problems

  • The overhead of marshalling input data.
  • Underutilization of interconnect bandwidth.
  • Underutilization of device bandwidth.

10

11 of 42

Execution time of matrix multiplication

11

  • Data already present in memory�
    • If the application needs to fetch an 8K×8K submatrix, the row-store format will require the program to issue 8,192 I/O requests—to fetch 8,192 rows that contain the submatrix items.�
    • The sequential baseline configuration requires 2.11× more time to avoid CPU overhead than the sub-block configuration does.

12 of 42

Problems

  • The overhead of marshalling input data.
  • Underutilization of interconnect bandwidth.
  • Underutilization of device bandwidth.

12

13 of 42

13

  • Interconnect bandwidth staturates if each request is larger than 2MB.�
  • I/O request to fetch row of 8kX8k submatrix of for 32KB data.�
  • Underutilizing interconnect bandwidth.

14 of 42

Problems

  • The overhead of marshalling input data.
  • Underutilization of interconnect bandwidth.
  • Underutilization of device bandwidth.

14

15 of 42

15

This problem will increase with increase in number of channels in device.

16 of 42

Execution time of matrix multiplication

16

  • Data already present in SSD�
    • The baseline spends 1.92× more time fetching data compared to an SSD configuration with optimal data layout

17 of 42

Challenges

  • Unavailability of internal memory-device architecture to applications.
  • Unpredictability of optimal dimensionality in compute kernels.
  • Demand mismatch between storage devices and compute kernels.

17

18 of 42

18

18

?

I don’t know about internal device structure

Also due to function like garbage collection and wear-level functions in the SSD, can lead to data-location shuffling.

19 of 42

Challenges

  • Unavailability of internal memory-device architecture to applications.
  • Unpredictability of optimal dimensionality in compute kernels.
  • Demand mismatch between storage devices and compute kernels.

19

row-oriented pair-wise matrix can maximize GPU computation, but cannot be efficient for matrix-multiplication kernels.

20 of 42

Optimal Dimensionality of different devices

20

  • The optimal submatrix size that maximizes performance on the CUDA cores is 2048×2048 while in Tensor Cores is 512×512.�
  • Extreme difference in optimal data sizes due to different internal device architectures.

21 of 42

Challenges

  • Unavailability of internal memory-device architecture to applications.
  • Unpredictability of optimal dimensionality in compute kernels.
  • Demand mismatch between storage devices and compute kernels.

21

kernel worked best on 512×512 submatrices, but the bandwidth for the consumer SSD is maximized if each I/O request fetches 16K×16K submatrices.

22 of 42

22

  • Modern accelerator-based architectures
  • Problem and challenges
  • NDS

23 of 42

NDS

23

  • Provide commands in multi-dimensional addressing modes.
    • minimizes the number of I/O requests while maximizing the data volume that each request moves.�
  • NDS automatically and implicitly transforms a data object into an application’s required dimensionality.�
  • NDS decouples storage-device granularity and data layout from an application’s view.

24 of 42

NDS

24

25 of 42

NDS

25

26 of 42

NDS

26

27 of 42

NDS

27

28 of 42

NDS

28

29 of 42

NDS

29

30 of 42

NDS

30

31 of 42

Accessing a submatrix from NDS

31

32 of 42

STL - Space translation layer

32

  • Gauges application demands and memory/storage-device characteristics�
  • Break down datasets into building blocks
    • Building block is a fixed-size logical chunk of data storage
    • Complete building block stores its data in units available through all parallel channels
  • Maintains mapping of the dataset’s building blocks�

33 of 42

Locating and allocating building blocks

33

  • STL maintains a B-tree

structure for each N-D space to locate a building block.

  • Each level denotes corresponding dimension.
  • Node degree :
    • Di represents size of ith order dimension
    • bbi represents size of ith-order dimension of building block

34 of 42

System implementations

34

35 of 42

Evaluation

35

36 of 42

Row access

36

  • Matrix 512X32,768 to 4096X32,768�
  • Hardware assisted NDS achieves performance identical to baseline SSD

37 of 42

Column access

37

  • Matrix dimension ranging from 32,768X512 to 32,768X4096�
  • H/W NDS performance is similar to Baseline with col-store.

38 of 42

Sub-matrices access

  • Fetches submatrices of various sizes.�
  • NDS significantly outperforms the baseline SSD, regardless of NDS implementation type due to underutilization of interconnect and the device bandwidth

38

39 of 42

End-to-end application latency

  • Software-only implementation can achieve a speedup of 5.07 times.�
  • Hardware NDS can further accelerate applications by 5.73 times�
  • Because H/W implementation removes both computation overhead and host traffic

39

40 of 42

Conclusion

  • NDS provides multidimensional address spaces for applications.�
  • Decouples storage dimensionality from application optimal dataset dimensionality by reconstructing data objects.�
  • Addresses underutilization of both interconnect bandwidth and device bandwidth.�
  • Hardware-assisted NDS version achieves an average 5.73× speedup over a datacenter-class SSD baseline.

40

41 of 42

Reference

41

42 of 42

Thank you

Any questions?

42