1 of 42

NDS: N-Dimensional Storage

Team - sleep(10);

Mahendra Patel

203050078

2 of 42

Outline

Introduction

One-dimensional, serial, covenstional , GPU

Problems
Challenges
NDS

NDS Overview
STL
NDS Prototype

Evaluation
Conclusion

3 of 42

INTRODUCTION

4 of 42

Multi-dimensional storage

Applications using high- dimensional datasets like:

Neural nets, Visions, AI, Graph Analytics�

Powerful General Purpose processor super efficient domain specific compute engine�
1-order vector (Nvidia’s Tensor Core Units ), 2-order tensor (Google’s Tensor Processing Units)

5 of 42

Overview

I perform best at [mxn]

I have N parallel channels so I can service N parallel request

But I don't how to optimally marshal and unmarshal high-dimensional object and store it in SSD

But I don't know the exact architecture of accelerator or storage device

I am sorry application but you need to it yourself

6 of 42

NDS

Application-defined Multi-dimensional Memory/Storage Abstraction

7 of 42

Modern accelerator-based architectures
Problem and challenges
NDS

8 of 42

Accessing a submatrix from SSD

Raw input matrix is 16K×16K.�
SSD has 8KB pages and 8 parallel channels; each SSD page stores 2K elements in floating-point format.

9 of 42

Modern accelerator-based architectures
Problem and challenges
NDS

10 of 42

Problems

The overhead of marshalling input data.
Underutilization of interconnect bandwidth.
Underutilization of device bandwidth.

11 of 42

Execution time of matrix multiplication

Data already present in memory�

If the application needs to fetch an 8K×8K submatrix, the row-store format will require the program to issue 8,192 I/O requests—to fetch 8,192 rows that contain the submatrix items.�
The sequential baseline configuration requires 2.11× more time to avoid CPU overhead than the sub-block configuration does.

12 of 42

Problems

The overhead of marshalling input data.
Underutilization of interconnect bandwidth.
Underutilization of device bandwidth.

13 of 42

Interconnect bandwidth staturates if each request is larger than 2MB.�
I/O request to fetch row of 8kX8k submatrix of for 32KB data.�
Underutilizing interconnect bandwidth.

14 of 42

Problems

The overhead of marshalling input data.
Underutilization of interconnect bandwidth.
Underutilization of device bandwidth.

15 of 42

This problem will increase with increase in number of channels in device.

16 of 42

Execution time of matrix multiplication

Data already present in SSD�

The baseline spends 1.92× more time fetching data compared to an SSD configuration with optimal data layout

17 of 42

Challenges

Unavailability of internal memory-device architecture to applications.
Unpredictability of optimal dimensionality in compute kernels.
Demand mismatch between storage devices and compute kernels.

18 of 42

I don’t know about internal device structure

Also due to function like garbage collection and wear-level functions in the SSD, can lead to data-location shuffling.

19 of 42

Challenges

Unavailability of internal memory-device architecture to applications.
Unpredictability of optimal dimensionality in compute kernels.
Demand mismatch between storage devices and compute kernels.

row-oriented pair-wise matrix can maximize GPU computation, but cannot be efficient for matrix-multiplication kernels.

20 of 42

Optimal Dimensionality of different devices

The optimal submatrix size that maximizes performance on the CUDA cores is 2048×2048 while in Tensor Cores is 512×512.�
Extreme difference in optimal data sizes due to different internal device architectures.

21 of 42

Challenges

Unavailability of internal memory-device architecture to applications.
Unpredictability of optimal dimensionality in compute kernels.
Demand mismatch between storage devices and compute kernels.

kernel worked best on 512×512 submatrices, but the bandwidth for the consumer SSD is maximized if each I/O request fetches 16K×16K submatrices.

22 of 42

Modern accelerator-based architectures
Problem and challenges
NDS

23 of 42

NDS

Provide commands in multi-dimensional addressing modes.

minimizes the number of I/O requests while maximizing the data volume that each request moves.�

NDS automatically and implicitly transforms a data object into an application’s required dimensionality.�
NDS decouples storage-device granularity and data layout from an application’s view.

31 of 42

Accessing a submatrix from NDS

32 of 42

STL - Space translation layer

Gauges application demands and memory/storage-device characteristics�
Break down datasets into building blocks

Building block is a fixed-size logical chunk of data storage
Complete building block stores its data in units available through all parallel channels�

Maintains mapping of the dataset’s building blocks�

33 of 42

Locating and allocating building blocks

STL maintains a B-tree

structure for each N-D space to locate a building block.

Each level denotes corresponding dimension.
Node degree :

Di represents size of ith order dimension
bbi represents size of ith-order dimension of building block

34 of 42

System implementations

35 of 42

Evaluation

36 of 42

Row access

Matrix 512X32,768 to 4096X32,768�
Hardware assisted NDS achieves performance identical to baseline SSD

37 of 42

Column access

Matrix dimension ranging from 32,768X512 to 32,768X4096�
H/W NDS performance is similar to Baseline with col-store.

38 of 42

Sub-matrices access

Fetches submatrices of various sizes.�
NDS significantly outperforms the baseline SSD, regardless of NDS implementation type due to underutilization of interconnect and the device bandwidth

39 of 42

End-to-end application latency

Software-only implementation can achieve a speedup of 5.07 times.�
Hardware NDS can further accelerate applications by 5.73 times�
Because H/W implementation removes both computation overhead and host traffic

40 of 42

Conclusion

NDS provides multidimensional address spaces for applications.�
Decouples storage dimensionality from application optimal dataset dimensionality by reconstructing data objects.�
Addresses underutilization of both interconnect bandwidth and device bandwidth.�
Hardware-assisted NDS version achieves an average 5.73× speedup over a datacenter-class SSD baseline.

41 of 42

Reference

NDS: N-Dimensional Storage , https://dl.acm.org/doi/pdf/10.1145/3466752.3480122
Domain Specific Architecture: https://www.cse.iitb.ac.in/~biswa/courses/CS773/lectures/L7.pdf�

42 of 42

Thank you

Any questions?

1 of 42

2 of 42

3 of 42

4 of 42

5 of 42

6 of 42

7 of 42

8 of 42

9 of 42

10 of 42

11 of 42

12 of 42

13 of 42

14 of 42

15 of 42

16 of 42

17 of 42

18 of 42

19 of 42

20 of 42

21 of 42

22 of 42

23 of 42

24 of 42

25 of 42

26 of 42

27 of 42

28 of 42

29 of 42

30 of 42

31 of 42

32 of 42

33 of 42

34 of 42

35 of 42

36 of 42

37 of 42

38 of 42

39 of 42

40 of 42

41 of 42

42 of 42