1 of 67

1

Be SMART, Save I/O: Probabilistic Approach to�Avoid Uncorrectable Errors in Storage Systems

Md Arifuzzaman, Masudul Bhuiyan, Mehmet Gumus, and Engin Arslan

University of Nevada, Reno, CISPA Helmholtz Center for Information Security

2 of 67

Big Data of Science

3

Event Horizon Telescopes produce 4.5 petabytes of astronomy data every day.

Large Hadron Collider (LHC) experiment at CERN produces more than 120 PB of data per year.

Storages capacity of Frontier supercomputer is around 700 petabyte.

3 of 67

Error is Inevitable

3

  • Hard drives use Error Correcting Code (ECC) to detect and recover from errors.

  • Yes, a non-negligible portion of errors skips ECC, referred to as Uncorrectable Errors (UE)

  • Uncorrectable errors take place once in every 1012 (125GB) to 1015 (125TB) bits of I/O operation.

  • 20-57% of disks in Google and Facebook data centers experienced one UE in their lifetime.

4 of 67

Failure is Costly

4

  • Uncorrectable errors can lead to data loss, disk failure, or even service interruption.

  • According to Gartner, server downtime costs $5,600 per minute on average.

  • The average cost of a data center outage is $740,357.

5 of 67

Problem Statement

5

Can we predict uncorrectable errors before they happen?

Almost!

6 of 67

SMART Values

  • SMART (Self-Monitoring, Analysis and Reporting Technology) is a monitoring system
    • It is supported by most disk drives.
    • It provides various indicators of drive health.
    • It reports various types of disk errors such as Seek Error Rate, Read Error Rate, and Uncorrectable Error.
    • It also reports operational data, such as drive temperature, and power on hours of the drive.

  • SMART logs are typically populated once a day for each disk.

6

7 of 67

Dataset

  • February 2014 to September 2021 (93 months)

  • 270M SMART logs.

  • 200K unique disks in total but 30K-120K of them were active at any given day.

  • 9.25K disks had at least one uncorrectable error (error percentage 8.7%)

  • 5% of disks had 89% of all errors and less than 10% of disks had 92.4% of all errors.

7

8 of 67

SMART Attributes

8

ID

Attribute Name

Importance Score

7

Seek Error Rate

0.1711

9

Power-On Hours

0.1633

1

Read Error Rate

0.1346

193

Load Cycle Count

0.1208

187

Uncorrectable Error

0.1176

9 of 67

9

Models

CNN-LSTM

DNN

DT

SVC

RF

RF

XGB

ST12000NM0007

95.41

96.9

90.4

97.93

97.85

98.19

98.25

ST4000DM000

93.79

94.17

81.68

93.98

93.81

94.45

94.7

ST8000NM0055

90.08

91.82

81.94

94.81

94.32

94.87

95.57

ST8000DM002

92.2

93.85

83.9

95.87

95.17

95.47

96.05

ST3000DM001

88.13

91.3

76.8

88.71

91.83

92.63

91.78

Performance Comparison Of Machine Learning Models

Neural network models

Drive

Models

CNN-LSTM

DNN

DT

SVC

RF

RF

XGB

ST12000NM0007

95.41

96.9

90.4

97.93

97.85

98.19

98.25

ST4000DM000

93.79

94.17

81.68

93.98

93.81

94.45

94.7

ST8000NM0055

90.08

91.82

81.94

94.81

94.32

94.87

95.57

ST8000DM002

92.2

93.85

83.9

95.87

95.17

95.47

96.05

ST3000DM001

88.13

91.3

76.8

88.71

91.83

92.63

91.78

Previous proposed Model

10 of 67

Application Scenarios to Utilize the Predictability of Uncorrectable Errors

10

11 of 67

11

Sender

Receiver

End-to-End File Transfer

Disk

Disk

Memory

Memory

Network

Uncorrectable error

12 of 67

12

Sender

Receiver

End-to-End Integrity Verification​​​

Disk

Disk

Memory

Memory

Network

2367213558

2367213558

13 of 67

Additional I/O Size for End-to-End Integrity Verification​​​

13

Additional I/O in sender

Additional I/O in

receiver

200 GB

200 GB

14 of 67

Impact of Integrity Verification on Transfer Time

14

400s for Checksum Computation

487s for Checksum Computation

Actual Transfer Time=312s

15 of 67

15

16 of 67

Probabilistic Integrity Verification

16

17 of 67

17

Probabilistic Integrity Verification

Memory

Disk 1

Disk 2

Disk 3

Disk 4

18 of 67

18

Probabilistic Integrity Verification

Memory

Disk 1

Disk 2

Disk 3

Disk 4

p=0.05

p=0.02

p=0.6

p=0.9

ML Model

19 of 67

19

Probabilistic Integrity Verification

Memory

Disk 1

Disk 2

Disk 3

Disk 4

No Error

p=0.05

p=0.02

p=0.6

p=0.9

ML Model

20 of 67

Problem Definition

 

20

21 of 67

21

Disk 1

Disk 2

Disk 3

Disk 4

X

X

X

X

 

 

Disk 1

Disk 2

Disk 3

Disk 4

  • Captures all errors
  • Incurs high I/O overhead
  • Slow transfer speed

  • No I/O overhead
  • Fast transfers
  • Fails to capture errors

Disk 1

Disk 2

Disk 3

Disk 4

  • Low I/O overhead
  • Fast transfers
  • Captures most errors

X

X

 

 

Probabilistic Integrity Verification aims to select disks to run integrity verification such that most errors captured by checking fewer disks possible

 

22 of 67

Problem Definition

22

 

Total data size on which we run integrity verification

Error probability for rest of the data

 

23 of 67

Brute Force Solution

23

Disk 1

Disk 2

Disk 3

Disk N

.

.

.

No disk

1,2

2,3

.

.

.

1,2,…,N

 

 

 

 

1,2

 

 

24 of 67

Greedy Solution

24

Disk 1

Disk 3

Disk N

.

.

.

 

 

 

 

Disk 2

Result Set

Disk 2

.

.

.

 

Disk 3

Disk N

Disk 1

.

.

.

25 of 67

25

Evaluation Metrics

Memory

p=0.05

p=0.02

p=0.6

p=0.9

50 MB

20 MB

10 MB

20 MB

50 MB

10 MB

20 MB

20 MB

 

 

26 of 67

Comparison of Brute Force and Greedy algorithms

26

27 of 67

Improvement Ratio

27

 

  • Large improvement ratio leads to an increase in the coverage ratio for all disk types.

  • Consequently I/O save ratio decreases.

28 of 67

Coverage Ratio Comparison

28

Batch Workload

Streaming Workload

29 of 67

I/O Overhead comparison

29

Batch Workload

Streaming Workload

30 of 67

Overhead Analysis on Production File Systems

30

31 of 67

Impact of probabilistic write verification

31

Darwin

Expanse

Campus Cluster

Stampede2

Darwin

Expanse

32 of 67

Conclusion

  • If not handled properly, uncorrectable errors can lead to complete data loss or even system failure.

  • We analyzed 270M SMART logs from 200K disks over a period of 93 months and developed prediction models for uncorrectable errors.

  • Our models achieve up-to 98% accuracy in uncorrectable bit error prediction while keeping false positive rates less than 3%.

32

33 of 67

Conclusion

  • We incorporated the prediction model into two application scenarios.

  • Our probabilistic integrity verification method achieves up to 97% reduction in I/O overhead caused by the integrity verification process of wide area file transfers.

33

34 of 67

Performance Analysis for Different Improvement Factors

34

  • Large improvement ratio leads to an increase in the coverage ratio for all disk types.

  • Consequently I/O save ratio decreases.

  • For 90% improvement ratio, we can achieve 85-95% coverage ratio while saving 40-75% I/O.

 

35 of 67

Impact of Probabilistic Integrity Verification on Transfer Time

35

I/O Saving Ratio

0%

10%

50%

90%

Time(sec)

360.8 ± 5.3

353.6 ± 6.8

263.0 ± 1.7

168.5 ± 2.0

Transfer times is reduced by 53% for 90% I/O save

36 of 67

Avoiding Uncorrectable Errors through I/O Redirection

36

37 of 67

37

I/O Redirection

Memory

Disk 1

Disk 2

Disk 3

Disk 4

p=0.1

p=0.2

p=0.6

p=0.9

38 of 67

Problem Definition

  •  

38

39 of 67

Problem Definition

Let

    • 𝛾 : increased error ratio for increasing I/O load

39

 

Then error probability will be

 

40 of 67

I/O Redirection

40

Disk 1

Disk 2

Disk 3

Disk N

.

.

.

Disk 1

Disk 2

Disk 3

Disk N

Offline List

Offline List

Offline List

Offline List

Disk 2

Disk N

Disk 1

Disk 3

.

.

.

p=0.6

p=0.9

p=0.1

p=0.8

 

41 of 67

41

Evaluation Metrics

Memory

p=0.05

p=0.02

p=0.6

p=0.9

50 MB

20 MB

10 MB

20 MB

50 MB

10 MB

 

 

42 of 67

Performance Analysis of I/O Redirection

42

  • We can achieve more than 50% coverage ratio while keeping overhead less than 5%.

  • For disk model ST12000NM0007, we can even achieve 85-90% coverage ratio for less than 5% overhead.

 

43 of 67

Scale is Increasing and So Does Failure

43

  • 90% of the current datacenters are established after 2016.

  • Google uses more than 2.5 million custom servers that are equipped with multiple hard drives.

  • Facebook’s new data center has 500 racks that each hold 2 petabytes of data.

44 of 67

Disk Models Used in This Work

44

Disk mode

Capacity

Total Disk

Ratio of Disks with Uncorrectable Error

ST12000NM0007

12 TB

38,272

2.2%

ST4000DM000

4 TB

36,950

7.1%

ST8000NM0055

8 TB

14,811

2.2%

ST8000DM002

8 TB

10,161

2.7%

ST3000DM001

3 TB

4,286

33.2%

45 of 67

Evaluation Metrics

  •  

45

46 of 67

Model Evaluation

46

Random Forest achieves 97% TRP for less than 3% FPR

47 of 67

Performance Evaluation for Random Forest Classifier

  • Random Forest achieves more than 95% TPR while keeping FPR less than 5% for ST12000NM007 and ST4000DM000.

47

  • For ST80000NM0055 and ST80000DM002, 90% TPR causes less than or around 15% FPR.

48 of 67

48

Probabilistic Integrity Verification

Memory

Disk 1

Disk 2

Disk 3

Disk 4

p=0.1

p=0.2

p=0.6

p=0.9

49 of 67

Era of Big Data

49

Every minute on Facebook: 510K comments are posted, 293K statuses are updated, and 136K photos are uploaded. This generates about 500 TB of data each day.

Over 100 million Instagram posts are uploaded each day.

300 hours of video are uploaded to YouTube every minute.

50 of 67

Problem Definition

50

 

 

Let

      • 𝐸(𝑇) : Error probability without running integrity verification.

51 of 67

51

Probabilistic Integrity Verification

Memory

Disk 1

Disk 2

Disk 3

Disk 4

p=0.05

p=0.02

p=0.6

p=0.9

 

 

 

 

52 of 67

Additional I/O Size for End-to-End Integrity Verification​​​

52

53 of 67

53

54 of 67

Evalution Metrics

  •  

54

55 of 67

Evalution Metrics

  •  

55

56 of 67

56

57 of 67

Comparison of Brute Force and Greedy algorithms

57

58 of 67

Deployment Challenges and Opportunities

  • We developed our model using SMART metrics which is supported by all disk manufacturers.
  • Storage footprint to store daily SMART logs is very negligible(32.5MB for 120k disks)
  • Transfer learning is challenging but our model can be used as a good starting point.
  • Both Raid arrays and Parallel File Systems have options to deactivate disks.

58

59 of 67

Feature Selection

59

60 of 67

Checksum Calculation Speedup Ratio

60

61 of 67

Dataset Charactaristics

  • 14.5% of disks raised only one error.

  • 21.7% of disks raised more than 20 uncorrectable errors.

  • 6% disks had more than 100 errors.

61

62 of 67

Handling Data imbalance

62

63 of 67

Problem Definition

  •  

63

64 of 67

Simulation Default Settings

  • We simulated our algorithm on BackBlaze SMART logs.
    • RAID level ->0(no parity or mirroring)
    • Number of disks in RAID array -> 8
    • Stripe Size -> 128KB
    • Stripe Count -> 1
    • the number of disks used for each file transfer -> 8
    • Improvement ratio, k -> 90% (10 times improvement)

64

65 of 67

Dataset Characteristics

65

66 of 67

Performance Analysis for different Stripe Count

66

67 of 67

I/O Redirection

67

Disk 1

Disk 2

Disk 3

Disk N

.

.

.

Disk 1

Disk 2

Disk 3

Disk N

Offline List

Offline List

Offline List

Offline List

Disk 2

Disk N

Disk 1

Disk 3

.

.

.

p=0.6

p=0.9

p=0.1

p=0.8