1 of 67

1

Be SMART, Save I/O: Probabilistic Approach to�Avoid Uncorrectable Errors in Storage Systems

Md Arifuzzaman, Masudul Bhuiyan, Mehmet Gumus, and Engin Arslan

University of Nevada, Reno, CISPA Helmholtz Center for Information Security

2 of 67

Big Data of Science

3

Event Horizon Telescopes produce 4.5 petabytes of astronomy data every day.

Large Hadron Collider (LHC) experiment at CERN produces more than 120 PB of data per year.

Storages capacity of Frontier supercomputer is around 700 petabyte.

3 of 67

Error is Inevitable

3

Hard drives use Error Correcting Code (ECC) to detect and recover from errors.

Yes, a non-negligible portion of errors skips ECC, referred to as Uncorrectable Errors (UE)

Uncorrectable errors take place once in every 10¹²(125GB) to 10¹⁵ (125TB) bits of I/O operation.

20-57% of disks in Google and Facebook data centers experienced one UE in their lifetime.

4 of 67

Failure is Costly

4

Uncorrectable errors can lead to data loss, disk failure, or even service interruption.

According to Gartner, server downtime costs $5,600 per minute on average.

The average cost of a data center outage is $740,357.

5 of 67

Problem Statement

5

Can we predict uncorrectable errors before they happen?

Almost!

6 of 67

SMART Values

SMART (Self-Monitoring, Analysis and Reporting Technology) is a monitoring system

It is supported by most disk drives.
It provides various indicators of drive health.
It reports various types of disk errors such as Seek Error Rate, Read Error Rate, and Uncorrectable Error.
It also reports operational data, such as drive temperature, and power on hours of the drive.

SMART logs are typically populated once a day for each disk.

6

7 of 67

Dataset

February 2014 to September 2021 (93 months)

270M SMART logs.

200K unique disks in total but 30K-120K of them were active at any given day.

9.25K disks had at least one uncorrectable error (error percentage 8.7%)

5% of disks had 89% of all errors and less than 10% of disks had 92.4% of all errors.

7

8 of 67

SMART Attributes

8

ID	Attribute Name	Importance Score
7	Seek Error Rate	0.1711
9	Power-On Hours	0.1633
1	Read Error Rate	0.1346
193	Load Cycle Count	0.1208
187	Uncorrectable Error	0.1176

9 of 67

9

Models	CNN-LSTM	DNN	DT	SVC	RF	RF	XGB
ST12000NM0007	95.41	96.9	90.4	97.93	97.85	98.19	98.25
ST4000DM000	93.79	94.17	81.68	93.98	93.81	94.45	94.7
ST8000NM0055	90.08	91.82	81.94	94.81	94.32	94.87	95.57
ST8000DM002	92.2	93.85	83.9	95.87	95.17	95.47	96.05
ST3000DM001	88.13	91.3	76.8	88.71	91.83	92.63	91.78

Performance Comparison Of Machine Learning Models

Neural network models

Drive Models	CNN-LSTM	DNN	DT	SVC	RF	RF	XGB
ST12000NM0007	95.41	96.9	90.4	97.93	97.85	98.19	98.25
ST4000DM000	93.79	94.17	81.68	93.98	93.81	94.45	94.7
ST8000NM0055	90.08	91.82	81.94	94.81	94.32	94.87	95.57
ST8000DM002	92.2	93.85	83.9	95.87	95.17	95.47	96.05
ST3000DM001	88.13	91.3	76.8	88.71	91.83	92.63	91.78

Previous proposed Model

10 of 67

Application Scenarios to Utilize the Predictability of Uncorrectable Errors

10

11 of 67

11

Sender

Receiver

End-to-End File Transfer

Disk

Memory

Network

Uncorrectable error

12 of 67

12

Sender

Receiver

End-to-End Integrity Verification

Disk

Memory

Network

2367213558

13 of 67

Additional I/O Size for End-to-End Integrity Verification

13

Additional I/O in sender

Additional I/O in

receiver

200 GB

14 of 67

Impact of Integrity Verification on Transfer Time

14

400s for Checksum Computation

487s for Checksum Computation

Actual Transfer Time=312s

15 of 67

15

16 of 67

Probabilistic Integrity Verification

16

17 of 67

17

Probabilistic Integrity Verification

Memory

Disk 1

Disk 2

Disk 3

Disk 4

18 of 67

18

Probabilistic Integrity Verification

Memory

Disk 1

Disk 2

Disk 3

Disk 4

p=0.05

p=0.02

p=0.6

p=0.9

ML Model

19 of 67

19

Probabilistic Integrity Verification

Memory

Disk 1

Disk 2

Disk 3

Disk 4

No Error

p=0.05

p=0.02

p=0.6

p=0.9

ML Model

20 of 67

Problem Definition

20

21 of 67

21

Disk 1

Disk 2

Disk 3

Disk 4

X

Disk 1

Disk 2

Disk 3

Disk 4

Captures all errors
Incurs high I/O overhead
Slow transfer speed

No I/O overhead
Fast transfers
Fails to capture errors

Disk 1

Disk 2

Disk 3

Disk 4

Low I/O overhead
Fast transfers
Captures most errors

X

Probabilistic Integrity Verification aims to select disks to run integrity verification such that most errors captured by checking fewer disks possible

22 of 67

Problem Definition

22

Total data size on which we run integrity verification

Error probability for rest of the data

23 of 67

Brute Force Solution

23

Disk 1

Disk 2

Disk 3

Disk N

.

No disk

1,2

2,3

.

1,2,…,N

1,2

24 of 67

Greedy Solution

24

Disk 1

Disk 3

Disk N

.

Disk 2

Result Set

Disk 2

.

Disk 3

Disk N

Disk 1

.

25 of 67

25

Evaluation Metrics

Memory

p=0.05

p=0.02

p=0.6

p=0.9

50 MB

20 MB

10 MB

20 MB

50 MB

10 MB

20 MB

26 of 67

Comparison of Brute Force and Greedy algorithms

26

27 of 67

Improvement Ratio

27

Large improvement ratio leads to an increase in the coverage ratio for all disk types.

Consequently I/O save ratio decreases.

28 of 67

Coverage Ratio Comparison

28

Batch Workload

Streaming Workload

29 of 67

I/O Overhead comparison

29

Batch Workload

Streaming Workload

30 of 67

Overhead Analysis on Production File Systems

30

31 of 67

Impact of probabilistic write verification

31

Darwin

Expanse

Campus Cluster

Stampede2

Darwin

Expanse

32 of 67

Conclusion

If not handled properly, uncorrectable errors can lead to complete data loss or even system failure.

We analyzed 270M SMART logs from 200K disks over a period of 93 months and developed prediction models for uncorrectable errors.

Our models achieve up-to 98% accuracy in uncorrectable bit error prediction while keeping false positive rates less than 3%.

32

33 of 67

Conclusion

We incorporated the prediction model into two application scenarios.

Our probabilistic integrity verification method achieves up to 97% reduction in I/O overhead caused by the integrity verification process of wide area file transfers.

33

34 of 67

Performance Analysis for Different Improvement Factors

34

Large improvement ratio leads to an increase in the coverage ratio for all disk types.

Consequently I/O save ratio decreases.

For 90% improvement ratio, we can achieve 85-95% coverage ratio while saving 40-75% I/O.

35 of 67

Impact of Probabilistic Integrity Verification on Transfer Time

35

	I/O Saving Ratio
	0%	10%	50%	90%
Time(sec)	360.8 ± 5.3	353.6 ± 6.8	263.0 ± 1.7	168.5 ± 2.0

Transfer times is reduced by 53% for 90% I/O save

36 of 67

Avoiding Uncorrectable Errors through I/O Redirection

36

37 of 67

37

I/O Redirection

Memory

Disk 1

Disk 2

Disk 3

Disk 4

p=0.1

p=0.2

p=0.6

p=0.9

38 of 67

Problem Definition

38

39 of 67

Problem Definition

Let

𝛾 : increased error ratio for increasing I/O load

39

Then error probability will be

40 of 67

I/O Redirection

40

Disk 1

Disk 2

Disk 3

Disk N

.

Disk 1

Disk 2

Disk 3

Disk N

Offline List

Disk 2

Disk N

Disk 1

Disk 3

.

p=0.6

p=0.9

p=0.1

p=0.8

41 of 67

41

Evaluation Metrics

Memory

p=0.05

p=0.02

p=0.6

p=0.9

50 MB

20 MB

10 MB

20 MB

50 MB

10 MB

42 of 67

Performance Analysis of I/O Redirection

42

We can achieve more than 50% coverage ratio while keeping overhead less than 5%.

For disk model ST12000NM0007, we can even achieve 85-90% coverage ratio for less than 5% overhead.

43 of 67

Scale is Increasing and So Does Failure

43

90% of the current datacenters are established after 2016.

Google uses more than 2.5 million custom servers that are equipped with multiple hard drives.

Facebook’s new data center has 500 racks that each hold 2 petabytes of data.

44 of 67

Disk Models Used in This Work

44

Disk mode	Capacity	Total Disk	Ratio of Disks with Uncorrectable Error
ST12000NM0007	12 TB	38,272	2.2%
ST4000DM000	4 TB	36,950	7.1%
ST8000NM0055	8 TB	14,811	2.2%
ST8000DM002	8 TB	10,161	2.7%
ST3000DM001	3 TB	4,286	33.2%

45 of 67

Evaluation Metrics

45

46 of 67

Model Evaluation

46

Random Forest achieves 97% TRP for less than 3% FPR

47 of 67

Performance Evaluation for Random Forest Classifier

Random Forest achieves more than 95% TPR while keeping FPR less than 5% for ST12000NM007 and ST4000DM000.

47

For ST80000NM0055 and ST80000DM002, 90% TPR causes less than or around 15% FPR.

48 of 67

48

Probabilistic Integrity Verification

Memory

Disk 1

Disk 2

Disk 3

Disk 4

p=0.1

p=0.2

p=0.6

p=0.9

49 of 67

Era of Big Data

49

Every minute on Facebook: 510K comments are posted, 293K statuses are updated, and 136K photos are uploaded. This generates about 500 TB of data each day.

Over 100 million Instagram posts are uploaded each day.

300 hours of video are uploaded to YouTube every minute.

50 of 67

Problem Definition

50

Let

𝐸(𝑇) : Error probability without running integrity verification.

51 of 67

51

Probabilistic Integrity Verification

Memory

Disk 1

Disk 2

Disk 3

Disk 4

p=0.05

p=0.02

p=0.6

p=0.9

52 of 67

Additional I/O Size for End-to-End Integrity Verification

52

53 of 67

53

54 of 67

Evalution Metrics

54

55 of 67

Evalution Metrics

55

56 of 67

56

57 of 67

Comparison of Brute Force and Greedy algorithms

57

58 of 67

Deployment Challenges and Opportunities

We developed our model using SMART metrics which is supported by all disk manufacturers.
Storage footprint to store daily SMART logs is very negligible(32.5MB for 120k disks)
Transfer learning is challenging but our model can be used as a good starting point.
Both Raid arrays and Parallel File Systems have options to deactivate disks.

58

59 of 67

Feature Selection

59

60 of 67

Checksum Calculation Speedup Ratio

60

61 of 67

Dataset Charactaristics

14.5% of disks raised only one error.

21.7% of disks raised more than 20 uncorrectable errors.

6% disks had more than 100 errors.

61

62 of 67

Handling Data imbalance

62

63 of 67

Problem Definition

63

64 of 67

Simulation Default Settings

We simulated our algorithm on BackBlaze SMART logs.

RAID level ->0(no parity or mirroring)
Number of disks in RAID array -> 8
Stripe Size -> 128KB
Stripe Count -> 1
the number of disks used for each file transfer -> 8
Improvement ratio, k -> 90% (10 times improvement)

64

65 of 67

Dataset Characteristics

65

66 of 67

Performance Analysis for different Stripe Count

66

67 of 67

I/O Redirection

67

Disk 1

Disk 2

Disk 3

Disk N

.

Disk 1

Disk 2

Disk 3

Disk N

Offline List

Disk 2

Disk N

Disk 1

Disk 3

.

p=0.6

p=0.9

p=0.1

p=0.8