1
Be SMART, Save I/O: Probabilistic Approach to�Avoid Uncorrectable Errors in Storage Systems
Md Arifuzzaman, Masudul Bhuiyan, Mehmet Gumus, and Engin Arslan
University of Nevada, Reno, CISPA Helmholtz Center for Information Security
Big Data of Science
3
Event Horizon Telescopes produce 4.5 petabytes of astronomy data every day.
Large Hadron Collider (LHC) experiment at CERN produces more than 120 PB of data per year.
Storages capacity of Frontier supercomputer is around 700 petabyte.
Error is Inevitable
3
Failure is Costly
4
Problem Statement
5
Can we predict uncorrectable errors before they happen?
Almost!
SMART Values
6
Dataset
7
SMART Attributes
8
ID | Attribute Name | Importance Score |
7 | Seek Error Rate | 0.1711 |
9 | Power-On Hours | 0.1633 |
1 | Read Error Rate | 0.1346 |
193 | Load Cycle Count | 0.1208 |
187 | Uncorrectable Error | 0.1176 |
9
Models | CNN-LSTM | DNN | DT | SVC | RF | RF | XGB |
ST12000NM0007 | 95.41 | 96.9 | 90.4 | 97.93 | 97.85 | 98.19 | 98.25 |
ST4000DM000 | 93.79 | 94.17 | 81.68 | 93.98 | 93.81 | 94.45 | 94.7 |
ST8000NM0055 | 90.08 | 91.82 | 81.94 | 94.81 | 94.32 | 94.87 | 95.57 |
ST8000DM002 | 92.2 | 93.85 | 83.9 | 95.87 | 95.17 | 95.47 | 96.05 |
ST3000DM001 | 88.13 | 91.3 | 76.8 | 88.71 | 91.83 | 92.63 | 91.78 |
Performance Comparison Of Machine Learning Models
Neural network models
Drive Models | CNN-LSTM | DNN | DT | SVC | RF | RF | XGB |
ST12000NM0007 | 95.41 | 96.9 | 90.4 | 97.93 | 97.85 | 98.19 | 98.25 |
ST4000DM000 | 93.79 | 94.17 | 81.68 | 93.98 | 93.81 | 94.45 | 94.7 |
ST8000NM0055 | 90.08 | 91.82 | 81.94 | 94.81 | 94.32 | 94.87 | 95.57 |
ST8000DM002 | 92.2 | 93.85 | 83.9 | 95.87 | 95.17 | 95.47 | 96.05 |
ST3000DM001 | 88.13 | 91.3 | 76.8 | 88.71 | 91.83 | 92.63 | 91.78 |
Previous proposed Model
Application Scenarios to Utilize the Predictability of Uncorrectable Errors
10
11
Sender
Receiver
End-to-End File Transfer
Disk
Disk
Memory
Memory
Network
Uncorrectable error
12
Sender
Receiver
End-to-End Integrity Verification
Disk
Disk
Memory
Memory
Network
2367213558
2367213558
Additional I/O Size for End-to-End Integrity Verification
13
Additional I/O in sender
Additional I/O in
receiver
200 GB
200 GB
Impact of Integrity Verification on Transfer Time
14
400s for Checksum Computation
487s for Checksum Computation
Actual Transfer Time=312s
15
Probabilistic Integrity Verification
16
17
Probabilistic Integrity Verification
Memory
Disk 1
Disk 2
Disk 3
Disk 4
18
Probabilistic Integrity Verification
Memory
Disk 1
Disk 2
Disk 3
Disk 4
p=0.05
p=0.02
p=0.6
p=0.9
ML Model
19
Probabilistic Integrity Verification
Memory
Disk 1
Disk 2
Disk 3
Disk 4
No Error
p=0.05
p=0.02
p=0.6
p=0.9
ML Model
Problem Definition
20
21
Disk 1
Disk 2
Disk 3
Disk 4
X
X
X
X
Disk 1
Disk 2
Disk 3
Disk 4
Disk 1
Disk 2
Disk 3
Disk 4
X
X
Probabilistic Integrity Verification aims to select disks to run integrity verification such that most errors captured by checking fewer disks possible
Problem Definition
22
Total data size on which we run integrity verification
Error probability for rest of the data
Brute Force Solution
23
Disk 1
Disk 2
Disk 3
Disk N
.
.
.
No disk
1,2
2,3
.
.
.
1,2,…,N
1,2
Greedy Solution
24
Disk 1
Disk 3
Disk N
.
.
.
Disk 2
Result Set
Disk 2
.
.
.
Disk 3
Disk N
Disk 1
.
.
.
25
Evaluation Metrics
Memory
p=0.05
p=0.02
p=0.6
p=0.9
50 MB
20 MB
10 MB
20 MB
50 MB
10 MB
20 MB
20 MB
Comparison of Brute Force and Greedy algorithms
26
Improvement Ratio
27
Coverage Ratio Comparison
28
Batch Workload
Streaming Workload
I/O Overhead comparison
29
Batch Workload
Streaming Workload
Overhead Analysis on Production File Systems
30
Impact of probabilistic write verification
31
Darwin
Expanse
Campus Cluster
Stampede2
Darwin
Expanse
Conclusion
32
Conclusion
33
Performance Analysis for Different Improvement Factors
34
Impact of Probabilistic Integrity Verification on Transfer Time
35
| I/O Saving Ratio | |||
| 0% | 10% | 50% | 90% |
Time(sec) | 360.8 ± 5.3 | 353.6 ± 6.8 | 263.0 ± 1.7 | 168.5 ± 2.0 |
Transfer times is reduced by 53% for 90% I/O save
Avoiding Uncorrectable Errors through I/O Redirection
36
37
I/O Redirection
Memory
Disk 1
Disk 2
Disk 3
Disk 4
p=0.1
p=0.2
p=0.6
p=0.9
Problem Definition
38
Problem Definition
Let
39
Then error probability will be
I/O Redirection
40
Disk 1
Disk 2
Disk 3
Disk N
.
.
.
Disk 1
Disk 2
Disk 3
Disk N
Offline List
Offline List
Offline List
Offline List
Disk 2
Disk N
Disk 1
Disk 3
.
.
.
p=0.6
p=0.9
p=0.1
p=0.8
41
Evaluation Metrics
Memory
p=0.05
p=0.02
p=0.6
p=0.9
50 MB
20 MB
10 MB
20 MB
50 MB
10 MB
Performance Analysis of I/O Redirection
42
Scale is Increasing and So Does Failure
43
Disk Models Used in This Work
44
Disk mode | Capacity | Total Disk | Ratio of Disks with Uncorrectable Error |
ST12000NM0007 | 12 TB | 38,272 | 2.2% |
ST4000DM000 | 4 TB | 36,950 | 7.1% |
ST8000NM0055 | 8 TB | 14,811 | 2.2% |
ST8000DM002 | 8 TB | 10,161 | 2.7% |
ST3000DM001 | 3 TB | 4,286 | 33.2% |
Evaluation Metrics
45
Model Evaluation
46
Random Forest achieves 97% TRP for less than 3% FPR
Performance Evaluation for Random Forest Classifier
47
48
Probabilistic Integrity Verification
Memory
Disk 1
Disk 2
Disk 3
Disk 4
p=0.1
p=0.2
p=0.6
p=0.9
Era of Big Data
49
Every minute on Facebook: 510K comments are posted, 293K statuses are updated, and 136K photos are uploaded. This generates about 500 TB of data each day.
Over 100 million Instagram posts are uploaded each day.
300 hours of video are uploaded to YouTube every minute.
Problem Definition
50
Let
51
Probabilistic Integrity Verification
Memory
Disk 1
Disk 2
Disk 3
Disk 4
p=0.05
p=0.02
p=0.6
p=0.9
Additional I/O Size for End-to-End Integrity Verification
52
53
Evalution Metrics
54
Evalution Metrics
55
56
Comparison of Brute Force and Greedy algorithms
57
Deployment Challenges and Opportunities
58
Feature Selection
59
Checksum Calculation Speedup Ratio
60
Dataset Charactaristics
61
Handling Data imbalance
62
Problem Definition
63
Simulation Default Settings
64
Dataset Characteristics
65
Performance Analysis for different Stripe Count
66
I/O Redirection
67
Disk 1
Disk 2
Disk 3
Disk N
.
.
.
Disk 1
Disk 2
Disk 3
Disk N
Offline List
Offline List
Offline List
Offline List
Disk 2
Disk N
Disk 1
Disk 3
.
.
.
p=0.6
p=0.9
p=0.1
p=0.8