Enabling Scalability in the Cloud for Scientific
Workflows: An Earth Science Use Case
—�Paula Olaya*, Jakob Luettgau*, Ricardo Llamas↟, Rodrigo Vargas↟, Jay Lofstead†, Sophia Wen‡, I-Hsin Chung‡, Seetharami Seelam‡, and Michela Taufer*��*University of Tennessee, Knoxville; ↟University of Delaware; �†Sandia National Lab; ‡IBM-HCIR �� �
2023 IBM Corporation
NSDF Application Domains targeted
2
Materials Science
Astronomy
Earth Sciences
Scientific workflows: Data transformations
3
Big Data
Small �Data
Big Data
Small �Data
Big Data
Big Data
Big Data
Small �Data
Big�Data
(i) workflows that produce a large output from small input
(ii) workflows that reduce large input to small output
(iii) workflows that process the same input data many times with large data reuse
(iv) workflows with large input that produce large intermediate data and process it to generate a smaller output
Courtesy of Frank Wuerthwein, SDSC
Scientific workflows: Data transformations
4
Big Data
Small �Data
Big Data
Small �Data
Big Data
Big Data
Big Data
Small �Data
Big�Data
(i) workflows that produce a large output from small input
(ii) workflows that reduce large input to small output
(iii) workflows that process the same input data many times with large data reuse
(iv) workflows with large input that produce large intermediate data and process it to generate a smaller output
Courtesy of Frank Wuerthwein, SDSC
ML-based workflows
5
General scientific workflow with intermediate data
ML-based workflows
6
Earth Sciences
Exemplary ML-based earth science workflow (SOMOSPIE) that uses ML models to predict soil moisture data from low-resolution satellite data to high-resolution values
General scientific workflow with intermediate data
7
For ML-based workflows, data is crucial in obtaining better models and predictions
Data can scale as we move from low to high resolutions in a given region (scale-up)
Evolution of spatial resolution in Global Climate Models used for IPCC assessment reports (Source: www.wmo.int/pages/themes/climate/climate_models.php).
8
For ML-based workflows, data is crucial in obtaining better models and predictions
Data can scale as we move from low to high resolutions in a given region (scale-up)
Data grows as we cover a larger region with the same resolution (scale-out)
Evolution of spatial resolution in Global Climate Models used for IPCC assessment reports (Source: www.wmo.int/pages/themes/climate/climate_models.php).
Scaling up: Resolution
9
Satellite resolution
Midwest region at 27 km
2x1 data points
Midwest region at 90 m
453x227 data points
Midwest region at 10 m
8926x4539 data points
Scaling out: Region
10
Midwest region at 10 m
CONUS at 10 m
State
Country
Data scenarios for SOMOSPIE
11
Region �Resolution | Data �Type | Data �Description | Data characterization | |
Points [#] | Size [B] | |||
Midwest, �90 m x 90 m | Input | Satellite data, 27 km x 27 km | 450 | 44,198 |
Input | Terrain params (4 params), 90 m x 90 m | 36,073,181 | 3.03 G | |
Intermediate | 1 prediction, Midwest, 90 m x 90 m | 36,073,181 | 1.42 G | |
Output | 1 visualization, Midwest, 90 m x 90 m | - | 438,126 | |
Total Midwest 90 m | | 4.45 G | ||
Midwest, �10 m x 10 m | Input | Satellite data, 27 km x 27 km | 450 | 44,198 |
Input | Terrain params (4 params), 10 m x 10 m | 2,921,927,661 | 280.51 G | |
Predictions | 1 prediction, Midwest, 10 m x 10 m | 2,921,927,661 | 140.25 G | |
Output | 1 visualization, Midwest, 10 m x 10 m | - | 4.57 M | |
Total Midwest 10 m | | 420.76 G | ||
CONUS, �10 m x 10 m | Input | Satellite data, 27 km x 27 km | 8,214 | 985,600 |
Input | Terrain params (4 params), 10 m x 10 m | 99,925,046,500 | 9.00 T | |
Predictions | 1 prediction, CONUS, 10 m x 10 m | 99,925,046,500 | 5.12 T | |
Output | 1 visualization, CONUS, 10 m x 10 m | - | 21.14 M | |
Total CONUS 10 m | | 14,934.00 G | ||
Data partitioning
12
Scientists apply data partitioning to process temporal and spatial data at different scales, enabling data parallelism
<long., lat., p1,p2,...,pn>
Terrain parameters
Each of the tiles is processed in an independent node
ML-model
...
tile
1. read
2. compute
3. write
*Each tile is an independent file
The predictions are written to independent files
<long., lat.,sm>
Soil moisture predictions
ML-model
ML-model
ML-model
Data partitioning for SOMOSPIE scenarios
13
Region �Resolution | Data �Type | Data �Description | Data characterization | |
Points [#] | Size [B] | |||
Midwest, �90 m x 90 m | Input | Satellite data, 27 km x 27 km | 450 | 44,198 |
Input | Terrain params (4 params), 90 m x 90 m | 36,073,181 | 3.03 G | |
Intermediate | 1 prediction, Midwest, 90 m x 90 m | 36,073,181 | 1.42 G | |
Output | 1 visualization, Midwest, 90 m x 90 m | - | 438,126 | |
Total Midwest 90 m | | 4.45 G | ||
Midwest, �10 m x 10 m | Input | Satellite data, 27 km x 27 km | 450 | 44,198 |
Input | Terrain params (4 params), 10 m x 10 m | 2,921,927,661 | 280.51 G | |
Predictions | 1 prediction, Midwest, 10 m x 10 m | 2,921,927,661 | 140.25 G | |
Output | 1 visualization, Midwest, 10 m x 10 m | - | 4.57 M | |
Total Midwest 10 m | | 420.76 G | ||
CONUS, �10 m x 10 m | Input | Satellite data, 27 km x 27 km | 8,214 | 985,600 |
Input | Terrain params (4 params), 10 m x 10 m | 99,925,046,500 | 9.00 T | |
Predictions | 1 prediction, CONUS, 10 m x 10 m | 99,925,046,500 | 5.12 T | |
Output | 1 visualization, CONUS, 10 m x 10 m | - | 21.14 M | |
Total CONUS 10 m | | 14,934.00 G | ||
Data partitioning for SOMOSPIE scenarios
14
Region �Resolution | Data �Type | Data �Description | Total tiles | Size per tile [B] |
Midwest, �90 m x 90 m | Input | Terrain params (4 params), 90 m x 90 m | 1 | 2.43 G |
Intermediate | 1 prediction, Midwest, 90 m x 90 m | 1 | 1.42 G | |
Midwest, �10 m x 10 m | Input | Terrain params (4 params), 10 m x 10 m | 225 | 900 M - 1.6 G |
Predictions | 1 prediction, Midwest, 10 m x 10 m | 225 | 350 M - 800 MB | |
CONUS, �10 m x 10 m | Input | Terrain params (4 params), 10 m x 10 m | 1,156 | 1.1 G - 11 G |
Predictions | 1 prediction, CONUS, 10 m x 10 m | 1,156 | 650 MB - 6.1 G |
Scientists apply data partitioning to process temporal and spatial data at different scales, enabling data parallelism
15
HPC
Cloud
16
When scientists port their workflows to HPC on cloud services, there are questions about
Automatic scalability challenges for this kind of scientific workflow with cloud technologies
Deployment services for orchestrating their workflows
Storage technology solutions to support the large data transformations
Cloud Object Storage
17
When scientists port their workflows to HPC on cloud services, there are questions about
Deployment services for orchestrating their workflows
HPC on cloud services
18
Kernel
Userspace
Application
VM Instance
VPC
Kernel
Userspace
Application
VM Instance
Pods
Kubernetes
VPC
Container
LSF: HPC as a service on the cloud
Kubernetes: Open source cloud-native service
19
When scientists port their workflows to HPC on cloud services, there are questions about
Deployment services for orchestrating their workflows
Storage technology solutions to support the large data transformations
Cloud Object Storage
HPC on cloud services
20
Cloud Object Storage
Kernel
Userspace
Application
VM Instance
VPC
Cloud Object Storage
Kernel
Userspace
Application
VM Instance
Pods
Kubernetes
VPC
Container
LSF: HPC as a service on the cloud
Kubernetes: Open source cloud-native service
Cloud object storage (COS) grants distributed data over multiple instances to provide proportional capacity and throughput scalability
HPC on cloud services
21
Cloud Object Storage
Kernel
Userspace
Application
S3FS
FUSE �(Filesystem in USErspace)
VFS �(Virtual file system)
VM Instance
VPC
Cloud Object Storage
Kernel
Userspace
Application
S3FS
FUSE �(Filesystem in USErspace)
VFS �(Virtual file system)
VM Instance
Pods
PV
Kubernetes
PVC
VPC
Container
LSF: HPC as a service on the cloud
Kubernetes: Open source cloud-native service
We use s3fs to map the object storage into POSIX namespaces
Tuning I/O parameters
22
Deployment services
COS-FS Mapping package (s3fs)
Cloud Object Storage (COS)
High capacity
High scalability
Multiple access
High data resilience
Out-of-the-box settings come with data movement costs that tuning the platform’s I/O parameters can mitigate
23
Tuning I/O parameters at a single VM instance level
With this resolution we study:
<long., lat., p1,p2,...,pn>
Inference
<long., lat.,sm>
Terrain parameters
knn model
1. read
2. �compute
3. write
Region �Resolution | Data �Type | Data �Description | Data characterization | |
Points [#] | Size [B] | |||
Midwest, �90 m x 90 m | Input | Satellite data, 27 km x 27 km | 450 | 44,198 |
Input | Terrain params (4 params), 90 m x 90 m | 36,073,181 | 3.03 G | |
Intermediate | 1 prediction, Midwest, 90 m x 90 m | 36,073,181 | 1.42 G | |
Output | 1 visualization, Midwest, 90 m x 90 m | - | 438,126 | |
Total Midwest 90 m | | 4.45 G | ||
24
I/O time for LSF and KS: Write BW
Write Bandwidth [MB/s]
LSF
Kubernetes
Node: 32 cores 64 GB RAM
10 runs
25
I/O time for LSF and KS: Read BW
LSF
Kubernetes
Node: 32 cores 64 GB RAM
10 runs
Read Bandwidth [MB/s]
26
When scientists port their workflows to HPC on cloud services, there are questions about
Automatic scalability challenges for this kind of scientific workflow with cloud technologies
Deployment services for orchestrating their workflows
Storage technology solutions to support the large data transformations
Cloud Object Storage
27
Scaling for multiple reads and writes to COS
WN
WN
WN
WNn
WN1
WN2
WN3
WN4
WN6
WN5
WN7
WN8
…
…
…
Tile
Read
Write
COS 1
COS 2
WN1
WorkerNode
As we scale up the number of VM instances (tile per VM instance) do we find a bottleneck in data read and write?
Predicting soil moisture at large scale
28
Region �Resolution | Data �Type | Data �Description | Data characterization | |
Points [#] | Size [B] | |||
Midwest, �10 m x 10 m | Input | Satellite data, 27 km x 27 km | 450 | 44,198 |
Input | Terrain params (4 params), 10 m x 10 m | 2,921,927,661 | 280.51 G | |
Predictions | 1 prediction, Midwest, 10 m x 10 m | 2,921,927,661 | 140.25 G | |
Output | 1 visualization, Midwest, 10 m x 10 m | - | 4.57 M | |
Total Midwest 10 m | | 420.76 G | ||
CONUS, �10 m x 10 m | Input | Satellite data, 27 km x 27 km | 8,214 | 985,600 |
Input | Terrain params (4 params), 10 m x 10 m | 99,925,046,500 | 9.00 T | |
Predictions | 1 prediction, CONUS, 10 m x 10 m | 99,925,046,500 | 5.12 T | |
Output | 1 visualization, CONUS, 10 m x 10 m | - | 21.14 M | |
Total CONUS 10 m | | 14,934.00 G | ||
29
Read BW
Write BW
Average
Outliers
Scaling for Midwest at 10 m
Midwest 10m: 1.2 GB data read per node and 685 MB data written per node
Node: 2 cores 16 GB RAM
3 runs
8 nodes
16 nodes
24 nodes
32 nodes
48 nodes
94 nodes
152 nodes
225 nodes
8 nodes
16 nodes
24 nodes
32 nodes
48 nodes
94 nodes
152 nodes
225 nodes
LSF
Kubernetes
30
Scaling for CONUS at 10 m
CONUS 10m: 5.1 GB data read per node and 2.8 GB data written per node
Read Bandwidth [MB/s]
Write Bandwidth [MB/s]
8 nodes
16 nodes
24 nodes
32 nodes
48 nodes
94 nodes
152 nodes
8 nodes
16 nodes
24 nodes
32 nodes
48 nodes
94 nodes
152 nodes
Node: 8 cores 64 GB RAM
3 runs
Read BW
Write BW
Average
Outliers
LSF
Kubernetes
Takeaways
31
The complexity of large intermediate data and their overall execution can be significantly affected by the underlying computational infrastructure
Cloud infrastructure comes with new layers that need to be tuned for executing scientific workflows
The flexibility of cloud infrastructure enables automatic scalability of scientific applications with large data transformations
Next directions
32