1 of 4

LEM data flow

ESA WorldCover tile�File type: geotiff�Dims: �Location: AWS public url

EO data�File type: geotiff�Dims: 256 x 256 x 18 b x 24 m�Location: gs://lem-tifs/tifs�Spec: 3040 files at 70-90mb�Total: 220.06 GB

Notes:�The entire tile is a polygon not a box (e.g. nan ocean data is skipped)

�256x256 dimensions must be used to allow file to fit in tif-to-np cloud function memory

Sampled EO data�File type: npyDims: 18 b x 24 m�Source: tif-to-np cloud function�Location: gs://lem-assets/npy�Spec: 1,145,889m files at 3.5kb�Total: 3.82 GB��Notes: �Npy files are sampled every 10 pixels why is the total size reduction nearly 60x

Theories: nans + lighter encoding + cloud function scaling, float32

Webdataset shard�File type: tar�Source: Cloud VM�Location: gs://lem-assets/tars�Spec: 1 tar file at 4.4GB (in 1-10 recommended range)��Notes:�Amount of time it takes to download the data is frustrating because it’s so little

Training

Earth�Engine�[fetching] (~24 hrs)

Tif-to-np function [sampling] (mins)

Cloud VM script

[packaging]�(~3hrs)

2 of 4

LEM data flow

ESA WorldCover tile�File type: geotiff�Dims: �Location: AWS public url

EO data�File type: geotiff�Dims: 256 x 256 x 18 b x 24 m�Location: gs://lem-tifs/tifs�Spec: 3040 files at 70-90mb�Total: 220.06 GB

Notes:�The entire tile is a polygon not a box (e.g. nan ocean data is skipped)

�256x256 dimensions must be used to allow file to fit in tif-to-np cloud function memory

Sampled EO data�File type: npyDims: 600 x 18 b x 24 m�Source: tif-to-np cloud function�Location: gs://lem-assets/npy�Spec: 1884 files at 2MB�Total: 3.82 GB��Notes: �Npy files are sampled every 10 pixels why is the total size reduction nearly 60x

Theories: nans + lighter encoding + cloud function scaling, float32

Webdataset shard�File type: tar�Source: Cloud VM�Location: gs://lem-assets/tars�Spec: 1 tar file at 3.82GB (in 1-10 recommended range)��

Training

Earth�Engine�[fetching] (~24 hrs)

Tif-to-np function [sampling] (mins)

Cloud VM script

[packaging]�(mins)

3 of 4

LEM data flow

2022-11-28

Training

1

0

2

3

4

5

S1_S2_ERA5_STRM( EEPipeline )

WorldCover2020( EEPipeline )

4 of 4

LEM data flow

2022-11-28