1 of 68

Geo for Good Summit Sept 25th- What’s Happening Now?

Embedding Fields: Accelerate Your Mapping & Monitoring Workflows with Geospatial AI

Room: Experiment

Operational field data collection with Ground

Room: Carina Nebula

Profiling Scripts and Advanced Debugging

Room: Birr Castle

1:45-2:45pm

2:45-3pm

Short break - time to walk to your next session!

3-4pm

Embedding Fields Classification with Open Architectures

Room: Experiment

Monitoring Land Use Land Cover Changes with Dynamic World

Room: Carina Nebula

Demystifying Data Export and Extraction

Room: Birr Castle

4-4:30pm

Coffee break outside Experiment

4:30-5pm

Hackathon Kickoff in Experiment

Office Hours - meet in Hubble

By appointment only

You are here

2 of 68

Embedding Fields Classification on

Open Architecture

Sean Wohltman, Kel Markert, Jeremy Malczyk, Gennadii Donchyts

September 2024 | #GeoForGood24

Repo

http://tiny.cc/efcoa

3 of 68

Embedding Fields Classification on

Open Architecture

Sean Wohltman, Kel Markert

September 2024 | #GeoForGood24

Repo

http://tiny.cc/efcoa

4 of 68

Agenda

01

02

03

EFM on Earth Engine

What it is and why we think it changes everything

How EE makes classification “easy” on top of a dataset like this

Going outside the sandbox

Model- as-Data for use cases beyond Earth Engine

Exporting EFM data

How to get EFM data into storage you can use outside of EE

A Desktop / non-scaled Approach

Frameworks / tools you can use in place of EE to run classifications on compute you manage

Running Classification at Scale

How you can leverage Kubernetes and Dask to classify at scale

04

05

#GeoForGood24

5 of 68

Embedding Fields Model on Earth Engine

Optional eyebrow

6 of 68

#3: Using FM embeddings directly

Multi-purpose remote sensing embeddings without specialization

General Patterns for Geospatial Foundation Models (GeoFMs); across a range of users, task types and domains

Foundational Model

Semantic Segmentation

Object detection

Change detection

Task type X

Agri landscape

Buildings & roads

Tree species

Oil, Gas, Infra

Disaster impact

Urban features

Task A

Task B

#2: Train open vocabulary models�Build models that specialize in a type of task, but for any class

#1: Fine tune task specific models�Models that Specialize in very specific task and classes.

Proprietary + Confidential

7 of 68

How does fine tuning work in practice?

Training a typical supervised remote sensing model

Large set of labeled images* (~10K-1M)

Model training with a supervised objective

A task specific model

* In this example we use images as the basic training data for simplicity. In practice many other data sources like weather, DSMs, structured data could be used and labeled data might be at the pixel level (for segmentation) or language (for VQA etc.).

Proprietary + Confidential

8 of 68

Fine tuning an FM for a specific task

First train (once) a foundational model…

Small set of labeled images* (10-1K)

… then fine tune it to (many) specific tasks

Task specific layers

Fine tuned layers

Task specific model

FM Fine tuning

Huge set of unlabeled imagery* (>10M)

Diverse set of pre-training tasks

Multi-task foundational model

* In this example we use images as the basic training data for simplicity. In practice many other data sources like weather, DSMs, structured data could be used and labeled data might be at the pixel level (for segmentation) or language (for VQA etc.).

Proprietary + Confidential

9 of 68

130+ enterprise-ready foundation models in Vertex AI Model Garden

Choice and flexibility with Google, open source, and third-party foundation models
Multiple modalities to match every use case
Multiple model sizes to match cost and efficacy needs
Domain-specific models for specialized industries
Enterprise ready with safety, security, and responsibility
Decrease time to value with fully integrated platform

Google Foundation Models

Google Task Specific Models

Google Domain Specific Models

Partner & Open Ecosystem

Imagen 2

PaLM 2

Speech-to-Text

Text-to-Speech

Natural Language

Translation

Vision

Video Intelligence

Doc AI OCR

Occupancy analytics

MedLM

Life Science and Healthcare

Sec-PaLM

Cybersecurity

Llama 2�Code Llama

Falcon

Claude 2

Pre-announce

Vertex AI Model Garden

Codey

Chirp

Embeddings

Gemini Foundation Models

1.0

Pro

1.5

Pro

1.0

Ultra

Foundation models are very powerful and capable. We are bringing multiple families of Google foundation models, across modalities, like text, chat, images, code, speech, embeddings, in different sizes, in one single platform, providing customers choices to pick the right model for the right price, performance, and latency considerations.

These also include our task specific models, domain specific models, such as Med-PaLM and Sec-PaLM., and foundation models from Open Source and Partners.

No matter what foundation model you use, they come with enterprise ready tooling and integration to our end to end platform.

Curated selection of 100+ enterprise ready models from Google, partners and OSS

Built for full data control with zero data leakage

And grounded, factual results with embeddings that include all your data

10 of 68

A Vision Transformer (ViT)

Proprietary + Confidential

11 of 68

segment-geospatial

Proprietary + Confidential

12 of 68

Earth Engine Timelapse

sam2 Segmented

Proprietary + Confidential

13 of 68

Proprietary + Confidential

14 of 68

Pros to the model checkpoint approach

Optional eyebrow

Generalization:

Model checkpoints typically reflect an inflection point in training before you reach diminishing returns, across a wide field of intended tasks. Making them a good foundation for zero-shot attempts at those tasks, and also a good starting point for fine-tuning.

Portability:

Typically these can be single files of a few gigabytes - that can fit in modern GPU RAM.

Flexibility:

Can be easily spread across numerous workers / GPUs to support high-volume inference and classification.

15 of 68

Cons to the model checkpoint approach

Optional eyebrow

Everyone does the same thing with it:

You can imagine many users classifying the same exact Sentinel 2 scene for instance, using the same exact model checkpoint, getting the same result. This is pretty wasteful!

You need to bring it data:

The checkpoint is just one component, you will need to feed it input data (e.g. a new Sentinel 2 granule) to produce anything. Do you have a library of all the data you want to classify offline with you?

Temporal Context: Very hard to capture both spatial and temporal context in a checkpoint, especially globally.

Cost:

When you scale an inference or classification workload to many workers / GPUs, each one of them has to load the checkpoint into RAM, which can get very costly, very quickly!

16 of 68

Embedding Fields

Serving a model as data with the power of deep learning packed into every pixel

Bringing an EO paradigm shift to EE that only Alphabet can do

17 of 68

RGB

R

G

B

B34

B05

B06

64-dimensional embedding space

Embedding Fields

18 of 68

19 of 68

20 of 68

21 of 68

22 of 68

The portability of Model as Data

Optional eyebrow

23 of 68

24 of 68

25 of 68

26 of 68

27 of 68

Exporting EFM

Optional eyebrow

28 of 68

29 of 68

Why use Zarr?

Zarr is a relatively new cloud-based, self-describing data format specifically for N-Dimensional arrays with predefined chunks

“Use the right data model appropriate for your task”

Zarr storage model

├ Zarr dataset

├─── Dataset-level attributes (properties)

├─── Coordinates (x, y, time)

├─── Variable 1 (Band1)

├ ├─── Variable attributes (properties)

├ ├─── x₁,y₁,t₁ data chunk

├ ├─── …

├ ├─── x_i,y_j,t_k data chunk

├─── …

├─── Variable N (Band N)

EE Image Collection model

├ Image Collection ID

├─── Collection-level properties

├─── Image

├ ├─── Properties (t)

├ ├─── Band 1

├ ├ ├─── Coordinates (x,y)

├ ├─── …

├ ├─── Band N

├─── …

├─── Image M

30 of 68

Xarray

Earth Engine

Xee

Data Extraction

Wraps computePixels within Xarray paradigm to take advantage of automatic chunking, retries, and merging of data

Plays nicely with other GCP compute services like Dataflow and Dataproc

31 of 68

Exporting data to Zarr format

Used xarray-beam and xee with Dataflow to handle parallel request to EE and writing in parallel to Zarr

Can scale up to hundreds of CPUs sending requests to EE using the High Volume API

Stats for the S2 composite export:

Exports 10m data over CONUS in under 1hr
496CPUs
1.82 TB of memory
2.9TB of data transferred
$20 in Dataflow/CPU cost
about $1,100 of Online EECU cost

Stats for the EFM data export:

Exports 10m data over CONUS in 4.5hrs
496CPUs
~2 TB of memory
20.75TB of data transferred
$225.93 in Dataflow/CPU cost
about $3,700 of Online EECU cost
~100K queries per second to EE

A couple of things to be aware of:

re-chunking in the pipeline is very memory intensive and can kill workers
Dataflow worker optimization (need to play with CPUs and memory to get the job done)
Can try to optimize size of chunks getting from EE (get as much data in one request) vs rechunking when writing to Zarr
Must use High Volume API for jobs otherwise will get 429 errors
Consider chunk size of Zarr output data
Retries are your friend! Earth Engine will give you 429 errors, retry requests with the HVAPI to keep data moving

32 of 68

A Desktop or non-scaled Approach

Optional eyebrow

33 of 68

k-NN Regression / Classification

34 of 68

35 of 68

36 of 68

37 of 68

38 of 68

39 of 68

Offline with

Open Source Software

jupyter / ipyleaflet / scikit-learn

Notebook

40 of 68

Create a Slippy Map with ipyleaflet and a TileLayer

41 of 68

Digitize the first Class

42 of 68

Class 3 …

Class 2 …

43 of 68

Create a GeoDataFrame

44 of 68

Show the 3 Classes on the Map

45 of 68

Select a subset of the Zarr

46 of 68

Spatial Join Training Points to the Xarray DataSet

47 of 68

Train a KNN Classifier from the training points and their intersected Embedding Field values and check the accuracy.

48 of 68

Now use that model to predict the classes of all of the points

and plot the classified map

49 of 68

But, this is really points...

50 of 68

To do this on the DataSet, it needs to be turned into an array and transposed

51 of 68

Then, the transposed array needs to be sliced along the time dimension

52 of 68

Next, create predictions using the previously trained model (knn)

53 of 68

The array with predicted classes can now be plotted

54 of 68

To export as a GeoTIFF, reshape the array

55 of 68

Then use rioxarray to convert array to raster and write as GeoTIFF

56 of 68

Running Classification at Scale

Optional eyebrow

57 of 68

Google Kubernetes Engine (GKE)

https://kubernetes.io/

Nodes, Pods, Workloads

> kubectl get pods

> helm install …

https://cloud.google.com/kubernetes-engine

https://docs.dask.org/en/stable/

58 of 68

Borg → Kubernetes

“So let me get this straight. You want to build an external version of the Borg task scheduler. One of our most important competitive advantages. The one we don’t even talk about externally. And, on top of that, you want to open source it?”

– Urs Hölzle

59 of 68

Kubernetes is winning container orchestration

Google Search Trends keyword search 25-Oct-15 to 25-Oct-18

60 of 68

Dask on Google Kubernetes Engine (GKE)

GKE, Kubernetes-as-a-service

Master

Nodes

kubectl helm

gcloud

Spin up a Kubernetes cluster on GKE with Dask scheduler and worker pods
Use kubectl command to control the cluster
Use helm command to instal workloads on the cluster
Execute parallelized python code on that cluster

61 of 68

This notebook installs Dask and will multithread locally

62 of 68

It will create a local cluster and Dask client

63 of 68

Call dask_ml KMeans to create 10 classes with unsupervised classification

64 of 68

When the predictions are completed, add them to the DataArray and unstack

65 of 68

Using the Embedding Fields with unsupervised classification

gives us 10 very compelling classes!

66 of 68

Scaling Up and Out with Dask on GKE

https://kubernetes.io/

https://cloud.google.com/kubernetes-engine

https://docs.dask.org/en/stable/

67 of 68

Geo for Good Summit Sept 25th- What’s Happening Now?

Embedding Fields: Accelerate Your Mapping & Monitoring Workflows with Geospatial AI

Room: Experiment

Operational field data collection with Ground

Room: Carina Nebula

Profiling Scripts and Advanced Debugging

Room: Birr Castle

1:45-2:45pm

2:45-3pm

Short break - time to walk to your next session!

3-4pm

Embedding Fields Classification with Open Architectures

Room: Experiment

Monitoring Land Use Land Cover Changes with Dynamic World

Room: Carina Nebula

Demystifying Data Export and Extraction

Room: Birr Castle

4-4:30pm

Coffee break outside Experiment

4:30-5pm

Hackathon Kickoff in Experiment

Office Hours - meet in Hubble

By appointment only

68 of 68

Thank you!

Image set-up:

Set slide background color to black
Add image
Right-click image
Select ‘format options’
In side panel, adjust transparency slider in adjustments section to 50%

#GeoForGood24