1 of 60

AI for Science illustrated by �Deep Learning for Geospatial Time Series

Geoffrey Fox, University of Virginia

John Rundle, UC Davis

Bo Feng, Indiana University

The IEEE 12th Annual Computing and Communication Workshop and Conference(CCWC2022)

January 27 2022

Especially earthquake nowcasting

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

2 of 60

Abstract

AI is expected to transform both science and the approach to science.
As an example, we take the use of deep learning to describe geospatial time series. We present a general approach building on previous work on recurrent neural networks and transformers.
We give three examples of so-called spatial bags from earthquake nowcasting, medical time series, and particle dynamics and focus on the earthquake case. The latter is presented as an MLCommons benchmark challenge with three different implementations: a pure recurrent network, a Spatio-temporal science transformer, and a version of the Google Temporal Fusion Transformer.
We discuss how deep learning is used to both clean up the inputs and describe hidden dynamics.
We show that both data engineering (wrangling data into desired input format) and data science (the deep learning training/inference) are needed and comment on achieving high performance in both. We briefly speculate how such particular examples can drive broad progress in AI for science.

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

3 of 60

Operator Formulation of Prediction

Suppose we are solving PDE’s or sets of coupled ODE’s
Typically we solve iteratively New Values = (Differential Operator O) Previous Values
Theory-Driven Classic applied math tells you nifty difference equations and spectral methods to represent Operator numerically
Data-Driven Deep Learning learns the operator from observation
Prediction is called Inference and
Inference is New Values = (DL Operator O) Previous Values
This new nonlinear trained DL operator can allow much larger �time steps, incorporate variations in parameters, learn potentials etc.
DL Operator O is the new data-driven theory (Newton’s laws) of science
High order approximations are traditionally very sensitive to noise and one was taught to avoid but Deep NNs are the opposite – both verbose and robust

Note DL operator O with multiple LSTM layers has 1000s-10,000,000 parameters
Newton’s laws for ODE have 2-4 parameters

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

4 of 60

Predicting the Future with Time Series

Newton’s laws: Force = mass acceleration with say �in chemistry force = sum over forces from other particles�or Gravity = mass.g

m a(t) = F(x,t)
v(t+δt) = v(t) + δt a(t)
x(t+δt) = x(t) + δt v(t)

These define an operator that produces a time series x(t), v(t), a(t) initialized by values at t= 0 and 1
We want to take general time series and analyze them to�find “operator” and then use it to project into future

Note operator depends on time series dependent parameters such� as mass and Force so we cannot expect a universal operator
e.g. for COVID-19 time series of cases and fatalities, operator for �New York City different from Charlottesville
Initial conditions are also very different

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

5 of 60

Learn Newton’s laws with Recurrent Neural Networks

Deep Learning is revolutionizing (spatial) Time series Analysis
Good example is integrating sets of differential equations
Train the network on traditional 5 time step series from (Verlet) difference equations
Verlet needs time step .001 for reliable integration but
Learnt LSTM network is reliable for time steps which

are 4000 times longer � and also learn potential.

Speedup is 30000 on 16 particles interacting with Lennard-Jones potentials
2 layer-64 units per layer LSTM network: 65,072 trainable parameters
5000 training simulations

RNN Error² up to step size dT=4 and total time 10⁶

Verlet error²�dT = 0.01, 0.1

10^-5

10²³

10¹

JCS Kadupitiya, Vikram Jadhao

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

6 of 60

Basic Spatial (bag) Time Series

Forecast the Future �(any number of time units

any number properties)

Predict Now

or Seq2Seq map

as in

English to French

rainfall to runoff

Input Properties

Static e.g. %Seniors

Dynamic e.g. Covid cases per day

Space x

(Different data sources, not necessarily nearby)

Forecast the Future

Time t

Seq 2 Seq

Data Analysis Unit

Time sequence at one space point

For Natural Language Processing, space points are different paragraphs or books. A few sentences at each point.

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

7 of 60

1990-2019 Dataset overview: (a) 444,589 events with magnitude >= 0; (b) 24,822 events with magnitude >= 2.5; (c) 2,489 events with magnitude >= 3.5; (d) 237 events with magnitude >= 4.5; We can observe the number of larger earthquakes is orders of magnitude less than smaller ones.

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

8 of 60

Earthquakes and Deep Learning

There are at least two major computational tasks related to earthquakes

Forecast an occurrence of an earthquake; data-driven?
Predict damage once an earthquake happens; theory-driven?

The second b) is complicated but doable as you can get data usable as boundary conditions for prediction of the movement of waves of earthquake energy
a) is challenging as there is physics (theory) governing the movement of plates that generates an earthquake.

However you don’t know details of plates underground and the friction laws between them
Further quake is a “phase transition” and not a deterministic motion

Japan built a major computer - the earth simulator - to solve giant finite element model but FEM technology improved but not really earthquake forecasting
This implies that are “lots of hidden variables” and one can hope that deep learning can model these with hidden neurons
Looking in a different way, one can observe data about earthquakes (the seismic shocks) but not the data needed for a physics simulation
We need to train a neural network to map observed data into future earthquakes
Dog barking etc also used as possible harbingers (multi-modal data) of an earthquake

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

9 of 60

Data-driven or Theory Driven approaches

Computational science needs exascale supercomputers
Data driven needs big data not necessarily big supercomputers
Theory driven approach has lots of successes but also failures

Don’t know enough to compute theory
Theory an incomplete model of nature and not practical to parameterize missing aspects of theory

Data driven method has spectacular successes but applicability unclear in most areas of science

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

10 of 60

Structure of the Data

The earthquake data consists of sequences associated with each space pixel (0.1 by 0.1 degree subregion)
The events from USGS were initially binned daily for each space point but then they were aggregated for two reasons

We first accumulated data into 14 day (2 week samples) using energy-based averaging described elsewhere
Then we formed sequences of length upto 4 years each

These sequences were formed for every 2 week starting position so they are not independent.
The deep learning models use sequences that have properties for each 2 week member and predictions for each sequence for times later than those of any event included in sequence
Both the choice of properties and predictions is quite rich and we give later our current choice which is surely not yet optimal.
We insisted that all properties were known (i.e. had no missing data )but predictions were allowed to be missing

It is not trivial to correct for missing input properties but missing predictions are simple to address by adjusting the loss function (which sums independently over all predictions) to drop the least squares term corresponding to any missing data
The NaN symbol was used to flag this
The MSE or MAE loss summed over predictions in each sequence but averaged over sequences

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

11 of 60

Energy Weighted Quantities

Use logEnergy = m_bin for magnitude averaged over bin
We need to explore different observables logEnergy, Energyⁿ for n = 0.25, 0.5 and 1.
As n increases (from n~0 the log to n=1), one becomes more sensitive to large earthquakes but lose dynamic range in network as measured by mean value/maximum
Note all input properties and predictions are independently normalized to have maximum modulus of 1 over all space and time values.
Current results use logEnergy in aggregating magnitudes over space and time; they use energy weighting for depth but simply add occurrence multiplicities

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

12 of 60

Different observable functions versus energy averaged magnitude

Multiplicity v. mag

Multiplicity m>3.29 v. mag

E^0.25 v. mag

E^0.5 v. mag

Maximum of mag plot set to 50% other variable plotted

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

13 of 60

Choosing function f(x) of any input quantity x

There are not so many rules in deep learning but you can ask what makes sense
In general if you want to feed in variable x, any monotonic function f(x) can be used but note deep learning adds together and multiples input by weights
Does adding magnitudes -- multiplying energies -- make sense?
But keeping |f(x)| around the range (0,1) is important for deep learning as thats where activation layers work
Earthquake Energy/Maximum Earthquake Energy is in (0,1) but nearly every value is tiny at 10^-5
So there is a conflict between “physics” and “numerics”
In Covid analysis I used sqrt(x) and in hydrology x^1/3

x^{(n <1)} increases importance of small observables -- increases dynamic range
sqrt(x) gives correct counting statistics (x has error sqrt(x)) for Mean Square Error

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

14 of 60

LSTM Results - predict 2 weeks for magnitude

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

15 of 60

Magnitude replaced by Energy E^0.25

Errors dominated by big spikes and description poor in quiescent region -- if value small, then absolute error small even if fractional error large; Energy or Energy^0.5 worse!

Could change loss function

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

16 of 60

Nash Sutcliffe Efficiency NSE

Q^t_m is model prediction for Quantity Q

Q^t_o is observed value of Q

is mean value of Q^t_o over time

We use Normalized Nash–Sutcliffe Efficiency NNSE = 1/(2-NSE) as a measure of fit quality

See https://en.wikipedia.org/wiki/Nash%E2%80%93Sutcliffe_model_efficiency_coefficient

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

17 of 60

Depiction of Faults

It is attractive to build in fault structure into model and various ideas including use of Graph neural nets were considered. However here we adopt a simple not very powerful approach that relates locations with the same fault
As shown in diagram we used known Southern California fault locations to group them together into 36 fault families. These were labelled in various ways (4) by different choices of space filling curves
These labels were carried by each location as a static property. This and a simple location label were the only static properties used. The fault labels were only used as properties and not as predictions

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

18 of 60

36 Fault Groupings

Note region�32 to 36 degrees latitude

-120 to -114 degrees longitude

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

19 of 60

Training and Validation Datasets

We use the same dataset for validation and testing purposes and it comprised 20% of the sample
We tried to make the Validation clean by making it largely independent of Training set. We achieved this by using totally different spatial locations for these with 100 of 500 space bins in sample being randomly chosen for validation
A common alternative of choosing 20% of the sequence set would lead to biases as sequences are not independent

An unbiased alternative could choose the final 20% of time window from 1950-2020

There is a small correlation between training and validation set through fault family labels
Note transformer network masks out future times so one can only learn from the past and not from the future.

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

20 of 60

The 500 locations used - logEnergy

The 2400 0.1 by 0.1 degree subregions analyzed

500 of these regions were used in analysis after examination of number of quakes with M>3.29 in the region

400 (Red) for training

100 (Green) randomly chosen from 500 for validation

Remaining 1900 (black) not used

Locations of top 20 Earthquakes shown

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

21 of 60

Properties and Predictions

Properties and Predictions were chosen from the same collection of data but used the different time values (properties had to reflect the past and predictions the future).
This data collection consisted of different subclasses

Data input from USGS and binned into 2 week aggregations for each location
Static features which here are currently just the fault and space position labels
Calculated quantities which consisted of longer time averages with currently 4 years as maximum interval
As discussed we currently represent magnitude aggregation as LogEnergy but intend to investigate other forms such as the Benioff strain (square root of Energy) or more generally a power of Energy to be determined
Mathematical expansion functions which allow network to learn rich time dependence

As we need to have no missing input data, we only used aggregations up to one year on input and dropped sequences starting in first year to allow this. The predictions currently go up to 4 years and are missing for sequence ending in final 4 years of the dataset

Missing predictions are easy to deal with; drop from MAE or MSE sum

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

22 of 60

Input and Output Variables

Observed Inputs are the basic measured quantities such as magnitude and depth of earthquakes. In the Covid example, they are daily infections and fatalities but also time-dependent auxiliary variables such as vaccination rate, measures of social distancing, hospitalization rates etc.. In the earthquake case, bin multiplicity counts are auxiliary variables.
Divide observed inputs into “(a priori) unknown inputs” and targets
Known Inputs are an interesting concept that includes both static features and time series (known time-dependent features).
These are parameters that are known in both the past and future whereas observed inputs are only known in the past and need to be nowcast into the future.
In Covid and Earthquake examples the only measured known inputs are the static features.
In commercial applications, daily signatures of holidays or weekends are interesting inputs.
In our analysis, we made extensive use of Mathematical known inputs which are functions of time that it appears natural to express the time dependence in terms of.

Here you can feed in approximate models or intuition as to nature of prediction

Targets are the functions of time that we are trying to predict. They often include all or a subset of the observed inputs as for training they need to be known for times previous to that where the nowcasting was made.
We also sometimes used Synthetic Targets that were predicted but not in input set. These need to be known for “all training times” so they can be used in training.

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

23 of 60

Known Inputs: Mathematical expansion functions

It seems natural to use mathematical functions as that represent time dependence. These would in isolation be identical forms for all locations but the different values of other properties can make a complex location dependent parameterization
We didn’t find a full discussion of this but adopted a “top-down” and “bottom-up” approach
The bottom up approach was motivated by our Covid study which sees a strong weekly structure and by hydrology which sees a yearly structure -- in each case for daily data
Neither of these is appropriate here so we allowed “bottom up functions” of the form (Cos𝜽 and sin𝜽 -- 2 properties) where 𝜽 varies from 0 to 2𝞹 over an N 2-week period. We chose N= 8, 16, 32, 64; a total of 8 properties representing the fine time scale behavior Classic Trigonometric Expansion
For top-down we allowed choice P_l (cos𝜽) the Legendre Polynomials where cos𝜽 varied from -1 to 1 over full time range. This can represent long time scales. We chose l = 0 to 4 in current models. Classic Expansion in Orthogonal Polynomials
Note all these functions are bounded between -1 and 1.
Note we used these in Properties and Predictions but had weight reduced in predictions

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

24 of 60

LSTM/TFT Description of Covid Data (3142 Counties)

Uses Weekly property plus “top-down” Legendre Polynomials

500 most populous counties in the USA

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

25 of 60

Inputs and Outputs

Static Known Inputs (5)	4 space-filling curve labels of fault grouping, linear label of pixel
Targets (24)	m_bin (F:Δt,t) for Δt = 2, 4, 8, 14, 26, 52, 104, 208 weeks. Also for skip 52 weeks and predict next 52; skip 104 and predict next 104. With relative weight 0.25, all the Known inputs and linear label of pixel
Dynamic Known Inputs (13)	P_l(cos_Full) for l=0 to 4 cos_period(t), sin_period(t) for period = 8, 16, 32, 64
Dynamic Unknown Inputs (9)	Energy-averaged Depth, Multiplicity, Multiplicity m>3.29 events m_bin (B:Δt,t) for Δt = 2, 4, 8, 14, 26, 52 weeks

Static Known Inputs (5)	4 space-filling curve labels of fault grouping, linear label of pixel
Targets (4)	m_bin (F:Δt,t) for Δt = 2, 14, 26, 52 weeks. Calculated for t-52 to t for encoder and t to t+52 weeks for decoder in 2 week intervals. 104 predictions per sequence.
Dynamic Known Inputs (13)	P_l(cos_Full) for l=0 to 4 cos_period(t), sin_period(t) for period = 8, 16, 32, 64
Dynamic Unknown Inputs (9)	Energy-averaged Depth, Multiplicity, Multiplicity m>3.29 events m_bin (B:Δt,t) for Δt = 2, 4, 8, 14, 26, 52 weeks

TFT�Note targets restricted to time period of decoder LSTM although predicting the next 26 2 week m_bin does NOT allow one to predict the 52 week m_bin as adding and taking logs (roughly adding and taking maximum) do NOT commute

LSTM and Science Transformer

Only differ in Targets

m_bin (F:Δt,t) is energy averaged total magnitude over time Δt starting at time t

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

26 of 60

Architectures of LSTM and Hybrid Science Transformer

b) Space-Time Transformer (for encoder) and LSTM (for decoder)

Merge

Final

Initial

LSTM Layer

Outputs

optional but best results if you do this

Input
Dense Encoder with activation
LSTM-1
LSTM-2
Dense Decoder with activation
Dense Output

(B,W,InProp)
(B,W,InProp)
(B,W,128)
(B,W,48)
(B, 48)
(B, 128)
(B, OutPred)

LSTM

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

27 of 60

c) TFT Architecture

Encoder: Sequence (relatively simple) LSTM of All Inputs to “past/current” Targets
Decoder: Sequence (relatively simple) LSTM of Known Inputs to “Future” Targets
Temporal Transformer across LSTM Outputs refines “Future” Targets
All in Context of Static variables

Separate embedding for each input
Original Univariate output but easily extended to multivariate
Quantile loss replaced by mean square error�
Need to look separately at these different choices

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

28 of 60

Some details of Forecasting Models

We train models on the 400+100 locations already described.
Number of time values ~1785 (depends a bit on window size and quality cuts)
The data is completely shuffled over sequences and divvied up into batches of variable size
Number of sequences is #locations (~500) times number of time values

This so large that tends to exhaust available memory so use “symbolic windows” -- Hand Tensorflow sequence definitions not sequences. Generate sequence windows dynamically for each batch -- drastically reduces memory needed at some cost in time

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

29 of 60

Search Strategies in TFT and Science Transformer

Choose group of items (space-time collections) to be considered together -- implement full case statistically by random choice of sequences linked in a single attention calculation
a) Temporal(TFT): At each location look over time window size W=T_seq O(N_SW²)��

�

b) Spatial: Look across items at fixed position in time window O(N_S²W)��

c) Full: Look over all space time windows in batch O(N_S²W²)

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

30 of 60

Forecasting Models Sample Sizes

The TFT and LSTM are typical deep learning networks with a significant >=64 sized batch of simple one time one-location samples.
The Science Transformer model often has a batch size 1 as it needs each sample to have multiple space and time sequences to find query-key matches.
This large sample size is not practical with our current memory and so we sample these matches statistically with completely independent shuffling into groups of lengths we investigated

More structured choices were investigated but full shuffling appeared the best
All locations/times do appear once in each epoch but not once in each sample
Multi-GPU parallel implementations could be investigated to explore larger sample sizes
This structure implies that the Science Transformer-LSTM end up with training times per epoch that are similar (Transformer 15% slower than LSTM) but the science transformer is much slower on Inference as it must run over multiple statistical samples of the data. This is no great concern for this type of analysis

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

31 of 60

Static Context

LSTM Modified Temporal Fusion Transformer Science Transformer AE-TCN Joint Model

Temporal Attention

Embedders

Output Mappers

2-layer LSTM as Forward decoder

2-layer LSTM as Backward encoder

Embedder

Merge

2-layer LSTM as decoder

Space-Time�Attention

Embedder

Output �Mapper

2-layer LSTM

AutoEncoder

Temporal Convolutional

Network

Image

Prediction

Weights: 67K 8 Million 2.3 Million

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

32 of 60

Lots of Weights to Train

Allowed by steepest descent which is insensitive to redundant parameters
Leads to overfitting: Training loss can be << Validation/Testing Loss
No obvious rules other than don’t be too greedy on training loss
Rate of change sensitive to learning rate (step size) and batch size

LSTM TFT

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

33 of 60

LSTM and Transformer Results - predict 2 weeks

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

34 of 60

LSTM and Transformer Results - predict 6 months

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

35 of 60

LSTM and Transformer Results - predict 4 years

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

36 of 60

NNSE Summed over Locations

	Normalized Nash Sutcliffe Efficiency NNSE
Time Period	LSTM Train	TFT Train	Science Transformer�Train	LSTM Validation	TFT Validation	Science Transformer Validation
2 weeks	0.903	0.925	0.893	0.868	0.87	0.856
4 weeks	0.895		0.916	0.867		0.884
8 weeks	0.886		0.913	0.866		0.881
14 weeks	0.924	0.982	0.919	0.893	0.899	0.881
26 weeks	0.946	0.985	0.954	0.897	0.895	0.896
52 weeks	0.919	0.988	0.955	0.861	0.88	0.876
104 weeks	0.923		0.937	0.853		0.83
208 weeks	0.935		0.921	0.811		0.77

Validation results similar between methods

Training quality�TFT > Science Transformer > LSTM

Training results reflect number of weights

TFT 8M >

Science Transformer 2.3M >>

LSTM 66K

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

37 of 60

Comments on Deep Learning for Earthquakes

There is no clear consensus as to

Best practice approach to Time Series

Treatment of static Variables
Treatment of multi-horizon futures 2 weeks to 4 years in this case
Univariate versus Multivariate
Spatial versus Temporal Attention
Role of “Known Inputs” from “Is today a holiday” to a Legendre polynomial in time to an approximate model of the phenomena

Measure of success -- I have a liking for Nash Sutcliffe Efficiency
Definition of Test/validation set -- divide in space, time or both
Variables to look at: log(E) to E^1/4 to E^1/2 (Benioff strain) to E to counts�to complex derived quantities such as eigenvalues
What is role of Faults?

Issues from “Computer Science” and “Earthquake Science” and some that mix both

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

38 of 60

Artificial Intelligence AI for Science

Computational Technologies to Assist, Augment and Automate Human activities

Statistics: Machine Learning ML and Probability

Probability and Bayesian approach provides overall framework and Machine

Deep Learning

Builds flexible models requiring less a priori knowledge in terms of stacked layers of neural nets such as dense, recurrent, convolutional, graph.

Learning specific algorithms to analyze data and use “on its own” (data driven) or in conjunction with theoretical ideas to give models, which are learnt from training and used in inference.

Neural Networks

Problem Classes

Specific Tasks

Expert Systems

Sequence To Sequence maps

Forecasting

Knowledge reasoning

Anomaly Detection

Natural Language Processing

Recommendation systems

Simulation Surrogates

Vision and Perception

Regression

Classification

Clustering

Topic Modelling

Random Forests

Autoencoders

Generative Adversarial Networks

Reinforcement Learning

Transformers (Attention)

ML&�Probability

Theory Driven Data-Driven

Laws of Nature

Phenomenology

χ²

Model 🡪 Nowcasting

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

39 of 60

Choices in Scientific Discovery

Traditional Theoretical Science

Discover by thinking

Traditional Observational Science

Data interpreted by simple or theory models -- typically a few parameters (<= 100)

Computational Science

Need a good numerical formulation of the theory

Data-Driven Science and AI for Science (nowadays ~all Deep Learning)

Surrogates enable very fast ensembles of simulations
Time dependent deep learning derives evolutionary behavior of data (the hidden Newton’s laws) whether there is a viable theory or not

Please look at other time dependent problems

Need to define Known Inputs
Static Features �(equivalent of mass and g in Newton’s Laws)
Observed Inputs
Targets

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

40 of 60

Building a model in 1978-1979

Quantum-chromodynamic approach for the large-transverse-momentum production of particles and jets; R. P. Feynman, R. D. Field, and G. C. Fox, Phys. Rev. D 18, 3320 – Published 1 November 1978
In these days models took a long time to develop and embodied best physics known needing Nobel prize winner!!
One used 𝞆² least squares fits to determine parameters in the model from data such as that on the left

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

41 of 60

More Such Physics Models

π⁰X⁰

π⁰X

η⁰X

η⁰X⁰

-t

200 GeV

Experiments at Fermilab

E110, E260, E350

E260

E350

Model Field - Feynman-Fox

Model Regge Theory

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

42 of 60

MLCommons Benchmarks

12/7/2019

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

43 of 60

MLCommons (MLPerf) Consortium Deep Learning Benchmarks

Some Relevant Working Groups

Training
Inference (Batch and Streaming)
TinyML (embedded)
Power
Datasets
HPC (Supercomputer Implementations)
Research (Academic-Industry Links)
Science (AI for Science)
Best Practice (Software)
Logging/Infrastructure (metadata)

Major effort of 52 companies to produce benchmarks with ongoing challenges
Training at V1.0 (fourth release)
Fox set up Science Working Group with co-chair Tony Hey who has a significant benchmarking group SciML

Identified ~12 science benchmarks including light source, satellite, surrogate and time series

MLCommons aims to accelerate machine learning innovation to benefit everyone. Benchmarking, Datasets, Best Practices Total Effort ~50 FTE

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

44 of 60

MLCommons (MLPerf) Consortium Activity Areas

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

45 of 60

Science Research MLCommons working group

Science like industry involves edge and data-center issues, end-to-end systems, inference, and training, There are some similarities in the datasets and analytics as both industry and science involve image data but also differences; science data associated with simulations and particle physics experiments are quite different from most industry exemplars
When fully contributed, the benchmark suite will cover (at least) the following domains: material sciences, environmental sciences, life sciences, fusion, particle physics, astronomy, earthquake and earth sciences, with more than one representative problem from each of these domains

https://mlcommons.org/en/groups/research-science/
One aim is to provide a mechanism for assessing the capability of different ML models in addressing different scientific problem
i.e. one benchmark measure is Scientific Discovery
Cover rich range of problem classes
“End-to-end” is one class
Provide common environment to store and run benchmarks (Software)
4 Initial Benchmarks (2 from DOE labs, 1 UK, 1 UVA)
Surrogates Included (1 from LLNL next round)
Lead use of FAIR metadata for MLCommons

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

46 of 60

Science-based Metrics

Metrics will include those measuring performance on science discovery, e.g., could be one or more of:

Accuracy achieved
Time to solution (to meet a specific accuracy target)
Top-1 or Top-5 score
Chance your home will suffer a big earthquake …..

Goal of our benchmarks is to stimulate development of new methods relevant for scientific outcomes. We aim to:

Offer well-defined “science data” sets
Provide a reference implementation - to help others overcome any format/interpretation/usage hurdles
Specify target benchmark metrics (to outperform)
Require a description of the improved method or code used by respondents

The science data should have enough richness to allow experimentation with innovative approaches.
Also include traditional system performance benchmarks

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

47 of 60

Benchmark	Science	Task	Owner Institute	Specific Benchmark Issues
CloudMask	Climate	Segmentation	RAL	Classify cloud pixels in images
STEMDL	Material	Classification	ORNL	Classifying the space groups of materials from their electron diffraction patterns
CANDLE-UNO	Medicine	Classification	ANL	Cancer prediction at cellular, molecular and population levels.
TEvolOp Forecasting	Earthquake	Regression	Virginia	Predict Earthquake Activity from recorded event data
ICF or Inertial Confinement Fusion	Plasma Physics	Simulation surrogate	LLNL	There are other possible LLNL benchmarks from collection of 10

Benchmark contains Datasets, Science Goals, Reference Implementations; hosted at SDSC or RAL

Specification of 4 Benchmarks https://drive.google.com/file/d/1BeefJTj4ZZL4Wa5c3zNz1l5nzQN-ktGR/view?usp=sharing

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

48 of 60

Current Science WG Benchmark Status

Chairs Geoffrey Fox, Tony Hey, Jeyan Thiyagalingam
4+1 Benchmarks available with datasets, reference implementations and preliminary goals
The benchmarks are ready except for uniform MLCommons structures and specific submission formats.
The formal submission process is not yet precisely defined but there will be

Open Division: Metric is Scientific Discovery
Closed Division: Metric is System Performance

Access at MLCommonsScienceBenchmarks.pdf https://github.com/rushilanirudh/macc
Join Working group https://mlcommons.org/en/groups/research-science/ at https://mlcommons.org/en/get-involved/
See minutes at https://docs.google.com/document/d/167m7FK6-Ud4M5gXta5cIc1hKqaRHkk2B1GyKasdeQLc/edit?usp=sharing

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

49 of 60

High Performance Data Engineering

Some Details on Cylon

12/7/2019

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

50 of 60

ML Code

NIPS 2015 http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf

This well-known paper points out that parallel high-performance machine learning is perhaps most fun but just a part of system. We need to integrate in the other data and orchestration components.

This integration is not very good or easy partly because data management systems like Spark are JVM-based which doesn’t cleanly link to C++, Python world of high-performance ML

ML code module is itself built up hierarchically from Numpy and Pandas operations (if Python)

Need to assemble 10 large modules into full workflow and efficiently execute Numpy/Pandas etc. inside modules

Integrating Data Engineering and Data Science

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

51 of 60

Data Engineering versus Data Science I: Deep Learning Workflow

Workflow often divide into two:

Data => Information preprocessing -- Hadoop, Spark, Twister2, Scikit-Learn

Information => Knowledge Compute intensive step Cylon enhanced Spark Twister2, PyTorch and Tensorflow

Post-Processing

Data

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

52 of 60

Data Engineering versus Data Science II

Data engineering includes producing structured data from raw data with ETL Extract-Transform-Load operations.
Data engineering enables Deep Learning(DL) and Machine Learning (ML) workflows.
No clear requirements but needs

Java ecosystem important with networking focus
Python ecosystem for user-facing capabilities
C++ ecosystem for performance

Data Engineering Data Science

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

53 of 60

Two ways

Lines of Open Source Code: Twister2 145000 (Java, Python)

Cylon 25000 (C++, Python, Java)

Data Collection

Pre-Processing

Model Training

Inference/Prediction

Big Data Frameworks

Deep Learning Frameworks

Twister2DL

Databases
API’s
Adaptors
Data Streams
IOT
...

Normalization
Filtering
Transformation
Aggregation
Feature Extraction
...

MLP
AutoEncoders
CNN
RNN
LSTM
...

Classification
Generation
Prediction
Pattern Recognition
...

Twister2, Spark, Flink, Hadoop, ...

PyTorch, TensorFlow, MXNet, Keras, ...

From Data Management(DM) to DL (Twister2) from Deep Learning(DL) to DM (Cylon)

Data Engineering

https://cylondata.org/

https://twister2.org/

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

54 of 60

Must support Parallelism as Automatically as Possible

At a low level, parallelism will be

NCCL MPI Horovod etc.
or pleasing parallelism (many task)

But the user can be shielded this by libraries
Originally these libraries were Linear Algebra for simulations and implemented in (not very successful) programming models such as High Performance Fortran HPF, C++, Java
Data Analytics actually has developed a more powerful set of such operator/function libraries with

Pandas and Numpy array table and dataframe operations
Deep Learning in PyTorch and Tensorflow
Spark transformations

Further this parallelism is proxied with Python frontends invoking parallel (C++) implementations
So community is developing infrastructure for operator-based parallelism

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

55 of 60

Some Intrinsically Parallel Operators

Classic Parallel Computing: (720 MPI functions)

AllReduce, Broadcast, Gather, Scatter

Linear Algebra (320 functions in SCALAPACK at one precision)

Matrix and Vector Operations

Tables 224 Pandas operators for Dataframe out of 4782 total

Intersect: Applicable on two tables having similar schemas to keep only the records that are present in both tables.
Join: Combines two tables based on the values of columns.Includes variations Left, Right, Full, Outer, and Inner joins.
OrderBy: Sorts the records of the table based on a specified column.
Aggregate: Performs a calculation on a set of values (records) and outputs a single value (Record). Aggregations include summation and multiplication.
GroupBy: Groups the data using the given columns; GroupBy is usually followed by aggregate operations. Famous from MapReduce

Arrays (1085 Numpy)
Tensors (>700 Tensorflow, PyTorch, Keras)

All the myriad of Numpy array operations
Add a layer to a deep learning network
Forward (calculate loss) and backward (calculate derivative) propagation
Checkpoint weights of network

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

56 of 60

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

57 of 60

Cylon: A High Performance Distributed Data Table

Cylon is a high performance C++ kernel and a distributed runtime for data pre-processing

Apache Parquet and Arrow based storage and in-memory data structure

Supports seamless integration with Deep Learning workloads, Pandas and Numpy
Zero-Copy data transfer between heterogeneous systems and languages.

Table API, an abstraction for ETL (extract, transform, load) for scientific computing and deep learning workloads including Pandas, HDF5

Join, Union, Intersect, Difference, Product, Project … 36 operators

Currently we support Joins (all formats) and other components (see all those operators listed earlier) are complete or currently in development.
Written in C++, APIs available in Java and Python (via Cython).
Cylon is the high performance kernel of Twister2.
Future link to RAPIDS (for NVIDIA GPU) BlazingSQL (SQL operators) Other accelerators such as AMD and Intel GPU’s

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

1 of 60

2 of 60

3 of 60

4 of 60

5 of 60

6 of 60

7 of 60

8 of 60

9 of 60

10 of 60

11 of 60

12 of 60

13 of 60

14 of 60

15 of 60

16 of 60

17 of 60

18 of 60

19 of 60

20 of 60

21 of 60

22 of 60

23 of 60

24 of 60

25 of 60

26 of 60

27 of 60

28 of 60

29 of 60

30 of 60

31 of 60

32 of 60

33 of 60

34 of 60

35 of 60

36 of 60

37 of 60

38 of 60

39 of 60

40 of 60

41 of 60

42 of 60

43 of 60

44 of 60

45 of 60

46 of 60

47 of 60

48 of 60

49 of 60

50 of 60

51 of 60

52 of 60

53 of 60

54 of 60

55 of 60

56 of 60

57 of 60

58 of 60

59 of 60

60 of 60