1 of 60

AI for Science illustrated by �Deep Learning for Geospatial Time Series

1

Geoffrey Fox, University of Virginia

John Rundle, UC Davis

Bo Feng, Indiana University

The IEEE 12th Annual Computing and Communication Workshop and Conference(CCWC2022)

January 27 2022

Especially earthquake nowcasting

or

UVA Biocomplexity/CS

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

2 of 60

Abstract

  • AI is expected to transform both science and the approach to science.
  • As an example, we take the use of deep learning to describe geospatial time series. We present a general approach building on previous work on recurrent neural networks and transformers.
  • We give three examples of so-called spatial bags from earthquake nowcasting, medical time series, and particle dynamics and focus on the earthquake case. The latter is presented as an MLCommons benchmark challenge with three different implementations: a pure recurrent network, a Spatio-temporal science transformer, and a version of the Google Temporal Fusion Transformer.
  • We discuss how deep learning is used to both clean up the inputs and describe hidden dynamics.
  • We show that both data engineering (wrangling data into desired input format) and data science (the deep learning training/inference) are needed and comment on achieving high performance in both. We briefly speculate how such particular examples can drive broad progress in AI for science.

2

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

3 of 60

Operator Formulation of Prediction

  • Suppose we are solving PDE’s or sets of coupled ODE’s
  • Typically we solve iteratively New Values = (Differential Operator O) Previous Values
  • Theory-Driven Classic applied math tells you nifty difference equations and spectral methods to represent Operator numerically
  • Data-Driven Deep Learning learns the operator from observation
  • Prediction is called Inference and
  • Inference is New Values = (DL Operator O) Previous Values
  • This new nonlinear trained DL operator can allow much larger �time steps, incorporate variations in parameters, learn potentials etc.
  • DL Operator O is the new data-driven theory (Newton’s laws) of science
  • High order approximations are traditionally very sensitive to noise and one was taught to avoid but Deep NNs are the opposite – both verbose and robust
    • Note DL operator O with multiple LSTM layers has 1000s-10,000,000 parameters
    • Newton’s laws for ODE have 2-4 parameters

3

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

4 of 60

Predicting the Future with Time Series

  • Newton’s laws: Force = mass acceleration with say �in chemistry force = sum over forces from other particles�or Gravity = mass.g
    • m a(t) = F(x,t)
    • v(t+δt) = v(t) + δt a(t)
    • x(t+δt) = x(t) + δt v(t)
  • These define an operator that produces a time series x(t), v(t), a(t) initialized by values at t= 0 and 1
  • We want to take general time series and analyze them to�find “operator” and then use it to project into future
    • Note operator depends on time series dependent parameters such� as mass and Force so we cannot expect a universal operator
    • e.g. for COVID-19 time series of cases and fatalities, operator for �New York City different from Charlottesville
    • Initial conditions are also very different

4

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

5 of 60

Learn Newton’s laws with Recurrent Neural Networks

  • Deep Learning is revolutionizing (spatial) Time series Analysis
  • Good example is integrating sets of differential equations
  • Train the network on traditional 5 time step series from (Verlet) difference equations
  • Verlet needs time step .001 for reliable integration but
  • Learnt LSTM network is reliable for time steps which

5

are 4000 times longer � and also learn potential.

  • Speedup is 30000 on 16 particles interacting with Lennard-Jones potentials
  • 2 layer-64 units per layer LSTM network: 65,072 trainable parameters
  • 5000 training simulations

RNN Error2 up to step size dT=4 and total time 106

Verlet error2�dT = 0.01, 0.1

10-5

1023

101

JCS Kadupitiya, Vikram Jadhao

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

6 of 60

Basic Spatial (bag) Time Series

6

Forecast the Future �(any number of time units

any number properties)

Predict Now

or Seq2Seq map

as in

English to French

or

rainfall to runoff

Input Properties

Static e.g. %Seniors

Dynamic e.g. Covid cases per day

Space x

(Different data sources, not necessarily nearby)

Forecast the Future

Time t

Seq 2 Seq

Data Analysis Unit

Time sequence at one space point

For Natural Language Processing, space points are different paragraphs or books. A few sentences at each point.

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

7 of 60

1990-2019 Dataset overview: (a) 444,589 events with magnitude >= 0; (b) 24,822 events with magnitude >= 2.5; (c) 2,489 events with magnitude >= 3.5; (d) 237 events with magnitude >= 4.5; We can observe the number of larger earthquakes is orders of magnitude less than smaller ones.

7

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

8 of 60

Earthquakes and Deep Learning

  • There are at least two major computational tasks related to earthquakes
    1. Forecast an occurrence of an earthquake; data-driven?
    2. Predict damage once an earthquake happens; theory-driven?
  • The second b) is complicated but doable as you can get data usable as boundary conditions for prediction of the movement of waves of earthquake energy
  • a) is challenging as there is physics (theory) governing the movement of plates that generates an earthquake.
    • However you don’t know details of plates underground and the friction laws between them
    • Further quake is a “phase transition” and not a deterministic motion
  • Japan built a major computer - the earth simulator - to solve giant finite element model but FEM technology improved but not really earthquake forecasting
  • This implies that are “lots of hidden variables” and one can hope that deep learning can model these with hidden neurons
  • Looking in a different way, one can observe data about earthquakes (the seismic shocks) but not the data needed for a physics simulation
  • We need to train a neural network to map observed data into future earthquakes
  • Dog barking etc also used as possible harbingers (multi-modal data) of an earthquake

8

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

9 of 60

Data-driven or Theory Driven approaches

  • Computational science needs exascale supercomputers
  • Data driven needs big data not necessarily big supercomputers
  • Theory driven approach has lots of successes but also failures
    • Don’t know enough to compute theory
    • Theory an incomplete model of nature and not practical to parameterize missing aspects of theory
  • Data driven method has spectacular successes but applicability unclear in most areas of science

9

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

10 of 60

Structure of the Data

  • The earthquake data consists of sequences associated with each space pixel (0.1 by 0.1 degree subregion)
  • The events from USGS were initially binned daily for each space point but then they were aggregated for two reasons
    1. We first accumulated data into 14 day (2 week samples) using energy-based averaging described elsewhere
    2. Then we formed sequences of length upto 4 years each
  • These sequences were formed for every 2 week starting position so they are not independent.
  • The deep learning models use sequences that have properties for each 2 week member and predictions for each sequence for times later than those of any event included in sequence
  • Both the choice of properties and predictions is quite rich and we give later our current choice which is surely not yet optimal.
  • We insisted that all properties were known (i.e. had no missing data )but predictions were allowed to be missing
    • It is not trivial to correct for missing input properties but missing predictions are simple to address by adjusting the loss function (which sums independently over all predictions) to drop the least squares term corresponding to any missing data
    • The NaN symbol was used to flag this
    • The MSE or MAE loss summed over predictions in each sequence but averaged over sequences

10

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

11 of 60

Energy Weighted Quantities

  • Use logEnergy = mbin for magnitude averaged over bin
  • We need to explore different observables logEnergy, Energyn for n = 0.25, 0.5 and 1.
  • As n increases (from n~0 the log to n=1), one becomes more sensitive to large earthquakes but lose dynamic range in network as measured by mean value/maximum
  • Note all input properties and predictions are independently normalized to have maximum modulus of 1 over all space and time values.
  • Current results use logEnergy in aggregating magnitudes over space and time; they use energy weighting for depth but simply add occurrence multiplicities

11

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

12 of 60

Different observable functions versus energy averaged magnitude

12

Multiplicity v. mag

Multiplicity m>3.29 v. mag

E0.25 v. mag

E0.5 v. mag

Maximum of mag plot set to 50% other variable plotted

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

13 of 60

Choosing function f(x) of any input quantity x

  • There are not so many rules in deep learning but you can ask what makes sense
  • In general if you want to feed in variable x, any monotonic function f(x) can be used but note deep learning adds together and multiples input by weights
  • Does adding magnitudes -- multiplying energies -- make sense?
  • But keeping |f(x)| around the range (0,1) is important for deep learning as thats where activation layers work
  • Earthquake Energy/Maximum Earthquake Energy is in (0,1) but nearly every value is tiny at 10-5
  • So there is a conflict between “physics” and “numerics”
  • In Covid analysis I used sqrt(x) and in hydrology x1/3
    • x(n <1) increases importance of small observables -- increases dynamic range
    • sqrt(x) gives correct counting statistics (x has error sqrt(x)) for Mean Square Error

13

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

14 of 60

LSTM Results - predict 2 weeks for magnitude

14

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

15 of 60

Magnitude replaced by Energy E0.25

15

Errors dominated by big spikes and description poor in quiescent region -- if value small, then absolute error small even if fractional error large; Energy or Energy0.5 worse!

Could change loss function

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

16 of 60

Nash Sutcliffe Efficiency NSE

Qtm is model prediction for Quantity Q

Qto is observed value of Q

is mean value of Qto over time

We use Normalized Nash–Sutcliffe Efficiency NNSE = 1/(2-NSE) as a measure of fit quality

See https://en.wikipedia.org/wiki/Nash%E2%80%93Sutcliffe_model_efficiency_coefficient

16

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

17 of 60

Depiction of Faults

  • It is attractive to build in fault structure into model and various ideas including use of Graph neural nets were considered. However here we adopt a simple not very powerful approach that relates locations with the same fault
  • As shown in diagram we used known Southern California fault locations to group them together into 36 fault families. These were labelled in various ways (4) by different choices of space filling curves
  • These labels were carried by each location as a static property. This and a simple location label were the only static properties used. The fault labels were only used as properties and not as predictions

17

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

18 of 60

36 Fault Groupings

18

Note region�32 to 36 degrees latitude

-120 to -114 degrees longitude

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

19 of 60

Training and Validation Datasets

  • We use the same dataset for validation and testing purposes and it comprised 20% of the sample
  • We tried to make the Validation clean by making it largely independent of Training set. We achieved this by using totally different spatial locations for these with 100 of 500 space bins in sample being randomly chosen for validation
  • A common alternative of choosing 20% of the sequence set would lead to biases as sequences are not independent
    • An unbiased alternative could choose the final 20% of time window from 1950-2020
  • There is a small correlation between training and validation set through fault family labels
  • Note transformer network masks out future times so one can only learn from the past and not from the future.

19

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

20 of 60

The 500 locations used - logEnergy

The 2400 0.1 by 0.1 degree subregions analyzed

500 of these regions were used in analysis after examination of number of quakes with M>3.29 in the region

400 (Red) for training

100 (Green) randomly chosen from 500 for validation

Remaining 1900 (black) not used

Locations of top 20 Earthquakes shown

20

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

21 of 60

Properties and Predictions

  • Properties and Predictions were chosen from the same collection of data but used the different time values (properties had to reflect the past and predictions the future).
  • This data collection consisted of different subclasses
    • Data input from USGS and binned into 2 week aggregations for each location
    • Static features which here are currently just the fault and space position labels
    • Calculated quantities which consisted of longer time averages with currently 4 years as maximum interval
    • As discussed we currently represent magnitude aggregation as LogEnergy but intend to investigate other forms such as the Benioff strain (square root of Energy) or more generally a power of Energy to be determined
    • Mathematical expansion functions which allow network to learn rich time dependence
  • As we need to have no missing input data, we only used aggregations up to one year on input and dropped sequences starting in first year to allow this. The predictions currently go up to 4 years and are missing for sequence ending in final 4 years of the dataset
    • Missing predictions are easy to deal with; drop from MAE or MSE sum

21

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

22 of 60

Input and Output Variables

  • Observed Inputs are the basic measured quantities such as magnitude and depth of earthquakes. In the Covid example, they are daily infections and fatalities but also time-dependent auxiliary variables such as vaccination rate, measures of social distancing, hospitalization rates etc.. In the earthquake case, bin multiplicity counts are auxiliary variables.
  • Divide observed inputs into “(a priori) unknown inputs” and targets
  • Known Inputs are an interesting concept that includes both static features and time series (known time-dependent features).
  • These are parameters that are known in both the past and future whereas observed inputs are only known in the past and need to be nowcast into the future.
  • In Covid and Earthquake examples the only measured known inputs are the static features.
  • In commercial applications, daily signatures of holidays or weekends are interesting inputs.
  • In our analysis, we made extensive use of Mathematical known inputs which are functions of time that it appears natural to express the time dependence in terms of.
    • Here you can feed in approximate models or intuition as to nature of prediction
  • Targets are the functions of time that we are trying to predict. They often include all or a subset of the observed inputs as for training they need to be known for times previous to that where the nowcasting was made.
  • We also sometimes used Synthetic Targets that were predicted but not in input set. These need to be known for “all training times” so they can be used in training.

22

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

23 of 60

Known Inputs: Mathematical expansion functions

  • It seems natural to use mathematical functions as that represent time dependence. These would in isolation be identical forms for all locations but the different values of other properties can make a complex location dependent parameterization
  • We didn’t find a full discussion of this but adopted a “top-down” and “bottom-up” approach
  • The bottom up approach was motivated by our Covid study which sees a strong weekly structure and by hydrology which sees a yearly structure -- in each case for daily data
  • Neither of these is appropriate here so we allowed “bottom up functions” of the form (Cos𝜽 and sin𝜽 -- 2 properties) where 𝜽 varies from 0 to 2𝞹 over an N 2-week period. We chose N= 8, 16, 32, 64; a total of 8 properties representing the fine time scale behavior Classic Trigonometric Expansion
  • For top-down we allowed choice Pl (cos𝜽) the Legendre Polynomials where cos𝜽 varied from -1 to 1 over full time range. This can represent long time scales. We chose l = 0 to 4 in current models. Classic Expansion in Orthogonal Polynomials
  • Note all these functions are bounded between -1 and 1.
  • Note we used these in Properties and Predictions but had weight reduced in predictions

23

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

24 of 60

LSTM/TFT Description of Covid Data (3142 Counties)

Uses Weekly property plus “top-down” Legendre Polynomials

500 most populous counties in the USA

24

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

25 of 60

Inputs and Outputs

25

Static Known Inputs (5)

4 space-filling curve labels of fault grouping, linear label of pixel

Targets (24)

mbin (F:Δt,t) for Δt = 2, 4, 8, 14, 26, 52, 104, 208 weeks. Also for skip 52 weeks and predict next 52; skip 104 and predict next 104.

With relative weight 0.25, all the Known inputs and linear label of pixel

Dynamic Known Inputs (13)

Pl(cosFull) for l=0 to 4

cosperiod(t), sinperiod(t) for period = 8, 16, 32, 64

Dynamic Unknown Inputs (9)

Energy-averaged Depth, Multiplicity, Multiplicity m>3.29 events

mbin (B:Δt,t) for Δt = 2, 4, 8, 14, 26, 52 weeks

Static Known Inputs (5)

4 space-filling curve labels of fault grouping, linear label of pixel

Targets (4)

mbin (F:Δt,t) for Δt = 2, 14, 26, 52 weeks. Calculated for t-52 to t for encoder and t to t+52 weeks for decoder in 2 week intervals. 104 predictions per sequence.

Dynamic Known Inputs (13)

Pl(cosFull) for l=0 to 4

cosperiod(t), sinperiod(t) for period = 8, 16, 32, 64

Dynamic Unknown Inputs (9)

Energy-averaged Depth, Multiplicity, Multiplicity m>3.29 events

mbin (B:Δt,t) for Δt = 2, 4, 8, 14, 26, 52 weeks

TFT�Note targets restricted to time period of decoder LSTM although predicting the next 26 2 week mbin does NOT allow one to predict the 52 week mbin as adding and taking logs (roughly adding and taking maximum) do NOT commute

LSTM and Science Transformer

Only differ in Targets

mbin (F:Δt,t) is energy averaged total magnitude over time Δt starting at time t

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

26 of 60

Architectures of LSTM and Hybrid Science Transformer

26

b) Space-Time Transformer (for encoder) and LSTM (for decoder)

26

Merge

Final

Initial

LSTM Layer

LSTM Layer

Outputs

optional but best results if you do this

Input

Dense Encoder with activation

LSTM-1

LSTM-2

Dense Decoder with activation

Dense Output

(B,W,InProp)

(B,W,InProp)

(B,W,128)

(B,W,48)

(B, 48)

(B, 128)

(B, OutPred)

  1. LSTM

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

27 of 60

c) TFT Architecture

  • Encoder: Sequence (relatively simple) LSTM of All Inputs to “past/current” Targets
  • Decoder: Sequence (relatively simple) LSTM of Known Inputs to “Future” Targets
  • Temporal Transformer across LSTM Outputs refines “Future” Targets
  • All in Context of Static variables

27

  • Separate embedding for each input
  • Original Univariate output but easily extended to multivariate
  • Quantile loss replaced by mean square error�
  • Need to look separately at these different choices

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

28 of 60

Some details of Forecasting Models

  • We train models on the 400+100 locations already described.
  • Number of time values ~1785 (depends a bit on window size and quality cuts)
  • The data is completely shuffled over sequences and divvied up into batches of variable size
  • Number of sequences is #locations (~500) times number of time values
    • This so large that tends to exhaust available memory so use “symbolic windows” -- Hand Tensorflow sequence definitions not sequences. Generate sequence windows dynamically for each batch -- drastically reduces memory needed at some cost in time

28

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

29 of 60

Search Strategies in TFT and Science Transformer

  • Choose group of items (space-time collections) to be considered together -- implement full case statistically by random choice of sequences linked in a single attention calculation
  • a) Temporal(TFT): At each location look over time window size W=Tseq O(NSW2)���

  • b) Spatial: Look across items at fixed position in time window O(NS2W)���

  • c) Full: Look over all space time windows in batch O(NS2W2)

29

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

30 of 60

Forecasting Models Sample Sizes

  • The TFT and LSTM are typical deep learning networks with a significant >=64 sized batch of simple one time one-location samples.
  • The Science Transformer model often has a batch size 1 as it needs each sample to have multiple space and time sequences to find query-key matches.
  • This large sample size is not practical with our current memory and so we sample these matches statistically with completely independent shuffling into groups of lengths we investigated
    • More structured choices were investigated but full shuffling appeared the best
    • All locations/times do appear once in each epoch but not once in each sample
    • Multi-GPU parallel implementations could be investigated to explore larger sample sizes
    • This structure implies that the Science Transformer-LSTM end up with training times per epoch that are similar (Transformer 15% slower than LSTM) but the science transformer is much slower on Inference as it must run over multiple statistical samples of the data. This is no great concern for this type of analysis

30

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

31 of 60

31

Static Context

LSTM Modified Temporal Fusion Transformer Science Transformer AE-TCN Joint Model

Temporal Attention

Embedders

Output Mappers

2-layer LSTM as Forward decoder

2-layer LSTM as Backward encoder

Embedder

Merge

2-layer LSTM as decoder

Space-Time�Attention

Embedder

Output �Mapper

2-layer LSTM

AutoEncoder

Temporal Convolutional

Network

Image

Image

Prediction

Weights: 67K 8 Million 2.3 Million

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

32 of 60

Lots of Weights to Train

  • Allowed by steepest descent which is insensitive to redundant parameters
  • Leads to overfitting: Training loss can be << Validation/Testing Loss
  • No obvious rules other than don’t be too greedy on training loss
  • Rate of change sensitive to learning rate (step size) and batch size

32

LSTM TFT

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

33 of 60

LSTM and Transformer Results - predict 2 weeks

33

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

34 of 60

LSTM and Transformer Results - predict 6 months

34

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

35 of 60

LSTM and Transformer Results - predict 4 years

35

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

36 of 60

NNSE Summed over Locations

36

Normalized Nash Sutcliffe Efficiency NNSE

Time Period

LSTM Train

TFT Train

Science Transformer�Train

LSTM Validation

TFT

Validation

Science Transformer

Validation

2 weeks

0.903

0.925

0.893

0.868

0.87

0.856

4 weeks

0.895

0.916

0.867

0.884

8 weeks

0.886

0.913

0.866

0.881

14 weeks

0.924

0.982

0.919

0.893

0.899

0.881

26 weeks

0.946

0.985

0.954

0.897

0.895

0.896

52 weeks

0.919

0.988

0.955

0.861

0.88

0.876

104 weeks

0.923

0.937

0.853

0.83

208 weeks

0.935

0.921

0.811

0.77

Validation results similar between methods

Training quality�TFT > Science Transformer > LSTM

Training results reflect number of weights

TFT 8M >

Science Transformer 2.3M >>

LSTM 66K

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

37 of 60

Comments on Deep Learning for Earthquakes

  • There is no clear consensus as to
    • Best practice approach to Time Series
      • Treatment of static Variables
      • Treatment of multi-horizon futures 2 weeks to 4 years in this case
      • Univariate versus Multivariate
      • Spatial versus Temporal Attention
      • Role of “Known Inputs” from “Is today a holiday” to a Legendre polynomial in time to an approximate model of the phenomena
    • Measure of success -- I have a liking for Nash Sutcliffe Efficiency
    • Definition of Test/validation set -- divide in space, time or both
    • Variables to look at: log(E) to E1/4 to E1/2 (Benioff strain) to E to counts�to complex derived quantities such as eigenvalues
    • What is role of Faults?
  • Issues from “Computer Science” and “Earthquake Science” and some that mix both

37

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

38 of 60

38

Artificial Intelligence AI for Science

Computational Technologies to Assist, Augment and Automate Human activities

Statistics: Machine Learning ML and Probability

Probability and Bayesian approach provides overall framework and Machine

Deep Learning

Builds flexible models requiring less a priori knowledge in terms of stacked layers of neural nets such as dense, recurrent, convolutional, graph.

Learning specific algorithms to analyze data and use “on its own” (data driven) or in conjunction with theoretical ideas to give models, which are learnt from training and used in inference.

Neural Networks

Problem Classes

Specific Tasks

Expert Systems

Sequence To Sequence maps

Forecasting

Knowledge reasoning

Anomaly Detection

Natural Language Processing

Recommendation systems

Simulation Surrogates

Vision and Perception

Regression

Classification

Clustering

Topic Modelling

Random Forests

Autoencoders

Generative Adversarial Networks

Reinforcement Learning

Transformers (Attention)

ML&�Probability

DL

Theory Driven Data-Driven

Laws of Nature

Phenomenology

χ2

Model 🡪 Nowcasting

AI

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

39 of 60

Choices in Scientific Discovery

  • Traditional Theoretical Science
    • Discover by thinking
  • Traditional Observational Science
    • Data interpreted by simple or theory models -- typically a few parameters (<= 100)
  • Computational Science
    • Need a good numerical formulation of the theory
  • Data-Driven Science and AI for Science (nowadays ~all Deep Learning)
    • Surrogates enable very fast ensembles of simulations
    • Time dependent deep learning derives evolutionary behavior of data (the hidden Newton’s laws) whether there is a viable theory or not
  • Please look at other time dependent problems
    • Need to define Known Inputs
    • Static Features �(equivalent of mass and g in Newton’s Laws)
    • Observed Inputs
    • Targets

39

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

40 of 60

Building a model in 1978-1979

  • Quantum-chromodynamic approach for the large-transverse-momentum production of particles and jets; R. P. Feynman, R. D. Field, and G. C. Fox, Phys. Rev. D 18, 3320 – Published 1 November 1978
  • In these days models took a long time to develop and embodied best physics known needing Nobel prize winner!!
  • One used 𝞆2 least squares fits to determine parameters in the model from data such as that on the left

40

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

41 of 60

More Such Physics Models

41

π0X0

π0X

η0X

η0X0

-t

200 GeV

Experiments at Fermilab

E110, E260, E350

E260

E350

Model Field - Feynman-Fox

Model Regge Theory

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

42 of 60

MLCommons Benchmarks

42

12/7/2019

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

43 of 60

MLCommons (MLPerf) Consortium Deep Learning Benchmarks

Some Relevant Working Groups

  • Training
  • Inference (Batch and Streaming)
  • TinyML (embedded)
  • Power
  • Datasets
  • HPC (Supercomputer Implementations)
  • Research (Academic-Industry Links)
  • Science (AI for Science)
  • Best Practice (Software)
  • Logging/Infrastructure (metadata)

43

  • Major effort of 52 companies to produce benchmarks with ongoing challenges
  • Training at V1.0 (fourth release)
  • Fox set up Science Working Group with co-chair Tony Hey who has a significant benchmarking group SciML
    • Identified ~12 science benchmarks including light source, satellite, surrogate and time series
  • MLCommons aims to accelerate machine learning innovation to benefit everyone. Benchmarking, Datasets, Best Practices Total Effort ~50 FTE

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

44 of 60

MLCommons (MLPerf) Consortium Activity Areas

44

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

45 of 60

Science Research MLCommons working group

  • Science like industry involves edge and data-center issues, end-to-end systems, inference, and training, There are some similarities in the datasets and analytics as both industry and science involve image data but also differences; science data associated with simulations and particle physics experiments are quite different from most industry exemplars
  • When fully contributed, the benchmark suite will cover (at least) the following domains: material sciences, environmental sciences, life sciences, fusion, particle physics, astronomy, earthquake and earth sciences, with more than one representative problem from each of these domains

45

  • https://mlcommons.org/en/groups/research-science/
  • One aim is to provide a mechanism for assessing the capability of different ML models in addressing different scientific problem
  • i.e. one benchmark measure is Scientific Discovery
  • Cover rich range of problem classes
  • “End-to-end” is one class
  • Provide common environment to store and run benchmarks (Software)
  • 4 Initial Benchmarks (2 from DOE labs, 1 UK, 1 UVA)
  • Surrogates Included (1 from LLNL next round)
  • Lead use of FAIR metadata for MLCommons

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

46 of 60

Science-based Metrics

  • Metrics will include those measuring performance on science discovery, e.g., could be one or more of:
    • Accuracy achieved
    • Time to solution (to meet a specific accuracy target)
    • Top-1 or Top-5 score
    • Chance your home will suffer a big earthquake …..
  • Goal of our benchmarks is to stimulate development of new methods relevant for scientific outcomes. We aim to:
    • Offer well-defined “science data” sets
    • Provide a reference implementation - to help others overcome any format/interpretation/usage hurdles
    • Specify target benchmark metrics (to outperform)
    • Require a description of the improved method or code used by respondents
  • The science data should have enough richness to allow experimentation with innovative approaches.
  • Also include traditional system performance benchmarks

46

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

47 of 60

47

Benchmark

Science

Task

Owner Institute

Specific Benchmark Issues

CloudMask

Climate

Segmentation

RAL

Classify cloud pixels in images

STEMDL

Material

Classification

ORNL

Classifying the space groups of materials from their electron diffraction patterns

CANDLE-UNO

Medicine

Classification

ANL

Cancer prediction at cellular, molecular and population levels.

TEvolOp Forecasting

Earthquake

Regression

Virginia

Predict Earthquake Activity from recorded event data

ICF or Inertial Confinement Fusion

Plasma Physics

Simulation surrogate

LLNL

There are other possible LLNL benchmarks from collection of 10

Benchmark contains Datasets, Science Goals, Reference Implementations; hosted at SDSC or RAL

Specification of 4 Benchmarks https://drive.google.com/file/d/1BeefJTj4ZZL4Wa5c3zNz1l5nzQN-ktGR/view?usp=sharing

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

48 of 60

Current Science WG Benchmark Status

48

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

49 of 60

High Performance Data Engineering

Some Details on Cylon

49

12/7/2019

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

50 of 60

50

ML Code

NIPS 2015 http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf

This well-known paper points out that parallel high-performance machine learning is perhaps most fun but just a part of system. We need to integrate in the other data and orchestration components.

This integration is not very good or easy partly because data management systems like Spark are JVM-based which doesn’t cleanly link to C++, Python world of high-performance ML

ML code module is itself built up hierarchically from Numpy and Pandas operations (if Python)

Need to assemble 10 large modules into full workflow and efficiently execute Numpy/Pandas etc. inside modules

Integrating Data Engineering and Data Science

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

51 of 60

Data Engineering versus Data Science I: Deep Learning Workflow

51

Workflow often divide into two:

Data => Information preprocessing -- Hadoop, Spark, Twister2, Scikit-Learn

Information => Knowledge Compute intensive step Cylon enhanced Spark Twister2, PyTorch and Tensorflow

Post-Processing

Data

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

52 of 60

Data Engineering versus Data Science II

  • Data engineering includes producing structured data from raw data with ETL Extract-Transform-Load operations.
  • Data engineering enables Deep Learning(DL) and Machine Learning (ML) workflows.
  • No clear requirements but needs
    • Java ecosystem important with networking focus
    • Python ecosystem for user-facing capabilities
    • C++ ecosystem for performance

52

Data Engineering Data Science

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

53 of 60

Two ways

53

Lines of Open Source Code: Twister2 145000 (Java, Python)

Cylon 25000 (C++, Python, Java)

Data Collection

Pre-Processing

Model Training

Inference/Prediction

Big Data Frameworks

Deep Learning Frameworks

Twister2DL

  • Databases
  • API’s
  • Adaptors
  • Data Streams
  • IOT
  • ...
  • Normalization
  • Filtering
  • Transformation
  • Aggregation
  • Feature Extraction
  • ...
  • MLP
  • AutoEncoders
  • CNN
  • RNN
  • LSTM
  • ...
  • Classification
  • Generation
  • Prediction
  • Pattern Recognition
  • ...

Twister2, Spark, Flink, Hadoop, ...

PyTorch, TensorFlow, MXNet, Keras, ...

From Data Management(DM) to DL (Twister2) from Deep Learning(DL) to DM (Cylon)

Data Engineering

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

54 of 60

Must support Parallelism as Automatically as Possible

  • At a low level, parallelism will be
    • NCCL MPI Horovod etc.
    • or pleasing parallelism (many task)
  • But the user can be shielded this by libraries
  • Originally these libraries were Linear Algebra for simulations and implemented in (not very successful) programming models such as High Performance Fortran HPF, C++, Java
  • Data Analytics actually has developed a more powerful set of such operator/function libraries with
    • Pandas and Numpy array table and dataframe operations
    • Deep Learning in PyTorch and Tensorflow
    • Spark transformations
  • Further this parallelism is proxied with Python frontends invoking parallel (C++) implementations
  • So community is developing infrastructure for operator-based parallelism

54

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

55 of 60

Some Intrinsically Parallel Operators

  • Classic Parallel Computing: (720 MPI functions)
    • AllReduce, Broadcast, Gather, Scatter
  • Linear Algebra (320 functions in SCALAPACK at one precision)
    • Matrix and Vector Operations
  • Tables 224 Pandas operators for Dataframe out of 4782 total
    • Intersect: Applicable on two tables having similar schemas to keep only the records that are present in both tables.
    • Join: Combines two tables based on the values of columns.Includes variations Left, Right, Full, Outer, and Inner joins.
    • OrderBy: Sorts the records of the table based on a specified column.
    • Aggregate: Performs a calculation on a set of values (records) and outputs a single value (Record). Aggregations include summation and multiplication.
    • GroupBy: Groups the data using the given columns; GroupBy is usually followed by aggregate operations. Famous from MapReduce
  • Arrays (1085 Numpy)
  • Tensors (>700 Tensorflow, PyTorch, Keras)
    • All the myriad of Numpy array operations
    • Add a layer to a deep learning network
    • Forward (calculate loss) and backward (calculate derivative) propagation
    • Checkpoint weights of network

55

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

56 of 60

56

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

57 of 60

Cylon: A High Performance Distributed Data Table

  • Cylon is a high performance C++ kernel and a distributed runtime for data pre-processing
    • Apache Parquet and Arrow based storage and in-memory data structure
      • Supports seamless integration with Deep Learning workloads, Pandas and Numpy
      • Zero-Copy data transfer between heterogeneous systems and languages.
  • Table API, an abstraction for ETL (extract, transform, load) for scientific computing and deep learning workloads including Pandas, HDF5
    • Join, Union, Intersect, Difference, Product, Project 36 operators
  • Currently we support Joins (all formats) and other components (see all those operators listed earlier) are complete or currently in development.
  • Written in C++, APIs available in Java and Python (via Cython).
  • Cylon is the high performance kernel of Twister2.
  • Future link to RAPIDS (for NVIDIA GPU) BlazingSQL (SQL operators) Other accelerators such as AMD and Intel GPU’s

57

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

58 of 60

Strong Scaling Comparison with Other Frameworks

  • 200M records per table (left and right)
  • 160 Processes accross 10 Intel® Xeon® Platinum 8160 with 255GB RAM in each mode and mounted SSD. InfiniBand with 40Gbps bandwidth

58

Inner join

Union

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

59 of 60

Large Scale Experiments with PySpark and Cylon

  • Inner join - up to 10B records per relation (left/right). The graph shows the total rows
  • 200 Processes across 10 Intel® Xeon® Platinum 8160 with 256GB RAM in each mode and mounted SSD. InfiniBand with 40Gbps bandwidth and 10Gbps Ethernet

59

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series

60 of 60

Strong Scaling Comparison with Other Frameworks

  • 200 Million records per table (left and right)
  • 160 Processes accross 10 Intel® Xeon® Platinum 8160 with 255GB RAM in each mode and mounted SSD. InfiniBand with 40Gbps bandwidth

60

Aggregations

Group-by + aggregations

UVA Biocomplexity/CS

AI for Science illustrated by Deep Learning for Geospatial Time Series