1 of 34

Enabling model and simulation provenance and data transparency with PROVENA

Presenter: Jonathan Yu

AVSIG March 2025

Funded by the Australian Climate Service, led by CSIRO and supported by the National Emergency Management Agency

Australia’s National Science Agency

2 of 34

Provena team

  • Jonathan Yu
  • Peter Baker
  • Linda Thomas
  • Sharon Tickell
  • Peter Fitch
  • Fareed Mirza
  • Jordan Munnerley
  • Ross Petridis

 

Prior years:

  • Xinyu Hou
  • David Lemon
  • Omid Rezvani
  • Parth Kulkarni
  • Simon Cox
  • Mark Baird

3 of 34

Outline

  1. Why we need data provenance? (feature 2 modelling projects)
  2. What are data provenance challenges? Current solutions…
  3. Provena as a solution
  4. Learnings and emerging outcomes

4 of 34

Where did this data come from?

5 of 34

Data

Process(es) e.g. modelling

People

Decisions

Actions

"Provenance refers to the sources of information, such as entities and processes, involved in producing or delivering an artifact“ (W3C, Prov)

Output�Data

6 of 34

Reef Restoration and Adaptation Program Modelling and Decision Support (MDS) for the Great Barrier Reef

RRAP MDS Modelling

Modeller

Stakeholders

(e.g. RRAP Program, Reef Managers and Industry Partners)

Researcher

Decision Maker

Fogging

Heat tolerant corals

Rubble �stabilisation

Produce data/info/knowledge

through modelling

Make decisions

Actions

7 of 34

1

2

Bushfire hazard data

NBIC Modelling

Produce data/info/knowledge

through modelling

Make decisions

Actions

Insurance

Calculate levies more accurately

Fire agencies

Improve bushfire preparedness

Planning authorities

Improve zoning of where buildings go

Infrastructure Managers

Take bushfire risk into publicly funded road/ infrastructure builds

National Bushfire Intelligence Capability | Board Meeting May 2024

|  7   

8 of 34

Modelling and Decision Support

7 sub-teams, ~40 people total �(14 modellers)

1 organisation

7 organisations

5 modelling teams, ~35 people total �(25 modellers)

People

Modelling approach

Data formats

Why?

Rigour

Credibility

Transparency

Data quality

Communication

Understand dependencies

Cloud optimized Geotiffs

9 of 34

Some data provenance solutions to date

Embed provenance in a file naming system… �

Write (more) docs about our process�

Record model runs, config, datasets in Excel��

Let’s make copies of inputs, code, and store it with the model outputs��

Use a workflow engine

Embed the provenance metadata in or with the dataset

{ }

10 of 34

Data Provenance Challenges

Finding / querying provenance info

Describing the data and its provenance

Sharing data/info

Integrating this into (modelling and simulation) processes

Registering, (storing) and accessing data

1

2

3

4

5

11 of 34

High-level Design of a Provenance solution

Register Datasets

Find/Access Datasets

Record Details of Modelling Activities

Understand and Trust Modelling Activities

Data Registry

Provenance Registry

Register Dataset

Access Dataset

Record Model Runs

Query Provenance

12 of 34

Provena – a solution to provenance and data challenges

13 of 34

PROVENA

  • Cloud-based provenance management system for projects
  • Provenance of datasets and details of related modelling and simulation workflows
  • (Currently) AWS-based solution

REST APIs

User and Secure Access

Identity �Service

Data Repository

Knowledge Registry

Cloud

PROVENA

14 of 34

The Provena high level features

Search index built in

Documented, automation ready APIs for all system interactions

PID

Leverages persistent identifier system (Handle)

Open source and ready to use operationally

Scalable Data storage backed by AWS S3 with user friendly security implementation

User logins using identity providers such as Australian Access Federation

Provenance system – Scalable Registry which records all datasets, modelling activities and related metadata. Records are query-able via graph database.

15 of 34

The Provena Registry

Model Run Activity Provenance Record

People

Organisations

Models

ADRIA

CoCoNet

C~scape

Reefmod

Datasets

Model run activity records

The Registry is a centralised location to register, update, explore and share persistently identified resources

… enables linking things together and �the creation of provenance records

16 of 34

Model Run Data Model (Prov-O extension)

Model

Input Dataset

Output Dataset

Model Run

Start Time

End Time

used

wasGeneratedBy

Organisation

Person

wasAssociatedWith

1..*

1..*

Entity

Activity

Agent

used

17 of 34

Model

Input Dataset

Output Dataset

Model Run

2023-11-01 0800

2023-11-01 1000

used

wasGeneratedBy

UQ

Bob

wasAssociatedWith

1..*

1..*

Entity

Activity

Agent

used

Reefmod

CMIP6 Input Data

Counterfact-ual result dataset

Model Run Data Model (Prov-O extension)

18 of 34

Modelling team

Modelling workflow

Provena

Register / Update dataset record and model run provenance

Register input datasets

Register model runs

Register output datasets

Lookup provenance info

Project team

Query �info for�reporting

Modelling team

Find and access datasets

19 of 34

Registered datasets in RRAP MDS

20 of 34

Registry view in RRAP MDS

21 of 34

Using provenance info

22 of 34

Provena features

Find / query provenance info (data, model, workflows…)

Describe the data and its provenance – human and machine readable

Share data/info (with the right people)

Integrate this into modelling and simulation processes

Register, (store) and access data

1

2

3

4

5

23 of 34

Provena can’t do

Find-ing / query-ing provenance info (data, model, workflows…)

Describ-ing the data and its provenance

Shar-ing data/info (with the right people)

Integrat-ing this into modelling and simulation processes

Register-ing, (store) and access data

1

2

3

4

5

Relies on people

24 of 34

What our �team think �modellers �would think…

25 of 34

What�modellers �actually�think…

26 of 34

Register datasets, entities, model runs via Web Browser

Register datasets, entities and model runs via Python client

Register model runs�via Excel

27 of 34

Register model runs using common tools like CSV file uploads

28 of 34

Register datasets and model runs from python environments

29 of 34

Inputs, model run and output record summaries and links automatically generated

Generate a report of modelling activities and datasets

30 of 34

Exporting Data Specification Sheets from Provena Dataset Records using an ISO-19135 template

31 of 34

Tools to integrate with Linked Data using Python libraries

RDF/Turtle serialisation

Prov-N serialisation

Graphviz Prov visualisation using Prov-N doc

32 of 34

Outcomes and Learnings

Reflections

  • Provena is a tool to enable rigour in science

  • Challenge - building tools and iterating with users in mind and making it as easy as possible

  • Role of Provena – partnering with projects as data is produced. Opened up many interesting conversations about data reproducibility

Highlighted project examples:

  • National Bushfire Intelligence Capability – since Jan 2024
  • RRAP MDS – since Nov 2022

RRAP Registered Items to date

33 of 34

Summary

Provena = Cloud-based provenance management system designed to support modelling and simulation activities

Enabling users to do the “basics” – �mint identifiers, capture provenance (entities and processes) and support practical uses of provenance info

Provena approach and tools - general purpose and adaptable to different domains and modes of modelling – e.g. NBIC, RRAP MDS

Using Provena to record data provenance

34 of 34

Thank you

CSIRO Environment �Jonathan Yu

jonathan.yu@csiro.au

Website

Github

Australia’s National Science Agency