Enabling model and simulation provenance and data transparency with PROVENA
Presenter: Jonathan Yu
AVSIG March 2025
Funded by the Australian Climate Service, led by CSIRO and supported by the National Emergency Management Agency
Australia’s National Science Agency
Provena team
Prior years:
Outline
Where did this data come from?
Data
Process(es) e.g. modelling
People
Decisions
Actions
"Provenance refers to the sources of information, such as entities and processes, involved in producing or delivering an artifact“ (W3C, Prov)
Output�Data
Reef Restoration and Adaptation Program Modelling and Decision Support (MDS) for the Great Barrier Reef
RRAP MDS Modelling
Modeller
Stakeholders
(e.g. RRAP Program, Reef Managers and Industry Partners)
Researcher
Decision Maker
Fogging
Heat tolerant corals
Rubble �stabilisation
Produce data/info/knowledge
through modelling
Make decisions
Actions
1
2
3
Bushfire hazard data
NBIC Modelling
Produce data/info/knowledge
through modelling
Make decisions
Actions
Insurance
Calculate levies more accurately
Fire agencies
Improve bushfire preparedness
Planning authorities
Improve zoning of where buildings go
Infrastructure Managers
Take bushfire risk into publicly funded road/ infrastructure builds
National Bushfire Intelligence Capability | Board Meeting May 2024
| 7
Modelling and Decision Support
7 sub-teams, ~40 people total �(14 modellers)
1 organisation
7 organisations
5 modelling teams, ~35 people total �(25 modellers)
People
Modelling approach
Data formats
Why?
Rigour
Credibility
Transparency
Data quality
Communication
Understand dependencies
Cloud optimized Geotiffs
Some data provenance solutions to date
Embed provenance in a file naming system… �
Write (more) docs about our process�
Record model runs, config, datasets in Excel��
Let’s make copies of inputs, code, and store it with the model outputs��
Use a workflow engine
Embed the provenance metadata in or with the dataset
{ }
Data Provenance Challenges
Finding / querying provenance info
Describing the data and its provenance
Sharing data/info
Integrating this into (modelling and simulation) processes
Registering, (storing) and accessing data
1
2
3
4
5
High-level Design of a Provenance solution
Register Datasets
Find/Access Datasets
Record Details of Modelling Activities
Understand and Trust Modelling Activities
Data Registry
Provenance Registry
Register Dataset
Access Dataset
Record Model Runs
Query Provenance
Provena – a solution to provenance and data challenges
PROVENA
REST APIs
User and Secure Access
Identity �Service
Data Repository
Knowledge Registry
Cloud
PROVENA
The Provena high level features
Search index built in
Documented, automation ready APIs for all system interactions
PID
Leverages persistent identifier system (Handle)
Open source and ready to use operationally
Scalable Data storage backed by AWS S3 with user friendly security implementation
User logins using identity providers such as Australian Access Federation
Provenance system – Scalable Registry which records all datasets, modelling activities and related metadata. Records are query-able via graph database.
The Provena Registry
Model Run Activity Provenance Record
People
Organisations
Models
ADRIA
CoCoNet
C~scape
Reefmod
Datasets
Model run activity records
The Registry is a centralised location to register, update, explore and share persistently identified resources
… enables linking things together and �the creation of provenance records
Model Run Data Model (Prov-O extension)
Model
Input Dataset
Output Dataset
Model Run
Start Time
End Time
used
wasGeneratedBy
Organisation
Person
wasAssociatedWith
1..*
1..*
Entity
Activity
Agent
used
Model
Input Dataset
Output Dataset
Model Run
2023-11-01 0800
2023-11-01 1000
used
wasGeneratedBy
UQ
Bob
wasAssociatedWith
1..*
1..*
Entity
Activity
Agent
used
Reefmod
CMIP6 Input Data
Counterfact-ual result dataset
Model Run Data Model (Prov-O extension)
Modelling team
Modelling workflow
Provena
Register / Update dataset record and model run provenance
Register input datasets
Register model runs
Register output datasets
Lookup provenance info
Project team
Query �info for�reporting
Modelling team
Find and access datasets
Registered datasets in RRAP MDS
Registry view in RRAP MDS
Using provenance info
Provena features
Find / query provenance info (data, model, workflows…)
Describe the data and its provenance – human and machine readable
Share data/info (with the right people)
Integrate this into modelling and simulation processes
Register, (store) and access data
1
2
3
4
5
Provena can’t do
Find-ing / query-ing provenance info (data, model, workflows…)
Describ-ing the data and its provenance
Shar-ing data/info (with the right people)
Integrat-ing this into modelling and simulation processes
Register-ing, (store) and access data
1
2
3
4
5
Relies on people
What our �team think �modellers �would think…
What�modellers �actually�think…
Register datasets, entities, model runs via Web Browser
Register datasets, entities and model runs via Python client
Register model runs�via Excel
Register model runs using common tools like CSV file uploads
Register datasets and model runs from python environments
Inputs, model run and output record summaries and links automatically generated
Generate a report of modelling activities and datasets
Exporting Data Specification Sheets from Provena Dataset Records using an ISO-19135 template
Tools to integrate with Linked Data using Python libraries
RDF/Turtle serialisation
Prov-N serialisation
Graphviz Prov visualisation using Prov-N doc
Outcomes and Learnings
Reflections
Highlighted project examples:
RRAP Registered Items to date
Summary
Provena = Cloud-based provenance management system designed to support modelling and simulation activities
Enabling users to do the “basics” – �mint identifiers, capture provenance (entities and processes) and support practical uses of provenance info
Provena approach and tools - general purpose and adaptable to different domains and modes of modelling – e.g. NBIC, RRAP MDS
Using Provena to record data provenance
Thank you
CSIRO Environment �Jonathan Yu
Website
Github
Australia’s National Science Agency