1 of 22

DIRISA: A National Data Infrastructure for Digital Humanities

DH-IGNITE

19 October 2022

2 of 22

About DIRISA & NICIS

DIRISA�A national initiative enabling and supporting data driven research

Researchers deposit, find and access relevant data in the DIRISA Data Commons. They share, reuse and combine data from other domains with their own research in new ways

Core

services

Networked

resources

Skills & expertise

Computing Services (CHPC)

Networking Services (SANReN)

Data Services (DIRISA)

Data based research environments (Cloud)

Materials & Manuf.

Energy

Earth & Environment

Phy Sci & Eng.

Humans & Society

Health, Bio & Food

3 of 22

DIRISA Objectives and Activities

3

Build research data infrastructure

    • National research data repository
    • Cloud services for RDM and research

Develop skills and expertise

    • Postgraduate and e-Research R&D
    • Training & workshops

Advocate and coordinate

    • Research projects and HCD
    • Stakeholder engagement

Strategic input

    • Strategies and frameworks
    • Policies and guidelines
    • National eScience Masters
    • Data Science Summer School, High-school
    • Online/Active “Hot” storage. 8PB
    • Archival/Passive “Cold” repo. 20PB
    • Services. DMP tool; DOI minting
    • Global: RDA, CODATA, AOSP, GOSC
    • Local: DSI, NRF, USAf, DCDT,
    • Annual Research Data Workshop
    • National Big Data strategy
    • Open Science Policy Framework

4 of 22

South African Research Data Commons

4

Authenticate DIRISA user

Research Data Management and Data Based Research Services

DShare

Register at DIRISA

5 of 22

Data Management Planning: DMP_SA Tool

Create data management plans: https://secure.dirisa.ac.za/SADMPTool/

  • Funder requirements
  • Data quality and preservation
  • Visibility and discovery
  • Asset management
  • Publication provenance
  • Attribution: citable data

6 of 22

“Data is the new gold” �National Investment in Data

  • SAEON: Environmental
  • DataFirst: Survey and Administrative 
  • Agincourt: Social & Health demographics
  • SANSA: Earth Observation
  • SADA: Survey & related
  • SARAO: Astronomy
  • SAWS: Meteorology, Climate change
  • Government departments (StatsSA, Water, Energy…)
  • Academia (Universities, FETs)
  • Research councils (HSRC, CSIR, WRC, MRC…)
  • iThemba Labs, Meerkat, Square Kilometre Array

SKA projected budget

€ 2 billion to 2020

�€ 650 million for Phase 1

SA so far: R2 billion

“We should get more value from our investments in data” [DST Minister Pandor, 2016]

7 of 22

Data Connects Disciplines

  • Physical & Chemical Sciences
  • Biological Sciences
  • Medicine & Health
  • Engineering & �Manufacturing
  • Environmental & �Earth Sciences
  • Social Sciences & Humanities
  • Languages
  • Education
  • Business & Economics
  • Law
  • Social media

8 of 22

Data Attribution

9 of 22

The Open (Research) Data Mindset

10 of 22

Data Access Model: Open by Default

10

Closed Shared Open

Internal access

    • Private
    • Confidential
    • Sensitive

    • Surveillance data

Named access

    • Assigned by contract
    • Regulation authorised

    • Drivers licences

Group based access

    • Project assigned
    • Selected membership

    • Genomic data

Public access

    • Licence that limits use
    • Terms and conditions

    • Geospatial data

Anyone

    • Open to public
    • No limits on use

    • Weather data

Personal Private Public

Small Medium Big

11 of 22

Thank you

Dr Anwar Vahed

NICIS – DIRISA

avahed@csir.ac.za

12 of 22

13 of 22

Data Ecosystem, Data Visibility

Well managed data

Funder

(Private, Public)

Publisher

(Profit, Non-profit)

Repository / Long-Term Archive

Data Steward / Data Manager

Researcher, Collector

Library

14 of 22

Research has changed

  • The edge lies between disciplines
    • Internet of Things, Internet of Everything
    • AI, Machine Learning, Neural Networks
    • From 4IR to Society 5.0
    • Big Data
  • A world where all data is linked
  • Ethics & Privacy VS “Open & Free”
  • Trust & Security
  • Sharing & access
  • Fully open 🡨🡪 Fully closed
  • IP & copyright
  • Quality & integrity
  • Data sharing mind-set
  • Legislation and regulation: Laws have borders; data does not

15 of 22

DIRISA Activities

  • Data Infrastructure and (Cloud-based )services
    • Online / Active (“Hot”) 8 PB store
    • Archival / Passive 20 PB (“Cold”) repo
    • Research Data Management
  • Capacity & Expertise
    • National Masters in eScience
    • Coordinating training (Data Science)
  • Advocacy & outreach
    • Local: DSI, DTPS, USAf, ASSAF, NRF
    • Global: RDA, DCC, CODATA, WDS, SKA
  • Coordination & strategy
    • National Big Data strategy
    • AOSP; SADC Cyberinfrastructure framework

16 of 22

South African National Data Commons

Tier 3 (Institutional)

Tier 2 (Regional/Thematic)

Tier 1 (National)

Tier 0 (Global)

CERN, SKA

ARDC (Australia)

Nectar

ANDS

JISC (UK)

EUDAT (EU)

NICIS

SANSA

SAEON

Ilifu

IDIA

H3ABioNet

17 of 22

Tier 1 Conceptual Architecture

40 PB

2 PB

Archival data & staging; DevOps

8 (16) PB

Active data: near real time interactive access

0.5 PB

Services & staging between DIRISA and CHPC storage systems

Storage Virtualisation Service

CHPC Lustre or Posix storage systems

CHPC compute

systems

* PB

Software defined storage hierarchy

iRODS

DIRISA cloud portal

18 of 22

High Level Architecture

Distributed Data Clouds Management (iRODS, OpenStack , Ceph, Resonant,…)

Deposit iRODS client

Data Cloud Interface

T2/3

Regional/Other

8 PB

2 PB WOS

Service and Portal Infrastructure

DEPOSIT | DISCOVERY | APPLICATION

DOI: SAFIRE�RA…

DMP�tool

RDM services

WebDav

ORCID, Re3data…

Registries

Data Objects

Services

Users

Data �Staging

CHPC

T2/3

T2/3

T2/3

40 PB

Collaborators

EUDAT

ARDC

UK DA

JISC

Data.gov

NIST

Hardware

    • Serve cluster
    • 8PB Repo

Middleware

    • iRODS, kvm
    • Openstack

Service app’s

    • VMs, Dockers
    • Ansible

19 of 22

In Conclusion

  • RDM => Greater return on investment (80% of research is publicly funded)
  • Open Data: make data FAIR (Free, Accessible, Interoperable, Reusable)
  • Increase and broaden research vibrancy (and promote Open Science)
  • Improve integrity and provenance (Link data and literature)
  • Reward and recognition: new ways of attribution (Altmetrics)

Incentive to produce award worthy data set

Set benchmark for good research practice

Improve research practice

More good quality data

More and more diverse research

20 of 22

Research Data Value Chain

Phenomena

Simulation�models

Instruments, Sensors & Humans

Data collection tools

Research data repositories

Research Analyses & Visualisation

Innovation

21 of 22

Publication Provenance

22 of 22

Improving Return on Data

Research Ecosystems: �cross & multi disciplinary �research

RDM Services: harmonised data management

Federated Data Infrastructure: observations (models and measurements)

Skills and expertise

SAEON

SANSA

StatsSA

SUN

DIRISA

e-Research Environments

Data Management Services

National Data Infrastructure