1 of 29

Statistics Norway’s Dataplatform

A quick introduction on the Dataplatform and the use of

 

2 of 29

3 of 29

4 of 29

5 of 29

6 of 29

List of tech

  • Google Cloud Platform
    • K8S,Istio,Keycloak,etc
  • LDS - Linked Data Store
    • Immutable, time-based versioning for metadata,GraphQL
  • Google Dataproc
    • Managed hadoop + SPARK
  • Parquet (datafiles)
  • PySpark/Python, R, SparkSQL,java-vtl+SPARK
  • Zeppelin (other notebook tech’s like Jupyter and Polynote is also being looked into)
  • Data collector
    • General purpose, configurable API-based data collector (streams + batch). Support for Kafka backend if needed

7 of 29

Collect

Raw data

(XML)

Convert

Store

Raw data

(Parquet)

Input data

Parquet+GSIM

Process

Processed Data

Parquet+GSIM

Data flow

8 of 29

Collect

Raw Data

(XML)

Convert

Store

Raw Data

(Parquet)

Input Data

Parquet+GSIM

Process

Processed Data

Parquet+GSIM

FREG (External)

XSD

Atom feed +

HTTP resources

Rådata

data-collector

specification

Feed

& XML

XML stream

9 of 29

Collect

Raw Data

(XML)

Convert

Store

Raw Data

(Parquet)

Input Data

Parquet+GSIM

Process

Processed Data

Parquet+GSIM

Raw data

Converter

XSD (FREG)

XML stream

Cloud

Storage

Parquet

Data lineage

10 of 29

Collect

Raw Data

(XML)

Convert

Store

Raw Data

(Parquet)

Input Data

Parquet+GSIM

Process

Processed Data

Parquet+GSIM

Cloud Storage

LDS

Work bench / Tools

Cloud Dataproc

Data lineage

Raw Data

Input Data

11 of 29

Collect

Raw Data

(XML)

Convert

Store

Raw Data

(Parquet)

Input Data

Parquet+GSIM

Process

Processed Data

Parquet+GSIM

Cloud Storage

LDS

Work bench / Tools

Cloud Dataproc

Data lineage

Input Data

Processed

12 of 29

Zeppelin: ��From raw data to Input data ... and beyond

13 of 29

Read a parquet-fil from the raw data storage

14 of 29

Select from hiearchy

15 of 29

Create desired data structure

16 of 29

Visual inspection using a simple aggregation

17 of 29

Write to LDS (in GSIM format)

18 of 29

Zeppelin: ��Process and connections to Business Group in GSIM

19 of 29

Focus on the “Process” process (Input data to Process data)

20 of 29

21 of 29

Notes as ProcessStep with code as codeBlocks

22 of 29

Sneak-preview:�Process graph based on the notebook �(v0.2 alfa)

23 of 29

Notes as Process steps: ��Templates and �best-practice

24 of 29

Data and metadata browsers + tools for managing production

Integration with Spark API’s

Offer alternative views for for certain operations

25 of 29

26 of 29

27 of 29

28 of 29

29 of 29

All Software is on GitHub .. and

Open Source.