Statistics Norway’s Dataplatform
A quick introduction on the Dataplatform and the use of
List of tech
Collect
Raw data
(XML)
Convert
Store
Raw data
(Parquet)
Input data
Parquet+GSIM
Process
Processed Data
Parquet+GSIM
Data flow
Collect
Raw Data
(XML)
Convert
Store
Raw Data
(Parquet)
Input Data
Parquet+GSIM
Process
Processed Data
Parquet+GSIM
FREG (External)
XSD
Atom feed +
HTTP resources
Rådata
data-collector
specification
Feed
& XML
XML stream
Collect
Raw Data
(XML)
Convert
Store
Raw Data
(Parquet)
Input Data
Parquet+GSIM
Process
Processed Data
Parquet+GSIM
Raw data
Converter
XSD (FREG)
XML stream
Cloud
Storage
Parquet
Data lineage
Collect
Raw Data
(XML)
Convert
Store
Raw Data
(Parquet)
Input Data
Parquet+GSIM
Process
Processed Data
Parquet+GSIM
Cloud Storage
LDS
Work bench / Tools
Cloud Dataproc
Data lineage
Raw Data
Input Data
Collect
Raw Data
(XML)
Convert
Store
Raw Data
(Parquet)
Input Data
Parquet+GSIM
Process
Processed Data
Parquet+GSIM
Cloud Storage
LDS
Work bench / Tools
Cloud Dataproc
Data lineage
Input Data
Processed
Zeppelin: ��From raw data to Input data ... and beyond
Read a parquet-fil from the raw data storage
Select from hiearchy
Create desired data structure
Visual inspection using a simple aggregation
Write to LDS (in GSIM format)
Zeppelin: ��Process and connections to Business Group in GSIM
Focus on the “Process” process (Input data to Process data)
Notes as ProcessStep with code as codeBlocks
Sneak-preview:�Process graph based on the notebook �(v0.2 alfa)
Notes as Process steps: ��Templates and �best-practice
Data and metadata browsers + tools for managing production
Integration with Spark API’s
Offer alternative views for for certain operations
All Software is on GitHub .. and
Open Source.