1 of 70

LinkedPipes ETL

Jakub Klímek, Petr Škoda

2 of 70

Publication of Linked Open Data (LOD)

  1. Gather sources (RDB, XML, CSV, JSON, XLSX, proprietary)
  2. Analyze them (UML class diagrams)
  3. Design transformations to RDF (vocabularies to be used - LOV)
  4. Implement transformations to RDF
    • Wrappers (D2RQ)
    • Transformers (RML)
  5. Publish files (RDF Turtle, RDF TriG, JSON-LD, ...)
  6. Upload to SPARQL endpoint (Virtuoso, Blazegraph, RDF4J, Jena, …)
  7. Register in Open Data catalog (CKAN, DKAN, …)

2

3 of 70

LinkedPipes ETL - Extract Transform Load for LOD

3

EXTRACT

TRANSFORM

LOAD

components in a pipeline

4 of 70

LinkedPipes ETL - examples of components

4

EXTRACT

TRANSFORM

LOAD

5 of 70

LinkedPipes ETL - Passing Files

  1. Component 1 creates files
  2. Component 2 consumes files
  3. Data is passed as a directory in local file system

5

6 of 70

LinkedPipes ETL - Passing RDF data

  • Component 1 creates data
  • Component 2 consumes data
  • Data is passed as an RDF4J repository (Native or In-Memory)

6

7 of 70

Designing a pipeline in LP-ETL

8 of 70

LP-ETL Component types

8

9 of 70

LP-ETL Pipeline designer

9

10 of 70

LP-ETL Pipeline designer

10

  • label
  • function

11 of 70

LP-ETL Pipeline designer

11

Drag from data unit�& Drop

12 of 70

LP-ETL Pipeline designer

12

Filter by:

  • compatible input
  • frequent usage

13 of 70

LP-ETL Pipeline designer

13

14 of 70

LP-ETL Pipeline designer

14

Configuration

“Run before”

Enable & Disable

Debug to

Delete

Copy

15 of 70

Debugging in LP-ETL

16 of 70

LP-ETL Pipeline debugging - overview

16

17 of 70

LP-ETL Pipeline debugging - component states

17

18 of 70

LP-ETL Pipeline debugging - error details

18

19 of 70

LP-ETL Pipeline debugging - fixing components

19

20 of 70

LP-ETL Pipeline debugging - “debug from”

20

21 of 70

LP-ETL Pipeline debugging - debug data - files

21

22 of 70

LP-ETL Pipeline debugging - debug data - RDF

22

23 of 70

LP-ETL Pipeline debugging - data invalidation

23

24 of 70

LP-ETL Pipeline debugging - data invalidation

24

25 of 70

LP-ETL Architecture & API

26 of 70

LP-ETL Architecture

26

27 of 70

LP-ETL API

27

https://demo.etl.linkedpipes.com/#/pipelines/edit/canvas?pipeline=https:%2F%2Fdemo.etl.linkedpipes.com%2Fresources%2Fpipelines%2Fcreated-1509344965849&execution=https:%2F%2Fdemo.etl.linkedpipes.com%2Fresources%2Fexecutions%2Fff1f83a8-5d68-4bbd-82ea-e5cbbfa129b8

28 of 70

LP-ETL API - execute a pipeline

28

curl -i -X POST https://demo.etl.linkedpipes.com/resources/executions?pipeline=https://demo.etl.linkedpipes.com/resources/pipelines/created-1509344965849

HTTP/1.1 200 OK

Date: Mon, 30 Oct 2017 07:27:45 GMT

Content-Type: application/json;charset=UTF-8

{"iri":"https://demo.etl.linkedpipes.com/resources/executions/f38208c0-a6a9-410e-8ae2-79c06e27642f"}

29 of 70

LP-ETL API - monitor execution

29

30 of 70

LP-ETL Components Overview

31 of 70

LP-ETL Components overview - more than 60

  • Extractors
    • HTTP GET, SPARQL endpoint (and variants)
    • Text holder
    • Extract from FTP
    • Pipeline input
    • Files from local
  • Transformers
    • SPARQL Construct, SPARQL Update
    • Files to RDF, RDF to file
    • XSLT, JSON to JSON-LD, Tabular, Excel to CSV
    • Bing translator, Decode base64, File hasher, Compress, Decompress, Mustache
  • Loaders
    • Files to SCP
    • SPARQL Update, SPARQL Graph Store Protocol
    • DCAT-AP to CKAN
    • Files to local

31

32 of 70

LP-ETL Component Templates

33 of 70

LP-ETL Component Templates

33

34 of 70

LP-ETL Component Templates

34

35 of 70

Runtime configuration of

LP-ETL components

36 of 70

LP-ETL Runtime configuration

@prefix httpList: <http://plugins.linkedpipes.com/ontology/e-httpGetFiles#> .��<http://localhost/resource/configuration> a httpList:Configuration ;� httpList:reference <http://localhost/resource/ref/2015-04-30> , <http://localhost/resource/ref/2015-05-18> , <http://localhost/resource/ref/2015-06-16> , <http://localhost/resource/ref/2015-07-21> .� �<http://localhost/resource/ref/2015-04-30> a httpList:Reference ;� httpList:fileName "2015-04-30.xml" ;� httpList:fileUri "http://portal.gov.cz/portal/rejstriky/data/97898/index-2015-04.xml" .��<http://localhost/resource/ref/2015-05-18> a httpList:Reference ;� httpList:fileName "2015-05-18.xml" ;� httpList:fileUri "http://portal.gov.cz/portal/rejstriky/data/97898/index-2015-05.xml" .��<http://localhost/resource/ref/2015-06-16> a httpList:Reference ;� httpList:fileName "2015-06-16.xml" ;� httpList:fileUri "http://portal.gov.cz/portal/rejstriky/data/97898/index-2015-06.xml" .��<http://localhost/resource/ref/2015-07-21> a httpList:Reference ;� httpList:fileName "2015-07-21.xml" ;� httpList:fileUri "http://portal.gov.cz/portal/rejstriky/data/97898/index-2015-07.xml" .

36

37 of 70

LP-ETL Tasks

<https://nkod.opendata.cz/sparql> a <http://plugins.linkedpipes.com/ontology/e-sparqlEndpointList#Task> ;� <http://plugins.linkedpipes.com/ontology/e-sparqlEndpointList#endpoint> "https://nkod.opendata.cz/sparql" ;� <http://plugins.linkedpipes.com/ontology/e-sparqlEndpointList#group> "https://nkod.opendata.cz/sparql" ;� <http://plugins.linkedpipes.com/ontology/e-sparqlEndpointList#query> """PREFIX adhoc: <http://linked.opendata.cz/ontology/adhoc/>�CONSTRUCT {� [] adhoc:class ?Class ;� adhoc:endpointUri \"https://nkod.opendata.cz/sparql\";� adhoc:numberOfInstances ?numberOfInstances .�} WHERE {� {� SELECT ?Class (COUNT(?resource) AS ?numberOfInstances)� WHERE {� ?resource a ?Class.� }� GROUP BY ?Class� }�}""" .

37

38 of 70

Processing larger data in LP-ETL

with RDF data chunking

39 of 70

LinkedPipes ETL - Passing RDF data

  • Component 1 creates data
  • Component 2 consumes data
  • Data is passed as an RDF4J repository (Native or In-Memory)
  • Small data => OK
  • Larger data (e.g. 1 GB CSV file and larger) => OutOfMemoryException

39

40 of 70

LP-ETL - CSV to RDF conversion example

_:1 :Company+name "My First Company" ;� :ID "00122344" .

_:2 :Company+name "Unlimited Ltd." ;� :ID "11334499" .

_:3 :Company+name "ACME" ;� :ID "99778811" .

_:4 :Company+name "Trading One" ;� :ID "19971375" .

_:5 :Company+name "Trading Two" ;� :ID "99771133" .

_:6 :Company+name "The Company" ;� :ID "00990099" .

40

Company name

ID

My First Company

00122344

Unlimited Ltd.

11334499

ACME

99778811

Trading One

19971375

Trading Two

99771133

The Company

00990099

according to�Generating RDF from Tabular Data on the Web W3C Recommendation

41 of 70

LP-ETL - RDF data chunk size: 6 rows ~ 12 triples

_:1 :Company+name "My First Company" ;� :ID "00122344" .

_:2 :Company+name "Unlimited Ltd." ;� :ID "11334499" .

_:3 :Company+name "ACME" ;� :ID "99778811" .

_:4 :Company+name "Trading One" ;� :ID "19971375" .

_:5 :Company+name "Trading Two" ;� :ID "99771133" .

_:6 :Company+name "The Company" ;� :ID "00990099" .

41

Company name

ID

My First Company

00122344

Unlimited Ltd.

11334499

ACME

99778811

Trading One

19971375

Trading Two

99771133

The Company

00990099

according to�Generating RDF from Tabular Data on the Web W3C Recommendation

42 of 70

LP-ETL - RDF data chunk size: 2 rows ~ 4 triples

_:1 :Company+name "My First Company" ;� :ID "00122344" .

_:2 :Company+name "Unlimited Ltd." ;� :ID "11334499" .

_:3 :Company+name "ACME" ;� :ID "99778811" .

_:4 :Company+name "Trading One" ;� :ID "19971375" .

_:5 :Company+name "Trading Two" ;� :ID "99771133" .

_:6 :Company+name "The Company" ;� :ID "00990099" .

42

Company name

ID

My First Company

00122344

Unlimited Ltd.

11334499

ACME

99778811

Trading One

19971375

Trading Two

99771133

The Company

00990099

according to�Generating RDF from Tabular Data on the Web W3C Recommendation

43 of 70

LP-ETL - RDF data chunk size: 1 row ~ 2 triples

_:1 :Company+name "My First Company" ;� :ID "00122344" .

_:2 :Company+name "Unlimited Ltd." ;� :ID "11334499" .

_:3 :Company+name "ACME" ;� :ID "99778811" .

_:4 :Company+name "Trading One" ;� :ID "19971375" .

_:5 :Company+name "Trading Two" ;� :ID "99771133" .

_:6 :Company+name "The Company" ;� :ID "00990099" .

43

Company name

ID

My First Company

00122344

Unlimited Ltd.

11334499

ACME

99778811

Trading One

19971375

Trading Two

99771133

The Company

00990099

according to�Generating RDF from Tabular Data on the Web W3C Recommendation

44 of 70

LP-ETL - RDF data chunk: bad chunking

_:1 :Company+name "My First Company" ;� :ID "00122344" .

_:2 :Company+name "Unlimited Ltd." ;� :ID "11334499" .

_:3 :Company+name "ACME" ;� :ID "99778811" .

_:4 :Company+name "Trading One" ;� :ID "19971375" .

_:5 :Company+name "Trading Two" ;� :ID "99771133" .

_:6 :Company+name "The Company" ;� :ID "00990099" .

44

Company name

ID

My First Company

00122344

Unlimited Ltd.

11334499

ACME

99778811

Trading One

19971375

Trading Two

99771133

The Company

00990099

according to�Generating RDF from Tabular Data on the Web W3C Recommendation

45 of 70

LP-ETL - Chunked components in a pipeline

Yellow = Chunked�Only N chunks in memory at one time

45

46 of 70

LP-ETL - Not everything can be solved by chunking

SELECT (COUNT (DISTINCT ?s) AS ?dsubjects) �WHERE {?s ?p ?o}�...

46

47 of 70

LinkedPipes ETL - Chunked components

  • Extractors
    • SPARQL endpoint
  • Transformers
    • Creating chunks
      • Tabular
      • Files To RDF
    • Transforming chunks
      • SPARQL Construct
      • SPARQL Update
      • Chunked To Files - each chunk forms an RDF file
      • Mustache - files from data + text templates
      • SPARQL Linker
    • Merging chunks
      • Merger
  • Loaders
    • SPARQL Update

47

48 of 70

Datasets: List of Czech business entity IDs

  • 4.3M rows, 200MB CSV file => 14.5M RDF triples, avg 3.3 triples per record
  • 5h 30m = 19 800 seconds w/o chunking in UnifiedViews
  • 48m = 2 880 seconds using TARQL

48

49 of 70

Datasets: Sample of Czech Business Registry

  • 220K RDF files = 26M triples => 10M triples (anonymization)
  • avg 119.8 triples per record, 856 max, 29 min
  • Did not finish in UnifiedViews

49

50 of 70

LP-ETL Documentation

51 of 70

LP-ETL Component Documentation

  • Web - https://etl.linkedpipes.com/
  • Components documentation
  • Pipeline fragments
  • Tutorials & How-Tos

51

52 of 70

LP-ETL Component Documentation

52

53 of 70

LP-ETL Component Documentation

53

54 of 70

LP-ETL Pipeline fragments

54

55 of 70

LP-ETL Pipeline fragments

55

56 of 70

LP-ETL Tutorials & How-Tos

56

57 of 70

LP-ETL Tutorials & How-Tos

  • https://etl.linkedpipes.com/tutorials/
  • Tutorials
    • Converting tabular data to RDF
    • Geocoding with Nominatim
  • How-Tos
    • How to generate SPARQL queries via Mustache
    • How to cache linked data
    • How to map RDF properties using SPARQL Update
    • How to test data
    • How to process large RDF data
    • How to load data to Virtuoso
    • How to compress and decompress data
    • How to input fixed data
    • How to map JSON to RDF
    • How to convert CSV to RDF
    • How to convert RDF to CSV
    • How to convert XML to RDF

57

58 of 70

LP-ETL NKOD Pipeline

58

59 of 70

LP-ETL Known instances

  • Charles University - teaching LOD, research projects, OpenData.cz datasets
  • Czech Technical University - teaching LOD
  • University of Economics - teaching LOD, research projects, data publishing
  • Ministry of the Interior of the Czech Republic - National Open Data Catalog transformation and data quality
  • Datlowe - working with semantic models for text processing
  • Municipality of Brno - data publishing and catalogization
  • Greek National Library
  • Fraunhofer IAIS - OpenBudgets.eu platform

59

60 of 70

Advanced RDF data consumption

using LP-ETL

61 of 70

LP-ETL and larger data

CONSTRUCT {�?point a gml:Point; � gml:pos ?o .�}�WHERE �{�?point a gml:Point; � gml:pos ?o . �}

61

62 of 70

LP-ETL - bad chunking ~ bad paging

_:1 :Company+name "My First Company" ;� :ID "00122344" .

_:2 :Company+name "Unlimited Ltd." ;� :ID "11334499" .

_:3 :Company+name "ACME" ;� :ID "99778811" .

_:4 :Company+name "Trading One" ;� :ID "19971375" .

_:5 :Company+name "Trading Two" ;� :ID "99771133" .

_:6 :Company+name "The Company" ;� :ID "00990099" .

CONSTRUCT {�?point a gml:Point; � gml:pos ?o .�}�WHERE �{�?point a gml:Point; � gml:pos ?o . �}

LIMIT 3

62

63 of 70

LP-ETL and larger data

SELECT ?point �WHERE {� [] ruian:adresniBod ?point. �}

SELECT ?point �WHERE {� [] ruian:adresniBod ?point. �}

LIMIT 100�OFFSET 10200

63

HttpException: 500 SPARQL Request Failed

Virtuoso 22023 Error SR353: Sorted TOP clause specifies more then 41000 rows to sort.

Only 40000 are allowed.

Either decrease the offset and/or row count or use a scrollable cursor

64 of 70

LP-ETL and Virtuoso Scrollable Cursor

SELECT ?point�WHERE�{� {� SELECT ?point � WHERE {� [] ruian:adresniBod ?point. � }� ORDER BY ASC(?point) � }�}�LIMIT 100�OFFSET 10200

64

65 of 70

LP-ETL and larger data

65

CSV

66 of 70

LP-ETL and larger data - 1 record

CONSTRUCT {�my:Point a gml:Point; � gml:pos ?o .�}�WHERE �{�my:Point a gml:Point; � gml:pos ?o . �}

66

67 of 70

LP-ETL and larger data - 2 records

CONSTRUCT {�?point a gml:Point; � gml:pos ?o .�}�WHERE �{�?point a gml:Point; � gml:pos ?o . �VALUES ?point { my:Point1 my:Point2 }�}

67

68 of 70

LP-ETL and larger data - all records

CONSTRUCT {�?point a gml:Point; � gml:pos ?o .�}�WHERE �{�?point a gml:Point; � gml:pos ?o . �${VALUES}�}

68

69 of 70

LP-ETL and larger data

69

CSV

70 of 70

LinkedPipes ETL

Jakub Klímek, Petr Škoda