LinkedPipes ETL
Jakub Klímek, Petr Škoda
Publication of Linked Open Data (LOD)
2
LinkedPipes ETL - Extract Transform Load for LOD
3
EXTRACT
TRANSFORM
LOAD
components in a pipeline
LinkedPipes ETL - examples of components
4
EXTRACT
TRANSFORM
LOAD
LinkedPipes ETL - Passing Files
5
LinkedPipes ETL - Passing RDF data
6
Designing a pipeline in LP-ETL
LP-ETL Component types
8
LP-ETL Pipeline designer
9
LP-ETL Pipeline designer
10
LP-ETL Pipeline designer
11
Drag from data unit�& Drop
LP-ETL Pipeline designer
12
Filter by:
LP-ETL Pipeline designer
13
LP-ETL Pipeline designer
14
Configuration
“Run before”
Enable & Disable
Debug to
Delete
Copy
Debugging in LP-ETL
LP-ETL Pipeline debugging - overview
16
LP-ETL Pipeline debugging - component states
17
LP-ETL Pipeline debugging - error details
18
LP-ETL Pipeline debugging - fixing components
19
LP-ETL Pipeline debugging - “debug from”
20
LP-ETL Pipeline debugging - debug data - files
21
LP-ETL Pipeline debugging - debug data - RDF
22
LP-ETL Pipeline debugging - data invalidation
23
LP-ETL Pipeline debugging - data invalidation
24
LP-ETL Architecture & API
LP-ETL Architecture
26
LP-ETL API
27
https://demo.etl.linkedpipes.com/#/pipelines/edit/canvas?pipeline=https:%2F%2Fdemo.etl.linkedpipes.com%2Fresources%2Fpipelines%2Fcreated-1509344965849&execution=https:%2F%2Fdemo.etl.linkedpipes.com%2Fresources%2Fexecutions%2Fff1f83a8-5d68-4bbd-82ea-e5cbbfa129b8
LP-ETL API - execute a pipeline
28
curl -i -X POST https://demo.etl.linkedpipes.com/resources/executions?pipeline=https://demo.etl.linkedpipes.com/resources/pipelines/created-1509344965849
HTTP/1.1 200 OK
Date: Mon, 30 Oct 2017 07:27:45 GMT
Content-Type: application/json;charset=UTF-8
{"iri":"https://demo.etl.linkedpipes.com/resources/executions/f38208c0-a6a9-410e-8ae2-79c06e27642f"}
LP-ETL API - monitor execution
29
LP-ETL Components Overview
LP-ETL Components overview - more than 60
31
LP-ETL Component Templates
LP-ETL Component Templates
33
LP-ETL Component Templates
34
Runtime configuration of
LP-ETL components
LP-ETL Runtime configuration
@prefix httpList: <http://plugins.linkedpipes.com/ontology/e-httpGetFiles#> .��<http://localhost/resource/configuration> a httpList:Configuration ;� httpList:reference <http://localhost/resource/ref/2015-04-30> , <http://localhost/resource/ref/2015-05-18> , <http://localhost/resource/ref/2015-06-16> , <http://localhost/resource/ref/2015-07-21> .� �<http://localhost/resource/ref/2015-04-30> a httpList:Reference ;� httpList:fileName "2015-04-30.xml" ;� httpList:fileUri "http://portal.gov.cz/portal/rejstriky/data/97898/index-2015-04.xml" .��<http://localhost/resource/ref/2015-05-18> a httpList:Reference ;� httpList:fileName "2015-05-18.xml" ;� httpList:fileUri "http://portal.gov.cz/portal/rejstriky/data/97898/index-2015-05.xml" .��<http://localhost/resource/ref/2015-06-16> a httpList:Reference ;� httpList:fileName "2015-06-16.xml" ;� httpList:fileUri "http://portal.gov.cz/portal/rejstriky/data/97898/index-2015-06.xml" .��<http://localhost/resource/ref/2015-07-21> a httpList:Reference ;� httpList:fileName "2015-07-21.xml" ;� httpList:fileUri "http://portal.gov.cz/portal/rejstriky/data/97898/index-2015-07.xml" .
36
LP-ETL Tasks
<https://nkod.opendata.cz/sparql> a <http://plugins.linkedpipes.com/ontology/e-sparqlEndpointList#Task> ;� <http://plugins.linkedpipes.com/ontology/e-sparqlEndpointList#endpoint> "https://nkod.opendata.cz/sparql" ;� <http://plugins.linkedpipes.com/ontology/e-sparqlEndpointList#group> "https://nkod.opendata.cz/sparql" ;� <http://plugins.linkedpipes.com/ontology/e-sparqlEndpointList#query> """PREFIX adhoc: <http://linked.opendata.cz/ontology/adhoc/>�CONSTRUCT {� [] adhoc:class ?Class ;� adhoc:endpointUri \"https://nkod.opendata.cz/sparql\";� adhoc:numberOfInstances ?numberOfInstances .�} WHERE {� {� SELECT ?Class (COUNT(?resource) AS ?numberOfInstances)� WHERE {� ?resource a ?Class.� }� GROUP BY ?Class� }�}""" .
37
Processing larger data in LP-ETL
with RDF data chunking
LinkedPipes ETL - Passing RDF data
39
LP-ETL - CSV to RDF conversion example
_:1 :Company+name "My First Company" ;� :ID "00122344" .
_:2 :Company+name "Unlimited Ltd." ;� :ID "11334499" .
_:3 :Company+name "ACME" ;� :ID "99778811" .
_:4 :Company+name "Trading One" ;� :ID "19971375" .
_:5 :Company+name "Trading Two" ;� :ID "99771133" .
_:6 :Company+name "The Company" ;� :ID "00990099" .
40
Company name | ID |
My First Company | 00122344 |
Unlimited Ltd. | 11334499 |
ACME | 99778811 |
Trading One | 19971375 |
Trading Two | 99771133 |
The Company | 00990099 |
according to�Generating RDF from Tabular Data on the Web W3C Recommendation
LP-ETL - RDF data chunk size: 6 rows ~ 12 triples
_:1 :Company+name "My First Company" ;� :ID "00122344" .
_:2 :Company+name "Unlimited Ltd." ;� :ID "11334499" .
_:3 :Company+name "ACME" ;� :ID "99778811" .
_:4 :Company+name "Trading One" ;� :ID "19971375" .
_:5 :Company+name "Trading Two" ;� :ID "99771133" .
_:6 :Company+name "The Company" ;� :ID "00990099" .
41
Company name | ID |
My First Company | 00122344 |
Unlimited Ltd. | 11334499 |
ACME | 99778811 |
Trading One | 19971375 |
Trading Two | 99771133 |
The Company | 00990099 |
according to�Generating RDF from Tabular Data on the Web W3C Recommendation
LP-ETL - RDF data chunk size: 2 rows ~ 4 triples
_:1 :Company+name "My First Company" ;� :ID "00122344" .
_:2 :Company+name "Unlimited Ltd." ;� :ID "11334499" .
_:3 :Company+name "ACME" ;� :ID "99778811" .
_:4 :Company+name "Trading One" ;� :ID "19971375" .
_:5 :Company+name "Trading Two" ;� :ID "99771133" .
_:6 :Company+name "The Company" ;� :ID "00990099" .
42
Company name | ID |
My First Company | 00122344 |
Unlimited Ltd. | 11334499 |
ACME | 99778811 |
Trading One | 19971375 |
Trading Two | 99771133 |
The Company | 00990099 |
according to�Generating RDF from Tabular Data on the Web W3C Recommendation
LP-ETL - RDF data chunk size: 1 row ~ 2 triples
_:1 :Company+name "My First Company" ;� :ID "00122344" .
_:2 :Company+name "Unlimited Ltd." ;� :ID "11334499" .
_:3 :Company+name "ACME" ;� :ID "99778811" .
_:4 :Company+name "Trading One" ;� :ID "19971375" .
_:5 :Company+name "Trading Two" ;� :ID "99771133" .
_:6 :Company+name "The Company" ;� :ID "00990099" .
43
Company name | ID |
My First Company | 00122344 |
Unlimited Ltd. | 11334499 |
ACME | 99778811 |
Trading One | 19971375 |
Trading Two | 99771133 |
The Company | 00990099 |
according to�Generating RDF from Tabular Data on the Web W3C Recommendation
LP-ETL - RDF data chunk: bad chunking
_:1 :Company+name "My First Company" ;� :ID "00122344" .
_:2 :Company+name "Unlimited Ltd." ;� :ID "11334499" .
_:3 :Company+name "ACME" ;� :ID "99778811" .
_:4 :Company+name "Trading One" ;� :ID "19971375" .
_:5 :Company+name "Trading Two" ;� :ID "99771133" .
_:6 :Company+name "The Company" ;� :ID "00990099" .
44
Company name | ID |
My First Company | 00122344 |
Unlimited Ltd. | 11334499 |
ACME | 99778811 |
Trading One | 19971375 |
Trading Two | 99771133 |
The Company | 00990099 |
according to�Generating RDF from Tabular Data on the Web W3C Recommendation
LP-ETL - Chunked components in a pipeline
Yellow = Chunked�Only N chunks in memory at one time
45
LP-ETL - Not everything can be solved by chunking
SELECT (COUNT (DISTINCT ?s) AS ?dsubjects) �WHERE {?s ?p ?o}�...
46
LinkedPipes ETL - Chunked components
47
Datasets: List of Czech business entity IDs
48
Datasets: Sample of Czech Business Registry
49
LP-ETL Documentation
LP-ETL Component Documentation
51
LP-ETL Component Documentation
52
LP-ETL Component Documentation
53
LP-ETL Pipeline fragments
54
LP-ETL Pipeline fragments
55
LP-ETL Tutorials & How-Tos
56
LP-ETL Tutorials & How-Tos
57
LP-ETL NKOD Pipeline
58
LP-ETL Known instances
59
Advanced RDF data consumption
using LP-ETL
LP-ETL and larger data
CONSTRUCT {�?point a gml:Point; � gml:pos ?o .�}�WHERE �{�?point a gml:Point; � gml:pos ?o . �}
61
LP-ETL - bad chunking ~ bad paging
_:1 :Company+name "My First Company" ;� :ID "00122344" .
_:2 :Company+name "Unlimited Ltd." ;� :ID "11334499" .
_:3 :Company+name "ACME" ;� :ID "99778811" .
_:4 :Company+name "Trading One" ;� :ID "19971375" .
_:5 :Company+name "Trading Two" ;� :ID "99771133" .
_:6 :Company+name "The Company" ;� :ID "00990099" .
CONSTRUCT {�?point a gml:Point; � gml:pos ?o .�}�WHERE �{�?point a gml:Point; � gml:pos ?o . �}
LIMIT 3
62
LP-ETL and larger data
SELECT ?point �WHERE {� [] ruian:adresniBod ?point. �}
SELECT ?point �WHERE {� [] ruian:adresniBod ?point. �}
LIMIT 100�OFFSET 10200
63
HttpException: 500 SPARQL Request Failed
Virtuoso 22023 Error SR353: Sorted TOP clause specifies more then 41000 rows to sort.
Only 40000 are allowed.
Either decrease the offset and/or row count or use a scrollable cursor
LP-ETL and Virtuoso Scrollable Cursor
SELECT ?point�WHERE�{� {� SELECT ?point � WHERE {� [] ruian:adresniBod ?point. � }� ORDER BY ASC(?point) � }�}�LIMIT 100�OFFSET 10200
64
LP-ETL and larger data
65
CSV
LP-ETL and larger data - 1 record
CONSTRUCT {�my:Point a gml:Point; � gml:pos ?o .�}�WHERE �{�my:Point a gml:Point; � gml:pos ?o . �}
66
LP-ETL and larger data - 2 records
CONSTRUCT {�?point a gml:Point; � gml:pos ?o .�}�WHERE �{�?point a gml:Point; � gml:pos ?o . �VALUES ?point { my:Point1 my:Point2 }�}
67
LP-ETL and larger data - all records
CONSTRUCT {�?point a gml:Point; � gml:pos ?o .�}�WHERE �{�?point a gml:Point; � gml:pos ?o . �${VALUES}�}
68
LP-ETL and larger data
69
CSV
LinkedPipes ETL
Jakub Klímek, Petr Škoda