Monthly OpenLineage
TSC meeting
March 19, 2025
Recording of calls
Reminder:
The meeting is recorded and archived on the wiki
https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting
2
Roll Call
Julien Le Dem Paweł Leszczyński Kacper Muda
Mandy Chessell Will Johnson Zhenqiu Huang
Daniel Henneberger Michael Robinson Jens Pfau
Drew Banin Ross Turk Sheeri Cabral
James Campbell Howard Yoo
Ryan Blue Jakub Dardziński
Willy Lulciuc Tomasz Nazarewicz
Zhamak Dehghani Minkyu Park
Michael Collado Benji Lampel
Maciej Obuchowski Kengo Seki
Harel Shein Damien Hawes
3
Communication
4
Agenda
5
Announcements
March 19, 2025
Warsaw Meetup @ Google in April
Date: April 3 (Thursday)
Time: 17:30-20:30 CET
Place: Google, Rondo Daszyńskiego 2C, 00-843 Warsaw
Format: Hybrid (Zoom link to follow)
RSVP: required by March 31
https://www.meetup.com/warsaw-openlineage-meetup-group/events/305919584
7
OpenLineage @ move(data) 2025
8
Recent Releases
March 19, 2025
OpenLineage 1.29.0
Added
Changed
Fixed
10
OpenLineage 1.30.0
Added
Changed
Fixed
11
dbt structured logs
Massy BOURENNANI, SWE @ Datadog
Agenda
13
What is dbt ?
dbt: data build tool
15
dbt DAG: jaffle shop example
16
dbt Artifacts
dbt artifact: manifest.json
18
dbt artifact: run_results.json
19
How are dbt OpenLineage events generated ?
Consume run_results.json
21
Problems of consuming run_results.json?
Problem I: High latency
The dbt pipeline needs to completely finish before the first OL event is ever seen by the user
23
Problem II: Lack of granularity in OL events
Only dbt model SQL queries are forwarded by Openlineage integration
24
Problem II: Lack of granularity in OL events
25
Is there another way to generate dbt OpenLineage events ?
Solution: Structured Logs
Structured Logs
28
Structured Logs: examples
29
Structured Logs: real time monitoring
30
Structured Logs
VS
run_results.json
OpenLineage Events: run_results.json
32
OpenLineage Events: Structured Logs
33
In Datadog
Datadog Waterfall: run_results.json
35
Datadog Waterfall: Structured Logs
36
Datadog Waterfall: Structured Logs jaffle shop
37
Datadog Flame Graph: dbt jaffle shop with 2 threads
38
Benefits of Structured Logs
39
| Structured Logs | run_results.json |
Latency | Low (events are sent in real time) | High (events are sent after pipeline finishes) |
Granularity | High (All SQL queries are forwarded) | Low (a single SQL query is forwarded) |
SQL Platform | Agnostic | Agnostic |
Next Steps
40
Thank you !
Apache Hive integration
Quick Introduction
Tomasz Nazarewicz, GetInData
What's around the corner?
OpenLineage Hive integraion
Status: Integration is working, right now we’re integrating it with main Openlineage repository
PR: #3555
Main code contributor: @jphalip at Google
Integration with OpenLineage project: @ddebowczyk92, @tnazarew
44
What was?
45
(…not that impressive)
What is?
{
"eventTime": "2025-03-18T15:21:59.561Z",
"producer": "https://github.com/OpenLineage/OpenLineage/tree/1.0.0-SNAPSHOT/integration/hive",
"schemaURL": "https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunEvent",
"eventType": "COMPLETE",
"run": {
"runId": "f7835a55-aa11-44ce-85aa-024aac55c7c5",
"facets": {
"processing_engine": {...},
"hive_properties": {...}
}
},
"job": {
"namespace": "jobNamespace",
"name": "jobName"
},
"inputs": [
{
"namespace": "sourceNamespace",
"name": "sourceName",
"facets": {
"schema": {...},
"symlinks": {...}
}
}
],
"outputs": [
{
"namespace": "targetNamespace",
"name": "targetName",
"facets": {
"schema": {...},
"columnLineage": {...},
"symlinks": {...}
}
}
]
}
46
(…much more impressive)
What do we have?
Facets
47
"run": {
"runId": "f7835a55-aa11-44ce-85aa-024aac55c7c5",
"facets": {
"processing_engine": {...},
"hive_properties": {...}
}
},
"outputs": [
{
"namespace": "targetNamespace",
"name": "targetName",
"facets": {
"schema": {...},
"columnLineage": {...},
"symlinks": {...}
}
}
]
What do we not have?
48
Running
49
DEMO
Demo using Hive + Dataproc + Console Transport
50
Thank you !
Open Discussion
March 19, 2025
Open Discussion
53