1 of 53

Monthly OpenLineage

TSC meeting

March 19, 2025

2 of 53

Recording of calls

Reminder:

The meeting is recorded and archived on the wiki

https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

2

3 of 53

Roll Call

TSC voting members:

Julien Le Dem Paweł Leszczyński Kacper Muda

Mandy Chessell Will Johnson Zhenqiu Huang

Daniel Henneberger Michael Robinson Jens Pfau

Drew Banin Ross Turk Sheeri Cabral

James Campbell Howard Yoo

Ryan Blue Jakub Dardziński

Willy Lulciuc Tomasz Nazarewicz

Zhamak Dehghani Minkyu Park

Michael Collado Benji Lampel

Maciej Obuchowski Kengo Seki

Harel Shein Damien Hawes

3

4 of 53

Communication

4

5 of 53

Agenda

  • Announcements
    • Warsaw OpenLineage meetup
  • Recent releases
  • Presentations
    • dbt structured logs
      • Massy Bourennani, Datadog
    • Apache Hive integration
      • Tomasz Nazarewicz, GetInData
  • Open discussion

5

6 of 53

Announcements

March 19, 2025

7 of 53

Warsaw Meetup @ Google in April

Date: April 3 (Thursday)

Time: 17:30-20:30 CET

Place: Google, Rondo Daszyńskiego 2C, 00-843 Warsaw

Format: Hybrid (Zoom link to follow)

RSVP: required by March 31

https://www.meetup.com/warsaw-openlineage-meetup-group/events/305919584

  • GCP Composer Migration to OpenLineage (Augusto Hidalgo, Google)
  • Adding OpenLineage support to Airflow operators (Kacper Muda, Astronomer)
  • Are they compatible? Standardized compatibility testing for OpenLineage (Tomasz Nazarewicz, GetInData)
  • Connecting OpenLineage within Observability Ecosystems (Maciej Obuchowski, Datadog)
  • Exposing metrics through Spark OpenLineage connector (Paweł Leszczynski, GetInData)

7

8 of 53

OpenLineage @ move(data) 2025

  • Virtual & free
  • March 20, 2025
  • Registration link

8

9 of 53

Recent Releases

March 19, 2025

10 of 53

OpenLineage 1.29.0

Added

  • Python: allow adding user-supplied tags facets from config #3471 @leogodin217�User-supplied tags will allow the client to inject new tags or override tags provided by the integrations for jobs and runs.
  • Java: allow adding user-supplied tags facets from config #3493 @mobuchowski�Enabled parsing tags from config in Java client and Spark conf.

Changed

  • Java: change async breaker timeout setting is not a real timeout. #3487 @pawel-big-lebowski�Properly name case where TaskQueueCircuitBreaker allows a configurable blocking time after submitting a callable.
  • Flink: enabled circuit breaker for Flink 2 integration. #3503 @pawel-big-lebowski�Native Flink integration is now isolated within circuit breaker call.

Fixed

  • Spark: use all of the underlying classloaders to find META-INF/services resources. #3483 @ddebowczyk92�ServiceLoader should not fail to load OpenLineageExtensionProvider implementations in certain configurations.
  • Flink: handle default null job manager address. #3486 @MarquisC�Null Flink Job Manager address will default to localhost
  • dbt: handle tests on sources for structured logs option. #3488 @MassyB�Handle case for tests on sources which don't have the attached_node defined in the manifest.

10

11 of 53

OpenLineage 1.30.0

Added

  • Flink 2: added support for CheckpointFacet #3531 @pawel-big-lebowski�Similar to Flink 1 integration, Flink 2 integration will emit CheckpointFacet.
  • Flink 2: include SQL comments in OL events #3528 @pawel-big-lebowski�Table and field comments are available within generated OL events.

Changed

  • Java: remove deprecated configs: 'disabledFacets' and 'timeout'. #3522 @pawel-big-lebowski�Configs have been replaced with: 'facets..disabled=true' and 'timeoutInMillis'.

Fixed

  • Spark: enable Iceberg metrics reporting for SparkSessionCatalog. #3538 @sakjung�Fixes support for metrics in Iceberg SparkSessionCatalog.
  • Spark: inject metrics reporter to Iceberg's RESTCatalog. #3515 @pawel-big-lebowski�Fixes support for metrics in Iceberg RESTCatalog.
  • dbt: fix race condition on dbt log file #3535 @MassyB�Fixes race condition that was happening when using structured logs output.
  • dbt: fixes handling of skipped dbt nodes #3545 @MassyB�Skipped nodes no longed cause exceptions.
  • Spark: fix record count stats mixed with bytes stats #3550 @pawel-big-lebowski�InputStatistics facet for Iceberg datasets no longer produces incorrect stats.
  • Spark: catch input duplicates for SubqueryAlias. #3548 @pawel-big-lebowski�Subquery alias no longer is duplicating the inputs.
  • Spark: catch Java 17 add opens exception. #3552 @mobuchowski�Fixes catching InaccessibleMethodException in Java 17 within SparkExtensionVisitor.

11

12 of 53

dbt structured logs

Massy BOURENNANI, SWE @ Datadog

13 of 53

Agenda

  • What is dbt?
  • dbt artifacts
  • How are dbt OpenLineage events generated ?
  • Problems of consuming run_results.json?
  • Solution: Structured Logs
  • Comparison
  • Benefits of Structured Logs
  • Next steps
  • Links

13

14 of 53

What is dbt ?

15 of 53

dbt: data build tool

15

16 of 53

dbt DAG: jaffle shop example

16

17 of 53

dbt Artifacts

18 of 53

dbt artifact: manifest.json

18

19 of 53

dbt artifact: run_results.json

19

20 of 53

How are dbt OpenLineage events generated ?

21 of 53

Consume run_results.json

21

22 of 53

Problems of consuming run_results.json?

23 of 53

Problem I: High latency

The dbt pipeline needs to completely finish before the first OL event is ever seen by the user

23

24 of 53

Problem II: Lack of granularity in OL events

Only dbt model SQL queries are forwarded by Openlineage integration

24

25 of 53

Problem II: Lack of granularity in OL events

25

26 of 53

Is there another way to generate dbt OpenLineage events ?

27 of 53

Solution: Structured Logs

28 of 53

Structured Logs

  • Structured Logs are structured dbt events
  • They are generated while dbt runs
  • They are written in real time to stdout and to the log file
  • They are sent at different times:
    • When a dbt model starts
    • When a SQL query is executed
    • When a dbt Test starts
    • When a dbt model finishes

28

29 of 53

Structured Logs: examples

29

30 of 53

Structured Logs: real time monitoring

30

31 of 53

Structured Logs

VS

run_results.json

32 of 53

OpenLineage Events: run_results.json

32

33 of 53

OpenLineage Events: Structured Logs

33

34 of 53

In Datadog

35 of 53

Datadog Waterfall: run_results.json

35

36 of 53

Datadog Waterfall: Structured Logs

36

37 of 53

Datadog Waterfall: Structured Logs jaffle shop

37

38 of 53

Datadog Flame Graph: dbt jaffle shop with 2 threads

38

39 of 53

Benefits of Structured Logs

39

Structured Logs

run_results.json

Latency

Low (events are sent in real time)

High (events are sent after pipeline finishes)

Granularity

High (All SQL queries are forwarded)

Low (a single SQL query is forwarded)

SQL Platform

Agnostic

Agnostic

40 of 53

Next Steps

  • To have even more visibility into the data platform, OpenLineage could include:
    • query_id
    • bytes scanned
    • queue time/compile time/execution time
  • Standardization for dbt. Does it make sense to have both ways (artifacts and structured logs) ?
  • OpenLineage could be integrated to dbt
    • dbt users would benefit from OL out of the box

40

41 of 53

Links

  • Structured logs in dbt-core
  • Main dbt structured logs PR in OpenLineage repository

41

42 of 53

Thank you !

43 of 53

Apache Hive integration

Quick Introduction

Tomasz Nazarewicz, GetInData

44 of 53

What's around the corner?

OpenLineage Hive integraion

Status: Integration is working, right now we’re integrating it with main Openlineage repository

PR: #3555

Main code contributor: @jphalip at Google

Integration with OpenLineage project: @ddebowczyk92, @tnazarew

44

45 of 53

What was?

  • Before this integration, events generated by Hive looked something like this:

45

(…not that impressive)

46 of 53

What is?

  • Now they look more like this:

{

"eventTime": "2025-03-18T15:21:59.561Z",

"producer": "https://github.com/OpenLineage/OpenLineage/tree/1.0.0-SNAPSHOT/integration/hive",

"schemaURL": "https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunEvent",

"eventType": "COMPLETE",

"run": {

"runId": "f7835a55-aa11-44ce-85aa-024aac55c7c5",

"facets": {

"processing_engine": {...},

"hive_properties": {...}

}

},

"job": {

"namespace": "jobNamespace",

"name": "jobName"

},

"inputs": [

{

"namespace": "sourceNamespace",

"name": "sourceName",

"facets": {

"schema": {...},

"symlinks": {...}

}

}

],

"outputs": [

{

"namespace": "targetNamespace",

"name": "targetName",

"facets": {

"schema": {...},

"columnLineage": {...},

"symlinks": {...}

}

}

]

}

46

(…much more impressive)

47 of 53

What do we have?

Facets

  • ProcessingEngineRunFacet
  • HivePropertiesFacet
  • SchemaDatasetFacet
  • SymlinksDatasetFacet
  • ColumnLineageDatasetFacet
    • with Transformation Types

47

"run": {

"runId": "f7835a55-aa11-44ce-85aa-024aac55c7c5",

"facets": {

"processing_engine": {...},

"hive_properties": {...}

}

},

"outputs": [

{

"namespace": "targetNamespace",

"name": "targetName",

"facets": {

"schema": {...},

"columnLineage": {...},

"symlinks": {...}

}

}

]

48 of 53

What do we not have?

  • Limited number of operation types handled
    • Only QUERY and CREATETABLE_AS_SELECT
    • Simple inserts or table creations omitted
  • Only POST_EXEC_HOOK and ON_FAILURE_HOOK
    • no START or RUNNING events, only COMPLETE and FAIL
  • …there could always be more facets…

48

49 of 53

Running

  1. Put openlineage-hive jar on your cluster in hive libraries location
  2. Start Hive
    • Set property hive.exec.post.hooks=io.openlineage.hive.hooks.HiveOpenLineageHook

49

50 of 53

DEMO

Demo using Hive + Dataproc + Console Transport

50

51 of 53

Thank you !

52 of 53

Open Discussion

March 19, 2025

53 of 53

Open Discussion

  • Daniel Rolls: https://www.bearingnode.com/post/openlineage-vendor-integration-status
    • compliance status across the market

53