1 of 35

GA4 + BigQuery + Dataform

Fast way to start

2 of 35

Plan

Dataform and GA4 dataset are huge topics. But I’ll try:

  • Quick intro to Dataform
  • What’s the challenges?
  • dataform-ga4-sessions package to simplify the process

�This is the VERY FIRST MeasureCamp in Malmö and in my life, so I think we are both a little bit nervous.

3 of 35

Best GA4 UI feature is a free data export

BFF

4 of 35

We have GA4 raw data, so what’s next?

5 of 35

Dataform (the answer) is already here in BigQuery

6 of 35

Why Dataform?

We are already GCP Engineers (Server-Side GTM, Cloud Run, App Engine, BigQuery, Firestore)

We already use JavaScript for tracking

Full stack tracking engineers from data collection to data preparation (everything except profit)

7 of 35

But first you need to:

  1. Initial Dataform setup
  2. Understand Dataform basics
  3. Understand how to work with GA4 Export Schema
  4. Setup daily schedule

And then do something cool with data

8 of 35

The first-time setup could be tricky

You need to enable APIs

Create GitHub repo and tokens

Connect Dataform / GitHub

Grant permissions

Create Repo / Workspace

https://www.youtube.com/watch?v=zcmaBPnOxl0

https://artemkorneevga.github.io/dataform-ga4-sessions/guides/installation

9 of 35

Dataform Basics

SQLX: configuration block + sql

Incremental and non-incremental context

REF: manage dependencies

JavaScript helpers

Data Assertions

https://cloud.google.com/dataform/docs/quickstart-create-workflow

https://artemkorneevga.github.io/dataform-ga4-sessions/guides/dataform

10 of 35

SQLX example (super short intro)

REPORT.SQLX

config {

type: "incremental",

uniqueKey: ["session_id"]

}

SELECT

session_id,

user_pseudo_id,

date

FROM ${ref("sessions")}

${ when(incremental(), `WHERE date = date_sub(current_date(), interval 1 day)`)}

11 of 35

How to work with GA4 raw data

  • SQL
    • unnest / arrays
  • Extract sessions from raw data (not as simple as it may look)
    • Session id ?
    • Source / medium
    • Channel grouping
    • Last non direct attribution
    • Landing page data
  • Events
    • Event id?
    • Standard events
    • Ecommerce data

Show on examples later…

12 of 35

Same same, but different

Different different, but same

13 of 35

dbt-ga4

14 of 35

dataform-ga4-sessions

sorry guys self-promotion

15 of 35

The package main concepts

Dataform recommended project structure: staging / reporting

  • Package provides models for staging tables: session and event tables for each needed event (layer 1)
  • All other tables for dashboarding and etc. should use staging tables not raw data
  • It’s free and open source project; use it for production or for learning, could fork and of course contribute

16 of 35

Session basic example (show me the code)

ga4.js

const ga4 = require("dataform-ga4-sessions");

// Define your config

const config = {

dataset: "analytics_XXXXXX",

incrementalTableName: "events_XXXXXX",

};

// Declare GA4 source tables

ga4.declareSources(config);

// Create sessions object

const sessions = new ga4.Session(config);

// Publish session table

sessions.publish();

17 of 35

Result

Generates SQL:

After execution you have a table:

18 of 35

Standard cases

19 of 35

JavaScript helpers

  • To unnest event_params / user properties
  • To get query params from page location
  • Get values from item-scoped dimensions
  • Add more columns
  • Apply presets

20 of 35

JavaScript helpers (example)

Usage:

sessions.addEventParams([

{ name: "page_title", type: "string" },

{ name: "custom1", type: "coalesce" },

])

Result:

(SELECT ep.value.string_value FROM UNNEST(event_params) ep WHERE ep.key = 'page_title' LIMIT 1) AS page_title ,

(SELECT COALESCE(ep.value.string_value, SAFE_CAST(ep.value.int_value AS STRING), SAFE_CAST(ep.value.double_value AS STRING), SAFE_CAST(ep.value.float_value AS STRING)) FROM UNNEST(event_params) ep WHERE ep.key = 'custom1' LIMIT 1) AS custom1

21 of 35

Source / medium

BigQuery source and medium from event_params are not enough - all sort of ClickID.

sessions.sourceMediumRules = [

{

columns: ["gclid_url"],

conditionType: "NOT_NULL",

conditionValue: "",

value: {

source: "'google'",

medium: "'cpc'",

campaign: "campaign",

},

},

...sessions.sourceMediumRules,

];

22 of 35

Default Channel Grouping

Stay close to Google, thanks again to dbt-ga4 (about 30 rules)

https://support.google.com/analytics/answer/9756891?hl=en

And source category type based on referrer (about 900 rules):

https://support.google.com/analytics/answer/9756891?hl=en

Screenshot with SCHEMA

23 of 35

Attributions

Last click and last non direct click attributions.

By default it's 30 days lookback window. But we can change:

sessions.LastNonDirectLookBackWindow = 90;

24 of 35

Session id

We can’t use ga_session_id, but we could CONCAT it with user_pseudo_id.

But it still doesn’t guarantee that session_id will be unique.

You could provide your own SQL for session_id:

sessions.getSqlUniqueId = () => {

return `FARM_FINGERPRINT(CONCAT((select value.int_value from unnest(event_params) where key = 'ga_session_id'), user_pseudo_id)) as session_id`;

};

25 of 35

Sessions Assertions

Usage:

sessions.publishAssertions();

Result:

Sessions Timeliness

Sessions Completeness

Sessions Validity

26 of 35

Recommended and auto events

Event and Session classes extend the same base class, so you could use the same helper methods

But for recommended and auto events we can use

let ef = new ga4.EventFactory(config);

let addToCart = ef.createAddToCart();

addToCart.publish();

For all events from (40 in total):

https://developers.google.com/analytics/devguides/collection/ga4/reference/events?client_type=gtm

https://support.google.com/analytics/answer/9234069?hl=en

27 of 35

Purchase value

Value could be both int_value and float_value so please check both fields:

COALESCE(ep.value.float_value, SAFE_CAST(ep.value.int_value AS FLOAT64), ep.value.double_value)

Or if you are using package for custom event set coalesce_float type like this:

event.addEventParams([

{ name: "value", type: "coalesce_float" }

])

28 of 35

Event Id

By default by event_timestamp, event_name, user_pseudo_id and engagement_time_msec

It’s not guaranteed that event_id will be unique so it’s better to collect event_timestamp or event_id in GTM

Or you could define you own SQL:

pageView.getSqlUniqueId = () => {

return `FARM_FINGERPRINT(CONCAT(event_timestamp, event_name, user_pseudo_id, ifnull((select ep.value.int_value from unnest(event_params) as ep where ep.key = 'engagement_time_msec' ),0))) as event_id`;

};

29 of 35

Scheduler / daily updates

2 step process:

Release Configuration - compile the code, provide variables if needed

Workflow Configuration - execute release by cron rules

GA4 export time is unknown (even for 360)

https://artemkorneevga.github.io/dataform-ga4-sessions/guides/scheduling-daily

Update actions:

const config = {

dataset: "analytics_XXXXXX",

incrementalTableName: "events_*",

incrementalTableEventStepWhere:

"_TABLE_SUFFIX between format_date('%Y%m%d',date_sub(current_date(), interval 3 day)) and format_date('%Y%m%d',date_sub(current_date(), interval 1 day)) and contains_substr(_TABLE_SUFFIX, 'intraday') is false",

nonIncrementalTableName: "events_*",

nonIncrementalTableEventStepWhere:

"contains_substr(_TABLE_SUFFIX, 'intraday') is false",

};

30 of 35

Scheduler / event based

Steps:

  • Subscribe to GA4 table creation
  • Get table name
  • Make Reales Configuration by API and pass table name as variable
  • Execute release configuration by API

Problem with GitHub export - should be retries. So instead of Pub / Sub use Cloud Tasks (3 retries are usually enough)

31 of 35

Deploy Dataform + Scheduling + Alerts + Anomalies

  • Terraform with modules for separate steps (not all clients need all steps)
  • GitHub Actions to run terraform commands
  • Separate Cloud Bucket for Terraform State
  • Configuration file / secrets for each client

32 of 35

I have sessions and events so what?

Build dashboards

Send Pub / Sub event on success / failure

Use BigQuery ML (prepare tables upfront, no joins when you create model)

Try multi-touch attribution models (ChannelAttribution open source by David Loris)

Export to Parquet

Send offline conversions to platforms

More..

33 of 35

Lessons learned

Unit testing (provide input and output tables and run actions). Doesn’t support incremental.

Documentation: JSDoc + Docusaurus + Endless time

Open questions - multiple paid channels during the sessions, registration and logout sessions

34 of 35

Artem Korneev

OMMAX / Digital Consultancy / München

Analytics Developer / GCP Engineer

Pls connect with me on LinkedIn using QR code:

linkedin.com/in/artem-korneev or email artem.korneev@ommax.de

35 of 35

Links