GA4 + BigQuery + Dataform
Fast way to start
Plan
Dataform and GA4 dataset are huge topics. But I’ll try:
�This is the VERY FIRST MeasureCamp in Malmö and in my life, so I think we are both a little bit nervous.
Best GA4 UI feature is a free data export
BFF
We have GA4 raw data, so what’s next?
Dataform (the answer) is already here in BigQuery
Why Dataform?
We are already GCP Engineers (Server-Side GTM, Cloud Run, App Engine, BigQuery, Firestore)
We already use JavaScript for tracking
Full stack tracking engineers from data collection to data preparation (everything except profit)
But first you need to:
And then do something cool with data
The first-time setup could be tricky
You need to enable APIs
Create GitHub repo and tokens
Connect Dataform / GitHub
Grant permissions
Create Repo / Workspace
https://www.youtube.com/watch?v=zcmaBPnOxl0
https://artemkorneevga.github.io/dataform-ga4-sessions/guides/installation
Dataform Basics
SQLX: configuration block + sql
Incremental and non-incremental context
REF: manage dependencies
JavaScript helpers
Data Assertions
https://cloud.google.com/dataform/docs/quickstart-create-workflow
https://artemkorneevga.github.io/dataform-ga4-sessions/guides/dataform
SQLX example (super short intro)
REPORT.SQLX
config {
type: "incremental",
uniqueKey: ["session_id"]
}
SELECT
session_id,
user_pseudo_id,
date
FROM ${ref("sessions")}
${ when(incremental(), `WHERE date = date_sub(current_date(), interval 1 day)`)}
How to work with GA4 raw data
Show on examples later…
Same same, but different
Different different, but same
dbt-ga4
dataform-ga4-sessions
sorry guys self-promotion
The package main concepts
Dataform recommended project structure: staging / reporting
Session basic example (show me the code)
ga4.js
const ga4 = require("dataform-ga4-sessions");
// Define your config
const config = {
dataset: "analytics_XXXXXX",
incrementalTableName: "events_XXXXXX",
};
// Declare GA4 source tables
ga4.declareSources(config);
// Create sessions object
const sessions = new ga4.Session(config);
// Publish session table
sessions.publish();
Result
Generates SQL:
After execution you have a table:
Standard cases
JavaScript helpers
JavaScript helpers (example)
Usage:
sessions.addEventParams([
{ name: "page_title", type: "string" },
{ name: "custom1", type: "coalesce" },
])
Result:
(SELECT ep.value.string_value FROM UNNEST(event_params) ep WHERE ep.key = 'page_title' LIMIT 1) AS page_title ,
(SELECT COALESCE(ep.value.string_value, SAFE_CAST(ep.value.int_value AS STRING), SAFE_CAST(ep.value.double_value AS STRING), SAFE_CAST(ep.value.float_value AS STRING)) FROM UNNEST(event_params) ep WHERE ep.key = 'custom1' LIMIT 1) AS custom1
Source / medium
BigQuery source and medium from event_params are not enough - all sort of ClickID.
sessions.sourceMediumRules = [
{
columns: ["gclid_url"],
conditionType: "NOT_NULL",
conditionValue: "",
value: {
source: "'google'",
medium: "'cpc'",
campaign: "campaign",
},
},
...sessions.sourceMediumRules,
];
Default Channel Grouping
Stay close to Google, thanks again to dbt-ga4 (about 30 rules)
https://support.google.com/analytics/answer/9756891?hl=en
And source category type based on referrer (about 900 rules):
https://support.google.com/analytics/answer/9756891?hl=en
Screenshot with SCHEMA
Attributions
Last click and last non direct click attributions.
By default it's 30 days lookback window. But we can change:
sessions.LastNonDirectLookBackWindow = 90;
Session id
We can’t use ga_session_id, but we could CONCAT it with user_pseudo_id.
But it still doesn’t guarantee that session_id will be unique.
You could provide your own SQL for session_id:
sessions.getSqlUniqueId = () => {
return `FARM_FINGERPRINT(CONCAT((select value.int_value from unnest(event_params) where key = 'ga_session_id'), user_pseudo_id)) as session_id`;
};
Sessions Assertions
Usage:
sessions.publishAssertions();
Result:
Sessions Timeliness
Sessions Completeness
Sessions Validity
Recommended and auto events
Event and Session classes extend the same base class, so you could use the same helper methods
But for recommended and auto events we can use
let ef = new ga4.EventFactory(config);
let addToCart = ef.createAddToCart();
addToCart.publish();
For all events from (40 in total):
https://developers.google.com/analytics/devguides/collection/ga4/reference/events?client_type=gtm
Purchase value
Value could be both int_value and float_value so please check both fields:
COALESCE(ep.value.float_value, SAFE_CAST(ep.value.int_value AS FLOAT64), ep.value.double_value)
Or if you are using package for custom event set coalesce_float type like this:
event.addEventParams([
{ name: "value", type: "coalesce_float" }
])
Event Id
By default by event_timestamp, event_name, user_pseudo_id and engagement_time_msec
It’s not guaranteed that event_id will be unique so it’s better to collect event_timestamp or event_id in GTM
Or you could define you own SQL:
pageView.getSqlUniqueId = () => {
return `FARM_FINGERPRINT(CONCAT(event_timestamp, event_name, user_pseudo_id, ifnull((select ep.value.int_value from unnest(event_params) as ep where ep.key = 'engagement_time_msec' ),0))) as event_id`;
};
Scheduler / daily updates
2 step process:
Release Configuration - compile the code, provide variables if needed
Workflow Configuration - execute release by cron rules
GA4 export time is unknown (even for 360)
https://artemkorneevga.github.io/dataform-ga4-sessions/guides/scheduling-daily
Update actions:
const config = {
dataset: "analytics_XXXXXX",
incrementalTableName: "events_*",
incrementalTableEventStepWhere:
"_TABLE_SUFFIX between format_date('%Y%m%d',date_sub(current_date(), interval 3 day)) and format_date('%Y%m%d',date_sub(current_date(), interval 1 day)) and contains_substr(_TABLE_SUFFIX, 'intraday') is false",
nonIncrementalTableName: "events_*",
nonIncrementalTableEventStepWhere:
"contains_substr(_TABLE_SUFFIX, 'intraday') is false",
};
Scheduler / event based
Steps:
Problem with GitHub export - should be retries. So instead of Pub / Sub use Cloud Tasks (3 retries are usually enough)
Deploy Dataform + Scheduling + Alerts + Anomalies
I have sessions and events so what?
Build dashboards
Send Pub / Sub event on success / failure
Use BigQuery ML (prepare tables upfront, no joins when you create model)
Try multi-touch attribution models (ChannelAttribution open source by David Loris)
Export to Parquet
Send offline conversions to platforms
More..
Lessons learned
Unit testing (provide input and output tables and run actions). Doesn’t support incremental.
Documentation: JSDoc + Docusaurus + Endless time
Open questions - multiple paid channels during the sessions, registration and logout sessions
Artem Korneev
Pls connect with me on LinkedIn using QR code:
linkedin.com/in/artem-korneev or email artem.korneev@ommax.de