Snowplow drives everything we do

What and why?

Bauer Media

  • Digital and print publisher
  • Family-owned German company
  • 116 sites across Australia and New Zealand
  • Tag management across all sites

Just start collecting

  • Snowplow data collection in 2014
  • We didn’t really have a use case

Stuff we record

  • Page views
  • Metadata around content
  • User logins
  • Email click-throughs
  • Ad impressions

Use cases started showing up

  • Cross-site integrated reporting
  • Ad hoc tricky analysis
  • Sanity checking industry audience reporting
  • Stalking individual users
  • Audience overlaps

Dolly usage by hour

User behaviour

Ad impressions

Content metadata

Trending service

Recommendations

Dashboards

Ad hoc analysis

Some things you can’t do in GA

  • Tag-based reporting
  • Accurate reporting of in-app Facebook using user-agent contains FBAN

We’re using Snowplow 0.9.2 from 2014-04-29!

  • It just works
  • We’ve been busy building other stuff

But...

  • Page pings is b0rken: no time spent or scroll depth
  • (Out-of-the-box) browser categorisation is terrible
  • Hourly batches are a bit higher latency than we’d like
  • No context shredding, but JSON queries are performant enough

runSnowPlow.sh

Web page

(JavaScript in page creates image beacon)

S3

Cloudfront

SnowCannon

(Node app in Elastic Beanstalk)

Redirects to

Writes logs to

ETL

(Elastic Map Reduce)

S3

events

(Redshift)

events_temp
(Redshift)

x_events
(Redshift)

Tips

  • Redshift can get very expensive very quickly
  • Decent dashboarding platforms are rare
  • And plenty of crap ones are overpriced
  • Just tip everything in and worry about what you’ll do later

What’s next?

Future plans

  • Upgrade ETL to real-time: probably our own solution
  • Time spent and scroll depth
  • Shredding?
Snowplow is at the core of everything we do - Google Slides