1 of 21

1

Google BigQuery Story:�Optimizing 25PB Storage

2 of 21

BigQuery in AppsFlyer

  • Analytical storage with over 25PB of data
  • Serves raw data reports
  • Used for internal analytics

2

3 of 21

Highly scalable

Powerful toolset

Fully managed

Fast feeding

Fast SQL queries

Why

4 of 21

BigQuery Data Flows

4

5 of 21

BigQuery Internals

  • Dremel Implementation
  • Columnar Storage
  • Tree Architecture

5

6 of 21

Columnar Storage

6

7 of 21

Columnar Storage

  • Less traffic
  • Efficient scans (SIMD instructions)
  • Highly compressible
  • No updates

7

8 of 21

Tree Architecture

Massively Parallel Execution

8

9 of 21

Legacy Storage Overview

  • Legacy SQL dialect
  • Table-based partitioning
  • Manual data retention

9

10 of 21

Legacy Storage Overview

  • installs_ios_google_20181001
  • installs_ios_google_20181002
  • installs_ios_google_20190101

10

11 of 21

The Goal

  • Reduce slots consumption
  • Optimize query time
  • Improve table management

11

12 of 21

The Solution

  • Legacy SQL -> Standard SQL
  • Modern partitioning
  • Implement clustered tables

12

13 of 21

Change Partitioning

  • installs_ios_google_20181001
  • installs_ios_google_20181002
  • installs_ios_google_20190101

13

installs_ios_google

partitioned by event date

14 of 21

Clustered Tables

  • Organizes data inside files
  • Adjacent data next to each other
  • We always have `app_id` predicate in queries

14

15 of 21

How to migrate�with zero downtime?

16 of 21

Migration Process

  1. Create new tables, start writing
  2. Migrate legacy data (duplicate storage)
  3. Test
  4. Switch to new tables
  5. Repeat steps 1-4 for all topics

16

17 of 21

Achievements

  • Query time reduce
  • Slots consumption reduce
  • Easier tables management

17

18 of 21

Challenges

  • Data storage duplication
  • GDPR
  • SQL Dialect change

18

19 of 21

What didn’t help

  • Making schema strongly typed
  • Simplify queries (no regex)

19

20 of 21

Links

20

21 of 21

Thanks!