1 of 14

BigQuery DataFrames

BigFrames

Jiaxun Wu (jiaxun@google.com)

Feb 2024

Proprietary + Confidential

1

2 of 14

Agenda

Overview

01

bigframes.ml

02

Demo

05

bigframes.pandas

03

Architecture

04

2

Proprietary + Confidential

2

3 of 14

AI/ML Pipelines Overview in BQ

  • Prepare data for AI
    • Feature preprocessing to prepare data for training, inference or evaluation
    • Train and fine tune models
  • Process data with AI
    • Extract information, generate embeddings, perform vector similarity search and RAG
    • Join resulting datasets with other data in BQ, and/or
    • Make it available for online and analytical AI apps/agents
  • Structured and unstructured data
  • SQL and Python

SQL:��

CREATE MODEL()

ML.PREDICT()

ML.GENERATE_TEXT()�ML.GENERATE_EMBEDDING()

ML.MIN_MAX_SCALER()�…�CREATE VECTOR INDEX()�VECTOR_SEARCH()

Python:

�model.fit()�model.predict()

PaLM2TextGenerator()

PaLM2TextEmbedding�Generator()

MaxAbsScaler.fit()

…�similarity_search()�similarity_search_by_vector()�…

3

4 of 14

BigFrames Overview

BigQuery DataFrames, also known as BigFrames, is an open-source Python client that simplifies the interaction with BigQuery and GCP by compiling popular Python APIs into scalable BQ SQL queries + API calls.

Industry: Snowpark, PyStarburst, pyspark.pandas, PyODPS

How popular python and SQL are?

import bigframes.pandas as bf

# Read a BQ table

bq_df = bf.read_gbq("bigquery-public-data.ml_datasets.data")

#calculate aggregations on full datasets without downsampling

average_body_mass = bq_df["body_mass_g"].mean()

Proprietary + Confidential

4

5 of 14

bigframes.ml

sklearn-like APIs on top of BigQuery ML

Proprietary + Confidential

5

6 of 14

bigframes.ml

Model Register

Model Prediction

Pipeline

Serving

Evaluation

Training

Data Preparation

bigframes.ml.preprocessing

  • StandardScaler
  • kBinsDiscretizer
  • LabelEncoder
  • MaxAbsScaler
  • OneHotEncoder

bigframes.ml.metrics

  • accuracy_score
  • f1_score
  • r2_score

bigframes.ml.cluster

  • KMeans

bigframes.ml.decomposition

  • PCA

bigframes.ml.forecassting

  • ARIMAPlus

bigframes.ml.imported

  • ONNXModel
  • TensorFlowModel

bigframes.ml.model_selection

  • train_test_split

Predictive AI

Proprietary + Confidential

6

7 of 14

bigframes.ml.llm

2024

2023

PaLM2TextGenerator

PaLM2TextEmbeddingGenerator

More…

More…

Generative AI

Proprietary + Confidential

7

8 of 14

bigframes.pandas

I/O

bq_df = bf.read_gbq("bigquery-public-data.ml_datasets.penguins")

local_df = bf.read_csv("/content/sample_data/california_housing_test.csv")

in_memory_df = bf.read_pandas(pandas_df)

in_memory_df.to_gbq("project.dataset.table")

local_df.to_csv(GCS_BUCKET + "california_housing_test_COPY.csv")

Data Manipulation

400+ pandas APIs

  • Indexing
  • Ordering
  • Mutations

Arbitrary Python UDFs

Data Visualization & OSS Integration

  • Round trip to pandas dataframe
  • Smart downsampling
  • Integration with matplotlib

Scale & Performance

Scalable:

  • TBs of data
  • Everything runs in BQ

Performant:

  • Opportunistic execution
  • Mutation caching

Proprietary + Confidential

8

9 of 14

BigFrames Architecture

Proprietary + Confidential

9

10 of 14

Internal Stack

indexing

ordering

Proprietary + Confidential

10

11 of 14

Indexing & Ordering (deferred execution)

# Consistent head

df.head()

# Row access

gcs_df.iloc[12]

Proprietary + Confidential

11

12 of 14

Pandas Ops Implementation (dynamic operator)

Proprietary + Confidential

12

13 of 14

Demo: LLM powered Synthetic data generator

Proprietary + Confidential

13

14 of 14

Thank You!

bigframes-feedback@google.com

Proprietary + Confidential

14