BigQuery DataFrames
BigFrames
Jiaxun Wu (jiaxun@google.com)
Feb 2024
Proprietary + Confidential
1
Agenda
Overview
01
bigframes.ml
02
Demo
05
bigframes.pandas
03
Architecture
04
2
Proprietary + Confidential
2
AI/ML Pipelines Overview in BQ
SQL:��
CREATE MODEL()
ML.PREDICT()
ML.GENERATE_TEXT()�ML.GENERATE_EMBEDDING()
ML.MIN_MAX_SCALER()�…�CREATE VECTOR INDEX()�VECTOR_SEARCH()
…
Python:
�model.fit()�model.predict()
PaLM2TextGenerator()
PaLM2TextEmbedding�Generator()
MaxAbsScaler.fit()
…�similarity_search()�similarity_search_by_vector()�…
3
BigFrames Overview
BigQuery DataFrames, also known as BigFrames, is an open-source Python client that simplifies the interaction with BigQuery and GCP by compiling popular Python APIs into scalable BQ SQL queries + API calls.
Industry: Snowpark, PyStarburst, pyspark.pandas, PyODPS
How popular python and SQL are?
import bigframes.pandas as bf
# Read a BQ table
bq_df = bf.read_gbq("bigquery-public-data.ml_datasets.data")
#calculate aggregations on full datasets without downsampling
average_body_mass = bq_df["body_mass_g"].mean()
Proprietary + Confidential
4
bigframes.ml
sklearn-like APIs on top of BigQuery ML
Proprietary + Confidential
5
bigframes.ml
Model Register
Model Prediction
Pipeline
Serving
Evaluation
Training
Data Preparation
bigframes.ml.preprocessing
bigframes.ml.metrics
bigframes.ml.cluster
bigframes.ml.decomposition
bigframes.ml.forecassting
bigframes.ml.imported
…
bigframes.ml.model_selection
Predictive AI
Proprietary + Confidential
6
bigframes.ml.llm
2024
2023
PaLM2TextGenerator
PaLM2TextEmbeddingGenerator
More…
More…
Generative AI
Proprietary + Confidential
7
bigframes.pandas
I/O
bq_df = bf.read_gbq("bigquery-public-data.ml_datasets.penguins")
local_df = bf.read_csv("/content/sample_data/california_housing_test.csv")
in_memory_df = bf.read_pandas(pandas_df)
in_memory_df.to_gbq("project.dataset.table")
local_df.to_csv(GCS_BUCKET + "california_housing_test_COPY.csv")
Data Manipulation
400+ pandas APIs
Arbitrary Python UDFs
Data Visualization & OSS Integration
Scale & Performance
Scalable:
Performant:
Proprietary + Confidential
8
BigFrames Architecture
Proprietary + Confidential
9
Internal Stack
indexing
ordering
Proprietary + Confidential
10
Indexing & Ordering (deferred execution)
# Consistent head
df.head()
# Row access
gcs_df.iloc[12]
Proprietary + Confidential
11
Pandas Ops Implementation (dynamic operator)
Proprietary + Confidential
12
Demo: LLM powered Synthetic data generator
Proprietary + Confidential
13
Thank You!
bigframes-feedback@google.com
Proprietary + Confidential
14