1 of 25

Why Rent When You Can Own?

Build your modern data lakehouse with true optionality

2 of 25

What’s The Problem With Today’s Architecture?

01

3 of 25

The Data Warehouse Paradigm Creates Vendor Lock-In

Your data is locked into a proprietary database

4 of 25

Why Data Lakehouse

02

5 of 25

Lakehouse = Data Warehouse Without Vendor Lock-in, With Best-Of-Breed Tools

Data warehouse features coming to the data lake
All use cases in one place (low latency covering PB of data)
Diversity of data
Variety of tools and engines interacting with the data
High performance, fully replicated object storage

6 of 25

What An Open Lakehouse Looks Like

Ingestion of many different types of data in real-time or batch
Layers from raw to fully aggregated
Columnar file formats along with popular table formats
Multi-engine
Variety of end users from traditional BI to data science

7 of 25

Lakehouse Offers More Functionality Without Compromise

Feature	Lakehouse	Data Warehouse
Interactive queries	Yes	Yes
Manipulation of data (DML)	Yes	Yes
Petabytes of data	Yes	No
Indexing and caching to speed up queries	Yes, with Starburst+Verada	Yes
Ability to use the best engine for your use case, not locked into a vendors’ ecosystem	Yes	No
Optionality to switch to open source	Yes, with Starburst/Trino	No
Active data warehousing	No	Yes

8 of 25

Get The Benefits Of Today And Years To Come With The Lakehouse

“Innovate or we’re dead.” This doesn’t just apply to software vendors
“Sticky” can be good or bad, you decide
Data is the energy of your company (not oil)
KISS - Data mesh, decentralized, single-truth all apply
Data Warehouses lock you in but slowly go out of fashion - don’t get caught

9 of 25

Why The Starburst Approach To The Lakehouse

Data is in your account under your control, in your account
Many engines / solutions can interact with your data
Many use cases are supporting including data science and traditional BI reporting
No vendor lock-in (use the best engine for the job)
Enhance your data lake with other sources via data mesh architecture (query federation)

10 of 25

How To Build The Data Lakehouse

03

11 of 25

Lakehouse Architecture

Variety of data sources are ingested
Object storage is primary destination
Table formats bring database functionality

Many engines to choose from (Spark, Trino,etc..)
100s of access tools (BI, Data Science, SQL,etc..)

12 of 25

How It Looks Like With Starburst Galaxy

13 of 25

Operate

Your Data Lakehouse with Starburst

04

14 of 25

Use Case: Data Lakehouse Engine

Deploy to any environment. Also supports HDFS, cloud storage and S3 compatible (Dell ECS, Minio,etc..)

High concurrency, auto-scaling MPP engine (Trino), which is widely used in industry (replaced Hive)

Full role based access control

EMIS Health Case Study

Just yesterday, I was talking to a data engineer at an infrastructure software comapny encountering probably what many of you encounter. Data is on a system like Redshift, and latencies are really slow. So your business users are yelling at you because the dashboard takes a minute to load..

And in fact, when you consider it, Starburst has all the functionality that you would care about. Great latencies. That’s our strong point. It’s not just used as the big data compute engine at Meta, where we got our start. Companies like Slack, Doordash, Grubhub, Tesla, Expedia, EMIS Health, all use Starburst to deliver much faster latencies on big data that other tools like Redshift simply couldn’t.

EMIS Health, as an example, deployed Starburst to produce complex visualizations like analysis of COVID subvariant trends across Europe. Imagine for a second, the scale. All the patient applications streaming their data into the cloud data lake to process. Terabytes to Petabytes of data. They struggled to get it working. Redshift couldn’t scale. The Snowflake bill would’ve been insanely high.Starburst solved their problem.

15 of 25

Use Case: Data Lakehouse w/ Data Mesh

SELECT

c.orderkey,

o.shippriority

FROM

teradata.tpch.customer c, sql_server.tpch.orders o

Query over 35 data sources using standard ANSI SQL

Starburst engine provides really fast speeds via file indexing, caching, cost-based optimizer, dynamic filtering and join pushdown, and more

Doordash Case Study

Back at Facebook, there were all these data silos. So tons of funny stories. Facebook feed data was stored in external databases, and so people simply weren’t using that data to detect fraud. And you know, ETL’ing that data into your data lake and having multiple copies of the data, it gets really expensive. And if you’re tackling a use case like fraud detection, it just gets very expensive. And today, we have lots of customers that have data sovereignty problems too, where this data has to stay in this country, that data has to stay in that country.

Once we built query federation capabilities of Facebook, one could now run queries over the Facebook data feed from different sources to find fraud.

These days, data mesh is a hot buzz phrase, everyone tries to toss their hat into the ring. I linked a more detailed case study on Doordash chose us as their data mesh platform. But the synopsis is that they evaluated all the vendors, the likes of Snowflake, Databricks, AWS Athena which runs Trino in the backend. The platforms either didn’t have the ability to query external data tables, or they were just really really slow and expensive, prohibitively slow. Think about it: the reality is that people need to process tons of data. Think about the context in which you use big data: you’re trying to analyze all the order data to detect trends over time, you’re preparing data for ML jobs. Starburst has all these optimizations that make querying external data tables just really fast, things like file indexing, caching, cost-based optimizer, dynamic filtering, and join pushdown. So the net result is that Starburst is just 10x faster, and has the suite of connectors you need to get started.

16 of 25

Use Case: Data Processing Engine

ELT processing engine (now with fault tolerance !)
Raw/Stage - 50+ connectors to extract data or create views
Join, curate and enhance data on any data storage w/ standard sql
Data pipelines accessible to everyone who knows SQL, not bottlenecked by data engineers (TalkDesk)
Learn how Zillow, Lyft, use the Starburst/Trino for data processing at Trino Summit on Nov 10 (link)

Salesforce case study

A bit to the backstory of Starburst as a data processing engine. We initially ideated Trino for ad-hoc exploration of big data, for dashboarding. And it got viral adoption not just at Meta, but also at companies like Netflix, Lyft, LinkedIn, Shopify, TreasureData. But when we talked to our users, we discovered that everyone was using Trino for their data pipelines. People really valued it: you could get Trino speeds for their data pipelines. The data engineers really valued being able to have the Trino interactive experience when developing data pipelines, where you can test snippets of SQL code as you develop or debug.

Data analysts are often familiar with Trino and Starburst, we’re the tool that people developed their dashboards in. But when a data pipeline needs to be developed,it gets bottlenecked by data engineers. The analysts simply don’t have the advanced Spark knowledge, and often don’t even have access to Spark because access is limited to prevent exploits from ability to execute arbitrary Python code.

And quick plug: we have the Trino Summit coming up on November 10th, and there you’ll have companies like Zillow show you how they use Starburst for data processing.

17 of 25

What Makes Starburst (Trino) A Versatile Engine

05

18 of 25

Fast And Cost-Efficient

Compared to Open Source Spark, Starburst executes queries is 38% faster

*Test run on TPC-H 10TB data schema using 5 m5.8xlarge machines

19 of 25

Ability To Run Trino on Spots For Cost Savings

Running on spot instances is desirable because often compute cost is often 50% cheaper
Trino enables really great resiliency over spots because external exchange buffer service makes Trino more resilient over spots
Latency of losing nodes is half of what it is in Spark
Case study: 60% cost savings at BlueCat

20 of 25

Trino Is Fast And Predictable On Spots

Trino query execution time on spot instances is faster than Spark on-demand instances

21 of 25

Starburst Galaxy+dbt Demo

06

22 of 25

Demo - Building Pipeline and Consumption

23 of 25

Starburst Galaxy Provides Great Ecosystem For Trino

Ecosystem of connectors

Performance and flexibility

Scalability

Ease of use / consumability

Security and compliance

Optionality

24 of 25

Ease of use and consumability

Capabilities that enable easy discovery and consumption of high-quality data

Easy to connect to a rich ecosystem of data sources, BI tools, partner products

Intuitive user experience using the SQL skills and tools you already know

Fully managed SaaS option

Resource elasticity: reduces need for dedicated operational team

Flexible and transparent licensing, pricing, and billing options

1 of 25

2 of 25

3 of 25

4 of 25

5 of 25

6 of 25

7 of 25

8 of 25

9 of 25

10 of 25

11 of 25

12 of 25

13 of 25

14 of 25

15 of 25

16 of 25

17 of 25

18 of 25

19 of 25

20 of 25

21 of 25

22 of 25

23 of 25

24 of 25

25 of 25