Data & AI from First Principles

We will focus specifically on timeless ideas that allow you to present tradeoffs specific to a use case’s needs. These concepts will not change even as the industry, products, and roles continue to evolve.

We will not be discussing technical how-to. For example, “why am I getting this Spark bug” are off-topic. You will have plenty of opportunities to learn technical implementation through regular experience, certification programs, and other field enablement. That being said, we will get very low-level and technical to reach our ultimate goal of understanding the “why” behind the tech.

We’ll start with history and theory and gradually use this foundation to learn the Databricks product and the modern ecosystem. Many of the readings are multiple years old, but their relevance today is a testament to their timelessness and therefore critical if we want to truly think from first principles.

How to Progress Through a Session

Foundation

The foundation includes classic chapters, papers, and blogs contributed to the field of data and analytics. While some of these were written by Databricks founders and employees, they are all vendor-agnostic contributions to the world. Some of them are many years old but were the original start to a new trend.

Application

Learn about the given products using your preferred learning style. This could be coding/tinkering or reading/listening to focus on areas you’re curious about. You will have more time in your day job to troubleshoot and implement specifics, but this class is about becoming aware of areas you haven’t had a chance to focus on so far. The goal is not to become an expert in everything but to create pointers in your head to areas you will later get deep into as you work.

Technical Discussion

We will have a small group discussion on technical topics that set the stage for why the product is valuable in a new way. Questions are open-ended and have multiple right answers. The goal is not to show off. In this training, we don’t need to memorize all the details to repeat to a customer, because you will learn this through other enablement.

Value Discussion

We will open up to the “why” behind our product, and practice articulating this value to different audiences with diverse needs. This is not “pitch training” – it’s about truly understanding the ideas behind products in the industry. Questions are open-ended and have multiple right answers.

The Book

In earlier Sessions, we will be reading Martin Kleppman: Designing Data-Intensive Architectures (2017)

O’Reilly Media / Other ways to buy
I recommend expensing an O’Reilly subscription, so you can read other O’Reilly books too. There are a lot of other O’Reilly books that are relevant to the job of a data engineer/scientist.

Schedule

7/12 Session 0: Class Kickoff

We’ll discuss the class structure and answer questions.

7/19 Session 1: Data Systems I: Storage and Retrieval

Foundation

Read the Overview in this document
Kleppmann ch. 1: Foundations
Kleppmann ch. 3: Storage and Retrieval (SSTables, LSMTrees, and BTrees sections are optional)

Application

Technical Discussion

Compare and contrast Apache Spark to other OLAP systems
A client wants to know if Spark can replace HBase or Cassandra. What questions should you ask them to determine whether this would make sense?
Does Delta Lake support indexing?
How do you evolve schemas in Parquet? How does that compare to Delta Lake?
How does Spark natively support both dataframe-style Python development and tabular SQL queries?
How does Spark natively support structured, semi-structured, and unstructured data?
How is it possible that Spark provides libraries for ML, SQL, Graph, and Structured Streaming under a single framework and development model?

Value Discussion

How would you deploy Spark without Databricks?
Why does it matter that Spark and Delta Lake are both open source?
What is so hard to support ETL, ML, and SQL analytics in a single processing framework?

7/26 Session 2: Data Systems II: Consistency and the Cloud

Foundation

Kleppmann ch. 7: Transactions: The slippery concept of a transaction
Kleppmann ch. 9: Consistency: Consistency guarantees (Optional)
Photon Paper
Snowflake Paper

Application

Burak Yavuz: Under the Sediments (2019)

Technical Discussion

How do Lakehouse formats give ACID guarantees entirely on cloud object storage, even though cloud object storage isn’t even ACID compatible?
How did one perform MERGE on columnar files (Parquet, ORC) before Lakehouse formats?
How do Photon and Snowflake implement vectorized processing?

Value Discussion

Why is providing ACID on S3 better than simply using a traditional RDBMS or EDW that already supports ACID?
Why does it matter that Delta Lake is built on top of Parquet?
What are use cases for Time Travel (that you wouldn’t normally think of)?

8/2 Session 3: Data Systems III: Processing and Streaming

Foundation

Jay Kreps: Questioning the Lambda Architecture (2014)
Kay Kreps: The Log (2013)
Kleppmann ch. 11: Change Data Capture

Application

Structured Streaming Programming Guide

Programming Model, Join Operations, Output Modes, Output Sinks, Triggers. Rest is optional

Technical Discussion

How can you guarantee simultaneous batch/stream reads/writes from cloud storage? (e.g. consistent 10min query from a table while data is being upserted every second)
How does Spark provide a single API for batch and streaming?
What are the limitations of micro-batch processing?
Discuss common CDC tools and how ingestion pattern would change for each

Value Discussion

What’s new about the Medallion Architecture compared to previous ways of doing ETL?
What are the implications of the unification of batch and streaming for how we design data architectures? (Data duplication, freshness, etc)
How does the low cost and scale of cloud storage change how we think about ETL?
What benefits does streaming provide outside of simply reducing latency?
What value does Delta Live Tables provide to traditional EDW-style ETL workflows?

8/9 Session 4: The Lakehouse

Foundation

Databricks Founders: Lakehouse (2021)

Application

Technical Discussion

What are the key limitations of a data lake?
What are the common limitations of a data warehouse designed for on-premise hardware? (including Redshift, which was originally designed to be on-prem)
What is the process for training an ML model using data stored in a cloud DW?
How do different streaming products (Flink, Snowflake Snowpipe, Redpanda, ksql, etc), and what kind of latency can you expect from each relative to cost?

Value Discussion

Why should a lakehouse be open?
Why do most people still use a data warehouse?
If people don’t make multi-million dollar decisions based on one query benchmark, then why is TPC-DS still important?

8/16 Session 5: MLOps

Note: in the interest of time, we will not be covering model training in this course. For a theoretical introduction to model training, read “ISLR.” For a more practical introduction, try Andrew Ng’s ML Courses. For deep learning specifically, use Jeremy Howard’s fast.ai course

Per the Google 2015 paper: “Only a small fraction of real-world ML systems is composed of the ML code [...]. The required surrounding infrastructure is vast and complex.”

Foundation

Uber: Meet Michelangelo (2017)
AIIA: The Canonical ML Stack / AI Infra Landscape (2023)
(Optional) Google: Hidden Tech Debt in ML (2015)
(Optional) for LLMOps, check out huyenchip.com

Application

Technical Discussion

Do you really need a feature store?
How do most people serve models?
What are the different patterns for model serving? (Name 3 and discuss tradeoffs)

Value Discussion

How many MLOps capabilities does Databricks provide in the AIIA landscape survey?
How does MLFlow solve for the hidden tech debt problem?
How does traditional DevOps compare to MLOps? Why would MLOps require a distinct platform?

8/23 Session 6: The Modern Data Architecture

Foundation

Application

8/30 Session 7: Managing Data in the Organization

Foundation

Laney - Infonomics Chapter 2: Prime Ways to Monetize Information (2017) (Optional)
Dehghani - Data Mesh (2020) (Optional)

If interested, read: Great divide, Domain Ownership, Data as a Product
Don’t worry about the terminology – just the core concept about data ownership

Application

Technical Discussion

How is Delta Sharing better than copying data?
How does Databricks Marketplace compare to the Snowflake Data Cloud?
What does Unity Catalog provide that was previously not possible with Hive Metastore and legacy Databricks access control mechanisms?

Value Discussion

When a client wants to discuss “data quality” what types of problems are they likely facing?
How can zero-copy data sharing provide value internally within a large organization?