Data & AI from First Principles
Data & AI from First Principles
Overview
Introduction
How to Progress Through a Session
Foundation
Application
Technical Discussion
Value Discussion
The Book
Schedule
7/12 Session 0: Class Kickoff
7/19 Session 1: Data Systems I: Storage and Retrieval
Foundation
Application
Technical Discussion
Value Discussion
7/26 Session 2: Data Systems II: Consistency and the Cloud
Foundation
Application
Technical Discussion
Value Discussion
8/2 Session 3: Data Systems III: Processing and Streaming
Foundation
Application
Technical Discussion
Value Discussion
8/9 Session 4: The Lakehouse
Foundation
Application
Technical Discussion
Value Discussion
8/16 Session 5: MLOps
Foundation
Application
Technical Discussion
Value Discussion
8/23 Session 6: The Modern Data Architecture
Foundation
Application
8/30 Session 7: Managing Data in the Organization
Foundation
Application
Technical Discussion
Value Discussion
Overview
Introduction
We will focus specifically on timeless ideas that allow you to present tradeoffs specific to a use case’s needs. These concepts will not change even as the industry, products, and roles continue to evolve.
We will not be discussing technical how-to. For example, “why am I getting this Spark bug” are off-topic. You will have plenty of opportunities to learn technical implementation through regular experience, certification programs, and other field enablement. That being said, we will get very low-level and technical to reach our ultimate goal of understanding the “why” behind the tech.
We’ll start with history and theory and gradually use this foundation to learn the Databricks product and the modern ecosystem. Many of the readings are multiple years old, but their relevance today is a testament to their timelessness and therefore critical if we want to truly think from first principles.
How to Progress Through a Session
Foundation
The foundation includes classic chapters, papers, and blogs contributed to the field of data and analytics. While some of these were written by Databricks founders and employees, they are all vendor-agnostic contributions to the world. Some of them are many years old but were the original start to a new trend.
Application
Learn about the given products using your preferred learning style. This could be coding/tinkering or reading/listening to focus on areas you’re curious about. You will have more time in your day job to troubleshoot and implement specifics, but this class is about becoming aware of areas you haven’t had a chance to focus on so far. The goal is not to become an expert in everything but to create pointers in your head to areas you will later get deep into as you work.
Technical Discussion
We will have a small group discussion on technical topics that set the stage for why the product is valuable in a new way. Questions are open-ended and have multiple right answers. The goal is not to show off. In this training, we don’t need to memorize all the details to repeat to a customer, because you will learn this through other enablement.
Value Discussion
We will open up to the “why” behind our product, and practice articulating this value to different audiences with diverse needs. This is not “pitch training” – it’s about truly understanding the ideas behind products in the industry. Questions are open-ended and have multiple right answers.
The Book
In earlier Sessions, we will be reading Martin Kleppman: Designing Data-Intensive Architectures (2017)
- O’Reilly Media / Other ways to buy
- I recommend expensing an O’Reilly subscription, so you can read other O’Reilly books too. There are a lot of other O’Reilly books that are relevant to the job of a data engineer/scientist.
Schedule
7/12 Session 0: Class Kickoff
We’ll discuss the class structure and answer questions.
7/19 Session 1: Data Systems I: Storage and Retrieval
Foundation
- Read the Overview in this document
- Kleppmann ch. 1: Foundations
- Kleppmann ch. 3: Storage and Retrieval (SSTables, LSMTrees, and BTrees sections are optional)
Application
Technical Discussion
- Compare and contrast Apache Spark to other OLAP systems
- A client wants to know if Spark can replace HBase or Cassandra. What questions should you ask them to determine whether this would make sense?
- Does Delta Lake support indexing?
- How do you evolve schemas in Parquet? How does that compare to Delta Lake?
- How does Spark natively support both dataframe-style Python development and tabular SQL queries?
How does Spark natively support structured, semi-structured, and unstructured data? - How is it possible that Spark provides libraries for ML, SQL, Graph, and Structured Streaming under a single framework and development model?
Value Discussion
- How would you deploy Spark without Databricks?
- Why does it matter that Spark and Delta Lake are both open source?
- What is so hard to support ETL, ML, and SQL analytics in a single processing framework?
7/26 Session 2: Data Systems II: Consistency and the Cloud
Foundation
- Kleppmann ch. 7: Transactions: The slippery concept of a transaction
- Kleppmann ch. 9: Consistency: Consistency guarantees (Optional)
- Photon Paper
- Snowflake Paper
Application
Technical Discussion
- How do Lakehouse formats give ACID guarantees entirely on cloud object storage, even though cloud object storage isn’t even ACID compatible?
- How did one perform MERGE on columnar files (Parquet, ORC) before Lakehouse formats?
- How do Photon and Snowflake implement vectorized processing?
Value Discussion
- Why is providing ACID on S3 better than simply using a traditional RDBMS or EDW that already supports ACID?
- Why does it matter that Delta Lake is built on top of Parquet?
- What are use cases for Time Travel (that you wouldn’t normally think of)?
8/2 Session 3: Data Systems III: Processing and Streaming
Foundation
Application
- Programming Model, Join Operations, Output Modes, Output Sinks, Triggers. Rest is optional
Technical Discussion
- How can you guarantee simultaneous batch/stream reads/writes from cloud storage? (e.g. consistent 10min query from a table while data is being upserted every second)
- How does Spark provide a single API for batch and streaming?
- What are the limitations of micro-batch processing?
- Discuss common CDC tools and how ingestion pattern would change for each
Value Discussion
- What’s new about the Medallion Architecture compared to previous ways of doing ETL?
- What are the implications of the unification of batch and streaming for how we design data architectures? (Data duplication, freshness, etc)
- How does the low cost and scale of cloud storage change how we think about ETL?
- What benefits does streaming provide outside of simply reducing latency?
- What value does Delta Live Tables provide to traditional EDW-style ETL workflows?
8/9 Session 4: The Lakehouse
Foundation
Application
Technical Discussion
- What are the key limitations of a data lake?
- What are the common limitations of a data warehouse designed for on-premise hardware? (including Redshift, which was originally designed to be on-prem)
- What is the process for training an ML model using data stored in a cloud DW?
- How do different streaming products (Flink, Snowflake Snowpipe, Redpanda, ksql, etc), and what kind of latency can you expect from each relative to cost?
Value Discussion
- Why should a lakehouse be open?
- Why do most people still use a data warehouse?
- If people don’t make multi-million dollar decisions based on one query benchmark, then why is TPC-DS still important?
8/16 Session 5: MLOps
Note: in the interest of time, we will not be covering model training in this course. For a theoretical introduction to model training, read “ISLR.” For a more practical introduction, try Andrew Ng’s ML Courses. For deep learning specifically, use Jeremy Howard’s fast.ai course
Per the Google 2015 paper: “Only a small fraction of real-world ML systems is composed of the ML code [...]. The required surrounding infrastructure is vast and complex.”
Foundation
Application
Technical Discussion
- Do you really need a feature store?
- How do most people serve models?
- What are the different patterns for model serving? (Name 3 and discuss tradeoffs)
Value Discussion
- How many MLOps capabilities does Databricks provide in the AIIA landscape survey?
- How does MLFlow solve for the hidden tech debt problem?
- How does traditional DevOps compare to MLOps? Why would MLOps require a distinct platform?
8/23 Session 6: The Modern Data Architecture
Foundation
Application
8/30 Session 7: Managing Data in the Organization
Foundation
- If interested, read: Great divide, Domain Ownership, Data as a Product
- Don’t worry about the terminology – just the core concept about data ownership
Application
Technical Discussion
- How is Delta Sharing better than copying data?
- How does Databricks Marketplace compare to the Snowflake Data Cloud?
- What does Unity Catalog provide that was previously not possible with Hive Metastore and legacy Databricks access control mechanisms?
Value Discussion
- When a client wants to discuss “data quality” what types of problems are they likely facing?
- How can zero-copy data sharing provide value internally within a large organization?