Transactions
April 17, 2025
1
LECTURE 22
Wrapping Up:
Reservoir Sampling
MapReduce
MapReduce’s Legacy and Spark
—
Database Sampling
Sampling Tips and Pitfalls
Reservoir Sampling
Reservoir Sampling Probabilities/Proofs
[extra] Algorithm L: Optimization
Stratified Sampling
[extra] Visualizations
2
Lecture 22, Data 101, Spring 2025
What other sampling methods exist?
Database sampling has several drawbacks!
When do you sample the database (i.e., table sample)?
3
Up Next: Sampling Methods
In this class, we’ll discuss two flexible ways of sampling that address the issues of table sampling.
4
Reservoir Sampling: The idea
Goals: k-sized simple random sample i.i.d. without replacement.� Linear runtime and single pass
5
(i.e., total records n unknown during sampling).
Reservoir Sampling: The idea
Goals: k-sized simple random sample i.i.d. without replacement.� Linear runtime and single pass
6
Strategy:
Reservoir sampling process:
(i.e., total records n unknown during sampling).
Simple Reservoir Sample
# ported from �# https://en.wikipedia.org/wiki/Reservoir_sampling
from random import randrange
def reservoir_sample(data, n, k):
# fill the reservoir array
r = []
for i in range(k):
r.append(data[i])
# replace elements w/gradually decreasing prob.
for i in range(k, n):
# generates a uniform integer between 0 and a-1
j = randrange(i+1)
if j < k:
r[j] = data[i]
return r
data = list(range(1000))
n, k = len(data), 5
r = reservoir_sample(data, n, k)
r
7
Single pass!�Constant sample size!
Demo
Reservoir Sampling Probabilities/�Proofs
MapReduce
MapReduce’s Legacy and Spark
—
Database Sampling
Sampling Tips and Pitfalls
Reservoir Sampling
Reservoir Sampling Probabilities/Proofs
[extra] Algorithm L: Optimization
Stratified Sampling
[extra] Visualizations
8
Lecture 22, Data 101, Spring 2025
Reservoir Sampling SRS Proof Outline
def reservoir_sample(data, n, k):
r = []
for i in range(k):
r.append(data[i])
for i in range(k, n-1):
j = randrange(i+1)
if j < k:
r[j] = data[i]
return r
9
Reservoir Sampling SRS Proof Outline
def reservoir_sample(data, n, k):
r = []
for i in range(k):
r.append(data[i])
for i in range(k, n-1):
j = randrange(i+1)
if j < k:
r[j] = data[i]
return r
10
Reservoir sample S_i of size k records produced after step i of algorithm.
→ Every record must have equal probability of being in S_n (nth sample).
i.e., = ????
🤔
A. 1/n
B. k/n
C. (k-1)/n
D. 1/(n^k)
E.
F. Something else
Pick a rand number from 1..i, if <=k, then add to sample Pi = k/i
What is the probability of record i being in S_n, the nth (and final) reservoir sample?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Reservoir Sampling SRS Proof Outline
def reservoir_sample(data, n, k):
r = []
for i in range(k):
r.append(data[i])
for i in range(k, n-1):
j = randrange(i+1)
if j < k:
r[j] = data[i]
return r
12
Reservoir sample S_i of size k records produced after step i of algorithm.
→ Every record must have equal probability of being in S_n (nth sample).
i.e., = ????
A. 1/n
B. k/n
C. (k-1)/n
D. 1/(n^k)
E.
F. Something else
*proof by induction* (CS 70/Data 140)
The gist of the proof
Proof by Induction
Property holds until i-1 (after I’ve processed 1…i-1)
every item until i-1 is included with probability k/(i-1)
Now, want to ensure that after i (after I’ve processed 1…i)
every item until i is included with probability k/i
Case 1: ensure that ith item is included with prob k/i
P_i = k/i easy, by definition
Case 2: ensure that remaining items is included with prob k/i
Probabiliity that a given item m from 1…i-i is included in the sample of k after processing i-1 items: k/(i-1)
Probability that it will get to stay at after the ith step:
13
Empirical Proof
import numpy as np
K = 5
N_SAMPLES = 10000
samples = []
N = 1000
data = list(range(N))
for j in range(N_SAMPLES):
samples += \
reservoir_sample(data, N, k)
unique, counts = np.unique(
samples, return_counts=True)
ax = pd.Series(samples).plot.hist(grid=True, bins=20, rwidth=0.9,
color='#607c8e')
ax.set_title(f"Distribution of frequencies of all {N} values")
14
Demo
[Extra] Reservoir Sampling SRS Proof Outline
15
Proof by induction (see CS 70, Data 140)
🏡
Algorithm L: Reservoir Sampling Optimization
MapReduce
MapReduce’s Legacy and Spark
—
Database Sampling
Sampling Tips and Pitfalls
Reservoir Sampling
Reservoir Sampling Probabilities/Proofs
[extra] Algorithm L: Optimization
Stratified Sampling
[extra] Visualizations
16
not covered (out of scope)
Lecture 22, Data 101, Spring 2025
Algorithm L: Small optimization to Reservoir
def reservoir_sample(data, n, k):
r = []
for i in range(k):
r.append(data[i])
for i in range(k, n-1):
j = randrange(i+1)
if j < k:
r[j] = data[i]
return r
17
Calling a random number generator for every record can be slow, particularly when we have trillions of records.
Algorithm L: Small optimization to Reservoir
How can we reduce the number of calls to randrange()?
Consider the number of rows skipped between chosen values:
Algorithm L approach: pick geometrically distributed random gaps, then only call randrange on a much smaller set of records.
18
Sampling gap distribution for S_1000 from 100k records�(gaps between chosen values 1 to 100k)
[Extra] Algorithm L Implementation
[extra]
# This is called Algorithm L, ported from
# https://en.wikipedia.org/wiki/Reservoir_sampling
from random import random, randrange
from math import exp, log, floor
def reservoir_sample_L(data, n, k):
# fill the reservoir array
r = []
for i in range(k):
r.append(data[i])
# generates a uniform [0,1) random number
w = exp(log(random())/k)
while i < n:
i = i + floor(log(random())/log(1-w)) + 1
if i < n:
# replace random reservoir item with item i
r[randrange(k)] = data[i]
w = w * exp(log(random())/k)
return r
19
Assume this works as an optimal strategy for reservoir sampling.
Demo
[Extra] The plotting code
[extra]
import numpy as np
import pandas as pd
def plot_gaps(r):
r.sort()
gaps = pd.Series(np.diff(r))
gaps.plot.hist(grid=True, bins=20, rwidth=0.9,
color='#607c8e')
data = list(range(100000))
n = len(data)
k = 1000
r = reservoirSample(data, n, k)
plot_gaps(r)
20
Demo
[Extra]
Reservoir Sampling in SQL
MapReduce
MapReduce’s Legacy and Spark
—
Database Sampling
Sampling Tips and Pitfalls
Reservoir Sampling
Reservoir Sampling Probabilities/Proofs
[extra] Algorithm L: Optimization
Stratified Sampling
[extra] Visualizations
21
Lecture 22, Data 101, Spring 2025
Reservoir Sampling as a Table-Valued UDF
%%sql
-- create the reservoir_swaps UDF --
DROP TYPE IF EXISTS reservoir_pair CASCADE;
CREATE TYPE reservoir_pair AS (rownum integer, pos integer);
CREATE OR REPLACE FUNCTION reservoir_swaps(k integer, n integer) RETURNS setof reservoir_pair
AS $$
# optimized reservoir sampling algorithm, � # Algorithm L
$$
LANGUAGE 'plpython3u'
VOLATILE
RETURNS NULL ON NULL INPUT;
CREATE OR REPLACE FUNCTION
reservoir_rows(k integer, n integer)
RETURNS setof integer
AS $$
SELECT MAX(rownum) AS rownum
FROM reservoir_swaps(k, n)
GROUP by pos $$
LANGUAGE 'sql'
VOLATILE;
22
We skip the details of this User Defined Function (UDF)
It returns a table with one attribute reservoir_rows for the k row numbers in S_n.
Demo
Reservoir Sampling in SQL
%%sql
WITH rrows AS (� SELECT reservoir_rows(10, count(*)::integer) � AS rows
FROM batting),
rbatting AS (
SELECT ROW_NUMBER() OVER(), * FROM batting)
SELECT *
FROM rbatting, rrows
WHERE row_number = rows;
23
Demo
Stratified Sampling
MapReduce
MapReduce’s Legacy and Spark
—
Database Sampling
Sampling Tips and Pitfalls
Reservoir Sampling
Reservoir Sampling Probabilities/Proofs
[extra] Algorithm L: Optimization
Stratified Sampling
[extra] Visualizations
24
Lecture 22, Data 101, Spring 2025
What about Stratified Sampling?
Stratified sampling: Sample subpopulations.
aka GROUP BY sampling (in SQL):
In PostgreSQL:
25
Demo
What about Stratified Sampling?
%%sql
-- Stratified Sampling with Reservoirs
WITH grprows AS (
SELECT teamid,
reservoir_rows(10, COUNT(*)::integer) AS rows
FROM batting
GROUP BY teamid
),
rbatting AS (
SELECT row_number() OVER(� PARTITION BY teamid),� *
FROM batting�)
SELECT *
FROM rbatting b, grprows g
WHERE row_number = rows AND b.teamid = g.teamid
ORDER BY b.teamid;
26
4181385
Demo
Join at slido.com�#transactions
ⓘ
Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.
Updates create challenges
So far, we’ve largely focused on data science/read-only workloads.�In many settings, we need to also support updates.
When updating data, we want correctness + speed, particularly when users are accessing and modifying the same relations.
Course/lecture goals:
28
Today, a glance at database internals: transactions. More in CS186: Database Systems!
transactions
Two Main Features We Expect from our Database System for Updates
Concurrency Control:
Recovery:
To understand these features, we need to introduce the concept of transactions.
29
transactions
Transactions/�TCL
Transactions/TCL
The ACID Principle
Isolation
Transaction Schedules
Serializability
Details:
[Extra] Additional slides
30
Lecture 22, Data 101, Spring 2025
What is a Transaction?
Colloquially, a transaction in a database is a unit of work�that should appear to “happen together.”
Classic example: Debit/credit banking transaction, i.e.,�moving $1k from one account (1111) to another (9999).
31
BEGIN
-- "debit" one account
UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1111;
-- "credit" the other account
UPDATE savings
SET amount = amount + 1000
WHERE acctId = 9999;
COMMIT
These SQL commands need to “happen together.”
BEGIN, COMMIT are SQL TCL (Transaction Control Language) commands.
transactions
What constitutes “happening together”? Observation #1
BEGIN
-- "debit" one account
UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1111;
-- "credit" the other account
UPDATE savings
SET amount = amount + 1000
WHERE acctId = 9999;
COMMIT
A few observations:
(to be continued…)
32
transactions
SQL TCL
33
Data Definition Language
Transaction Control Language
Data Manipulation Language
Data�Query Language
Data�Control�Language
✅
✅
✅
BEGIN
transactions
SQL TCL, Briefly
A transaction in SQL is a list of commands sandwiched by BEGIN and COMMIT.
If BEGIN/COMMIT not specified, then most systems will autocommit individual SQL commands, i.e., auto-wrap each command in its own transaction.
Generally (with slight syntax variation across systems):
34
BEGIN� <command 1>� …� <command n>�COMMIT
transactions
[extra] SQL TCL: Savepoints
Savepoints let you break your transactions up into pieces. You can then “partially rollback” to a prior savepoint, or abort altogether.
Use case: beyond the scope of this class, but generally used with SQL conditionals or as part of database constraints.
35
BEGIN
UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1234;
SAVEPOINT debit_done;
UPDATE savings
SET amount = amount + 1000
WHERE acctId = 9999;
SAVEPOINT credit_done;
ROLLBACK TO SAVEPOINT debit_done;
UPDATE savings
SET amount = amount + 1000
WHERE acctId = 4321;
END
(arbitrary example; what is this doing?)
transactions
The ACID Principle
Transactions/TCL
The ACID Principle
Isolation
Transaction Schedules
Serializability
Details:
[Extra] Additional slides
36
Lecture 22, Data 101, Spring 2025
What constitutes “happening together”? Observations
BEGIN
-- "debit" one account
UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1111;
-- "credit" the other account
UPDATE savings
SET amount = amount + 1000
WHERE acctId = 9999;
COMMIT
A few observations:
37
These four properties, known as ACID, define how transactions guarantee�(1) concurrency control and (2) recovery.
transactions
ACID: Basic Guarantees
ACID defines four properties of transactions that guarantee concurrency control and recovery.
Atomicity �
Consistency
�
Isolation
�.
Durability
38
transactions
ACID: Basic Guarantees
ACID defines four properties of transactions that guarantee concurrency control and recovery.
39
Either all the commands are reflected in the database, or none are.�Ex: Both debit+credit should occur, or both should fail to occur.
If COMMIT succeeds, all the database integrity checks hold true.�(primary key/foreign keys, constraints, etc.)
Concurrent transactions should externally appear to run sequentially, i.e., 2 concurrent transactions should not “see” each other’s intermediate results.
If COMMIT succeeds, all changes from the transaction persist, even if there is a power failure or a reboot, until the transaction is overwritten by a later transaction.
Atomicity�
Consistency�
Isolation��
Durability
transactions
[History] Why ACID? Unknown, but…
40
Jim Gray, PhD, UC Berkeley, Industry/ Academic researcher. 1998 Turing Award Winner “For seminal contributions to database and transaction processing research and technical leadership in system implementation.”
1983
[Exercise] ACID
BEGIN
-- "debit" one account
UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1111;
-- "credit" the other account
UPDATE savings
SET amount = amount + 1000
WHERE acctId = 9999;
COMMIT
A few observations:
41
Match 1-4 with A, C, I, and D from the ACID Principle.
🤔
A. D, I, C, A
B. I, C, A, D
C. A, C, I, D
D. A, C, D, I
E. Something else
transactions
Match 1-4 with A, C, I, and D from the ACID Principle.
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
[Solution] ACID
BEGIN
-- "debit" one account
UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1111;
-- "credit" the other account
UPDATE savings
SET amount = amount + 1000
WHERE acctId = 9999;
COMMIT
A few observations:
A. We need both debit and credit to happen, i.e., we should not have partial transactions.
C. At the end of transactions, any database constraints should still be satisfied.
I. Even if another transaction happens simultaneously, one should appear to have finished “first.”
D. A committed transaction should appear to have happened, even if there is a power failure/reboot later.
43
transactions
How does the database address each ACID property?
44
�
If COMMIT succeeds, all the database integrity checks hold true.�(primary key/foreign keys, constraints, etc.)
Atomicity�
Consistency�
Isolation��
Durability
Standard database checks (relatively efficient to check for core things like attribute types, keys, constraints, etc.)
transactions
How does the database address each ACID property?
45
Either all the commands are reflected in the database, or none are.�Ex: Both debit+credit should occur, or both should fail to occur.
�
��
If COMMIT succeeds, all changes from the transaction persist, even if there is a power failure or a reboot, until the transaction is overwritten by a later transaction.
Atomicity�
Consistency�
Isolation��
Durability
The database’s internal recovery system.
After a crash:
See CS186 for the implementation. Devil is in the details!
transactions
How does the database address each ACID property?
46
Atomicity�
Consistency�
Isolation��
Durability
�
�
Concurrent transactions should externally appear to run sequentially, i.e., 2 concurrent transactions should not “see” each other’s intermediate results.
Provided by concurrency control, a component of the database. We’ll grasp the intuition today!!
transactions
Isolation
Transactions/TCL
The ACID Principle
Isolation
Transaction Schedules
Serializability
Details:
[Extra] Additional slides
47
Lecture 22, Data 101, Spring 2025
Isolation
Isolation: Concurrent transactions should externally appear to run sequentially.
48
Hire Mercy as the new VP of Engineering!
Move the entire payroll of the London Office to the Cambridge Office!
Prepare tax projections for the 2nd quarter!
transactions
Isolation,
The challenge: How do we execute these transactions “in isolation” but “concurrently”? With one single machine?
49
Prepare tax projections for the 2nd quarter!
Hire Mercy as the new VP of Engineering!
Move the entire payroll of the London Office to the Cambridge Office!
Assumption: the precise order of these three transactions doesn’t matter. What matters is that they appeared to have been executed by the DBMS in some order.
Isolation,
For simplicity, we will limit our discussion to reads and writes of individual “objects”:
“Objects” := records (for now)
i-th transaction has Read from O:
Ri(O)
i-th transaction has Write to O:
Wi(O [= value])
50
Prepare tax projections for the 2nd quarter!
Hire Mercy as the new VP of Engineering!
Move the entire payroll of the London Office to the Cambridge Office!
Transaction Schedules
Transactions/TCL
The ACID Principle
Isolation
Transaction Schedules
Serializability
Details:
[Extra] Additional slides
51
Lecture 22, Data 101, Spring 2025
Determining Transaction Schedules that Maintain Isolation
Our goal: Understand how multiple transactions can run concurrently (for performance) but also in isolation (for ACID).
To do so, we’ll define the following:
52
transactions
Example: Two Transactions
53
-- set Parth’s salary to 10% more �-- than Jonah’s
UPDATE employee
SET salary = (SELECT salary*1.1
FROM employee
WHERE name='Jonah')
WHERE name = 'Parth';
-- set Parth’s salary to 10% more �
UPDATE employee
SET salary = (SELECT salary*1.1
FROM employee
WHERE name='Parth')
WHERE name = ‘Parth';
T1:
R1(J) |
W1(P) |
R2(P) |
W2(P) |
T2:
transactions
Transaction Schedules
A transaction schedule is a ordered list of actions from a set of transactions.
54
A proposed Transaction Schedule�of T1 and T2
T1 | T2 |
R1(J) | |
W1(P) | |
| R2(P) |
| W2(P) |
time
While there are many possible transaction schedules, a DBMS will pick one with which to schedule and execute read/write actions.
transactions
Transaction Schedules
A transaction schedule is a ordered list of actions from a set of transactions.
55
T1 | T2 |
R1(J) | |
W1(P) | |
| R2(P) |
| W2(P) |
time
time slot 1
time slot 2
time slot 3
time slot 4
✅
transactions
Transaction Schedules
A transaction schedule is a ordered list of actions from a set of transactions.
56
T1 | T2 |
R1(J) | |
W1(P) | |
| R2(P) |
| W2(P) |
time
✅
T1:
R1(J) |
W1(P) |
R2(P) |
W2(P) |
T2:
transactions
Determining Transaction Schedules that Maintain Isolation
Our goal: Understand how multiple transactions can run concurrently (for performance) but also in isolation (for ACID).
To do so, we’ll define the following:
57
✅
transactions
Serial Schedules
A serial schedule is a transaction schedule for which �actions from different transactions are not interleaved.
58
T1:
R1(J) |
W1(P) |
R2(P) |
W2(P) |
T2:
T1 | T2 |
R1(J) | |
W1(P) | |
| R2(P) |
| W2(P) |
T1 | T2 |
| R2(P) |
| W2(P) |
R1(J) | |
W1(P) | |
T1 | T2 |
| R2(P) |
R1(J) | |
| W2(P) |
W1(P) | |
serial schedule
time
serial schedule
not a serial schedule;�transactions interleaved
Serial schedules exhibit no concurrency, because actions of a transaction are executed together and separate from those in other transactions.
transactions
Do these schedules satisfy the isolation property?
59
Which of these three schedules satisfy the isolation property? Select all.
🤔
Isolation: Concurrent transactions should externally appear to run sequentially, i.e., 2 concurrent transactions should not (appear to) “see” each other’s intermediate results.
T1:
R1(J) |
W1(P) |
R2(P) |
W2(P) |
T2:
T1 | T2 |
R1(J) | |
W1(P) | |
| R2(P) |
| W2(P) |
T1 | T2 |
| R2(P) |
| W2(P) |
R1(J) | |
W1(P) | |
T1 | T2 |
| R2(P) |
R1(J) | |
| W2(P) |
W1(P) | |
time
1.
2.
3.
transactions
Which of these three schedules satisfy the isolation property? Select all.
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Serializability
Transactions/TCL
The ACID Principle
Isolation
Transaction Schedules
Serializability
Details:
[Extra] Additional slides
61
Lecture 22, Data 101, Spring 2025
Do these schedules satisfy the isolation property?
62
T1:
R1(J) |
W1(P) |
R2(P) |
W2(P) |
T2:
T1 | T2 |
R1(J) | |
W1(P) | |
| R2(P) |
| W2(P) |
T1 | T2 |
| R2(P) |
| W2(P) |
R1(J) | |
W1(P) | |
T1 | T2 |
| R2(P) |
R1(J) | |
| W2(P) |
W1(P) | |
time
1.
2.
3.
Yes
Yes
Yes!
Isolation. If we execute a given schedule, from the DBMS’s POV, individual transactions appear to be executed sequentially.
transactions
All three schedules satisfy the isolation property!
63
T1:
R1(J) |
W1(P) |
R2(P) |
W2(P) |
T2:
T1 | T2 |
R1(J) | |
W1(P) | |
| R2(P) |
| W2(P) |
T1 | T2 |
| R2(P) |
| W2(P) |
R1(J) | |
W1(P) | |
T1 | T2 |
| R2(P) |
R1(J) | |
| W2(P) |
W1(P) | |
time
1.
2.
3.
It is okay that these two serial schedules produce non-equivalent database outcome states!
Isolation. If we execute a given schedule, from the DBMS’s POV, individual transactions appear to be executed sequentially.
transactions
All three schedules satisfy the isolation property!
64
T1:
R1(J) |
W1(P) |
R2(P) |
W2(P) |
T2:
T1 | T2 |
| R2(P) |
| W2(P) |
R1(J) | |
W1(P) | |
T1 | T2 |
| R2(P) |
R1(J) | |
| W2(P) |
W1(P) | |
time
1.
2.
3.
Isolation. If we execute a given schedule, from the DBMS’s POV, individual transactions appear to be executed sequentially.
Despite the interleaving, this schedule has an equivalent database outcome to one of the serial schedules!
transactions
Schedule 3 is a serializable schedule
65
T1:
R1(J) |
W1(P) |
R2(P) |
W2(P) |
T2:
T1 | T2 |
| R2(P) |
| W2(P) |
R1(J) | |
W1(P) | |
T1 | T2 |
| R2(P) |
R1(J) | |
| W2(P) |
W1(P) | |
time
1.
2.
3.
Despite the interleaving, this schedule has an equivalent database outcome to one of the serial schedules!
Schedule 3 is a serializable schedule: a transaction schedule whose database outcome is equivalent to some serial schedule.
transactions
Unserializable Schedules
66
T1:
R1(J) |
W1(P) |
R2(P) |
W2(P) |
T2:
time
2.
3.
Not all schedules are serializable! This is an unserializable schedule, because there is no serial equivalent, and therefore transactions do not appear isolated.
T1 | T2 |
| R2(P) |
| W2(P) |
R1(J) | |
W1(P) | |
T1 | T2 |
| R2(P) |
R1(J) | |
| W2(P) |
W1(P) | |
serializable schedule
serial schedule
4.
T1 | T2 |
| R2(P) |
R1(J) | |
W1(P) | |
| W2(P) |
⚠️
transactions
A Joke
67
Strict 2-Phase Locking
Transactions/TCL
The ACID Principle
Isolation
Transaction Schedules
Serializability
Details:
[Extra] Additional slides
68
Lecture 22, Data 101, Spring 2025
Summary so far
Our goal: Allow multiple transactions to run concurrently (for performance)�but also in isolation (for ACID).
To do so, we’ve traced the following steps:
69
T1 | T2 |
| R2(P) |
R1(J) | |
| W2(P) |
W1(P) | |
T1 | T2 |
| R2(P) |
| W2(P) |
R1(J) | |
W1(P) | |
T1 | T2 |
| R2(P) |
R1(J) | |
W1(P) | |
| W2(P) |
serial schedule
serializable schedule
unserializable schedule
transactions
DBMS goals
Under the covers, a database system can allow serializable schedules that may not be serial, but after execution have the same outcome as some serial schedule.
CS186: Build systems that guarantee serializability for all executed schedules.
70
T1 | T2 |
| R2(P) |
R1(J) | |
| W2(P) |
W1(P) | |
T1 | T2 |
| R2(P) |
| W2(P) |
R1(J) | |
W1(P) | |
T1 | T2 |
| R2(P) |
R1(J) | |
W1(P) | |
| W2(P) |
serial schedule
serializable schedule
unserializable schedule
❌
transactions
Two final goals of this lecture
71
transactions
Database locking
One of the most straightforward implementations that databases can use to ensure serializability is called Strict Two-Phase Locking (Strict 2PL).
Skipped slides: details on Strict 2PL.
72
transactions
Determining Serializability:
Conflicting Actions
Transactions/TCL
The ACID Principle
Isolation
Transaction Schedules
Serializability
Details:
[Extra] Additional slides
73
Lecture 22, Data 101, Spring 2025
How do we know if a schedule is serializable?
We like serializable schedules.
74
Conflicting actions between transactions will determine if a schedule is serializable.
What does it mean???
Let’s dive in!
transactions
Conflicting Actions
Def: Two actions conflict if:
Alt Def: If T1 and T2 have conflicting actions, then every equivalent serial schedule (i.e., with the same database outcome) must have T1 and T2 in some specific order.
75
transactions
Which of the following are conflicting actions?
Alt Def: If T1 and T2 have conflicting actions, then every equivalent serial schedule (i.e., with the same database outcome) must have T1 and T2 in some specific order.
Suppose T1 → T2 in a schedule,�i.e., T1 comes before T2.
76
W1(P)
R2(P)
T1 writes �Parth salary �as 110000
T2 reads �Parth salary �as 110000
R1(G)
W2(G)
T1 reads Gabi salary as 100000
T2 writes Gabi salary as 121000
W1(J)
W2(J)
T1 writes Jonah salary as 300000
T2 writes Jonah salary as 0
R1(G)
R2(G)
T1 reads Gabi salary as 121000
T2 reads Gabi salary as 121000
W1(J)
W2(P)
T1 writes Jonah salary as 0
T2 writes Parth salary as 200000
A.
B.
C.
D.
E.
For which of the following would the resulting flip of actions mean that this transaction order would change, i.e., that now T2 → T1? Select all.
🤔
transactions
Select all for which the following is true: Flipping the order of the two actions in T1 and T2 would result in a different database outcome state.
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Which of the following are conflicting actions?
78
W1(P)
R2(P)
T1 writes �Parth salary �as 110000
T2 reads �Parth salary �as 110000
R1(G)
W2(G)
T1 reads Gabi salary as 100000
T2 writes Gabi salary as 121000
W1(J)
W2(J)
T1 writes Jonah salary as 300000
T2 writes Jonah salary as 0
R1(G)
R2(G)
T1 reads Gabi salary as 121000
T2 reads Gabi salary as 121000
W1(J)
W2(P)
T1 writes Jonah salary as 0
T2 writes Parth salary as 200000
R2(P)
W1(P)
T2 reads �Parth salary �as ????
T1 writes�Parth salary �as ????
W2(G)
R1(G)
T2 writes Gabi salary as ????
T1 reads Gabi salary as ????
W2(J)
W1(J)
T2 writes Jonah salary as ????
T1 writes Jonah salary as ????
R2(G)
R1(G)
T2 reads Gabi salary as ????
T1 reads Gabi salary as ????
W2(P)
W1(J)
T2 writes Parth salary as ????
T1 writes Jonah salary as ????
Suppose T1 → T2 in a schedule. For which of the following would the resulting flip of actions mean that this transaction order would change, i.e., that now T2 → T1? Select all.
hypothetically
hypothetically
hypothetically
hypothetically
hypothetically
Which of the following are conflicting actions?
Alt Def: If T1 and T2 have conflicting actions, then every equivalent serial schedule (i.e., with the same database outcome) must have T1 and T2 in some specific order.
Suppose T1 → T2 in a schedule,�i.e., T1 comes before T2.
79
For which of the following would the resulting flip of actions mean that this transaction order would change, i.e., that now T2 → T1? Select all.
W1(P)
R2(P)
T1 writes �Parth salary �as 110000
T2 reads �Parth salary �as 110000
R1(G)
W2(G)
T1 reads Gabi salary as 100000
T2 writes Gabi salary as 121000
W1(J)
W2(J)
T1 writes Jonah salary as 300000
T2 writes Jonah salary as 0
R1(G)
R2(G)
T1 reads Gabi salary as 121000
T2 reads Gabi salary as 121000
W1(J)
W2(P)
T1 writes Jonah salary as 0
T2 writes Parth salary as 200000
A.
B.
C.
D.
E.
transactions
Which of the following are conflicting actions?
Alt Def: If T1 and T2 have conflicting actions, then every equivalent serial schedule (i.e., with the same database outcome) must have T1 and T2 in some specific order.
Suppose T1 → T2 in a schedule,�i.e., T1 comes before T2.
80
W1(P)
R2(P)
T1 writes �Parth salary �as 110000
T2 reads �Parth salary �as 110000
R1(G)
W2(G)
T1 reads Gabi salary as 100000
T2 writes Gabi salary as 121000
W1(J)
W2(J)
T1 writes Jonah salary as 300000
T2 writes Jonah salary as 0
R1(G)
R2(G)
T1 reads Gabi salary as 121000
T2 reads Gabi salary as 121000
W1(J)
W2(P)
T1 writes Jonah salary as 0
T2 writes Parth salary as 200000
can be flipped! no conflicts!
cannot be flipped! conflicting actions!
For which of the following would the resulting flip of actions mean that this transaction order would change, i.e., that now T2 → T1? Select all.
transactions
How do we know if a schedule is serializable?
We like serializable schedules.
A schedule is serializable if all conflicting actions dictate a specific ordering of the transactions (with no cycles)
81
Conflicting actions between transactions determine if a schedule is serializable.
transactions
The example again
82
T1:
R1(J) |
W1(P) |
R2(P) |
W2(P) |
T2:
3.
T1 | T2 |
| R2(P) |
R1(J) | |
| W2(P) |
W1(P) | |
serializable
4.
T1 | T2 |
| R2(P) |
R1(J) | |
W1(P) | |
| W2(P) |
unserializable
5.
T1 | T2 |
R1(J) | |
| R2(P) |
| W2(P) |
W1(P) | |
R2(P)
W1(P)
W2(P)
W1(P)
R2(P)
W1(P)
W2(P)
W1(P)
R2(P)
W1(P)
W2(P)
W1(P)
⚠️
⚠️
serializable
Schedule 4: The two pairs of conflicting actions imply two different orders to T1 and T2.
transactions
Performance Tradeoffs: Snapshot Isolation
Transactions
The ACID Principle
Isolation and Serializability
Strict 2-Phase Locking
Conflicting Actions
[Extra] Conflict Graphs
[Extra] Conflict Serializable
Weak Isolation
[Extra] Additional slides
83
Lecture 22, Data 101, Spring 2025
Final thoughts: Why Should Data Engineers Know Transactions?
You may think that transactions (and serializability) are very much in the weeds of DBMS design, which we don’t particularly implement in this course. However…
84
Inevitably you will update a database and manage data from transactional databases!
If your DB is slow for transactional reasons:
Finally, transaction concepts are also quite useful outside of databases.
transactions
Serialized Transactions: A summary
Serialized transactions ensure ACID properties of shared access, particularly Isolation.
85
Life is good?…Except…
transactions
Approximating Serialized Transactions with Weak Isolation
Serialized transactions ensure ACID properties of shared access, particularly Isolation.
86
Enter: Weak Isolation.
Life is good?…Except…
transactions
Snapshot Isolation
Snapshot isolation is a weaker form of isolation than serialization, but it’s good enough when we prefer more concurrency/higher performance.
87
At transaction start: Take a “snapshot” of the database, off which to do reads/writes.
transactions
Snapshot Isolation is Actually Popular
Isolation levels (both default and maximum) vary in support across different database engines.
Marketing also varies!
88
The maximum levels of many cloud DBMSs is not always the theoretical maximum,�which is “serializable” transactions.
[Bonus] Strict 2-Phase Locking: Details
Transactions
The ACID Principle
Isolation and Serializability
Strict 2-Phase Locking
Conflicting Actions
[Extra] Conflict Graphs
[Extra] Conflict Serializable
Weak Isolation
[Extra] Additional slides
89
Lecture 22, Data 101, Spring 2025
Database locking
How do databases ensure serializability?
One of the most straightforward implementations is called Strict Two-Phase Locking (Strict 2PL).
Locking is the process of ensuring that 2 conflicting actions happen in order.
90
transactions
Strict Two-Phase Locking (Strict 2PL)
Phase 1: During the transaction, lock objects before use. �Two types of locks:
Phase 2: At the end of the transaction (i.e., � COMMIT or ROLLBACK),� release all locks at once.
91
# locks held
acquisition phase
time
release all locks at end of xact
The Strict 2PL algorithm allows only serializable schedules!
Note that schedules can result in deadlock. See Discussion for more info/practice!
transactions
Strict Two-Phase Locking (Strict 2PL), Practically
What objects are we locking?
What does it mean to “acquire” or “release” a lock?
92
# locks held
acquisition phase
time
release all locks at end of xact
transactions
[Extra]
Determining Serializability: Conflict Graphs
Transactions
The ACID Principle
Isolation and Serializability
Strict 2-Phase Locking
Conflicting Actions
[Extra] Conflict Graphs
[Extra] Conflict Serializable
Weak Isolation
[Extra] Additional slides
93
Lecture 22, Data 101, Spring 2025
[Exercise] Determining Conflicting Actions
94
time
1.
2.
🤔
What are the conflicting actions in each of the schedules?
T1 | T2 |
| R2(J) |
R1(J) | |
| W2(P) |
W1(P) | |
T1 | T2 |
R1(G) | |
| R2(P) |
W1(P) | |
| W2(G) |
transactions
[Exercise] Determining Conflicting Actions
95
time
1.
2.
What are the conflicting actions in each of the schedules?
T1 | T2 |
| R2(J) |
R1(J) | |
| W2(P) |
W1(P) | |
T1 | T2 |
R1(G) | |
| R2(P) |
W1(P) | |
| W2(G) |
1. W1(P) and W2(P)�// write/write to same object
2. R1(G) and W2(G); // read/write same obj� W1(P) and R2(P) // read/write same obj
transactions
[Exercise] Determining Serializability
Suppose we have the following conflicting actions:
1. W1(P) and W2(P)
2. R1(G) and W2(G); W1(P) and R2(P)
Which of the following schedules are serializable?
96
time
1.
2.
A. Serializable schedule, i.e., equivalent to some serial schedule of T1 and T2
B. Unserializable schedule, i.e., no equivalent serial schedule exists
🤔
T1 | T2 |
| R2(J) |
R1(J) | |
| W2(P) |
W1(P) | |
T1 | T2 |
R1(G) | |
| R2(P) |
W1(P) | |
| W2(G) |
(no slido)
transactions
[Exercise] Determining Serializability
Suppose we have the following conflicting actions:
1. W1(P) and W2(P)
2. R1(G) and W2(G); W1(P) and R2(P)
Which of the following schedules are serializable?
97
time
1.
2.
T1 | T2 |
| R2(J) |
R1(J) | |
| W2(P) |
W1(P) | |
T1 | T2 |
R1(G) | |
| R2(P) |
W1(P) | |
| W2(G) |
A. Serializable! Equivalent to T2 happening before T1.
B. Unserializable! Conflicting actions can’t be “flipped.”
transactions
An Algorithm for Determining Serializability
Serializability of a schedule can be determined by drawing its conflict graph.
98
An Algorithm for Determining Serializability
Serializability of a schedule can be determined by drawing its conflict graph.
99
T1 | T2 |
| R2(J) |
R1(J) | |
| W2(P) |
W1(P) | |
T1 | T2 |
R1(G) | |
| R2(P) |
W1(P) | |
| W2(G) |
time
1.
2.
T1
T2
Given: Conflicting actions
1. W1(P) and W2(P)
2. R1(G) and W2(G); � W1(P) and R2(P)
Serializable!
Unserializable!
An Algorithm for Determining Serializability
Serializability of a schedule can be determined by drawing its conflict graph.
100
T1 | T2 |
| R2(J) |
R1(J) | |
| W2(P) |
W1(P) | |
T1 | T2 |
R1(G) | |
| R2(P) |
W1(P) | |
| W2(G) |
time
1.
2.
T1
T2
Given: Conflicting actions
1. W1(P) and W2(P)
2. R1(G) and W2(G); � W1(P) and R2(P)
Serializable!
Unserializable!
T1
T2
An Algorithm for Determining Serializability
101
time
1.
2.
T1 | T2 |
| R2(J) |
R1(J) | |
| W2(P) |
W1(P) | |
T1 | T2 |
R1(G) | |
| R2(P) |
W1(P) | |
| W2(G) |
Serializable! Equivalent to T2 happening before T1.
Unserializable! Conflicting actions can’t be “flipped.”
If the conflict graph has no cycles (acyclic), then the schedule is serializable. Otherwise, it has cycles and it is unserializable.
T1
T2
T1
T2
Given: Conflicting actions
1. W1(P) and W2(P)
2. R1(G) and W2(G); � W1(P) and R2(P)
(proof later)
transactions
How do we know if a schedule is serializable?
We like serializable schedules.
102
Conflicting actions between transactions will determine if a schedule is serializable.
We did it!
The strategy for a determining serializability of a given schedule of interleaved transactions:
🌟
transactions
[Extra]
Formal Terminology: Conflict Serializable
Transactions
The ACID Principle
Isolation and Serializability
Strict 2-Phase Locking
Conflicting Actions
[Extra] Conflict Graphs
[Extra] Conflict Serializable
Weak Isolation
[Extra] Additional slides
103
Lecture 22, Data 101, Spring 2025
Why does acyclic conflict graph → serializability?
Observations:
104
W1(P)
R2(P)
R1(G)
W2(G)
order
W1(P)
R2(P)
R1(G)
W2(G)
1.
2.
T1
T2
T1
T2
Unserializable! Conflicting actions can’t be “flipped.”
Serializable! Equivalent to T2 happening before T1.
no cycles!
cycles!
transactions
Formal Terminology: Conflict Serializable Schedule
Def A schedule is conflict serializable if and only if the conflict graph is acyclic (has no cycles).
Observations:
105
transactions
Formal Terminology: Conflict Serializable Schedule
Def A schedule is conflict serializable if and only if the conflict graph is acyclic (has no cycles).
Observations:
106
Lemma
If a schedule is conflict serializable, then it is serializable.
Proof:
transactions
[Extra] The converse is not true
Note that some serializable schedules are not necessarily conflict serializable!
From R&G Database Management Systems, Third Edition, Section 17.1, Figure 17.1 (p.550-1):
This schedule is equivalent to executing the transactions serially in the order T1, T2, T3, but it is not conflict equivalent to this serial schedule because the writes of T1 and T2 are ordered differently.
107
T1 | T2 | T3 |
R(A) | | |
| W(A) | |
| Commit | |
W(A) | | |
Commit | | |
| | W(A) |
| | Commit |
transactions
[Extra] Weak Isolation: Read Commit
Transactions
The ACID Principle
Isolation and Serializability
Strict 2-Phase Locking
Conflicting Actions
[Extra] Conflict Graphs
[Extra] Conflict Serializable
Weak Isolation
[Extra] Additional slides
108
Lecture 22, Data 101, Spring 2025
“Read Committed” Isolation level
109
transactions
Does it help us?
110
BEGIN � ISOLATION LEVEL serializable;
UPDATE students
SET gpa = 4.0
WHERE sid = 1234;
END;
BEGIN
ISOLATION LEVEL read committed;
SELECT avg(gpa)
FROM students;
END;
Locks every student but doesn’t need to be 100% correct!
Locks 1 student but must be serializable!
transactions
What could go wrong in Read Committed?
111
transactions
Repeatable Read Isolation
112
transactions
Snapshot Isolation is not (quite) Serializable
113
x = R1(G0) | | |
| y = R2(P0) | |
W2(P1 = 110000) | | |
| W2(G1 = 220000) | |
| | |
| | |
| | |
| |
Time
Pat (T1)
Gabi (T2)
P0: 200000
G0: 100000
Set my salary to be 10% bigger than Gabi’s
Set my salary to be 10% bigger than Pat’s
NOT equivalent to either order (not serializable!)
Write skew anomaly: concurrent reads, and writes reflect the fact that they didn’t read each other’s writes!
transactions