2 of 113

Wrapping Up:

Reservoir Sampling

MapReduce

MapReduce’s Legacy and Spark

—

Database Sampling

Sampling Tips and Pitfalls

Reservoir Sampling

Reservoir Sampling Probabilities/Proofs

[extra] Algorithm L: Optimization

Stratified Sampling

[extra] Visualizations

Lecture 22, Data 101, Spring 2025

3 of 113

What other sampling methods exist?

Database sampling has several drawbacks!

Have to pick the “right” Bernoulli probability to set for queries
No control of output size
Sample only during scan; can’t sample based on data attributes themselves
Finally, not all data sources support this sort of sampling!

When do you sample the database (i.e., table sample)?

(Note: sampling is effectively a form of selection pushdown)
For efficiency: sample as close to scanning the data as possible.
This allows a smaller amount of data to flow through your pipeline, facilitating throughput.

4 of 113

Up Next: Sampling Methods

In this class, we’ll discuss two flexible ways of sampling that address the issues of table sampling.

Reservoir sampling picks a k-sized simple random sample in a single pass.
Stratified sampling (using reservoir sampling) to get k-sized samples per grouped attribute.

5 of 113

Reservoir Sampling: The idea

Goals: k-sized simple random sample i.i.d. without replacement.� Linear runtime and single pass

(i.e., total records n unknown during sampling).

6 of 113

Reservoir Sampling: The idea

Goals: k-sized simple random sample i.i.d. without replacement.� Linear runtime and single pass

Strategy:

Build a reservoir, i.e., a fixed-size array that holds k records.
Then scan the table.�For record r_i, where i = 1, …, n:

If i < k, reservoir has space. Insert record r_i into reservoir.
Else, insert record r_i by evicting a random reservoir record,�with some probability P_i based on the scan order i.

Reservoir sampling process:

produces an SRS
is linear in # rows

(i.e., total records n unknown during sampling).

7 of 113

Simple Reservoir Sample

# ported from �# https://en.wikipedia.org/wiki/Reservoir_sampling

from random import randrange

def reservoir_sample(data, n, k):

# fill the reservoir array

r = []

for i in range(k):

r.append(data[i])

# replace elements w/gradually decreasing prob.

for i in range(k, n):

# generates a uniform integer between 0 and a-1

j = randrange(i+1)

if j < k:

r[j] = data[i]

return r

data = list(range(1000))

n, k = len(data), 5

r = reservoir_sample(data, n, k)

Single pass!�Constant sample size!

n can be unbounded! (w/small changes to the function)
Can also parallelize (since you know P_i for every record r_i)

Demo

#transactions

8 of 113

Reservoir Sampling Probabilities/�Proofs

MapReduce

MapReduce’s Legacy and Spark

—

Database Sampling

Sampling Tips and Pitfalls

Reservoir Sampling

Reservoir Sampling Probabilities/Proofs

[extra] Algorithm L: Optimization

Stratified Sampling

[extra] Visualizations

Lecture 22, Data 101, Spring 2025

9 of 113

Reservoir Sampling SRS Proof Outline

def reservoir_sample(data, n, k):

r = []

for i in range(k):

r.append(data[i])

for i in range(k, n-1):

j = randrange(i+1)

if j < k:

r[j] = data[i]

return r

10 of 113

Reservoir Sampling SRS Proof Outline

def reservoir_sample(data, n, k):

r = []

for i in range(k):

r.append(data[i])

for i in range(k, n-1):

j = randrange(i+1)

if j < k:

r[j] = data[i]

return r

Reservoir sample S_i of size k records produced after step i of algorithm.

→ Every record must have equal probability of being in S_n (nth sample).

i.e., = ????

🤔

A. 1/n

B. k/n

C. (k-1)/n

D. 1/(n^k)

F. Something else

Pick a rand number from 1..i, if <=k, then add to sample Pi = k/i

11 of 113

What is the probability of record i being in S_n, the nth (and final) reservoir sample?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

12 of 113

Reservoir Sampling SRS Proof Outline

def reservoir_sample(data, n, k):

r = []

for i in range(k):

r.append(data[i])

for i in range(k, n-1):

j = randrange(i+1)

if j < k:

r[j] = data[i]

return r

Reservoir sample S_i of size k records produced after step i of algorithm.

→ Every record must have equal probability of being in S_n (nth sample).

i.e., = ????

A. 1/n

B. k/n

C. (k-1)/n

D. 1/(n^k)

F. Something else

*proof by induction* (CS 70/Data 140)

The gist of the proof

13 of 113

Proof by Induction

Property holds until i-1 (after I’ve processed 1…i-1)

every item until i-1 is included with probability k/(i-1)

Now, want to ensure that after i (after I’ve processed 1…i)

every item until i is included with probability k/i

Case 1: ensure that ith item is included with prob k/i

P_i = k/i easy, by definition

Case 2: ensure that remaining items is included with prob k/i

Probabiliity that a given item m from 1…i-i is included in the sample of k after processing i-1 items: k/(i-1)

Probability that it will get to stay at after the ith step:

case 2a: the ith item isn’t included in the sample

(1- k/i) = (i-k)/i

case 2b: ith item is included in the sample, but a diff item got evicted

k/i * (k-1)/k = (k-1)/i

sum = (i-1)/i
Overall probability k/(i-1) * (i-1)/i = k/i

14 of 113

Empirical Proof

import numpy as np

K = 5

N_SAMPLES = 10000

samples = []

N = 1000

data = list(range(N))

for j in range(N_SAMPLES):

samples += \

reservoir_sample(data, N, k)

unique, counts = np.unique(

samples, return_counts=True)

ax = pd.Series(samples).plot.hist(grid=True, bins=20, rwidth=0.9,

color='#607c8e')

ax.set_title(f"Distribution of frequencies of all {N} values")

Demo

#transactions

15 of 113

[Extra] Reservoir Sampling SRS Proof Outline

Proof by induction (see CS 70, Data 140)

🏡

16 of 113

Algorithm L: Reservoir Sampling Optimization

MapReduce

MapReduce’s Legacy and Spark

—

Database Sampling

Sampling Tips and Pitfalls

Reservoir Sampling

Reservoir Sampling Probabilities/Proofs

[extra] Algorithm L: Optimization

Stratified Sampling

[extra] Visualizations

not covered (out of scope)

Lecture 22, Data 101, Spring 2025

17 of 113

Algorithm L: Small optimization to Reservoir

def reservoir_sample(data, n, k):

r = []

for i in range(k):

r.append(data[i])

for i in range(k, n-1):

j = randrange(i+1)

if j < k:

r[j] = data[i]

return r

Calling a random number generator for every record can be slow, particularly when we have trillions of records.

18 of 113

Algorithm L: Small optimization to Reservoir

How can we reduce the number of calls to randrange()?

Consider the number of rows skipped between chosen values:

approximately geometric, has closed form! (we won’t prove this)
idea: don’t call randrange() on records that will be discarded anyway

Algorithm L approach: pick geometrically distributed random gaps, then only call randrange on a much smaller set of records.

Sampling gap distribution for S_1000 from 100k records�(gaps between chosen values 1 to 100k)

19 of 113

[Extra] Algorithm L Implementation

[extra]

# This is called Algorithm L, ported from

# https://en.wikipedia.org/wiki/Reservoir_sampling

from random import random, randrange

from math import exp, log, floor

def reservoir_sample_L(data, n, k):

# fill the reservoir array

r = []

for i in range(k):

r.append(data[i])

# generates a uniform [0,1) random number

w = exp(log(random())/k)

while i < n:

i = i + floor(log(random())/log(1-w)) + 1

if i < n:

# replace random reservoir item with item i

r[randrange(k)] = data[i]

w = w * exp(log(random())/k)

return r

[wikipedia]

Assume this works as an optimal strategy for reservoir sampling.

Demo

#transactions

20 of 113

[Extra] The plotting code

[extra]

import numpy as np

import pandas as pd

def plot_gaps(r):

r.sort()

gaps = pd.Series(np.diff(r))

gaps.plot.hist(grid=True, bins=20, rwidth=0.9,

color='#607c8e')

data = list(range(100000))

n = len(data)

k = 1000

r = reservoirSample(data, n, k)

plot_gaps(r)

Demo

#transactions

21 of 113

[Extra]

Reservoir Sampling in SQL

MapReduce

MapReduce’s Legacy and Spark

—

Database Sampling

Sampling Tips and Pitfalls

Reservoir Sampling

Reservoir Sampling Probabilities/Proofs

[extra] Algorithm L: Optimization

Stratified Sampling

[extra] Visualizations

Lecture 22, Data 101, Spring 2025

22 of 113

Reservoir Sampling as a Table-Valued UDF

%%sql

-- create the reservoir_swaps UDF --

DROP TYPE IF EXISTS reservoir_pair CASCADE;

CREATE TYPE reservoir_pair AS (rownum integer, pos integer);

CREATE OR REPLACE FUNCTION reservoir_swaps(k integer, n integer) RETURNS setof reservoir_pair

AS $$

# optimized reservoir sampling algorithm, � # Algorithm L

LANGUAGE 'plpython3u'

VOLATILE

RETURNS NULL ON NULL INPUT;

CREATE OR REPLACE FUNCTION

reservoir_rows(k integer, n integer)

RETURNS setof integer

AS $$

SELECT MAX(rownum) AS rownum

FROM reservoir_swaps(k, n)

GROUP by pos $$

LANGUAGE 'sql'

VOLATILE;

We skip the details of this User Defined Function (UDF)

It returns a table with one attribute reservoir_rows for the k row numbers in S_n.

Demo

#transactions

23 of 113

Reservoir Sampling in SQL

%%sql

WITH rrows AS (� SELECT reservoir_rows(10, count(*)::integer) � AS rows

FROM batting),

rbatting AS (

SELECT ROW_NUMBER() OVER(), * FROM batting)

SELECT *

FROM rbatting, rrows

WHERE row_number = rows;

Demo

#transactions

24 of 113

Stratified Sampling

MapReduce

MapReduce’s Legacy and Spark

—

Database Sampling

Sampling Tips and Pitfalls

Reservoir Sampling

Reservoir Sampling Probabilities/Proofs

[extra] Algorithm L: Optimization

Stratified Sampling

[extra] Visualizations

Lecture 22, Data 101, Spring 2025

25 of 113

What about Stratified Sampling?

Stratified sampling: Sample subpopulations.

aka GROUP BY sampling (in SQL):

want a k-sized sample per GROUP
"GROUP BY" columns are called subpopulations

In PostgreSQL:

Impossible with table samples (Bernoulli sampling on initial scans, not strata)
But can be implemented with reservoir_rows UDF and GROUP BY (next slide)

Demo

#transactions

26 of 113

What about Stratified Sampling?

%%sql

-- Stratified Sampling with Reservoirs

WITH grprows AS (

SELECT teamid,

reservoir_rows(10, COUNT(*)::integer) AS rows

FROM batting

GROUP BY teamid

rbatting AS (

SELECT row_number() OVER(� PARTITION BY teamid),� *

FROM batting�)

SELECT *

FROM rbatting b, grprows g

WHERE row_number = rows AND b.teamid = g.teamid

ORDER BY b.teamid;

4181385

Demo

#transactions

27 of 113

Join at slido.com�#transactions

ⓘ

Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.

28 of 113

Updates create challenges

So far, we’ve largely focused on data science/read-only workloads.�In many settings, we need to also support updates.

Single-user, one-at-a-time updates are easy.
But multi-user, simultaneous updates are challenging.

When updating data, we want correctness + speed, particularly when users are accessing and modifying the same relations.

Course/lecture goals:

Understand challenges that updates cause
Understand APIs/related guarantees.

Today, a glance at database internals: transactions. More in CS186: Database Systems!

transactions

29 of 113

Two Main Features We Expect from our Database System for Updates

Concurrency Control:

Many users query and update a database simultaneously.
How do we avoid confusion / incorrect state?

Recovery:

What happens when things fail?
Many such failure modes: Cancel modification partway,
app failure, DB engine failure, HW failure…

To understand these features, we need to introduce the concept of transactions.

transactions

30 of 113

Transactions/�TCL

Transactions/TCL

The ACID Principle

Isolation

Transaction Schedules

Serializability

Details:

Strict 2-Phase Locking
Conflicting Actions

[Extra] Additional slides

Lecture 22, Data 101, Spring 2025

31 of 113

What is a Transaction?

Colloquially, a transaction in a database is a unit of work�that should appear to “happen together.”

Classic example: Debit/credit banking transaction, i.e.,�moving $1k from one account (1111) to another (9999).

BEGIN

-- "debit" one account

UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1111;

-- "credit" the other account

UPDATE savings

SET amount = amount + 1000

WHERE acctId = 9999;

COMMIT

These SQL commands need to “happen together.”

BEGIN, COMMIT are SQL TCL (Transaction Control Language) commands.

transactions

32 of 113

What constitutes “happening together”? Observation #1

BEGIN

-- "debit" one account

UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1111;

-- "credit" the other account

UPDATE savings

SET amount = amount + 1000

WHERE acctId = 9999;

COMMIT

A few observations:

We need both debit and credit to happen, i.e., we should not have partial transactions.
…?

(to be continued…)

transactions

33 of 113

SQL TCL

Data Definition Language

Transaction Control Language

Data Manipulation Language

Data�Query Language

Data�Control�Language

✅

BEGIN

transactions

34 of 113

SQL TCL, Briefly

A transaction in SQL is a list of commands sandwiched by BEGIN and COMMIT.

If BEGIN/COMMIT not specified, then most systems will autocommit individual SQL commands, i.e., auto-wrap each command in its own transaction.

Generally (with slight syntax variation across systems):

BEGIN equivalent to START, BEGIN WORK, START TRANSACTION, etc.
COMMIT equivalent to END, END WORK, END TRANSACTION, etc.

Postgres BEGIN, START TRANSACTION, COMMIT, END, ROLLBACK, ABORT, SAVEPOINT

BEGIN� <command 1>� …� <command n>�COMMIT

transactions

35 of 113

[extra] SQL TCL: Savepoints

Savepoints let you break your transactions up into pieces. You can then “partially rollback” to a prior savepoint, or abort altogether.

ABORT
SAVEPOINT save_name
ROLLBACK TO SAVEPOINT save_name

Undo any commands that happened after the specified savepoint; and
Implicitly destroy any savepoints created after the specified one.

Use case: beyond the scope of this class, but generally used with SQL conditionals or as part of database constraints.

BEGIN

UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1234;

SAVEPOINT debit_done;

UPDATE savings

SET amount = amount + 1000

WHERE acctId = 9999;

SAVEPOINT credit_done;

ROLLBACK TO SAVEPOINT debit_done;

UPDATE savings

SET amount = amount + 1000

WHERE acctId = 4321;

END

(arbitrary example; what is this doing?)

Postgres BEGIN, START TRANSACTION, COMMIT, END, ROLLBACK, ABORT, SAVEPOINT

transactions

36 of 113

The ACID Principle

Transactions/TCL

The ACID Principle

Isolation

Transaction Schedules

Serializability

Details:

Strict 2-Phase Locking
Conflicting Actions

[Extra] Additional slides

Lecture 22, Data 101, Spring 2025

37 of 113

What constitutes “happening together”? Observations

BEGIN

-- "debit" one account

UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1111;

-- "credit" the other account

UPDATE savings

SET amount = amount + 1000

WHERE acctId = 9999;

COMMIT

A few observations:

We need both debit and credit to happen, i.e., we should not have partial transactions.
At the end of transactions, any database constraints should still be satisfied.
Even if another transaction happens simultaneously, one should appear to have finished “first.”
A committed transaction should appear to have happened, even if there is a power failure/reboot later.

These four properties, known as ACID, define how transactions guarantee�(1) concurrency control and (2) recovery.

transactions

38 of 113

ACID: Basic Guarantees

ACID defines four properties of transactions that guarantee concurrency control and recovery.

Atomicity �

Consistency

�

Isolation

�.

Durability

transactions

39 of 113

ACID: Basic Guarantees

ACID defines four properties of transactions that guarantee concurrency control and recovery.

Either all the commands are reflected in the database, or none are.�Ex: Both debit+credit should occur, or both should fail to occur.

If COMMIT succeeds, all the database integrity checks hold true.�(primary key/foreign keys, constraints, etc.)

Concurrent transactions should externally appear to run sequentially, i.e., 2 concurrent transactions should not “see” each other’s intermediate results.

If COMMIT succeeds, all changes from the transaction persist, even if there is a power failure or a reboot, until the transaction is overwritten by a later transaction.

Atomicity�

Consistency�

Isolation��

Durability

transactions

40 of 113

[History] Why ACID? Unknown, but…

Jim Gray, PhD, UC Berkeley, Industry/ Academic researcher. 1998 Turing Award Winner “For seminal contributions to database and transaction processing research and technical leadership in system implementation.”

1983

Gray, 1981, Hearder and Reuter, 1983, Wikipedia (acid test (disambiguation))

41 of 113

[Exercise] ACID

BEGIN

-- "debit" one account

UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1111;

-- "credit" the other account

UPDATE savings

SET amount = amount + 1000

WHERE acctId = 9999;

COMMIT

A few observations:

We need both debit and credit to happen, i.e., we should not have partial transactions.
At the end of transactions, any database constraints should still be satisfied.
Even if another transaction happens simultaneously, one should appear to have finished “first.”
A committed transaction should appear to have happened, even if there is a power failure/reboot later.

Match 1-4 with A, C, I, and D from the ACID Principle.

🤔

A. D, I, C, A

B. I, C, A, D

C. A, C, I, D

D. A, C, D, I

E. Something else

transactions

42 of 113

Match 1-4 with A, C, I, and D from the ACID Principle.

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

43 of 113

[Solution] ACID

BEGIN

-- "debit" one account

UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1111;

-- "credit" the other account

UPDATE savings

SET amount = amount + 1000

WHERE acctId = 9999;

COMMIT

A few observations:

A. We need both debit and credit to happen, i.e., we should not have partial transactions.

C. At the end of transactions, any database constraints should still be satisfied.

I. Even if another transaction happens simultaneously, one should appear to have finished “first.”

D. A committed transaction should appear to have happened, even if there is a power failure/reboot later.

transactions

44 of 113

How does the database address each ACID property?

�

If COMMIT succeeds, all the database integrity checks hold true.�(primary key/foreign keys, constraints, etc.)

Atomicity�

Consistency�

Isolation��

Durability

Standard database checks (relatively efficient to check for core things like attribute types, keys, constraints, etc.)

transactions

45 of 113

How does the database address each ACID property?

Either all the commands are reflected in the database, or none are.�Ex: Both debit+credit should occur, or both should fail to occur.

�

��

If COMMIT succeeds, all changes from the transaction persist, even if there is a power failure or a reboot, until the transaction is overwritten by a later transaction.

Atomicity�

Consistency�

Isolation��

Durability

The database’s internal recovery system.

After a crash:

Redo all committed work; and
Undo all uncommitted work!

See CS186 for the implementation. Devil is in the details!

transactions

46 of 113

How does the database address each ACID property?

Atomicity�

Consistency�

Isolation��

Durability

�

Concurrent transactions should externally appear to run sequentially, i.e., 2 concurrent transactions should not “see” each other’s intermediate results.

Provided by concurrency control, a component of the database. We’ll grasp the intuition today!!

transactions

47 of 113

Isolation

Transactions/TCL

The ACID Principle

Isolation

Transaction Schedules

Serializability

Details:

Strict 2-Phase Locking
Conflicting Actions

[Extra] Additional slides

Lecture 22, Data 101, Spring 2025

48 of 113

Isolation

Isolation: Concurrent transactions should externally appear to run sequentially.

If the database receives these transactions simultaneously, we should be able to successfully execute all three as if they happened “in isolation.”

Hire Mercy as the new VP of Engineering!

Move the entire payroll of the London Office to the Cambridge Office!

Prepare tax projections for the 2^nd quarter!

transactions

49 of 113

Isolation,

The challenge: How do we execute these transactions “in isolation” but “concurrently”? With one single machine?

Prepare tax projections for the 2^nd quarter!

Hire Mercy as the new VP of Engineering!

Move the entire payroll of the London Office to the Cambridge Office!

Assumption: the precise order of these three transactions doesn’t matter. What matters is that they appeared to have been executed by the DBMS in some order.

50 of 113

Isolation,

For simplicity, we will limit our discussion to reads and writes of individual “objects”:

“Objects” := records (for now)

i-th transaction has Read from O:

R_i(O)

i-th transaction has Write to O:

W_i(O [= value])

Prepare tax projections for the 2^nd quarter!

Hire Mercy as the new VP of Engineering!

Move the entire payroll of the London Office to the Cambridge Office!

51 of 113

Transaction Schedules

Transactions/TCL

The ACID Principle

Isolation

Transaction Schedules

Serializability

Details:

Strict 2-Phase Locking
Conflicting Actions

[Extra] Additional slides

Lecture 22, Data 101, Spring 2025

52 of 113

Determining Transaction Schedules that Maintain Isolation

Our goal: Understand how multiple transactions can run concurrently (for performance) but also in isolation (for ACID).

To do so, we’ll define the following:

Define transaction schedules (i.e., list of read/writes).

Define serial schedules, which satisfy isolation by definition.

Define serializable schedules, which allow for concurrency while maintaining isolation.

transactions

53 of 113

Example: Two Transactions

-- set Parth’s salary to 10% more �-- than Jonah’s

UPDATE employee

SET salary = (SELECT salary*1.1

FROM employee

WHERE name='Jonah')

WHERE name = 'Parth';

-- set Parth’s salary to 10% more �

UPDATE employee

SET salary = (SELECT salary*1.1

FROM employee

WHERE name='Parth')

WHERE name = ‘Parth';

T₁:

R₁(J)
W₁(P)

R₂(P)
W₂(P)

T₂:

transactions

54 of 113

Transaction Schedules

A transaction schedule is a ordered list of actions from a set of transactions.

A proposed Transaction Schedule�of T1 and T2

T₁	T₂
R₁(J)
W₁(P)
	R₂(P)
	W₂(P)

time

While there are many possible transaction schedules, a DBMS will pick one with which to schedule and execute read/write actions.

transactions

55 of 113

Transaction Schedules

A transaction schedule is a ordered list of actions from a set of transactions.

The ordered schedule of actions (reads from/writes to objects) represents the actual/potential execution sequence in time, as seen by the DBMS.

T₁	T₂
R₁(J)
W₁(P)
	R₂(P)
	W₂(P)

time

time slot 1

time slot 2

time slot 3

time slot 4

✅

transactions

56 of 113

Transaction Schedules

A transaction schedule is a ordered list of actions from a set of transactions.

The ordered schedule of actions (reads from/writes to objects) represents the actual/potential execution sequence in time, as seen by the DBMS.
The order in which two actions from the same transaction T are scheduled must reflect the order in which they appear in T.

T₁	T₂
R₁(J)
W₁(P)
	R₂(P)
	W₂(P)

time

✅

T₁:

R₁(J)
W₁(P)

R₂(P)
W₂(P)

T₂:

transactions

57 of 113

Determining Transaction Schedules that Maintain Isolation

Our goal: Understand how multiple transactions can run concurrently (for performance) but also in isolation (for ACID).

To do so, we’ll define the following:

Define transaction schedules (i.e., list of read/writes).

Define serial schedules, which satisfy isolation by definition.

Define serializable schedules, which allow for concurrency while maintaining isolation.

✅

transactions

58 of 113

Serial Schedules

A serial schedule is a transaction schedule for which �actions from different transactions are not interleaved.

T₁:

R₁(J)
W₁(P)

R₂(P)
W₂(P)

T₂:

T₁	T₂
R₁(J)
W₁(P)
	R₂(P)
	W₂(P)

T₁	T₂
	R₂(P)
	W₂(P)
R₁(J)
W₁(P)

T₁	T₂
	R₂(P)
R₁(J)
	W₂(P)
W₁(P)

serial schedule

time

serial schedule

not a serial schedule;�transactions interleaved

Serial schedules exhibit no concurrency, because actions of a transaction are executed together and separate from those in other transactions.

transactions

59 of 113

Do these schedules satisfy the isolation property?

Which of these three schedules satisfy the isolation property? Select all.

🤔

Isolation: Concurrent transactions should externally appear to run sequentially, i.e., 2 concurrent transactions should not (appear to) “see” each other’s intermediate results.

T₁:

R₁(J)
W₁(P)

R₂(P)
W₂(P)

T₂:

T₁	T₂
R₁(J)
W₁(P)
	R₂(P)
	W₂(P)

T₁	T₂
	R₂(P)
	W₂(P)
R₁(J)
W₁(P)

T₁	T₂
	R₂(P)
R₁(J)
	W₂(P)
W₁(P)

time

transactions

60 of 113

Which of these three schedules satisfy the isolation property? Select all.

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

61 of 113

Serializability

Transactions/TCL

The ACID Principle

Isolation

Transaction Schedules

Serializability

Details:

Strict 2-Phase Locking
Conflicting Actions

[Extra] Additional slides

Lecture 22, Data 101, Spring 2025

62 of 113

Do these schedules satisfy the isolation property?

T₁:

R₁(J)
W₁(P)

R₂(P)
W₂(P)

T₂:

T₁	T₂
R₁(J)
W₁(P)
	R₂(P)
	W₂(P)

T₁	T₂
	R₂(P)
	W₂(P)
R₁(J)
W₁(P)

T₁	T₂
	R₂(P)
R₁(J)
	W₂(P)
W₁(P)

time

Yes

Yes!

Isolation. If we execute a given schedule, from the DBMS’s POV, individual transactions appear to be executed sequentially.

transactions

63 of 113

All three schedules satisfy the isolation property!

T₁:

R₁(J)
W₁(P)

R₂(P)
W₂(P)

T₂:

T₁	T₂
R₁(J)
W₁(P)
	R₂(P)
	W₂(P)

T₁	T₂
	R₂(P)
	W₂(P)
R₁(J)
W₁(P)

T₁	T₂
	R₂(P)
R₁(J)
	W₂(P)
W₁(P)

time

It is okay that these two serial schedules produce non-equivalent database outcome states!

Isolation. If we execute a given schedule, from the DBMS’s POV, individual transactions appear to be executed sequentially.

transactions

64 of 113

All three schedules satisfy the isolation property!

T₁:

R₁(J)
W₁(P)

R₂(P)
W₂(P)

T₂:

T₁	T₂
	R₂(P)
	W₂(P)
R₁(J)
W₁(P)

T₁	T₂
	R₂(P)
R₁(J)
	W₂(P)
W₁(P)

time

Isolation. If we execute a given schedule, from the DBMS’s POV, individual transactions appear to be executed sequentially.

Despite the interleaving, this schedule has an equivalent database outcome to one of the serial schedules!

transactions

65 of 113

Schedule 3 is a serializable schedule

T₁:

R₁(J)
W₁(P)

R₂(P)
W₂(P)

T₂:

T₁	T₂
	R₂(P)
	W₂(P)
R₁(J)
W₁(P)

T₁	T₂
	R₂(P)
R₁(J)
	W₂(P)
W₁(P)

time

Despite the interleaving, this schedule has an equivalent database outcome to one of the serial schedules!

Schedule 3 is a serializable schedule: a transaction schedule whose database outcome is equivalent to some serial schedule.

transactions

66 of 113

Unserializable Schedules

T₁:

R₁(J)
W₁(P)

R₂(P)
W₂(P)

T₂:

time

Not all schedules are serializable! This is an unserializable schedule, because there is no serial equivalent, and therefore transactions do not appear isolated.

T₁	T₂
	R₂(P)
	W₂(P)
R₁(J)
W₁(P)

T₁	T₂
	R₂(P)
R₁(J)
	W₂(P)
W₁(P)

serializable schedule

serial schedule

T₁	T₂
	R₂(P)
R₁(J)
W₁(P)
	W₂(P)

⚠️

transactions

68 of 113

Strict 2-Phase Locking

Transactions/TCL

The ACID Principle

Isolation

Transaction Schedules

Serializability

Details:

Strict 2-Phase Locking
Conflicting Actions

[Extra] Additional slides

Lecture 22, Data 101, Spring 2025

69 of 113

Summary so far

Our goal: Allow multiple transactions to run concurrently (for performance)�but also in isolation (for ACID).

To do so, we’ve traced the following steps:

Define transaction schedules (i.e., list of read/writes).

Define serial schedules, which satisfy isolation by definition.

Define serializable schedules, which allow for concurrency while maintaining isolation.

T₁	T₂
	R₂(P)
R₁(J)
	W₂(P)
W₁(P)

T₁	T₂
	R₂(P)
	W₂(P)
R₁(J)
W₁(P)

T₁	T₂
	R₂(P)
R₁(J)
W₁(P)
	W₂(P)

serial schedule

serializable schedule

unserializable schedule

transactions

70 of 113

DBMS goals

Under the covers, a database system can allow serializable schedules that may not be serial, but after execution have the same outcome as some serial schedule.

Allows multiple transactions to run at the same time.
Much better for performance!!

CS186: Build systems that guarantee serializability for all executed schedules.

T₁	T₂
	R₂(P)
R₁(J)
	W₂(P)
W₁(P)

T₁	T₂
	R₂(P)
	W₂(P)
R₁(J)
W₁(P)

T₁	T₂
	R₂(P)
R₁(J)
W₁(P)
	W₂(P)

serial schedule

serializable schedule

unserializable schedule

❌

transactions

71 of 113

Two final goals of this lecture

Briefly, how do databases build schedule that ensure serializability?

Strict Two-Phase Locking

Conceptually, how do we know a schedule is serializable?

Conflicting actions

transactions

72 of 113

Database locking

One of the most straightforward implementations that databases can use to ensure serializability is called Strict Two-Phase Locking (Strict 2PL).

This is a conservative method to guarantee serializability.
It prevents certain serializable schedules and therefore may suffer some performance hits, but overall there is no harm done because it is always correct / satisfies ACID principle.
What theoretical guarantees? See conflict serializability

Skipped slides: details on Strict 2PL.

transactions

73 of 113

Determining Serializability:

Conflicting Actions

Transactions/TCL

The ACID Principle

Isolation

Transaction Schedules

Serializability

Details:

Strict 2-Phase Locking
Conflicting Actions

[Extra] Additional slides

Lecture 22, Data 101, Spring 2025

74 of 113

How do we know if a schedule is serializable?

We like serializable schedules.

Isolation, again: After the dust settles, transactions appear to have happened in some order (which may seem “arbitrary”). However, the order means that:

the txns appear to have followed a serial schedule.
that txns can be “rolled back” one-by-one.

Conflicting actions between transactions will determine if a schedule is serializable.

What does it mean???

Let’s dive in!

transactions

75 of 113

Conflicting Actions

Def: Two actions conflict if:

They are two different, concurrent transactions.
They reference the same object.
At least one is a write.

Alt Def: If T1 and T2 have conflicting actions, then every equivalent serial schedule (i.e., with the same database outcome) must have T1 and T2 in some specific order.

transactions

76 of 113

Which of the following are conflicting actions?

Alt Def: If T1 and T2 have conflicting actions, then every equivalent serial schedule (i.e., with the same database outcome) must have T1 and T2 in some specific order.

Suppose T1 → T2 in a schedule,�i.e., T1 comes before T2.

W₁(P)

R₂(P)

T1 writes �Parth salary �as 110000

T2 reads �Parth salary �as 110000

R₁(G)

W₂(G)

T1 reads Gabi salary as 100000

T2 writes Gabi salary as 121000

W₁(J)

W₂(J)

T1 writes Jonah salary as 300000

T2 writes Jonah salary as 0

R₁(G)

R₂(G)

T1 reads Gabi salary as 121000

T2 reads Gabi salary as 121000

W₁(J)

W₂(P)

T1 writes Jonah salary as 0

T2 writes Parth salary as 200000

For which of the following would the resulting flip of actions mean that this transaction order would change, i.e., that now T2 → T1? Select all.

🤔

transactions

77 of 113

Select all for which the following is true: Flipping the order of the two actions in T1 and T2 would result in a different database outcome state.

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

78 of 113

Which of the following are conflicting actions?

W₁(P)

R₂(P)

T1 writes �Parth salary �as 110000

T2 reads �Parth salary �as 110000

R₁(G)

W₂(G)

T1 reads Gabi salary as 100000

T2 writes Gabi salary as 121000

W₁(J)

W₂(J)

T1 writes Jonah salary as 300000

T2 writes Jonah salary as 0

R₁(G)

R₂(G)

T1 reads Gabi salary as 121000

T2 reads Gabi salary as 121000

W₁(J)

W₂(P)

T1 writes Jonah salary as 0

T2 writes Parth salary as 200000

R₂(P)

W₁(P)

T2 reads �Parth salary �as ????

T1 writes�Parth salary �as ????

W₂(G)

R₁(G)

T2 writes Gabi salary as ????

T1 reads Gabi salary as ????

W₂(J)

W₁(J)

T2 writes Jonah salary as ????

T1 writes Jonah salary as ????

R₂(G)

R₁(G)

T2 reads Gabi salary as ????

T1 reads Gabi salary as ????

W₂(P)

W₁(J)

T2 writes Parth salary as ????

T1 writes Jonah salary as ????

Suppose T1 → T2 in a schedule. For which of the following would the resulting flip of actions mean that this transaction order would change, i.e., that now T2 → T1? Select all.

hypothetically

79 of 113

Which of the following are conflicting actions?

Alt Def: If T1 and T2 have conflicting actions, then every equivalent serial schedule (i.e., with the same database outcome) must have T1 and T2 in some specific order.

Suppose T1 → T2 in a schedule,�i.e., T1 comes before T2.

For which of the following would the resulting flip of actions mean that this transaction order would change, i.e., that now T2 → T1? Select all.

W₁(P)

R₂(P)

T1 writes �Parth salary �as 110000

T2 reads �Parth salary �as 110000

R₁(G)

W₂(G)

T1 reads Gabi salary as 100000

T2 writes Gabi salary as 121000

W₁(J)

W₂(J)

T1 writes Jonah salary as 300000

T2 writes Jonah salary as 0

R₁(G)

R₂(G)

T1 reads Gabi salary as 121000

T2 reads Gabi salary as 121000

W₁(J)

W₂(P)

T1 writes Jonah salary as 0

T2 writes Parth salary as 200000

transactions

80 of 113

Which of the following are conflicting actions?

Alt Def: If T1 and T2 have conflicting actions, then every equivalent serial schedule (i.e., with the same database outcome) must have T1 and T2 in some specific order.

Suppose T1 → T2 in a schedule,�i.e., T1 comes before T2.

W₁(P)

R₂(P)

T1 writes �Parth salary �as 110000

T2 reads �Parth salary �as 110000

R₁(G)

W₂(G)

T1 reads Gabi salary as 100000

T2 writes Gabi salary as 121000

W₁(J)

W₂(J)

T1 writes Jonah salary as 300000

T2 writes Jonah salary as 0

R₁(G)

R₂(G)

T1 reads Gabi salary as 121000

T2 reads Gabi salary as 121000

W₁(J)

W₂(P)

T1 writes Jonah salary as 0

T2 writes Parth salary as 200000

can be flipped! no conflicts!

cannot be flipped! conflicting actions!

For which of the following would the resulting flip of actions mean that this transaction order would change, i.e., that now T2 → T1? Select all.

transactions

81 of 113

How do we know if a schedule is serializable?

We like serializable schedules.

Isolation, again: After the dust settles, transactions appear to have happened in some order (which may seem “arbitrary”). However, the order means that:

the txns appear to have followed a serial schedule.
that txns can be “rolled back” one-by-one.

A schedule is serializable if all conflicting actions dictate a specific ordering of the transactions (with no cycles)

A topological sort on the graph of conflicts between transactions.

Conflicting actions between transactions determine if a schedule is serializable.

transactions

82 of 113

The example again

T₁:

R₁(J)
W₁(P)

R₂(P)
W₂(P)

T₂:

T₁	T₂
	R₂(P)
R₁(J)
	W₂(P)
W₁(P)

serializable

T₁	T₂
	R₂(P)
R₁(J)
W₁(P)
	W₂(P)

unserializable

T₁	T₂
R₁(J)
	R₂(P)
	W₂(P)
W₁(P)

R₂(P)

W₁(P)

W₂(P)

W₁(P)

R₂(P)

W₁(P)

W₂(P)

W₁(P)

R₂(P)

W₁(P)

W₂(P)

W₁(P)

⚠️

serializable

Schedule 4: The two pairs of conflicting actions imply two different orders to T1 and T2.

transactions

83 of 113

Performance Tradeoffs: Snapshot Isolation

Transactions

The ACID Principle

Isolation and Serializability

Strict 2-Phase Locking

Conflicting Actions

[Extra] Conflict Graphs

[Extra] Conflict Serializable

Weak Isolation

[Extra] Additional slides

Lecture 22, Data 101, Spring 2025

84 of 113

Final thoughts: Why Should Data Engineers Know Transactions?

You may think that transactions (and serializability) are very much in the weeds of DBMS design, which we don’t particularly implement in this course. However…

Inevitably you will update a database and manage data from transactional databases!

This means you should have a sense of its characteristics.

If your DB is slow for transactional reasons:

You should understand why
And how you can trade-off speed and “correctness,” i.e., redefine your transactions.

Finally, transaction concepts are also quite useful outside of databases.

Examples: Queueing systems, e.g., RabbitMQ or Kafka.

transactions

85 of 113

Serialized Transactions: A summary

Serialized transactions ensure ACID properties of shared access, particularly Isolation.

Strict 2PL is a common implementation of serialization, though it is not the only one.

Life is good?…Except…

SELECT avg(gpa) FROM students;

Locks all students!
But we likely don’t need this to be 100% correct!

Sometimes we prefer to trade correctness for a little more performance.

transactions

86 of 113

Approximating Serialized Transactions with Weak Isolation

Serialized transactions ensure ACID properties of shared access, particularly Isolation.

Strict 2PL is a common implementation of serialization, though it is not the only one.

Enter: Weak Isolation.

Each isolation can choose to be a “bit sloppy”...
…as long as it doesn’t mess up other transaction’s choices to do so.
The most common weak isolation implementation is snapshot isolation.
This is a much weaker property of isolation than serialized transactions,�but it’s good enough when we prefer more concurrency/higher performance.

Life is good?…Except…

SELECT avg(gpa) FROM students;

Locks all students!
But we likely don’t need this to be 100% correct!

Sometimes we prefer to trade correctness for a little more performance.

transactions

87 of 113

Snapshot Isolation

Snapshot isolation is a weaker form of isolation than serialization, but it’s good enough when we prefer more concurrency/higher performance.

Database system requirements: Keep multiple versions of tuples.

At transaction start: Take a “snapshot” of the database, off which to do reads/writes.

snapshot reads: All reads of this transaction are from this snapshot.
write validation: This transaction can commit if none of its writes conflict with other transactions since the snapshot was taken.

If write-write conflicts, then abort this transaction.

transactions

88 of 113

Snapshot Isolation is Actually Popular

Isolation levels (both default and maximum) vary in support across different database engines.

Marketing also varies!

When Oracle says “Serializable,” they actually are giving you Snapshot Isolation!!

The maximum levels of many cloud DBMSs is not always the theoretical maximum,�which is “serializable” transactions.

Serializable: Google Cloud Spanner, CockroachDB, Azure SQL Server
Read Commit: Snowflake, AWS Aurora

For more about Read Commit and others, check out the bonus slides.

89 of 113

[Bonus] Strict 2-Phase Locking: Details

Transactions

The ACID Principle

Isolation and Serializability

Strict 2-Phase Locking

Conflicting Actions

[Extra] Conflict Graphs

[Extra] Conflict Serializable

Weak Isolation

[Extra] Additional slides

Lecture 22, Data 101, Spring 2025

90 of 113

Database locking

How do databases ensure serializability?

One of the most straightforward implementations is called Strict Two-Phase Locking (Strict 2PL).

This is a conservative method to guarantee serializability.
It prevents certain serializable schedules and therefore may suffer some performance hits, but overall there is no harm done because it is always correct / satisfies ACID principle.
What theoretical guarantees? See conflict serializability

Locking is the process of ensuring that 2 conflicting actions happen in order.

The first action that arrives should “lock” the shared object.
The second action that arrives needs to wait until the first action’s transaction completes.
(we’ll define conflicting action more precisely later)

transactions

91 of 113

Strict Two-Phase Locking (Strict 2PL)

Phase 1: During the transaction, lock objects before use. �Two types of locks:

S lock: Before executing R1(O), transaction T1 �must acquire a shared lock on O.
X lock: Before executing W1(O), transaction T1 �must acquire an exclusive lock on O.

Phase 2: At the end of the transaction (i.e., � COMMIT or ROLLBACK),� release all locks at once.

# locks held

acquisition phase

time

release all locks at end of xact

The Strict 2PL algorithm allows only serializable schedules!

Note that schedules can result in deadlock. See Discussion for more info/practice!

transactions

92 of 113

Strict Two-Phase Locking (Strict 2PL), Practically

What objects are we locking?

For most purposes, assume the DBMS is locking individual records.
It is sometimes useful to lock entire tables at once (e.g., to change a schema/a default attribute), but we won’t go into detail.

What does it mean to “acquire” or “release” a lock?

Under the hood: DBMS maintains some of “lock table” according to an internal protocol.
The system ensures that all transactions follow the internal protocol’s locking rules.

Analogy: red lights at intersections. You trust the protocol.

# locks held

acquisition phase

time

release all locks at end of xact

transactions

93 of 113

[Extra]

Determining Serializability: Conflict Graphs

Transactions

The ACID Principle

Isolation and Serializability

Strict 2-Phase Locking

Conflicting Actions

[Extra] Conflict Graphs

[Extra] Conflict Serializable

Weak Isolation

[Extra] Additional slides

Lecture 22, Data 101, Spring 2025

94 of 113

[Exercise] Determining Conflicting Actions

time

🤔

What are the conflicting actions in each of the schedules?

T₁	T₂
	R₂(J)
R₁(J)
	W₂(P)
W₁(P)

T₁	T₂
R₁(G)
	R₂(P)
W₁(P)
	W₂(G)

transactions

95 of 113

[Exercise] Determining Conflicting Actions

time

What are the conflicting actions in each of the schedules?

T₁	T₂
	R₂(J)
R₁(J)
	W₂(P)
W₁(P)

T₁	T₂
R₁(G)
	R₂(P)
W₁(P)
	W₂(G)

1. W1(P) and W2(P)�// write/write to same object

2. R1(G) and W2(G); // read/write same obj� W1(P) and R2(P) // read/write same obj

transactions

96 of 113

[Exercise] Determining Serializability

Suppose we have the following conflicting actions:

1. W1(P) and W2(P)

2. R1(G) and W2(G); W1(P) and R2(P)

Which of the following schedules are serializable?

time

A. Serializable schedule, i.e., equivalent to some serial schedule of T1 and T2

B. Unserializable schedule, i.e., no equivalent serial schedule exists

🤔

T₁	T₂
	R₂(J)
R₁(J)
	W₂(P)
W₁(P)

T₁	T₂
R₁(G)
	R₂(P)
W₁(P)
	W₂(G)

(no slido)

transactions

97 of 113

[Exercise] Determining Serializability

Suppose we have the following conflicting actions:

1. W1(P) and W2(P)

2. R1(G) and W2(G); W1(P) and R2(P)

Which of the following schedules are serializable?

time

T₁	T₂
	R₂(J)
R₁(J)
	W₂(P)
W₁(P)

T₁	T₂
R₁(G)
	R₂(P)
W₁(P)
	W₂(G)

A. Serializable! Equivalent to T2 happening before T1.

B. Unserializable! Conflicting actions can’t be “flipped.”

transactions

98 of 113

An Algorithm for Determining Serializability

Serializability of a schedule can be determined by drawing its conflict graph.

One node per transaction Ti.
Edge from Ti to Tj if:

Action a in Ti conflicts with Action b in Tj, AND
Action a happens before Action b in the schedule.

99 of 113

An Algorithm for Determining Serializability

Serializability of a schedule can be determined by drawing its conflict graph.

One node per transaction Ti.
Edge from Ti to Tj if:

Action a in Ti conflicts with Action b in Tj, AND
Action a happens before Action b in the schedule.

T₁	T₂
	R₂(J)
R₁(J)
	W₂(P)
W₁(P)

T₁	T₂
R₁(G)
	R₂(P)
W₁(P)
	W₂(G)

time

T₁

T₂

Given: Conflicting actions

1. W1(P) and W2(P)

2. R1(G) and W2(G); � W1(P) and R2(P)

Serializable!

Unserializable!

100 of 113

An Algorithm for Determining Serializability

Serializability of a schedule can be determined by drawing its conflict graph.

One node per transaction Ti.
Edge from Ti to Tj if:

Action a in Ti conflicts with Action b in Tj, AND
Action a happens before Action b in the schedule.

100

T₁	T₂
	R₂(J)
R₁(J)
	W₂(P)
W₁(P)

T₁	T₂
R₁(G)
	R₂(P)
W₁(P)
	W₂(G)

time

T₁

T₂

Given: Conflicting actions

1. W1(P) and W2(P)

2. R1(G) and W2(G); � W1(P) and R2(P)

Serializable!

Unserializable!

T₁

T₂

101 of 113

An Algorithm for Determining Serializability

101

time

T₁	T₂
	R₂(J)
R₁(J)
	W₂(P)
W₁(P)

T₁	T₂
R₁(G)
	R₂(P)
W₁(P)
	W₂(G)

Serializable! Equivalent to T2 happening before T1.

Unserializable! Conflicting actions can’t be “flipped.”

If the conflict graph has no cycles (acyclic), then the schedule is serializable. Otherwise, it has cycles and it is unserializable.

T₁

T₂

T₁

T₂

Given: Conflicting actions

1. W1(P) and W2(P)

2. R1(G) and W2(G); � W1(P) and R2(P)

(proof later)

transactions

102 of 113

How do we know if a schedule is serializable?

We like serializable schedules.

Isolation, again: For multiple concurrent transactions, after the dust settles, transactions appear to have happened in some order (which may seem “arbitrary”). However:

The order means that the transactions appear to have followed a serial schedule.
The order means that transactions can be “rolled back” one-by-one.

102

Conflicting actions between transactions will determine if a schedule is serializable.

We did it!

The strategy for a determining serializability of a given schedule of interleaved transactions:

Identify the conflicting actions.
Draw the conflict graph.
If the conflict graph is acyclic, then the schedule is serializable. Else, it is unserializable.

🌟

transactions

103 of 113

[Extra]

Formal Terminology: Conflict Serializable

Transactions

The ACID Principle

Isolation and Serializability

Strict 2-Phase Locking

Conflicting Actions

[Extra] Conflict Graphs

[Extra] Conflict Serializable

Weak Isolation

[Extra] Additional slides

103

Lecture 22, Data 101, Spring 2025

104 of 113

Why does acyclic conflict graph → serializability?

Observations:

The acyclic graph has a natural traversal order.
An order of conflicting actions will guide us to an equivalent serial schedule.
To make #1’s serial schedule: Move all T2 actions first, move all T1 actions second.
We cannot do this “bulk” moving for #2, because of the cycle. So no equiv. serial schedule!!

104

W₁(P)

R₂(P)

R₁(G)

W₂(G)

order

W₁(P)

R₂(P)

R₁(G)

W₂(G)

T₁

T₂

T₁

T₂

Unserializable! Conflicting actions can’t be “flipped.”

Serializable! Equivalent to T2 happening before T1.

no cycles!

cycles!

transactions

105 of 113

Formal Terminology: Conflict Serializable Schedule

Def A schedule is conflict serializable if and only if the conflict graph is acyclic (has no cycles).

Observations:

The acyclic graph has a natural traversal order.
An order of conflicting actions will guide us to an equivalent serial schedule.
To make #1’s serial schedule: Move all T2 actions first, move all T1 actions second.
We cannot do this “bulk” moving for #2, because of the cycle. So no equiv. serial schedule!!

105

transactions

106 of 113

Formal Terminology: Conflict Serializable Schedule

Def A schedule is conflict serializable if and only if the conflict graph is acyclic (has no cycles).

Observations:

The acyclic graph has a natural traversal order.
An order of conflicting actions will guide us to an equivalent serial schedule.
To make #1’s serial schedule: Move all T2 actions first, move all T1 actions second.
We cannot do this “bulk” moving for #2, because of the cycle. So no equiv. serial schedule!!

106

Lemma

If a schedule is conflict serializable, then it is serializable.

Proof:

By definition of conflict serializable, the given schedule has an acyclic conflict graph.
Any serial schedule that follows the edges of the given conflict graph has the same ordering of conflicting actions and is therefore equivalent to the given schedule.
By definition of serializable, the given schedule is therefore serializable.

transactions

107 of 113

[Extra] The converse is not true

Note that some serializable schedules are not necessarily conflict serializable!

From R&G Database Management Systems, Third Edition, Section 17.1, Figure 17.1 (p.550-1):

This schedule is equivalent to executing the transactions serially in the order T1, T2, T3, but it is not conflict equivalent to this serial schedule because the writes of T1 and T2 are ordered differently.

107

T1	T2	T3
R(A)
	W(A)
	Commit
W(A)
Commit
		W(A)
		Commit

transactions

108 of 113

[Extra] Weak Isolation: Read Commit

Transactions

The ACID Principle

Isolation and Serializability

Strict 2-Phase Locking

Conflicting Actions

[Extra] Conflict Graphs

[Extra] Conflict Serializable

Weak Isolation

[Extra] Additional slides

108

Lecture 22, Data 101, Spring 2025

109 of 113

“Read Committed” Isolation level

What if we dropped each Shared lock right after reading

But kept our eXclusive locks until COMMIT/ROLLBACK?

Prevents “dirty” (uncommitted) reads from other transactions

Each read is of an unlocked/committed item!

Doesn’t promise much more!

This isolation level is called Read Committed

Note: respects the locks of other, Strict 2PL transactions

109

transactions

110 of 113

Does it help us?

110

BEGIN � ISOLATION LEVEL serializable;

UPDATE students

SET gpa = 4.0

WHERE sid = 1234;

END;

BEGIN

ISOLATION LEVEL read committed;

SELECT avg(gpa)

FROM students;

END;

Locks every student but doesn’t need to be 100% correct!

Locks 1 student but must be serializable!

transactions

111 of 113

What could go wrong in Read Committed?

Non-repeatable reads

Suppose you read a tuple twice in your transaction
Another transaction could run between the two reads and update it!

Phantoms

Suppose you run a query with a non-key WHERE clause

E.g. “find all students with an A grade”

If you run it again, some brand new tuples (phantoms) could appear!

Staleness: Technically you could read a very old (but committed) “version”

Still satisfies the definition!

111

transactions

112 of 113

Repeatable Read Isolation

Prevents dirty reads and non-repeatable reads

A locking-based way to think about it:

All locks are held until COMMIT/ROLLBACK
But could be only tuple-level locks

So phantoms are still possible!

112

transactions

113 of 113

Snapshot Isolation is not (quite) Serializable

113

x = R₁(G₀)
		y = R₂(P₀)
W₂(P₁ = 110000)
		W₂(G₁ = 220000)

Time

Pat (T₁)

Gabi (T₂)

P₀: 200000

G₀: 100000

Set my salary to be 10% bigger than Gabi’s

Set my salary to be 10% bigger than Pat’s

NOT equivalent to either order (not serializable!)

Write skew anomaly: concurrent reads, and writes reflect the fact that they didn’t read each other’s writes!

transactions

1 of 113

2 of 113

3 of 113

4 of 113

5 of 113

6 of 113

7 of 113

8 of 113

9 of 113

10 of 113

11 of 113

12 of 113

13 of 113

14 of 113

15 of 113

16 of 113

17 of 113

18 of 113

19 of 113

20 of 113

21 of 113

22 of 113

23 of 113

24 of 113

25 of 113

26 of 113

27 of 113

28 of 113

29 of 113

30 of 113

31 of 113

32 of 113

33 of 113

34 of 113

35 of 113

36 of 113

37 of 113

38 of 113

39 of 113

40 of 113

41 of 113

42 of 113

43 of 113

44 of 113

45 of 113

46 of 113

47 of 113

48 of 113

49 of 113

50 of 113

51 of 113

52 of 113

53 of 113

54 of 113

55 of 113

56 of 113

57 of 113

58 of 113

59 of 113

60 of 113

61 of 113

62 of 113

63 of 113

64 of 113

65 of 113

66 of 113

67 of 113

68 of 113

69 of 113

70 of 113

71 of 113

72 of 113

73 of 113

74 of 113

75 of 113

76 of 113

77 of 113

78 of 113

79 of 113

80 of 113