Data Engineering��Transactions
This is a Big Topic
Two Main Topics
Why Should Data Engineers Know Transactions?
The Basic API
<command 1>
…
<command n>
The Basic Guarantees: ACID
The Basic Guarantees: ACID
The Basic Guarantees: ACID
The Basic Guarantees: ACID
Guaranteeing ACID
Thinking Transactionally
There are many cases when transactions are extremely useful!
Classic example: debit/credit
Potential Issues:
- Atomicity?
--debit
UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1234;
-- credit
UPDATE savings
SET amount = amount + 1000
WHERE acctId = 4321;
Thinking Transactionally
There are many cases when transactions are extremely useful!
Classic example: debit/credit
Potential Issues:
- Atomicity?
- Consistency?
--debit
UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1234;
-- credit
UPDATE savings
SET amount = amount + 1000
WHERE acctId = 4321;
Thinking Transactionally
There are many cases when transactions are extremely useful!
Classic example: debit/credit
Potential Issues:
- Atomicity?
- Consistency?
- Isolation?
--debit
UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1234;
-- credit
UPDATE savings
SET amount = amount + 1000
WHERE acctId = 4321;
Thinking Transactionally
There are many cases when transactions are extremely useful!
Classic example: debit/credit
Potential Issues:
- Atomicity?
- Consistency?
- Isolation?
- Durability?
--debit
UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1234;
-- credit
UPDATE savings
SET amount = amount + 1000
WHERE acctId = 4321;
Thinking Transactionally
There are many cases when transactions are extremely useful!
Classic example: debit/credit
Potential Issues:
- Atomicity?
- Consistency?
- Isolation?
- Durability?
BEGIN
--debit
UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1234;
-- credit
UPDATE savings
SET amount = amount + 1000
WHERE acctId = 4321;
END
Savepoints
BEGIN
UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1234;
SAVEPOINT debit_done;
UPDATE savings
SET amount = amount + 1000
WHERE acctId = 4322;
SAVEPOINT credit_done;
ROLLBACK TO SAVEPOINT debit_done;
UPDATE savings
SET amount = amount + 1000
WHERE acctId = 4321;
END
Quick Summary
Keywords in square brackets are optional
Keywords separated by vertical bars are alternatives
Isolation is tricky and it affects app designers!
Hire Maryam as the new VP of Engineering!
Prepare tax projections for the 2nd quarter!
Shut down the London Office and furlough all employees!
Let’s make it simpler!
R(X1)
W(X1 = X1-275000)
W(Y1 = (“VPE”, “Maryam”))
R(X1-100000)
W(Z1=…)
W(Q3, Q4, Q5 , Q11)
Reads and Writes of individual “objects”
A Transaction Schedule
Ri(O)
Wi(O = value)
Schedule 1
x = R1(G) | | |
W1(P = 110000) | | |
| y = R2(P) | |
| W2(G = 121000) | |
| | |
| | |
| | |
| |
P: 200000
G: 100000
Time
Set my salary to be 10% bigger than Gabi’s
Set my salary to be 10% bigger than Pat’s
Pat (T1)
Gabi (T2)
UPDATE employee
SET salary = (SELECT salary*1.1
FROM employee
WHERE name='Gabi')
WHERE name = 'Pat';
UPDATE employee
SET salary = (SELECT salary*1.1
FROM employee
WHERE name='Pat')
WHERE name = 'Gabi';
Schedule 2
| | y = R2(P) |
| W2(G = 220000) | |
x = R1(G) | | |
W1(P = 242000) | | |
| | |
| | |
| | |
| |
Time
Pat (T1)
Gabi (T2)
P: 200000
G: 100000
Set my salary to be 10% bigger than Gabi’s
Set my salary to be 10% bigger than Pat’s
UPDATE employee
SET salary = (SELECT salary*1.1
FROM employee
WHERE name='Gabi')
WHERE name = 'Pat';
UPDATE employee
SET salary = (SELECT salary*1.1
FROM employee
WHERE name='Pat')
WHERE name = 'Gabi';
Serial Schedules
Pat (T1)
Gabi (T2)
Set my salary to be 10% bigger than Gabi’s
Set my salary to be 10% bigger than Pat’s
Schedule 3: not serial
| | y = R2(P) |
x = R1(G) | | |
| W2(G = 220000) | |
W1(P = 110000) | | |
| | |
| | |
| | |
| |
Time
Pat (T1)
Gabi (T2)
G: 100000
Set my salary to be 10% bigger than Pat’s
Set my salary to be 10% bigger than Gabi’s
P: 200000
Serializable Schedules
Pat (T1)
Gabi (T2)
Set my salary to be 10% bigger than Gabi’s
Set my salary to be 10% bigger than Pat’s
P: 200000
G: 100000
Definition (once more for good measure)
Different, Serializable Transaction Mix!
| | y = R2(P) |
x = R1(J) | | |
| W2(G = 110000) | |
W2(P = 550000) | | |
| | |
| | |
| | |
| |
T: 200000
P: 100000
Time
Pat (T1)
Gabi (T2)
J: 500000
Set my salary to be 10% bigger than Jin’s
Set my salary to be 10% bigger than Pat’s
Different, Serializable Transaction Mix!
| | y = R2(P) |
x = R1(J) | | |
| W2(G = 110000) | |
W2(P = 550000) | | |
| | |
| | |
| | |
| |
Time
Pat (T1)
Gabi (T2)
Outcome?
Equivalent Serial Schedule?
T: 200000
P: 100000
J: 500000
Set my salary to be 10% bigger than Jin’s
Set my salary to be 10% bigger than Pat’s
Quick Summary
How do we know if a schedule is Serializable?
Strategy:
Conflicting Actions
Scenario: Write → Read
T1 writes Pat’s salary is 110000
Then
T2 reads Pat’s salary is 110000
W1(P)
R2(P)
Scenario: Read → Write
Then
R1(G)
W2(G)
Scenario: Write → Write
Then
W3(G)
W4(G)
W3(P = 300000) | | |
| W4(P = 0) | |
| | |
| | |
| | |
| | |
| | |
| |
Time
Scenario: Read / Read
Then
R5(G)
R6(G)
d = R5(P) | | |
| e = R6(P) | |
W(Q = d) | | |
| W(S = e) | |
| | |
| | |
| | |
| |
Time
W5(Q)
W6(S)
An Unserializable Schedule
R1(G)
W2(G)
Clearly any equivalent serial schedule has T1 before T2
T1
T2
An Unserializable Schedule
R1(G)
R2(P)
W1(P)
W2(G)
Clearly any equivalent serial schedule has T1 before T2
Clearly any equivalent serial schedule has T2 before T1
T1
T2
An Unserializable Schedule
R1(G)
R2(P)
W1(P)
W2(G)
Clearly any equivalent serial schedule has T1 before T2
Clearly any equivalent serial schedule has T2 before T1
T1
T2
Conflict Graphs
T1
T2
Conflict Graph.
A cycle corresponds to a schedule that is not conflict serializable.
Conflict Serializable ⇒ Serializable
Quick Summary
So how do databases ensure serializability?
Database locking
Strict Two Phase Locking
Strict 2PL
# locks held
acquisition phase
time
release all locks at end of xact
Two questions
Quick Summary
That Should be Enough!
UPDATE students
SET gpa = 4.0
WHERE sid = 1234;
SELECT avg(gpa)
FROM students;
Locks every student but doesn’t need to be 100% correct!
Locks 1 student but must be serializable!
Weak Isolation
“Read Committed” Isolation level
Does it help us?
BEGIN � ISOLATION LEVEL serializable;
UPDATE students
SET gpa = 4.0
WHERE sid = 1234;
END;
BEGIN
ISOLATION LEVEL read committed;
SELECT avg(gpa)
FROM students;
END;
Locks every student but doesn’t need to be 100% correct!
Locks 1 student but must be serializable!
What could go wrong in Read Committed?
Repeatable Read Isolation
Snapshot Isolation
Snapshot Isolation is not (quite) Serializable
x = R1(G0) | | |
| y = R2(P0) | |
W2(P1 = 110000) | | |
| W2(G1 = 220000) | |
| | |
| | |
| | |
| |
Time
Pat (T1)
Gabi (T2)
P0: 200000
G0: 100000
Set my salary to be 10% bigger than Gabi’s
Set my salary to be 10% bigger than Pat’s
NOT equivalent to either order (not serializable!)
Write skew anomaly: concurrent reads, and writes reflect the fact that they didn’t read each other’s writes!
What do Database Engines provide?
Summing Up
Additional Slides
Protocol for Writes (Exclusive locks)
Item | Lock Mode | Granted | Wait Queue |
O | S | T2, T3 | |
| | | |
| | | |
Protocol for Reads (Shared locks)
Item | Lock Mode | Granted | Wait Queue |
O | S | T2, T3 | |
| | | |
| | | |
Protocol for X Lock Release
Item | Lock Mode | Granted | Wait Queue |
O | X | T1 | T4:S, T5:S, T6:X |
| | | |
| | | |
You can prove that:
One remaining issue: Deadlock
T1 acquires X-lock on O
T2 acquires S-lock on P
T1 requests X-lock on P.
T1 is descheduled, waiting for T2.
T2 requests S-lock on O
T2 is descheduled, waiting for T1.
Solution: the system periodically detects deadlock cycles, and rolls back one transaction on the cycle.
T1
T2
Waits-for Graph. A slightly different kind of graph: each edge shows when a transaction is waiting for another transaction.
A cycle corresponds to a deadlock. No serializability violations happen, but transactions on the cycle are stuck and objects remain locked.