1 of 65

Data Engineering��Transactions

2 of 65

This is a Big Topic

  • Desire: Make it easy to think about updating data
    • Want correctness as well as speed!

  • Result: An extremely rich topic
    • Complexity in understanding the APIs & correctness guarantees
    • (More complexity in the implementation under the covers!)

  • Our goal in Data Engineering
    • Genuine understanding of the APIs/guarantees
    • Intuition about implementation to help with that understanding

3 of 65

Two Main Topics

  • Concurrency Control
    • Many users query & update a database simultaneously.
    • How do we keep that from being confusing/wrong?

  • Recovery
    • What happens when things fail?
    • E.g. transaction cancel, app failure, DB engine failure, hardware failure

4 of 65

Why Should Data Engineers Know Transactions?

  • Inevitably you will update a database

  • You will likely manage data from transactional databases
    • Should have a sense of its characteristics
    • E.g. consider tuples in two tables that join together

  • Transaction concepts turn out to be useful outside of databases
    • E.g. in queuing systems like RabbitMQ or Kafka
  • If your DB is slow for transactional reasons, you should understand why
    • And how you can tradeoff speed and “correctness”

5 of 65

The Basic API

  • Bracket a sequence of commands:
    • BEGIN

<command 1>

<command n>

    • COMMIT or ROLLBACK

6 of 65

The Basic Guarantees: ACID

  • Atomicity:
    • Either all the commands are reflected in the database, or none are!

7 of 65

The Basic Guarantees: ACID

  • Atomicity:
    • Either all the commands are reflected in the database, or none are!
  • Consistency:
    • If COMMIT succeeds, all the database integrity checks hold true

8 of 65

The Basic Guarantees: ACID

  • Atomicity:
    • Either all the commands are reflected in the database, or none are!
  • Consistency:
    • If COMMIT succeeds, all the database integrity checks hold true
  • Isolation:
    • 2 transactions running concurrently will not “see” each other’s results.

9 of 65

The Basic Guarantees: ACID

  • Atomicity:
    • Either all the commands are reflected in the database, or none are!
  • Consistency:
    • If COMMIT succeeds, all the database integrity checks hold true
  • Isolation:
    • 2 transactions running concurrently will not “see” each other’s results.
  • Durability:
    • If the COMMIT succeeds, all changes from the transaction persist
      • Until overwritten by a later transaction

10 of 65

Guaranteeing ACID

  • A and D: guaranteed by Recovery system
    • After crash, Redo committed work, Undo uncommitted work!
    • The devil is most definitely in the details! CS186.

  • C: DB engine just checks on each command
    • Efficient for core stuff like types, keys, etc.

  • I: provided by Concurrency Control
    • Intuition today

11 of 65

Thinking Transactionally

There are many cases when transactions are extremely useful!

Classic example: debit/credit

Potential Issues:

- Atomicity?

--debit

UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1234;

-- credit

UPDATE savings

SET amount = amount + 1000

WHERE acctId = 4321;

12 of 65

Thinking Transactionally

There are many cases when transactions are extremely useful!

Classic example: debit/credit

Potential Issues:

- Atomicity?

- Consistency?

--debit

UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1234;

-- credit

UPDATE savings

SET amount = amount + 1000

WHERE acctId = 4321;

13 of 65

Thinking Transactionally

There are many cases when transactions are extremely useful!

Classic example: debit/credit

Potential Issues:

- Atomicity?

- Consistency?

- Isolation?

--debit

UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1234;

-- credit

UPDATE savings

SET amount = amount + 1000

WHERE acctId = 4321;

14 of 65

Thinking Transactionally

There are many cases when transactions are extremely useful!

Classic example: debit/credit

Potential Issues:

- Atomicity?

- Consistency?

- Isolation?

- Durability?

--debit

UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1234;

-- credit

UPDATE savings

SET amount = amount + 1000

WHERE acctId = 4321;

15 of 65

Thinking Transactionally

There are many cases when transactions are extremely useful!

Classic example: debit/credit

Potential Issues:

- Atomicity?

- Consistency?

- Isolation?

- Durability?

BEGIN

--debit

UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1234;

-- credit

UPDATE savings

SET amount = amount + 1000

WHERE acctId = 4321;

END

16 of 65

Savepoints

  • Savepoints allow you to break your transaction up into pieces
  • Then you can do “partial” rollbacks to a prior Savepoint
  • Can have many Savepoints
    • ROLLBACK TO SAVEPOINT x�implicitly destroys any�SAVEPOINTS after x

BEGIN

UPDATE checking � SET amount = amount – 1000 � WHERE acctId = 1234;

SAVEPOINT debit_done;

UPDATE savings

SET amount = amount + 1000

WHERE acctId = 4322;

SAVEPOINT credit_done;

ROLLBACK TO SAVEPOINT debit_done;

UPDATE savings

SET amount = amount + 1000

WHERE acctId = 4321;

END

17 of 65

Quick Summary

  • A Transaction is a list of commands bracketed by BEGIN and END
    • Enjoys the ACID transaction guarantees!
  • SQL Syntax can vary slightly across systems.
    • BEGIN:
      • BEGIN [TRANSACTION | WORK | TRANS]
      • START [TRANSACTION | WORK | TRANS]
    • END:
      • END [TRANSACTION | WORK | TRANS]
      • COMMIT [TRANSACTION | WORK | TRANS]
    • ROLLBACK
      • ROLLBACK [TRANSACTION | WORK | TRANS]
      • ABORT [TRANSACTION | WORK | TRANS]
    • SAVEPOINT savepoint_name;
    • ROLLBACK [TRANSACTION | WORK | TRANS] TO [SAVEPOINT] savepoint_name;
  • “Autocommit”: If you don’t say BEGIN/END, many systems wrap each SQL command in a transaction

Keywords in square brackets are optional

Keywords separated by vertical bars are alternatives

18 of 65

Isolation is tricky and it affects app designers!

Hire Maryam as the new VP of Engineering!

Prepare tax projections for the 2nd quarter!

Shut down the London Office and furlough all employees!

19 of 65

Let’s make it simpler!

R(X1)

W(X1 = X1-275000)

W(Y1 = (“VPE”, “Maryam”))

R(X1-100000)

W(Z1=…)

W(Q3, Q4, Q5 , Q11)

Reads and Writes of individual “objects”

  • For now: object = “records”

20 of 65

21 of 65

22 of 65

A Transaction Schedule

  • A (time-)ordered list of reads and writes to specific objects by different transactions
    • The actions within each transaction are ordered
    • The actions of multiple transactions may be interleaved

  • Notation for today

Ri(O)

Wi(O = value)

23 of 65

Schedule 1

x = R1(G)

W1(P = 110000)

y = R2(P)

W2(G = 121000)

P: 200000

G: 100000

Time

Set my salary to be 10% bigger than Gabi’s

Set my salary to be 10% bigger than Pat’s

Pat (T1)

Gabi (T2)

UPDATE employee

SET salary = (SELECT salary*1.1

FROM employee

WHERE name='Gabi')

WHERE name = 'Pat';

UPDATE employee

SET salary = (SELECT salary*1.1

FROM employee

WHERE name='Pat')

WHERE name = 'Gabi';

24 of 65

Schedule 2

y = R2(P)

W2(G = 220000)

x = R1(G)

W1(P = 242000)

Time

Pat (T1)

Gabi (T2)

P: 200000

G: 100000

Set my salary to be 10% bigger than Gabi’s

Set my salary to be 10% bigger than Pat’s

UPDATE employee

SET salary = (SELECT salary*1.1

FROM employee

WHERE name='Gabi')

WHERE name = 'Pat';

UPDATE employee

SET salary = (SELECT salary*1.1

FROM employee

WHERE name='Pat')

WHERE name = 'Gabi';

25 of 65

Serial Schedules

  • T1 before T2 is “isolated”! No concurrency.
    • End state: P = 110000, G = 121000
  • T2 before T1 is also “isolated”! No concurrency.
    • End state: P = 242000, G = 220000
  • We call these serial schedules.
    • Serial schedules will be our definition of Isolation
    • All serial schedules are “OK” (by definition!)

Pat (T1)

Gabi (T2)

Set my salary to be 10% bigger than Gabi’s

Set my salary to be 10% bigger than Pat’s

26 of 65

Schedule 3: not serial

y = R2(P)

x = R1(G)

W2(G = 220000)

W1(P = 110000)

Time

Pat (T1)

Gabi (T2)

G: 100000

Set my salary to be 10% bigger than Pat’s

Set my salary to be 10% bigger than Gabi’s

P: 200000

27 of 65

Serializable Schedules

  • End state of that schedule?
    • P = 110000, G = 220000

  • A schedule whose outcome is equivalent to some serial schedule is serializable”
    • Unfortunately this one is not serializable!

  • Serializability: ensure that all schedules the system allows are serializable
    • Under the covers, maybe not actually be serial, just serializable!

Pat (T1)

Gabi (T2)

Set my salary to be 10% bigger than Gabi’s

Set my salary to be 10% bigger than Pat’s

P: 200000

G: 100000

28 of 65

Definition (once more for good measure)

  • A schedule is serializable if its outcome is the same as the outcome of some serial schedule.

  • A system provides serializability if it ensures that all its schedules are serializable!

29 of 65

Different, Serializable Transaction Mix!

y = R2(P)

x = R1(J)

W2(G = 110000)

W2(P = 550000)

T: 200000

P: 100000

Time

Pat (T1)

Gabi (T2)

J: 500000

Set my salary to be 10% bigger than Jin’s

Set my salary to be 10% bigger than Pat’s

30 of 65

Different, Serializable Transaction Mix!

y = R2(P)

x = R1(J)

W2(G = 110000)

W2(P = 550000)

Time

Pat (T1)

Gabi (T2)

Outcome?

Equivalent Serial Schedule?

T: 200000

P: 100000

J: 500000

Set my salary to be 10% bigger than Jin’s

Set my salary to be 10% bigger than Pat’s

31 of 65

Quick Summary

  • A Transaction Schedule is a sequence of Reads and Writes by named transactions on named objects.

  • We will define Isolation by the Serial Schedules in which each transactions runs from start to commit without interleaved actions from any other transaction.

  • Under the covers, a database system can allow Serializable Schedules that may not be Serial, but have the same outcome as some serial schedule.
    • This allows multiple transactions to run at once!
    • Much better for performance.

32 of 65

How do we know if a schedule is Serializable?

Strategy:

    • Define a notion of conflicting actions
    • The order of conflicting actions will guide us to an equivalent Serial schedule
    • If conflicting actions form a cycle, there is no way to define such an order!
      • And that is the non-serializable case!

33 of 65

Conflicting Actions

  • Definition: Two actions conflict if:

    • They are from two different, concurrent transactions
    • They reference the same object.
    • At least one is a Write.

34 of 65

Scenario: Write → Read

T1 writes Pat’s salary is 110000

Then

T2 reads Pat’s salary is 110000

  • Clearly any equivalent serial schedule must have T1 before T2

W1(P)

R2(P)

35 of 65

Scenario: Read → Write

  • T1 reads Gabi’s salary is 100000

Then

  • T2 writes Gabi’s salary is 121000

  • Clearly any equivalent serial schedule must have T1 before T2

R1(G)

W2(G)

36 of 65

Scenario: Write → Write

  • T3 reads Pat’s salary is 300000

Then

  • T4 writes Pat’s salary is 0

  • Clearly any equivalent serial schedule must have T3 before T4

W3(G)

W4(G)

W3(P = 300000)

W4(P = 0)

Time

37 of 65

Scenario: Read / Read

  • T5 reads Pat’s salary is 0

Then

  • T6 reads Pat’s salary is 0

  • The order of the reads makes no difference: no conflict between reads!
  • No other pair of actions references the same object
  • Hence these two transactions have no conflicts at all, and all interleavings of their actions result in the same outcome!

R5(G)

R6(G)

d = R5(P)

e = R6(P)

W(Q = d)

W(S = e)

Time

W5(Q)

W6(S)

38 of 65

An Unserializable Schedule

R1(G)

W2(G)

Clearly any equivalent serial schedule has T1 before T2

T1

T2

39 of 65

An Unserializable Schedule

R1(G)

R2(P)

W1(P)

W2(G)

Clearly any equivalent serial schedule has T1 before T2

Clearly any equivalent serial schedule has T2 before T1

T1

T2

40 of 65

An Unserializable Schedule

R1(G)

R2(P)

W1(P)

W2(G)

Clearly any equivalent serial schedule has T1 before T2

Clearly any equivalent serial schedule has T2 before T1

T1

T2

41 of 65

Conflict Graphs

  • Node per transaction
  • Edge from Ti to Tj if:
    • Action a in Ti conflicts with action b in Tj
    • Action a happens before action b in schedule

  • Def’n: A schedule is conflict serializable if conflict graph has no cycles

T1

T2

Conflict Graph.

A cycle corresponds to a schedule that is not conflict serializable.

42 of 65

Conflict Serializable ⇒ Serializable

  • Going back to our strategy:
    • The order of conflicting actions will guide us to an equivalent Serial schedule
    • If conflicting actions form a cycle, there is no way to define such an order!

  • Lemma: Every conflict serializable schedule is serializable!
  • Proof:
    • A conflict serializable schedule has an acyclic conflict graph (by definition)
    • Any serial schedule that follows the edges of the conflict graph has the same ordering of conflicting actions
      • Hence the same final outcome!

43 of 65

Quick Summary

  • Two actions conflict if they’re in different concurrent transactions, reference the same item, and at least one is a Write.

  • Can draw a conflict graph among the transactions of a schedule.
    • Each edge captures the schedule order of the conflict

  • If conflict graph has no cycles (acyclic) then the schedule is serializable!
    • Follow conflict order

44 of 65

So how do databases ensure serializability?

  • Ensure no conflict cycles: conflict serializability!
    • It’s “conservative”
      • Guarantees serializability
      • May prevent some serializable schedules, but no harm done
        • May be performance hit, but always correct

  • How do they do it? Locking!
    • Specifically, Strict Two-Phase Locking (Strict 2PL)

45 of 65

Database locking

  • Want to ensure 2 conflicting actions happen in order?
    • Have the first action that arrives “lock” the shared object!
    • The second action waits until the first action’s transaction completes!

46 of 65

Strict Two Phase Locking

  • Phase 1: During transaction, lock objects before use!
    • Before R1(O), transaction T1 must acquire a shared (S) lock on O
    • Before W1(O), transaction T1 must acquire an exclusive (X) lock on O

  • Phase 2: At end of transaction, drop all locks at once!

47 of 65

Strict 2PL

# locks held

acquisition phase

time

release all locks at end of xact

48 of 65

Two questions

  • What objects are we locking?
    • For most purposes, assume DBMS is locking individual records.
    • Sometimes useful to lock entire tables at once.

  • What does it mean to “acquire” or “release” a lock?
    • The database system maintains a lock “table” for this protocol
    • It’s all just an internal protocol
      • Like red lights at intersections.
      • Database system ensures that all transactions follow the locking rules

49 of 65

Quick Summary

  • Strict Two Phase Locking
    • Phase 1: S-Lock before you read, X-lock before you write
    • Phase 2: drop all locks at once during Commit or Rollback

  • Locking Protocol
    • Lock table keeps track of objects, lock modes, transaction status
    • All handled automatically by database system read/write interface

50 of 65

That Should be Enough!

  • You have been taught the religion of serializable transactions
    • They ensure the ACID properties
    • Strict 2PL is a common implementation of serializability
  • Life is good! Except ….
    • Sometimes you want to trade correctness for performance!
    • Example:

UPDATE students

SET gpa = 4.0

WHERE sid = 1234;

SELECT avg(gpa)

FROM students;

Locks every student but doesn’t need to be 100% correct!

Locks 1 student but must be serializable!

51 of 65

Weak Isolation

  • Idea: allow a transaction to choose to be a bit “sloppy”
    • As long as it doesn’t mess up other transactions’ choices!

52 of 65

“Read Committed” Isolation level

  • What if we dropped each Shared lock right after reading
    • But kept our eXclusive locks until COMMIT/ROLLBACK?
  • Prevents “dirty” (uncommitted) reads from other transactions
    • Each read is of an unlocked/committed item!
  • Doesn’t promise much more!

  • This isolation level is called Read Committed
    • Note: respects the locks of other, Strict 2PL transactions

53 of 65

Does it help us?

BEGIN � ISOLATION LEVEL serializable;

UPDATE students

SET gpa = 4.0

WHERE sid = 1234;

END;

BEGIN

ISOLATION LEVEL read committed;

SELECT avg(gpa)

FROM students;

END;

Locks every student but doesn’t need to be 100% correct!

Locks 1 student but must be serializable!

54 of 65

What could go wrong in Read Committed?

  • Non-repeatable reads
    • Suppose you read a tuple twice in your transaction
    • Another transaction could run between the two reads and update it!

  • Phantoms
    • Suppose you run a query with a non-key WHERE clause
      • E.g. “find all students with an A grade”
    • If you run it again, some brand new tuples (phantoms) could appear!

  • Staleness: Technically you could read a very old (but committed) “version”
    • Still satisfies the definition!

55 of 65

Repeatable Read Isolation

  • Prevents dirty reads and non-repeatable reads

  • A locking-based way to think about it:
    • All locks are held until COMMIT/ROLLBACK
    • But could be only tuple-level locks

  • So phantoms are still possible!

56 of 65

Snapshot Isolation

  • Possible in database systems that keep multiple versions of tuples

  • “Snapshot reads”: all the reads of a transaction are from the same transaction point (snapshot)
    • Typically as of transaction start

  • “Write validation”: the transaction can commit if none of its writes conflict with any transaction since time of transaction start
    • Abort on any write-write conflicts with overlapping transactions

57 of 65

Snapshot Isolation is not (quite) Serializable

x = R1(G0)

y = R2(P0)

W2(P1 = 110000)

W2(G1 = 220000)

Time

Pat (T1)

Gabi (T2)

P0: 200000

G0: 100000

Set my salary to be 10% bigger than Gabi’s

Set my salary to be 10% bigger than Pat’s

NOT equivalent to either order (not serializable!)

Write skew anomaly: concurrent reads, and writes reflect the fact that they didn’t read each other’s writes!

58 of 65

What do Database Engines provide?

  • Default isolation level varies!
  • Max isolation level varies!
  • Marketing varies!
    • When Oracle says “Serializable”�they actually give you Snapshot!!!
  • Many cloud DBMS max level is NOT serializable
    • Snowflake: RC
    • AWS Aurora: RC
    • Google Cloud Spanner: S
    • CockroachDB: S
    • Azure SQL Server: S

59 of 65

Summing Up

  • The ACID Transaction concept
  • Serializability as a touchstone for correctness
    • “Outcome oriented”
  • Conflict serializability as a conservative way to achieve serializability
  • Strict 2-phase-locking as an implementation of Conflict Serializability
  • Weak isolation modes to increase concurrency

60 of 65

Additional Slides

61 of 65

Protocol for Writes (Exclusive locks)

  • W1(O):
    • Check: Is O in the lock table?
    • If no:
      • Add entry (O, X, T1, NULL) to lock table
      • Allow T1 to write O and proceed
    • If so:
      • Add T1 to the Wait Queue for O

Item

Lock Mode

Granted

Wait Queue

O

S

T2, T3

62 of 65

Protocol for Reads (Shared locks)

  • R1(O):
    • Check: Is O in the lock table?
    • If no:
      • Add entry (O, S, T1, NULL) to lock table
      • Allow T1 to read O and proceed
    • If so
      • Check: is lock mode S?
      • If so:
        • Add T1 to the Granted list for O
        • Allow T1 to read O and proceed
      • If no, lock mode is X.
        • Add T1 to the Wait Queue for O

Item

Lock Mode

Granted

Wait Queue

O

S

T2, T3

63 of 65

Protocol for X Lock Release

  • Release T1(O):
    • Check: Is Wait Queue empty?
    • If so:
      • Delete O’s entry from lock table
    • If no:
      • Remove all mutually compatible items from the head of the wait queue (multiple S locks or a single X lock)
      • Set the lock mode for O to the mode of those items
      • Let those transactions proceed

Item

Lock Mode

Granted

Wait Queue

O

X

T1

T4:S, T5:S, T6:X

64 of 65

You can prove that:

  • Strict 2PL guarantees Conflict Serializability
    • All conflicts in the conflict graph are in the same order as the commit times of the transactions.
    • No cyclic conflict graphs are possible!

65 of 65

One remaining issue: Deadlock

T1 acquires X-lock on O

T2 acquires S-lock on P

T1 requests X-lock on P.

T1 is descheduled, waiting for T2.

T2 requests S-lock on O

T2 is descheduled, waiting for T1.

Solution: the system periodically detects deadlock cycles, and rolls back one transaction on the cycle.

T1

T2

Waits-for Graph. A slightly different kind of graph: each edge shows when a transaction is waiting for another transaction.

A cycle corresponds to a deadlock. No serializability violations happen, but transactions on the cycle are stuck and objects remain locked.