1 of 56

Is CAP Dead?

Presenters: Si and Youshan

Scribers: Richa and Shuo-Yang

1

Consistency, Availability, and Partition Tolerance

2 of 56

2

Presentation Roadmap

3 of 56

  • Consistency
  • Availability
  • Partition Tolerance

⇒ impossible to provide

three all together

3

Eric Allen Brewer

CAP Theorem

4 of 56

  • Concurrency
  • Availability
  • Parallelization

⇒ possible to provide

three all together

(only for web applications)

  • Strong consistency guarantees - easier to use but have limited performance
  • Weak consistency guarantees - harder to use but have better performance

possible to provide distributed txns with strong consistency using replication with no consistency

4

Yesquel 2015

Macros K. Aguilera, Joshua B.Leners, Michael Walfish

TAPIR 2015

Irene Zhang, Naveen Kr. Sharma, Adriana Szekere,

Arvind Krisshnamurthy, Dan R. K. Ports

5 of 56

5

6 of 56

Building Consistent Transactions with Inconsistent Replications

Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres,

Arvind Krishnamurthy, Dan R. K. Ports

University of Washington

6

7 of 56

TAPIR and CAP

IR provides fault tolerance without consistency

IR uses a single protocol for recovering failed replicas and running sync

TAPIR provides linearizable read-write txns

7

A

P

C

Designers still have to pick 2 out of 3. However, a CAP goal today is to maximize the consistency and availability.

8 of 56

Common Architecture for Distributed Transactional Systems

  • Distributed Transactional Protocol
  • atomic commitment protocol + concurrency control mechanism
  • e.g., 2PC + (S)2PL / OCC
  • Replication Protocol
  • e.g., Paxos, Viewstamped Replication

8

Google Spanner: 2PC + S2PL + Paxos

9 of 56

Why is consistency so expensive?

  • Cross-shards coordination (2PC)
  • enforce a serial ordering of txns
  • Cross-replica coordination (Paxos)
  • enforce a serial ordering of ops

Insight

Existing transactional storage systems incorporate a transactional protocol and a replication protocol that both enforce strong consistency

9

10 of 56

Motivation

How we go about it

Is it possible to provide distributed transactions with better performance and strong consistency (linearizable read-write transaction ordering)?

  • strong consistent protocols are expensive but impose a high performance cost
  • weak consistency protocols are less costly but provide fewer consistency

  • IR (Inconsistent Replication)
  • provides fault tolerance without consistency
  • needs no cross-replica coordination or designated leader for op processing
  • TAPIR (Transactional Application Protocol for IR)
  • provides linearizable read-write txns
  • supports globally-consistent reads across the database at a timestamp
  • TAPIR-KV
  • high-performance transactional storage achieved by TAPIR + IR

10

IR

IR

IR

OCC

OCC

OCC

TAPIR

11 of 56

IR Overview

  • Fault-tolerance without consistency
  • unordered op set instead of ordered op log
  • Two modes interface
  • inconsistent: execute in any order at each replica; conflicts resolved afterwards (similar to Cassandra)
  • consensus: execute in any order at each replica; return a single consensus result
  • Recovery & Sync by a single protocol
  • Guarantee
  • fault tolerance: any op in the op set persists in at least one replica in any quorum
  • visibility: for any 2 ops in the op set, at least one is visible to the other
  • consensus result: persist in at least one replica in any quorum

11

12 of 56

IR Protocol - Inconsistent op processing

IR can successfully complete an inconsistent op with a single round-trip to f+1 replicas and no coordination across replicas

12

...

Propose OP to all replicas

Reply after tentative recording

*(Async) finalize after receiving f + 1 replies & return to the app

IR Client

IR Replica

13 of 56

IR Protocol - Consensus op processing

A single round trip to 3/2 f + 1 replicas

Two round trips to at least f + 1 replicas

13

...

Propose to all replicas

Reply (with result) after tentative recording

*(Async) finalize (with result) after receiving at least 3/2 f + 1 matching results within a timeout & return to the app

IR Client

IR Replica

Confirm after (possibly) updating record

...

Propose to all replicas

Reply (with result) after tentative recording

(Async) finalize (with result) after receiving f + 1 replies

IR Client

IR Replica

*Confirm after (possibly) updating record

Fast Path:

Slow Path:

14 of 56

IR Protocol - Replica Recovery & Sync

  • A single protocol for recovering failed replicas and running periodic sync
  • introduce view changes (similar to Viewstamp Replication)

  • Each IR view change is run by a leader that
  • coordinates only view changes, not op processing
  • must merge records from the latest view into the “master record” that is used to sync replicas
  • relies on app protocol to determine consensus results

14

15 of 56

TAPIR Overview

  • Built atop IR
  • leverage IR’s weak guarantees to provide high-performance linearizable txns
  • app using TAPIR do not interact with IR directly
  • Clients are front-end app servers
  • possibly located in the same datacenter
  • used as 2PC coordinators
  • Replica state
  • multi-versioned data store (versions identified by timestamps)
  • txn log (committed and aborted txns in timestamp order)

15

16 of 56

TAPIR - Transaction Processing

  • 2PC + OCC + IR support
  • Prepare is the only consensus op
  • decide: if a majority of the replicas replied PREPARE-OK, then it commits the txn)
  • Commit and Abort are inconsistent ops
  • TAPIR can commit a txn with a single round-trip to all replicas in all shards

16

17 of 56

Spanner-like system (2PC + S2PL + Paxos) vs. TAPIR (2PC + OCC + IR)

17

18 of 56

Evaluation

  • Does TAPIR improve latency and throughput than conventional transactional protocols in both the datacenter and wide-area environments?
  • Does TAPIR’s abort rate scale similar to other OCC-based transaction protocols as contention increase?
  • Is TAPIR comparable to weakly consistent systems?

18

19 of 56

Experimental Setup

  • Google Compute Engine (GCE)
  • VMs place in 3 geographical regions: US, Europe, Asia
  • Average RTT: Euro-Asia (260ms), Euro-US (110ms), US-Asia (165ms)
  • Comparison systems
  • OCC-STORE (2PC + OCC + Multi-Paxos)
  • LOCK-STORE (Spanner-like; 2PC + S2PL + Multi-Paxos)
  • Workload
  • Retwis: open-source Twitter clone using Redis
  • YCSB+T: extension of YCSB wrapping DB ops inside simple txns

19

20 of 56

Average Retwis transaction Latency vs. throughput within a datacenter

  • At low offered load, TAPIR-KV has lower latency
  • single round trip commit
  • TAPIR-KV can provide 3X peak throughput
  • no leader & no cross-replica coord.

20

21 of 56

Average wide-area Retwis transaction Latency

  • Leader located in US and client in US, Europe or Asia

21

  • Sharing a DC: T must wait for responses from all replicas; O & L can commit to the local leader

  • A different DC: L much go to the leader for locks; O can go to a local replica

22 of 56

Abort rates at varying Zipf coefficients

22

Lower commit latency reduce the time between Prep and Commit/Abort making txns less like to abort

The most popular keys being accessed very frequently

23 of 56

Comparison with weakly consistent storage systems

23

  • TAPIR-KV outperforms MongoDB

  • TAPIR-KV has throughput within a factor of 2 of Cassandra and Redis

24 of 56

Summary

  • Replication does not have to be consistent for transactions to be
  • lower latency and higher throughput compared to conventional transactional storage systems
  • performance competitive with weakly consistent systems while providing much stronger guarantees

24

25 of 56

25

26 of 56

PROS

CONS

26

  • Novel Idea: reducing the cost of redundant consistency

  • TAPIR achieved 50% better latency and 3x better throughput compared to distributed systems using conventional protocols

  • Leaders are no longer required, which reduces the communication time

  • Having timestamps reduces the chance of having duplication in the system

  • Designing application protocols difficult. If not correctly designed, system’s performance will suffer�
  • Invariant checks must be performed pairwise, which may be limiting for application protocols

  • IR cannot support applications that require the entire history of transactions�
  • No real-world application with this technique and needs more tests on industry scale uses

27 of 56

Piazza Reviews

QUESTIONS

COMMENTS

27

  • Is there a way to get around the pairwise requirement and make the system more robust?
  • TAPIR has no leader; but client acts like the leader. What if it fails?
  • Is this type of distributed transactions going to become more popular in the future?
  • Standard deviation numbers not provided
  • Authors prove the correctness of IR and TAPIR, instead of just relying on experiment results�
  • If lot of contention between requests, system may just be stuck retrying all the operations without doing much of a useful work. High abort rates under heavy contention

28 of 56

Suggestions for Future Development

Running the fast and slow paths for operation processing in parallel in order to account for varying round trip times.

Read-write transactions can be combined with Spanner’s read-only transactions to provide high performance read-write and read-only transactions.

Code for TAPIR is available to public:

https://gitlab.cs.washington.edu/syslab/tapir

28

29 of 56

29

30 of 56

Yesquel

scalable SQL storage for Web applications

30

31 of 56

Yesquel and CAP

Concurrency Control Protocols

Data Request

Logs and Replicas

Query computation

Local & No Concurrency Required

Heavy Request

Load Splits & Replits

Failures

Parallelized reads

Query computation

Independent Computation

31

Data Request

Partition- tolerant

Consistency

Availability

32 of 56

32

33 of 56

Problems with existing storage techniques

  • SQL:
    • Rich functionalities: joins, compressions, secondary indexes, transactions etc.
    • Not scalable
  • NoSQL:
    • Week functionalities: no SQL functionalities.
    • Manual cashing & sharding & denormalization
      • increased application complexity, error-prone, limited
    • Each existing NoSQL system has a small subsets of features.
    • Lock-in problem: hard to replace one NoSQL system with another.

33

Solution for web apps: Yesquel = NOSQL performance + SQL functionalities

rich database system = simple application

weak database system = complex application

NoSQL Example

34 of 56

Yesquel overview

34

  • support all SQL features, efficient operations to insert, lookup, delete, and enumerate items in order
  • support transactions that span many ordered maps

query processor = transactional ordered maps

primary key -----> the rest

secondary index -----> primary key

35 of 56

Yesquel overview (cont.)

35

  • support all SQL features, efficient operations to insert, lookup, delete, and enumerate items in order
  • support transactions that span many ordered maps
  • one in each client application

query processor = transactional ordered maps

  • partition the data of the ordered maps and store the tree nodes across the transactional storage servers
  • is consistent, fault-tolerant, latency efficient

YDBT = distributed balanced tree

computation

data

return key

return value

search the tree &

fetch key from servers

search the tree &

fetch value from servers

36 of 56

YDBT feature: B+ tree => locality, scalability, latency efficiency

  • B-tree (self-balancing tree):
    • (key, value) ordered map = (index, data)
    • contiguous key interval
  • Locality: neighboring keys are often in the same server, allowing them to be fetched together during enumeration.
  • B+tree:
    • Interior node: (key) → only used as guides.
    • Leaf node: (key, value). → where values store.
    • Leaf level has double-linked list.
  • For trees with high fan-out (like file system), the linked leaf nodes reduces the number of I/O operations required to find an element in the tree.
  • Better locality (efficient in fetching neighbor nodes)

36

Data and logs store distributedly in storage servers (local storage)

+ children’s key range

fence interval 5~7

37 of 56

37

6

fence interval = key range

38 of 56

YDBT feature 1: load splits => balance load

split the workload of popular server

38

b

a

b

c

d

e

f

server 1

server 2

server 3

A

B

C

D

E

F

C

C

clients

clients are sending data requests to servers.

server 3 has larger workload

server 4

b

a

b

c

d

e

f

server 1

server 2

server 3

A

B

C

D

E

F

C

C

clients

each server gets similar workloads.

c

39 of 56

YDBT feature 1: replits => balance load, fault-tolerant

If popular on reading requests, replicate the node of popular server

If popular on updating requests, remove replica index

39

b

a

b

c

server 1

server 2

server 3

A

B

C

D

E

F

C

C

clients

clients are sending data requests to servers.

server 3 has larger workload

server 4

b

a

b

c

c

server 1

server 2

server 3

A

B

C

D

E

F

C

C

clients

each server gets similar workloads.

c

40 of 56

YDBT feature 1: delegate splits => serialize splits & reduce latency

40

b

a

b

c

d

e

f

server 1

server 2

server 3

b

a

b

c

d

e

f

server 1

server 2

server 3

b

server 4

c

server 4

  1. split altogether

b and c are popular,

split

whoops! All signed to server 4!

2) split in series

by assigning to a split processor

e

f

b

a

b

c

d

server 1

server 2

server 3

b

server 4

e

f

d

server 3

b

b

a

b

c

server 1

server 2

server 5

c

server 4

a

41 of 56

YDBT feature 2: back-down search => use of caching

  • cache inner nodes without coherence
  • maintain a stack for backtrack

41

in client’s cache

VALIDATION:

fence interval = the key range of this node

>=5

server 3

server 2

server 1

>=1

>=3

>=7

>=9

Ex. what is data for index 7?

7

8

fetch = reading a node remotely from the server that stores it.

>=5

>=7

stack

42 of 56

YDBT feature 2: back-down search => use of caching

  • cache inner nodes without coherence
  • maintain a stack for backtrack

42

in client’s cache

VALIDATION:

fence interval = the key range of this node

>=5

server 3

server 2

server 1

>=1

>=3

>=7

>=9

8

9

Ex. what is data for index 7?

need to update my cache

whoops! not found!

fetch = reading a node remotely from the server that stores it.

>=5

>=7

stack

>=8

>=9

5

7

43 of 56

43

6

fence interval = key range

7

8

6

whoops! not found!

in client’s cache

→ should be updated

but not yet

→ server’s index & data are moved

44 of 56

44

6

fence interval = key range

7

8

6

found it!

>= 7 fetch and update cache

in client’s cache

45 of 56

YDBT feature 3: atomic transactions => concurrency

  • multi-version concurrency control (MVCC) for concurrent access
    • transaction T : a logical unit of work on the database (e.g. read/write/update)
    • TimeStamp (TS), Read TimeStamp (RTS), Write TimeStamp (WTS)
    • Read Write, Write Write (muture exclusive)

45

one client → many transactions

one object → many RTS/WTS

If write, check if the current time series < any read & write transaction’s time series. If not, abort and put in line. Otherwise, assign a new object to this transaction with current time series.

time

clients

queries manager on one object

abort

T1

RTS (T1)

C1

read

T2

RTS (T2)

C2

read

T3

C3

write

WTS (T3)

T3

write

trade-off: storage consumption

46 of 56

YDBT feature 3: atomic transactions => concurrency

  • run commit at the clients
    • Use Sinfonia protocol to recover from crashes: a recovery process periodically visits the servers to check on pending transaction
    • trade-off: complexity
  • clocks with transactional protocol
    • two-phase commit: commit request + commit handling
    • two-phase locking: lock an object + release the lock
    • timestamp-ordering: transactions in order
    • trade-off: time drifts
  • snapshots = state of data
    • obtain snapshots for free from its transactional storage, which ensures that each transaction reads from a snapshot

46

47 of 56

Yesquel query processor optimizations

  • reading keys without values (requests values later on)
    • trade-off: seeking an iterator causes an RPC (remote procedure call) to fetch a leaf node without its value.
  • deferred seek

47

  • deferred write = buffer small writes until the transaction commits with large writes
  • optimistic insert = reuse back-down search for insert operation without fetching leaf node.

48 of 56

Yesquel limitations

  • Limit target usage: web applications
    • Many operations in the Yesequal API modify a small part of a node's state.
    • long-lived queries that touch a lot of data, e.g. computing an average over all data, would move all data from storage servers to the query processors. (data analytics). ⇒ better to ship computation to the data (computing partial averages at each server)

48

CAP?

  • Not distributed for relational operators (operate at the client)
    • SQLITE (query processor) is not distributed.
    • Run relational operators at the storage server.

49 of 56

Yesquel limitations (cont.)

  • Time drift
  • Implementation lack of
    • replicate storage servers (using Paxos or primary backup)
    • replicate keys (using replits)
  • In the worst case of back-down search, validation process may take 100 iterations.

49

50 of 56

Yesquel evaluation - baseline

50

DBT

MVCC

back-down

splits

BASE

x

x

x

BASE+

x

x

YDBT

back-down (YDBT) vs. non-back-down(BASE)

fraction of root access

load splits (YDBT) vs. size splits (base+)

insertion operation

MVCC (YDBT) vs. no MVCC (Minuet)

scan operation (retrieve ~ 10% keys) = snapshots

51 of 56

Alternative tech: Network DB

MySQL Network database (MySQL-NDB)

  • upside down tree which allows many-to-many relations.

  • SQL functionalities
  • shared nothing: no single point of failure
  • auto-partitioning
  • modification difficulty
  • no MVCC for concurrent access
  • no table compression
  • Network I/O consumption - Performance

Yesquel spreads its nodes over the network

  • Asynchronous data replications

Yesquel uses clocks to synchronize (time drifts are rare)

51

52 of 56

Yesquel evaluation (cont.) - one server

52

  • NoSQL performs better in sacrifice of SQL functionalities
  • NDB performs the worst
  • Yesquel performs well with SQL functionalities
  • NoSQL uses less client & network resources but more server resources.
  • NDB performs the worst
  • Yesquel uses more client & server resources, and has larger network transmission.

network resources but less server resources

53 of 56

Yesquel evaluation (cont.) - many servers

53

All scale linearly

All scale linearly

Yesquel & Redis scale linearly

NDB is limited by

coars-grained locking

54 of 56

Summary

  • Yesquel is a system providing performance and scalability comparable to NoSQL systems, and at the same time with all the features of SQL systems
  • Yesquel is designed for serving Web applications, not as a general-purpose database system
  • The architecture of Yesquel supplies a query processor per client and designs a storage engine (YDBT) that can handle all query processors
  • YDBT combines an optimistic caching scheme with back-down search to provide data locality
  • YDBT splits its nodes as in balanced trees based on a technique called replit

54

55 of 56

Pros

Cons

  • Additional network bandwidth, memory, and client CPU cycles
  • As clients execute all relational operators, iterative or statistical queries can be very expensive
  • Better scalability than SQL while preserving the functionality
  • Design a novel data structure (YDBT) to store data across network
  • New caching scheme
  • New mechanisms and policies to split nodes
  • New ways to handle and avoid contention
  • New approach to tree snapshots

55

56 of 56

Questions

  • What are some possible ways to solve Yesquel’s bad performance on data analysis queries that operate on lots of data?

56

  • Is it possible to improve the performance of Yesquel with the approach proposed in the paper “Building Consistent Transactions with Inconsistent Replication”?
  • Are enterprises nowadays truly using Yesquel-like distributed database systems?
  • Is wikipedia a representative workload for a web application?