1 of 56

Is CAP Dead?

Presenters: Si and Youshan

Scribers: Richa and Shuo-Yang

1

Consistency, Availability, and Partition Tolerance

2 of 56

2

Presentation Roadmap

3 of 56

Consistency
Availability
Partition Tolerance

⇒ impossible to provide

three all together

3

Eric Allen Brewer

CAP Theorem

4 of 56

Concurrency
Availability
Parallelization

⇒ possible to provide

three all together

(only for web applications)

Strong consistency guarantees - easier to use but have limited performance
Weak consistency guarantees - harder to use but have better performance

⇒ possible to provide distributed txns with strong consistency using replication with no consistency

4

Yesquel 2015

Macros K. Aguilera, Joshua B.Leners, Michael Walfish

TAPIR 2015

Irene Zhang, Naveen Kr. Sharma, Adriana Szekere,

Arvind Krisshnamurthy, Dan R. K. Ports

5 of 56

5

6 of 56

Building Consistent Transactions with Inconsistent Replications

Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres,

Arvind Krishnamurthy, Dan R. K. Ports

University of Washington

6

7 of 56

TAPIR and CAP

IR provides fault tolerance without consistency

IR uses a single protocol for recovering failed replicas and running sync

TAPIR provides linearizable read-write txns

7

A

P

C

Designers still have to pick 2 out of 3. However, a CAP goal today is to maximize the consistency and availability.

8 of 56

Common Architecture for Distributed Transactional Systems

Distributed Transactional Protocol
atomic commitment protocol + concurrency control mechanism
e.g., 2PC + (S)2PL / OCC
Replication Protocol
e.g., Paxos, Viewstamped Replication

8

Google Spanner: 2PC + S2PL + Paxos

9 of 56

Why is consistency so expensive?

Cross-shards coordination (2PC)
enforce a serial ordering of txns
Cross-replica coordination (Paxos)
enforce a serial ordering of ops

Insight

Existing transactional storage systems incorporate a transactional protocol and a replication protocol that both enforce strong consistency

9

10 of 56

Motivation

How we go about it

Is it possible to provide distributed transactions with better performance and strong consistency (linearizable read-write transaction ordering)?

strong consistent protocols are expensive but impose a high performance cost
weak consistency protocols are less costly but provide fewer consistency

IR (Inconsistent Replication)
provides fault tolerance without consistency
needs no cross-replica coordination or designated leader for op processing
TAPIR (Transactional Application Protocol for IR)
provides linearizable read-write txns
supports globally-consistent reads across the database at a timestamp
TAPIR-KV
high-performance transactional storage achieved by TAPIR + IR

10

IR

OCC

TAPIR

11 of 56

IR Overview

Fault-tolerance without consistency
unordered op set instead of ordered op log
Two modes interface
inconsistent: execute in any order at each replica; conflicts resolved afterwards (similar to Cassandra)
consensus: execute in any order at each replica; return a single consensus result
Recovery & Sync by a single protocol
Guarantee
fault tolerance: any op in the op set persists in at least one replica in any quorum
visibility: for any 2 ops in the op set, at least one is visible to the other
consensus result: persist in at least one replica in any quorum

11

12 of 56

IR Protocol - Inconsistent op processing

IR can successfully complete an inconsistent op with a single round-trip to f+1 replicas and no coordination across replicas

12

...

Propose OP to all replicas

Reply after tentative recording

*(Async) finalize after receiving f + 1 replies & return to the app

IR Client

IR Replica

13 of 56

IR Protocol - Consensus op processing

A single round trip to 3/2 f + 1 replicas

Two round trips to at least f + 1 replicas

13

...

Propose to all replicas

Reply (with result) after tentative recording

*(Async) finalize (with result) after receiving at least 3/2 f + 1 matching results within a timeout & return to the app

IR Client

IR Replica

Confirm after (possibly) updating record

...

Propose to all replicas

Reply (with result) after tentative recording

(Async) finalize (with result) after receiving f + 1 replies

IR Client

IR Replica

*Confirm after (possibly) updating record

Fast Path:

Slow Path:

14 of 56

IR Protocol - Replica Recovery & Sync

A single protocol for recovering failed replicas and running periodic sync
introduce view changes (similar to Viewstamp Replication)

Each IR view change is run by a leader that
coordinates only view changes, not op processing
must merge records from the latest view into the “master record” that is used to sync replicas
relies on app protocol to determine consensus results

14

15 of 56

TAPIR Overview

Built atop IR
leverage IR’s weak guarantees to provide high-performance linearizable txns
app using TAPIR do not interact with IR directly
Clients are front-end app servers
possibly located in the same datacenter
used as 2PC coordinators
Replica state
multi-versioned data store (versions identified by timestamps)
txn log (committed and aborted txns in timestamp order)

15

16 of 56

TAPIR - Transaction Processing

2PC + OCC + IR support
Prepare is the only consensus op
decide: if a majority of the replicas replied PREPARE-OK, then it commits the txn)
Commit and Abort are inconsistent ops
TAPIR can commit a txn with a single round-trip to all replicas in all shards

16

17 of 56

Spanner-like system (2PC + S2PL + Paxos) vs. TAPIR (2PC + OCC + IR)

17

18 of 56

Evaluation

Does TAPIR improve latency and throughput than conventional transactional protocols in both the datacenter and wide-area environments?
Does TAPIR’s abort rate scale similar to other OCC-based transaction protocols as contention increase?
Is TAPIR comparable to weakly consistent systems?

18

19 of 56

Experimental Setup

Google Compute Engine (GCE)
VMs place in 3 geographical regions: US, Europe, Asia
Average RTT: Euro-Asia (260ms), Euro-US (110ms), US-Asia (165ms)
Comparison systems
OCC-STORE (2PC + OCC + Multi-Paxos)
LOCK-STORE (Spanner-like; 2PC + S2PL + Multi-Paxos)
Workload
Retwis: open-source Twitter clone using Redis
YCSB+T: extension of YCSB wrapping DB ops inside simple txns

19

20 of 56

Average Retwis transaction Latency vs. throughput within a datacenter

At low offered load, TAPIR-KV has lower latency
single round trip commit
TAPIR-KV can provide 3X peak throughput
no leader & no cross-replica coord.

20

21 of 56

Average wide-area Retwis transaction Latency

Leader located in US and client in US, Europe or Asia

21

Sharing a DC: T must wait for responses from all replicas; O & L can commit to the local leader

A different DC: L much go to the leader for locks; O can go to a local replica

22 of 56

Abort rates at varying Zipf coefficients

22

Lower commit latency reduce the time between Prep and Commit/Abort making txns less like to abort

The most popular keys being accessed very frequently

23 of 56

Comparison with weakly consistent storage systems

23

TAPIR-KV outperforms MongoDB

TAPIR-KV has throughput within a factor of 2 of Cassandra and Redis

24 of 56

Summary

Replication does not have to be consistent for transactions to be
lower latency and higher throughput compared to conventional transactional storage systems
performance competitive with weakly consistent systems while providing much stronger guarantees

24

25 of 56

25

26 of 56

PROS

CONS

26

Novel Idea: reducing the cost of redundant consistency

TAPIR achieved 50% better latency and 3x better throughput compared to distributed systems using conventional protocols

Leaders are no longer required, which reduces the communication time

Having timestamps reduces the chance of having duplication in the system

Designing application protocols difficult. If not correctly designed, system’s performance will suffer�
Invariant checks must be performed pairwise, which may be limiting for application protocols

IR cannot support applications that require the entire history of transactions�
No real-world application with this technique and needs more tests on industry scale uses

27 of 56

Piazza Reviews

QUESTIONS�

COMMENTS�

27

Is there a way to get around the pairwise requirement and make the system more robust?

TAPIR has no leader; but client acts like the leader. What if it fails?

Is this type of distributed transactions going to become more popular in the future?

Standard deviation numbers not provided

Authors prove the correctness of IR and TAPIR, instead of just relying on experiment results�

If lot of contention between requests, system may just be stuck retrying all the operations without doing much of a useful work. High abort rates under heavy contention

28 of 56

Suggestions for Future Development

Running the fast and slow paths for operation processing in parallel in order to account for varying round trip times.

Read-write transactions can be combined with Spanner’s read-only transactions to provide high performance read-write and read-only transactions.

Code for TAPIR is available to public:

https://gitlab.cs.washington.edu/syslab/tapir

28

29 of 56

29

30 of 56

Yesquel

scalable SQL storage for Web applications

30

31 of 56

Yesquel and CAP

Concurrency Control Protocols

Data Request

Logs and Replicas

Query computation

Local & No Concurrency Required

Heavy Request

Load Splits & Replits

Failures

Parallelized reads

Query computation

Independent Computation

31

Data Request

Partition- tolerant

Consistency

Availability

32 of 56

32

33 of 56

Problems with existing storage techniques

SQL:

Rich functionalities: joins, compressions, secondary indexes, transactions etc.
Not scalable

NoSQL:

Week functionalities: no SQL functionalities.
Manual cashing & sharding & denormalization

increased application complexity, error-prone, limited

Each existing NoSQL system has a small subsets of features.
Lock-in problem: hard to replace one NoSQL system with another.

33

Solution for web apps: Yesquel = NOSQL performance + SQL functionalities

rich database system = simple application

weak database system = complex application

NoSQL Example

34 of 56

Yesquel overview

34

support all SQL features, efficient operations to insert, lookup, delete, and enumerate items in order
support transactions that span many ordered maps

query processor = transactional ordered maps

primary key -----> the rest

secondary index -----> primary key

35 of 56

Yesquel overview (cont.)

35

support all SQL features, efficient operations to insert, lookup, delete, and enumerate items in order
support transactions that span many ordered maps
one in each client application

query processor = transactional ordered maps

partition the data of the ordered maps and store the tree nodes across the transactional storage servers
is consistent, fault-tolerant, latency efficient

YDBT = distributed balanced tree

computation

data

return key

return value

search the tree &

fetch key from servers

search the tree &

fetch value from servers

36 of 56

YDBT feature: B+ tree => locality, scalability, latency efficiency

B-tree (self-balancing tree):

(key, value) ordered map = (index, data)
contiguous key interval

Locality: neighboring keys are often in the same server, allowing them to be fetched together during enumeration.

B+tree:

Interior node: (key) → only used as guides.
Leaf node: (key, value). → where values store.
Leaf level has double-linked list.

For trees with high fan-out (like file system), the linked leaf nodes reduces the number of I/O operations required to find an element in the tree.
Better locality (efficient in fetching neighbor nodes)

36

Data and logs store distributedly in storage servers (local storage)

+ children’s key range

fence interval 5~7

37 of 56

37

6

fence interval = key range

38 of 56

YDBT feature 1: load splits => balance load

split the workload of popular server

38

b

a

b

c

d

e

f

server 1

server 2

server 3

A

B

C

D

E

F

C

clients

clients are sending data requests to servers.

server 3 has larger workload

server 4

b

a

b

c

d

e

f

server 1

server 2

server 3

A

B

C

D

E

F

C

clients

each server gets similar workloads.

c

39 of 56

YDBT feature 1: replits => balance load, fault-tolerant

If popular on reading requests, replicate the node of popular server

If popular on updating requests, remove replica index

39

b

a

b

c

server 1

server 2

server 3

A

B

C

D

E

F

C

clients

clients are sending data requests to servers.

server 3 has larger workload

server 4

b

a

b

c

server 1

server 2

server 3

A

B

C

D

E

F

C

clients

each server gets similar workloads.

c

40 of 56

YDBT feature 1: delegate splits => serialize splits & reduce latency

40

b

a

b

c

d

e

f

server 1

server 2

server 3

b

a

b

c

d

e

f

server 1

server 2

server 3

b

server 4

c

server 4

split altogether

b and c are popular,

split

whoops! All signed to server 4!

2) split in series

by assigning to a split processor

e

f

b

a

b

c

d

server 1

server 2

server 3

b

server 4

e

f

d

server 3

b

a

b

c

server 1

server 2

server 5

c

server 4

a

41 of 56

YDBT feature 2: back-down search => use of caching

cache inner nodes without coherence
maintain a stack for backtrack

41

in client’s cache

VALIDATION:

fence interval = the key range of this node

>=5

server 3

server 2

server 1

>=1

>=3

>=7

>=9

Ex. what is data for index 7?

7

8

fetch = reading a node remotely from the server that stores it.

>=5

>=7

stack

42 of 56

YDBT feature 2: back-down search => use of caching

cache inner nodes without coherence
maintain a stack for backtrack

42

in client’s cache

VALIDATION:

fence interval = the key range of this node

>=5

server 3

server 2

server 1

>=1

>=3

>=7

>=9

8

9

Ex. what is data for index 7?

need to update my cache

whoops! not found!

fetch = reading a node remotely from the server that stores it.

>=5

>=7

stack

>=8

>=9

5

7

43 of 56

43

6

fence interval = key range

7

8

6

whoops! not found!

in client’s cache

→ should be updated

but not yet

→ server’s index & data are moved

44 of 56

44

6

fence interval = key range

7

8

6

found it!

>= 7 fetch and update cache

in client’s cache

45 of 56

YDBT feature 3: atomic transactions => concurrency

multi-version concurrency control (MVCC) for concurrent access

transaction T : a logical unit of work on the database (e.g. read/write/update)
TimeStamp (TS), Read TimeStamp (RTS), Write TimeStamp (WTS)
Read Write, Write Write (muture exclusive)

45

one client → many transactions

one object → many RTS/WTS

If write, check if the current time series < any read & write transaction’s time series. If not, abort and put in line. Otherwise, assign a new object to this transaction with current time series.

time

clients

queries manager on one object

abort

T1

RTS (T1)

C1

read

T2

RTS (T2)

C2

read

T3

C3

write

WTS (T3)

T3

write

trade-off: storage consumption

46 of 56

YDBT feature 3: atomic transactions => concurrency

run commit at the clients

Use Sinfonia protocol to recover from crashes: a recovery process periodically visits the servers to check on pending transaction
trade-off: complexity

clocks with transactional protocol

two-phase commit: commit request + commit handling
two-phase locking: lock an object + release the lock
timestamp-ordering: transactions in order
trade-off: time drifts

snapshots = state of data

obtain snapshots for free from its transactional storage, which ensures that each transaction reads from a snapshot

46

47 of 56

Yesquel query processor optimizations

reading keys without values (requests values later on)

trade-off: seeking an iterator causes an RPC (remote procedure call) to fetch a leaf node without its value.

deferred seek

47

deferred write = buffer small writes until the transaction commits with large writes
optimistic insert = reuse back-down search for insert operation without fetching leaf node.

48 of 56

Yesquel limitations

Limit target usage: web applications

Many operations in the Yesequal API modify a small part of a node's state.
long-lived queries that touch a lot of data, e.g. computing an average over all data, would move all data from storage servers to the query processors. (data analytics). ⇒ better to ship computation to the data (computing partial averages at each server)

48

CAP?

Not distributed for relational operators (operate at the client)

SQLITE (query processor) is not distributed.
Run relational operators at the storage server.

49 of 56

Yesquel limitations (cont.)

Time drift
Implementation lack of

replicate storage servers (using Paxos or primary backup)
replicate keys (using replits)

In the worst case of back-down search, validation process may take 100 iterations.

49

50 of 56

Yesquel evaluation - baseline

50

DBT	MVCC	back-down	splits
BASE	x	x	x
BASE+	x	✔	x
YDBT	✔	✔	✔

back-down (YDBT) vs. non-back-down(BASE)

fraction of root access

load splits (YDBT) vs. size splits (base+)

insertion operation

MVCC (YDBT) vs. no MVCC (Minuet)

scan operation (retrieve ~ 10% keys) = snapshots

51 of 56

Alternative tech: Network DB

MySQL Network database (MySQL-NDB)

upside down tree which allows many-to-many relations.

SQL functionalities
shared nothing: no single point of failure
auto-partitioning
modification difficulty
no MVCC for concurrent access
no table compression
Network I/O consumption - Performance

Yesquel spreads its nodes over the network

Asynchronous data replications

Yesquel uses clocks to synchronize (time drifts are rare)

51

52 of 56

Yesquel evaluation (cont.) - one server

52

NoSQL performs better in sacrifice of SQL functionalities
NDB performs the worst
Yesquel performs well with SQL functionalities

NoSQL uses less client & network resources but more server resources.
NDB performs the worst
Yesquel uses more client & server resources, and has larger network transmission.

network resources but less server resources

53 of 56

Yesquel evaluation (cont.) - many servers

53

All scale linearly

Yesquel & Redis scale linearly

NDB is limited by

coars-grained locking

54 of 56

Summary

Yesquel is a system providing performance and scalability comparable to NoSQL systems, and at the same time with all the features of SQL systems
Yesquel is designed for serving Web applications, not as a general-purpose database system
The architecture of Yesquel supplies a query processor per client and designs a storage engine (YDBT) that can handle all query processors
YDBT combines an optimistic caching scheme with back-down search to provide data locality
YDBT splits its nodes as in balanced trees based on a technique called replit

54

55 of 56

Pros

Cons

Additional network bandwidth, memory, and client CPU cycles
As clients execute all relational operators, iterative or statistical queries can be very expensive

Better scalability than SQL while preserving the functionality
Design a novel data structure (YDBT) to store data across network
New caching scheme
New mechanisms and policies to split nodes
New ways to handle and avoid contention
New approach to tree snapshots

55

56 of 56

Questions

What are some possible ways to solve Yesquel’s bad performance on data analysis queries that operate on lots of data?

56

Is it possible to improve the performance of Yesquel with the approach proposed in the paper “Building Consistent Transactions with Inconsistent Replication”?

Are enterprises nowadays truly using Yesquel-like distributed database systems?

Is wikipedia a representative workload for a web application?