Is CAP Dead?
Presenters: Si and Youshan
Scribers: Richa and Shuo-Yang
1
Consistency, Availability, and Partition Tolerance
2
Presentation Roadmap
⇒ impossible to provide
three all together
3
Eric Allen Brewer
CAP Theorem
⇒ possible to provide
three all together
(only for web applications)
⇒ possible to provide distributed txns with strong consistency using replication with no consistency
4
Yesquel 2015
Macros K. Aguilera, Joshua B.Leners, Michael Walfish
TAPIR 2015
Irene Zhang, Naveen Kr. Sharma, Adriana Szekere,
Arvind Krisshnamurthy, Dan R. K. Ports
5
Building Consistent Transactions with Inconsistent Replications
Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres,
Arvind Krishnamurthy, Dan R. K. Ports
University of Washington
6
TAPIR and CAP
IR provides fault tolerance without consistency
IR uses a single protocol for recovering failed replicas and running sync
TAPIR provides linearizable read-write txns
7
A
P
C
Designers still have to pick 2 out of 3. However, a CAP goal today is to maximize the consistency and availability.
Common Architecture for Distributed Transactional Systems
8
Google Spanner: 2PC + S2PL + Paxos
Why is consistency so expensive?
Insight
Existing transactional storage systems incorporate a transactional protocol and a replication protocol that both enforce strong consistency
9
Motivation
How we go about it
Is it possible to provide distributed transactions with better performance and strong consistency (linearizable read-write transaction ordering)?
10
IR
IR
IR
OCC
OCC
OCC
TAPIR
IR Overview
11
IR Protocol - Inconsistent op processing
IR can successfully complete an inconsistent op with a single round-trip to f+1 replicas and no coordination across replicas
12
...
Propose OP to all replicas
Reply after tentative recording
*(Async) finalize after receiving f + 1 replies & return to the app
IR Client
IR Replica
IR Protocol - Consensus op processing
A single round trip to 3/2 f + 1 replicas
Two round trips to at least f + 1 replicas
13
...
Propose to all replicas
Reply (with result) after tentative recording
*(Async) finalize (with result) after receiving at least 3/2 f + 1 matching results within a timeout & return to the app
IR Client
IR Replica
Confirm after (possibly) updating record
...
Propose to all replicas
Reply (with result) after tentative recording
(Async) finalize (with result) after receiving f + 1 replies
IR Client
IR Replica
*Confirm after (possibly) updating record
Fast Path:
Slow Path:
IR Protocol - Replica Recovery & Sync
14
TAPIR Overview
15
TAPIR - Transaction Processing
16
Spanner-like system (2PC + S2PL + Paxos) vs. TAPIR (2PC + OCC + IR)
17
Evaluation
18
Experimental Setup
19
Average Retwis transaction Latency vs. throughput within a datacenter
20
Average wide-area Retwis transaction Latency
21
Abort rates at varying Zipf coefficients
22
Lower commit latency reduce the time between Prep and Commit/Abort making txns less like to abort
The most popular keys being accessed very frequently
Comparison with weakly consistent storage systems
23
Summary
24
25
PROS
CONS
26
Piazza Reviews
QUESTIONS�
COMMENTS�
27
Suggestions for Future Development
Running the fast and slow paths for operation processing in parallel in order to account for varying round trip times.
Read-write transactions can be combined with Spanner’s read-only transactions to provide high performance read-write and read-only transactions.
28
29
Yesquel
scalable SQL storage for Web applications
30
Yesquel and CAP
Concurrency Control Protocols
Data Request
Logs and Replicas
Query computation
Local & No Concurrency Required
Heavy Request
Load Splits & Replits
Failures
Parallelized reads
Query computation
Independent Computation
31
Data Request
Partition- tolerant
Consistency
Availability
32
Problems with existing storage techniques
33
Solution for web apps: Yesquel = NOSQL performance + SQL functionalities
rich database system = simple application
weak database system = complex application
NoSQL Example
Yesquel overview
34
query processor = transactional ordered maps
primary key -----> the rest
secondary index -----> primary key
Yesquel overview (cont.)
35
query processor = transactional ordered maps
YDBT = distributed balanced tree
computation
data
return key
return value
search the tree &
fetch key from servers
search the tree &
fetch value from servers
YDBT feature: B+ tree => locality, scalability, latency efficiency
36
Data and logs store distributedly in storage servers (local storage)
+ children’s key range
fence interval 5~7
37
6
fence interval = key range
YDBT feature 1: load splits => balance load
split the workload of popular server
38
b
a
b
c
d
e
f
server 1
server 2
server 3
A
B
C
D
E
F
C
C
clients
clients are sending data requests to servers.
server 3 has larger workload
server 4
b
a
b
c
d
e
f
server 1
server 2
server 3
A
B
C
D
E
F
C
C
clients
each server gets similar workloads.
c
YDBT feature 1: replits => balance load, fault-tolerant
If popular on reading requests, replicate the node of popular server
If popular on updating requests, remove replica index
39
b
a
b
c
server 1
server 2
server 3
A
B
C
D
E
F
C
C
clients
clients are sending data requests to servers.
server 3 has larger workload
server 4
b
a
b
c
c
server 1
server 2
server 3
A
B
C
D
E
F
C
C
clients
each server gets similar workloads.
c
YDBT feature 1: delegate splits => serialize splits & reduce latency
40
b
a
b
c
d
e
f
server 1
server 2
server 3
b
a
b
c
d
e
f
server 1
server 2
server 3
b
server 4
c
server 4
b and c are popular,
split
whoops! All signed to server 4!
2) split in series
by assigning to a split processor
e
f
b
a
b
c
d
server 1
server 2
server 3
b
server 4
e
f
d
server 3
b
b
a
b
c
server 1
server 2
server 5
c
server 4
a
YDBT feature 2: back-down search => use of caching
41
in client’s cache
VALIDATION:
fence interval = the key range of this node
>=5
server 3
server 2
server 1
>=1
>=3
>=7
>=9
Ex. what is data for index 7?
7
8
fetch = reading a node remotely from the server that stores it.
>=5
>=7
stack
YDBT feature 2: back-down search => use of caching
42
in client’s cache
VALIDATION:
fence interval = the key range of this node
>=5
server 3
server 2
server 1
>=1
>=3
>=7
>=9
8
9
Ex. what is data for index 7?
need to update my cache
whoops! not found!
fetch = reading a node remotely from the server that stores it.
>=5
>=7
stack
>=8
>=9
5
7
43
6
fence interval = key range
7
8
6
whoops! not found!
in client’s cache
→ should be updated
but not yet
→ server’s index & data are moved
44
6
fence interval = key range
7
8
6
found it!
>= 7 fetch and update cache
in client’s cache
YDBT feature 3: atomic transactions => concurrency
45
one client → many transactions
one object → many RTS/WTS
If write, check if the current time series < any read & write transaction’s time series. If not, abort and put in line. Otherwise, assign a new object to this transaction with current time series.
time
clients
queries manager on one object
abort
T1
RTS (T1)
C1
read
T2
RTS (T2)
C2
read
T3
C3
write
WTS (T3)
T3
write
trade-off: storage consumption
YDBT feature 3: atomic transactions => concurrency
46
Yesquel query processor optimizations
47
Yesquel limitations
48
CAP?
Yesquel limitations (cont.)
49
Yesquel evaluation - baseline
50
DBT | MVCC | back-down | splits |
BASE | x | x | x |
BASE+ | x | ✔ | x |
YDBT | ✔ | ✔ | ✔ |
back-down (YDBT) vs. non-back-down(BASE)
fraction of root access
load splits (YDBT) vs. size splits (base+)
insertion operation
MVCC (YDBT) vs. no MVCC (Minuet)
scan operation (retrieve ~ 10% keys) = snapshots
Alternative tech: Network DB
MySQL Network database (MySQL-NDB)
Yesquel spreads its nodes over the network
Yesquel uses clocks to synchronize (time drifts are rare)
51
Yesquel evaluation (cont.) - one server
52
network resources but less server resources
Yesquel evaluation (cont.) - many servers
53
All scale linearly
All scale linearly
Yesquel & Redis scale linearly
NDB is limited by
coars-grained locking
Summary
54
Pros
Cons
55
Questions
56