1 of 68

Art of Caching

Ways

Wins

Woes

Weird

Wisdom

2 of 68

Beauty of Caching: Trade offs

A wise software engineer once said - “It Depends”

How many of you implemented caching?

How many of you said, lets not write caching logic here?

How many of you faced, cache invalidation issues?

How many of you built a backdoor to forcefully invalidate cache?

3 of 68

Intro

Srinivas

Founder @ Opti Owl

Web Developer to SRE Lead @ Zomato

2 Mil Orders Per Month to 3+ Mil Orders Per Day

Deployed Cultural Frameworks & Practices like Postmortems & FMEA at Zomato

4 of 68

Definition

Caching the process of

Storing a transformed copy in a storage system

Which is more efficient than the�Source storage system

5 of 68

How many here have seen

Dr Strange

Marvel Cinematic Universe ?

7 of 68

TOC

Definition
Ways & Wins
Woes & Weird
Wisdom
Characteristics

9 of 68

Saving Latency

Most Common Win
Cache SQL Queries
Cache Images on CDN

10 of 68

Saving Vendor API Cost

Who heard about Ola Maps
Google Maps Cost
Caching Data (Complying to T&C)

11 of 68

Saving Database Cost

Caching DynamoDB Reads
Caching DynamoDB Writes (10x costly)
DAX

12 of 68

Database Latency Distribution

Redis Latency Distribution

Saving Latency Distribution

13 of 68

Improving Resiliency

Caching Expensive Queries

Story: Deepi’s Followers

Fallback on DB Failure

14 of 68

Caching Booleans

Bloom Filters

Caching Serviceability
Caching Restaurant Open / Close

Bit Set (Array)

15 of 68

Layered Caching

Power Law Distribution

16 of 68

Layered Caching

Central Distributed Cache (Redis / Memcache)
Local Cache (Container Level - Sidecar or App Global)
Request / Thread Local Cache (Thread / Go Routine Level)

17 of 68

Woes & Weird

Bill always comes Due

19 of 68

When Cache Increases Latency

Batch GETs
When fetching 1000+ items the probability of a single cache miss can cause additional latency than reduction in latency

20 of 68

Caching Null Values

21 of 68

Caching Null Values

Inserts don’t delete cache, only updates delete cache
So even after Insert, the reads will respond that row doesn’t exist

22 of 68

Not Caching Null Values

23 of 68

Not Caching Null Values

Every query would go to the Database
In high frequency scenarios even if the load on database should be low, it remains high

24 of 68

Thrashing

Data is being removed due to eviction causing high miss rate
High Eviction Fetch Ratio

25 of 68

To Evict or Not to Evict

Variable Performance
Harder to predict what is the ideal hit ratio you could achieve
Perfer Expiry Tuning compared to Eviction Tuning

26 of 68

Compute Hot Nodes

Power Law
Too much throughput on specific keys
Delhi, Bangalore Orders

27 of 68

Memory Hot Nodes

Rare but happens
Grouped Keys
Hash Tags

`user_{1234}_feat1`, `user_{1234}_feat2`

28 of 68

Bandwidth Hot Nodes

Throughput so high that hot nodes form
Power Law
Levelled Caching
App Level Replicated Cache

user_123_c1, user_123_c2 etc,...

29 of 68

Bi Modal Behaviour

Behaviour When cache is online is quite different from when cache is offline
System is more fragile because of how it behaves differently for different cases
Constant Work model to control bi-modal behaviour

30 of 68

Cache Poisoning

31 of 68

Cache Poisoning

Incorporate serialisation, compression, encoding inbuilt into the caching package
Write time validations

32 of 68

Backwards Compatibility

33 of 68

X Broke Prod

Pre Prod breaks Prod
Canary breaks Prod

34 of 68

Revert Breaks Prod

35 of 68

Timeout Propagation

36 of 68

Timeout Propagation

Remove timeout on cache sets
Don’t propagate timeout, set independent timeout

37 of 68

Irrecoverable Cache Wipe

Timeout Propagation causes cache to be never filled

38 of 68

Cache Error is a Cache Miss?

39 of 68

Self Immolating Cache

40 of 68

External Cache Faster than In App Cache

Fetching from a local cache sidecar is faster than fetching cache from memory
Most In-memory cache implementations are not as concurrent as external caches

41 of 68

Not Serializing Local Cache

Unsafe Modifications to the cached objects
Error Prone Estimation of Total Cache Size
Immutable Objects FTW

42 of 68

Serializing Local Cache

High compute resources in serialization & deserialization
High Garbage Collection due to serialization allocations.

43 of 68

Infinite Loop Caches

44 of 68

Object Size - Cascading Failure

Local Cache Object Size > Remote Cache Object Size

Local Cache Object Size < Remote Cache Object Size

47 of 68

There is a reason for below saying

Cache Invalidation

Separate Talk

48 of 68

Eviction Policy: LFU

Most Ideal Solution
Only use if you see strong Power Law Distribution
Main Flaw: Localised Access Spikes skews the data
Ensuring Decay of Frequency Counters

49 of 68

Eviction Policy: LRU

Easier and More Performant Implementation than LFU
LFU tree is modified on reads, while LRU tree is only modified on writes
LFU lock times are higher than LRU
LFU more memory than LRU (storing counters or approximate counters)

50 of 68

Eviction Policy: FIFO

Even more easier and more performant implementation than LRU

Newer variants like LP-DQ-FIFO are even more performant and achieves better hit ratios than LRU

51 of 68

Cache Schema Changes

Incorporate version number hierarchically to upgrade the cache
Schema Changes
Cache Poisoning
Example: `global_v1/user_v2/1234123`

52 of 68

Cache Consistency

Consider Replica Lag
Node Stickiness for a request
Request Aware Local Cache Layer

53 of 68

Mind the Key Size, not just Value

user_service_v1_location_module_v3_most_recent_location_feature_v4

54 of 68

Cache Miss Injection

Finding Slow Queries
Finding Queries that could cause you an outage during cache wipes

55 of 68

Retry Budgets

Retry on errors
Retry on timeouts
Limit of 10-20% budget on retries to avoid overloading cache

56 of 68

Circuit Breakers

Client Container protecting circuit breakers
Server protecting circuit breakers

57 of 68

Mind Durability Guarantees

Redis by default doesn’t support durability
Memcache also doesn’t support durability

58 of 68

Tolerating Cache Errors

Similar to retry budgets, designing error budgets i.e 1% of the errors are treated as misses rather than errors

59 of 68

Serialization

Use Binary Serialisation techniques like Arrow or Protobuf
Schema Data is often too much bloat

60 of 68

Compression

As with any other data, duplication is often high in bigger objects
Use different compression trade offs at different levels

61 of 68

Characteristics

62 of 68

Efficiency

Latency
Cost

Storage Cost
Access Cost
Vendor Cost

Fault Tolerance
CPU
Network

63 of 68

Efficiency Trade offs

Latency
Cost

Storage Cost
Access Cost
Vendor Cost

Fault Tolerance
CPU
Network

64 of 68

Transientness

Not Necessarily Transient

Especially if the storage cost of cache storage system is extremely cheap

65 of 68

Durability

Durability of a cache system adds additional complexity

Failure Modes

66 of 68

Source of Truth

Cache System is not the source of truth

Sometimes it might contain a new data, but it’s still a transformation of source data rather than source of truth

67 of 68

More than 36% in Cost Optimization
Book a Meeting to discuss on any problems you are facing in Platform / SRE / Devops / Cost

Opti Owl