1 of 68

Art of Caching

Ways

Wins

Woes

Weird

Wisdom

2 of 68

Beauty of Caching: Trade offs

  • A wise software engineer once said - “It Depends”

  • How many of you implemented caching?

  • How many of you said, lets not write caching logic here?

  • How many of you faced, cache invalidation issues?

  • How many of you built a backdoor to forcefully invalidate cache?

3 of 68

Intro

  • Srinivas

  • Founder @ Opti Owl

  • Web Developer to SRE Lead @ Zomato

  • 2 Mil Orders Per Month to 3+ Mil Orders Per Day

  • Deployed Cultural Frameworks & Practices like Postmortems & FMEA at Zomato

4 of 68

Definition

Caching the process of

Storing a transformed copy in a storage system

Which is more efficient than the�Source storage system

5 of 68

How many here have seen

Dr Strange

-

Marvel Cinematic Universe ?

6 of 68

7 of 68

TOC

  • Definition
  • Ways & Wins
  • Woes & Weird
  • Wisdom
  • Characteristics

8 of 68

Ways & Wins

9 of 68

Saving Latency

  • Most Common Win
  • Cache SQL Queries
  • Cache Images on CDN

10 of 68

Saving Vendor API Cost

  • Who heard about Ola Maps
  • Google Maps Cost
  • Caching Data (Complying to T&C)

11 of 68

Saving Database Cost

  • Caching DynamoDB Reads
  • Caching DynamoDB Writes (10x costly)
  • DAX

12 of 68

Database Latency Distribution

Redis Latency Distribution

Saving Latency Distribution

13 of 68

Improving Resiliency

  • Caching Expensive Queries
    • Story: Deepi’s Followers

  • Fallback on DB Failure

14 of 68

Caching Booleans

  • Bloom Filters
    • Caching Serviceability
    • Caching Restaurant Open / Close
  • Bit Set (Array)

15 of 68

Layered Caching

  • Power Law Distribution

16 of 68

Layered Caching

  • Central Distributed Cache (Redis / Memcache)
  • Local Cache (Container Level - Sidecar or App Global)
  • Request / Thread Local Cache (Thread / Go Routine Level)

17 of 68

Woes & Weird

Bill always comes Due

18 of 68

19 of 68

When Cache Increases Latency

  • Batch GETs
  • When fetching 1000+ items the probability of a single cache miss can cause additional latency than reduction in latency

20 of 68

Caching Null Values

21 of 68

Caching Null Values

  • Inserts don’t delete cache, only updates delete cache
  • So even after Insert, the reads will respond that row doesn’t exist

22 of 68

Not Caching Null Values

23 of 68

Not Caching Null Values

  • Every query would go to the Database
  • In high frequency scenarios even if the load on database should be low, it remains high

24 of 68

Thrashing

  • Data is being removed due to eviction causing high miss rate
  • High Eviction Fetch Ratio

25 of 68

To Evict or Not to Evict

  • Variable Performance
  • Harder to predict what is the ideal hit ratio you could achieve
  • Perfer Expiry Tuning compared to Eviction Tuning

26 of 68

Compute Hot Nodes

  • Power Law
  • Too much throughput on specific keys
  • Delhi, Bangalore Orders

27 of 68

Memory Hot Nodes

  • Rare but happens
  • Grouped Keys
  • Hash Tags
    • `user_{1234}_feat1`, `user_{1234}_feat2`

28 of 68

Bandwidth Hot Nodes

  • Throughput so high that hot nodes form
  • Power Law
  • Levelled Caching
  • App Level Replicated Cache
    • user_123_c1, user_123_c2 etc,...

29 of 68

Bi Modal Behaviour

  • Behaviour When cache is online is quite different from when cache is offline
  • System is more fragile because of how it behaves differently for different cases
  • Constant Work model to control bi-modal behaviour

30 of 68

Cache Poisoning

31 of 68

Cache Poisoning

  • Incorporate serialisation, compression, encoding inbuilt into the caching package
  • Write time validations

32 of 68

Backwards Compatibility

33 of 68

X Broke Prod

  • Pre Prod breaks Prod
  • Canary breaks Prod

34 of 68

Revert Breaks Prod

35 of 68

Timeout Propagation

36 of 68

Timeout Propagation

  • Remove timeout on cache sets
  • Don’t propagate timeout, set independent timeout

37 of 68

Irrecoverable Cache Wipe

  • Timeout Propagation causes cache to be never filled

38 of 68

Cache Error is a Cache Miss?

39 of 68

Self Immolating Cache

40 of 68

External Cache Faster than In App Cache

  • Fetching from a local cache sidecar is faster than fetching cache from memory
  • Most In-memory cache implementations are not as concurrent as external caches

41 of 68

Not Serializing Local Cache

  • Unsafe Modifications to the cached objects
  • Error Prone Estimation of Total Cache Size
  • Immutable Objects FTW

42 of 68

Serializing Local Cache

  • High compute resources in serialization & deserialization
  • High Garbage Collection due to serialization allocations.

43 of 68

Infinite Loop Caches

44 of 68

Object Size - Cascading Failure

  • Local Cache Object Size > Remote Cache Object Size

Vs

  • Local Cache Object Size < Remote Cache Object Size

45 of 68

46 of 68

Wisdom

47 of 68

  • There is a reason for below saying

Cache Invalidation

  • Separate Talk

48 of 68

Eviction Policy: LFU

  • Most Ideal Solution
  • Only use if you see strong Power Law Distribution
  • Main Flaw: Localised Access Spikes skews the data
  • Ensuring Decay of Frequency Counters

49 of 68

Eviction Policy: LRU

  • Easier and More Performant Implementation than LFU
  • LFU tree is modified on reads, while LRU tree is only modified on writes
  • LFU lock times are higher than LRU
  • LFU more memory than LRU (storing counters or approximate counters)

50 of 68

Eviction Policy: FIFO

  • Even more easier and more performant implementation than LRU

  • Newer variants like LP-DQ-FIFO are even more performant and achieves better hit ratios than LRU

51 of 68

Cache Schema Changes

  • Incorporate version number hierarchically to upgrade the cache
  • Schema Changes
  • Cache Poisoning
  • Example: `global_v1/user_v2/1234123`

52 of 68

Cache Consistency

  • Consider Replica Lag
  • Node Stickiness for a request
  • Request Aware Local Cache Layer

53 of 68

Mind the Key Size, not just Value

  • user_service_v1_location_module_v3_most_recent_location_feature_v4

54 of 68

Cache Miss Injection

  • Finding Slow Queries
  • Finding Queries that could cause you an outage during cache wipes

55 of 68

Retry Budgets

  • Retry on errors
  • Retry on timeouts
  • Limit of 10-20% budget on retries to avoid overloading cache

56 of 68

Circuit Breakers

  • Client Container protecting circuit breakers
  • Server protecting circuit breakers

57 of 68

Mind Durability Guarantees

  • Redis by default doesn’t support durability
  • Memcache also doesn’t support durability

58 of 68

Tolerating Cache Errors

  • Similar to retry budgets, designing error budgets i.e 1% of the errors are treated as misses rather than errors

59 of 68

Serialization

  • Use Binary Serialisation techniques like Arrow or Protobuf
  • Schema Data is often too much bloat

60 of 68

Compression

  • As with any other data, duplication is often high in bigger objects
  • Use different compression trade offs at different levels

61 of 68

Characteristics

62 of 68

Efficiency

  • Latency
  • Cost
    • Storage Cost
    • Access Cost
    • Vendor Cost
  • Fault Tolerance
  • CPU
  • Network

63 of 68

Efficiency Trade offs

  • Latency
  • Cost
    • Storage Cost
    • Access Cost
    • Vendor Cost
  • Fault Tolerance
  • CPU
  • Network

64 of 68

Transientness

  • Not Necessarily Transient

  • Especially if the storage cost of cache storage system is extremely cheap

65 of 68

Durability

  • Durability of a cache system adds additional complexity

  • Failure Modes

66 of 68

Source of Truth

  • Cache System is not the source of truth

  • Sometimes it might contain a new data, but it’s still a transformation of source data rather than source of truth

67 of 68

  • More than 36% in Cost Optimization
  • Book a Meeting to discuss on any problems you are facing in Platform / SRE / Devops / Cost

Opti Owl

68 of 68

Thank you

Srinivas Devaki