1 of 70

FIFTH ELEPHANT

JULY 27th 2017

JULY 27-28, 2017 • BANGALORE, INDIA

2 of 70

Near Real time Indexing

Building Real Time Search Index For E-Commerce

Work Done @ Flipkart

Umesh Prasad,

Independent Consultant, Solr/Lucene Expert

Search & ML @Unbxd

3 of 70

About Myself

  • SOLR/LUCENE [2.1 to 4.6]
    • User/Hacker
    • Middleware
    • Contributor
    • Independent Consultant
    • NEW USE Cases

  • Independent Search Consultant @Unbxd

  • Ex-Advisor/Consultant @ Lucidworks

  • Search & Data Platform @ Flipkart (4.5 years)

  • Payments @ Amazon (1.4 years)
  • Vertical Search @ Verse Innovation & Naukri ( 3.4 years )

3

4 of 70

  • Ecommerce Search
    • Need for Real Time Search
  • Build a NRT Search Index
    • From first Principles
  • First Principle approach
  • Q & A

4

AGENDA

5 of 70

E-Commerce Marketplace

  • Real Time Search

5

6 of 70

6

231 million docs

  • 90 million sku
  • 160 million listings
  • result collapsing

drill down filters

Top positions at premium*

7 of 70

EXPERIENCE

7

The Good

“Our teams and sellers worked days and nights to make this sale a success – and our efforts paid off. We got a billion hits on our site today and achieved our 24 hour sales target of $100 mn in GMV in just 10 hours

  • Sachin Bansal & Binny Bansal : Press Release

8 of 70

  1. BBD 2014 Analysis

  • Sachin & Binny’s Email
    1. Price Changes
    2. Out-of-stock Issues

8

9 of 70

!! Flipkart [Sherlock] has BBD Deals[an Offer] ??[expired]

10 of 70

!! Steal Deals !!

  • [Sold Out aka Stolen] ..

10

11 of 70

Engineering 101

11

12 of 70

Normal Day - Ranking

  1. Rank == Function(User Intent, Product)
  2. Intent
    1. Implicit (Historic Behaviour, Session Behaviour)
    2. Explicit ( Query + Filters )
    3. Temporal + Personal
  3. Product
    • Quality : Ratings & Reviews

12

13 of 70

Sales day Ranking

  1. Influence Buying Behaviour through
    1. Price / Discount [Offer]*
    2. Instant Gratification
    3. Delivery Experience ( Availability* / SLA* / Seller Trust )

* But it is all essential for growth

13

14 of 70

Reduce Data Lag

Source of Truth → Search Index → Front End

14

15 of 70

Source of Truth

15

Seller Rating

Service

catalogue service

Promise Service

Availability Service

Offer

Service

Pricing

Service

Product aka SKU

Listings

16 of 70

  • Data pipeline
    • Source Of Truth Search Index
    • Streaming Updates aka PUSH
    • Kafka + Storm
  • Caches
    • Search Index End User / Front End
    • Couchbase with TTL
    • Top URL Refresher

Standard Lambda architecture, Literature Available

16

17 of 70

Search INDEX

17

18 of 70

Demystifying Lucene

18

  1. Storage ⇒ Directory (RAM / File / HDFS / MMap)
  2. Data Serialization ⇒ Codecs
    1. BitSets
    2. Column oriented fields
    3. Stored fields
    4. Term vector etc etc

19 of 70

Lucene Resources

  1. Information Retrieval
    1. Book : https://nlp.stanford.edu/IR-book/html/htmledition/irbook.html
    2. Course : http://www.inf.ed.ac.uk/teaching/courses/tts/
  2. Tool : Luke https://github.com/DmitryKey/luke
  3. Talks : What is inside a Lucene Index by Adrien Grand
  4. Papers : https://wiki.apache.org/lucene-java/LucenePapers
  5. Doug Cutting
    • Lucene was created by Doug Cutting
    • Lucene is his grandmother's Middle Name :)
    • talk in PISA : http://lucene.sourceforge.net/talks/pisa/
    • Based on Paper : Space Optimization for Total Ranking http://lucene.sf.net/papers/riao97.ps
  6. Galene : LinkedIn’s new search architecture
  7. Earlybird: Real-Time Search at Twitter by Michael Busch

19

20 of 70

Inside a Lucene Index

20

21 of 70

Personal Experience

  1. @Verse Innovation (Daily Hunt)
    1. Vertical/Classifieds Search Engine for mobiles
    2. Lucene Based
    3. 2nd employee
    4. Content Integration / Core Search / Learning System / NER / Transliteration / Spam detection
  2. I have been Using LUKE to debug lucene since 2008
  3. We added Distributed Indexing/Search capabilities to Lucene in 2009
    • User Generated Content will be searchable within a min

21

22 of 70

Final Solution : Take Away

Search index

      • Base : Lucene Index
      • Custom NRT Index

Basically We built a

  1. TERM Sharded INDEX
  2. Integrated to Lucene Query Engine through callbacks
  3. Better liveliness than SolrCloud , an ID Sharded Solution.

22

23 of 70

Why Not use Source of Truth during Ranking

  1. Last request matched : 234K SKUs (parent documents)
  2. Lots of sources used in ranking/relevance computation
  3. Calls in Tight loop
  4. Network Latency
  5. Will kill downstream service
  6. Data has to reside in memory

23

24 of 70

SolrCloud : How it works

24

  • ID Based Sharding
  • Update = Delete + Add
    • Block Join Index ⇒ Update Whole Block (Product + Listings)
  • Updated Document gets streamed to all replicas in sync in solrcloud
    • Resource Contention between indexing & search

25 of 70

SolrCloud : Rejection

  1. Support independent field updates
  2. Remove Single point of failure/bottlenecks
  3. Support horizontally scalable cluster (for qps)

25

26 of 70

E-commerce Marketplace

Special Data Characteristics

26

27 of 70

E-commerce Document

27

  • Product/SKU [Parent Document]
    • Listing [Child Document]

  • Query = Mostly SKU Attributes [Free Text]
  • Filters = SKU + Listing Attributes [Drill Down]
  • Ranking = SKU + Listing Attributes [Explicit/Relevance]

28 of 70

Text Relevance vs Real time Attribute

[update rates comparison]

28

updates / sec

updates /hr

normal

Peak

text / catalogue

~10

~100

~100K

pricing

~100

~1K

~10 million

availability

~100

~10K

~10 million

offer

~100

~10K

~10 million

seller rating

~10

~1K

~1 million

signal 6

~10

~100

~1 million

signal 7

~100

~10K

~10 million

signal 8

~100

~10K

~10 million

29 of 70

Ingestion pipeline

Catalogue API

Pricing API

Availability API

Offers API

...

Document Builder

Change Propagation

Documents {L1,L2 … P1}

Updates Stream 1

Updates Stream 2

Updates Stream 3

Bottleneck 1 : Document Builder

Partial Data

30 of 70

Bottleneck 2 : Lucene Segment Merges

30

Credits : http://blog.mikemccandless.com/

31 of 70

What we tried [ & failed]

  1. Take Segment Mappings Out
    1. Segment ⇒ Immutable
    2. Post commit Hook [SolrEventListener]
    3. <PrimaryKey , docID> mapping
    4. Merges Killed it
  2. Lucene Codecs

31

32 of 70

QUOTE

32

33 of 70

Thomas Edison in Digital Age

33

34 of 70

Agile + Team

34

35 of 70

E-commerce Document

35

  • Document [Block Join]
    • Product/SKU [Catalogue]
    • Listing [Seller/Marketplace]
  • Query
    • Mostly SKU Attributes [Free Text]
  • Filters /Drill Down
    • SKU [Enum/Numeric/Boolean]
    • Listing [Enum/Numeric/Boolean]
  • Ranking
    • SKU [ Text relevance]
    • Listing [Explicit sorting]

36 of 70

Base Index

  • It is a normal Lucene Index
  • Functional
    • text relevance (Catalogue)
    • Meta Fields (Category / Filters etc )
    • Product Ratings
  • NFR
    • Fields have lower rate of change
    • Full Document Creation

36

37 of 70

BUILDING NRT STORE

37

38 of 70

NRT Store : Requirements

  • Streaming Updates
  • Serving Live traffic
  • Commitless
  • Remove Single point of failure/bottlenecks
  • Support Burst update rates. Example meta field changes
  • Optimized for Ranking

38

39 of 70

Let’s put some numbers

  1. Performance , Performance , Performance
    1. End to end lag < 10 sec
    2. Support 100K updates / sec
    3. Support 100s of independent sources/signals
    4. Support horizontally scalable cluster (1000s of replicas)
    5. 10K requests per second from a cluster
    6. 10 millions calls to different fields in 99% percentile

39

40 of 70

Why Commitless ?

  1. Because commit has non trivial overheads for e-commerce
    1. Lag as new segment gets created
    2. Churn in Solr caches
    3. Garbage creation
    4. Performance degradation as older data structures get purged and rebuilt

40

41 of 70

Demystifying NRT Store

  1. Substores
    1. NRT Forward Index [ ⇒ List<DocValues> ]
    2. NRT Inverted Index [ ⇒ List<Posting List> ]

41

42 of 70

Demystifying NRT Store

  1. It is just a columnar storage
    1. Optimized for High frequency Updates and really really high frequency reads
    2. Data structures are memory resident
  2. External resources

42

43 of 70

NRT Forward Index - Considerations

43

  • Lookup efficiency
    • 50th percentile : ~10K matches
    • 99th percentile : ~1 million matches
  • Data on Java heap
    • Memory efficiency

44 of 70

NRT Forward Index Design - V1

  • API
    • <T> get(String primary_key)
    • where <T> == return type . can be Numeric | Float | Double | String
  • Call Hierarchy
    • Nested Loop Calls.
    • For each matching document
      • for each field participating in scoring | ranking
        • get(String primary_key)
      • apply some function to compute score

44

45 of 70

HashMap based Implementation

45

NRT Forward Index

Lucene Segment

Lookup Engine

0

ProductB

1

ProductA

2

ProductC

3

ProductD

ProductD

ProductA

ProductB

ProductC

ProductD

True

False

False

True

100

150

200

250

ProductId(3)

<ProductD,price>

250

ProductId

Availability

Price

Latency : ~10 secs for ~1 Million lookups

DocId : 3

field : price

46 of 70

HashMap :BottleNecks

  1. Recap : 10 sec for 1 million lookups
  2. Lucene Layer :
    1. Integer Ord → String (primary key)
  3. Application Layer
    • String (primary key) → HashCode [ CPU ]
    • Hashcode to Bucket [ Random Bucket ]
    • < 1% of the hashmap will be referred
  4. Solution : Remove
    • Unnecessary Ord → String → hashcode conversion.

46

47 of 70

NRT Forward Index : V2

  1. Lookup Engine is also based on Ord
    1. Primary Key → Integer Ord (NRT Dictionary)
    2. Update Path
      1. Primary Key → Ord
      2. Update values for Ord
      3. Lucene does it already
    3. Lookup Path
      • Lucene Ord → NRT Ord [ Another array ]
      • NRT Ord → NRT Value
      • basically a 2D array lookup

47

48 of 70

ID Mappings

Lucene Segment

0

ProductB

1

ProductA

2

ProductC

3

ProductD

DocId - NrtId

0

1

2

3

3

0

1

2

NRT Forward Index (Segment Independent)

100

200

250

150

Price

0

ProductA

1

ProductC

2

ProductD

3

ProductB

Availability

T

F

F

T

Status

01

10

01

00

49 of 70

Data structures

  1. Lucene Index
    1. Term Dictionary
      1. Essentially FST
    2. Posting List
      • essentially Sparse Bitsets
      • One for each term ie millions of terms
    3. DocValues
      • essentially column oriented data structures
      • one for each field ie thousands of fields.

49

50 of 70

Type Tuned Data structures

  1. Read/write wrappers/helpers
  2. Boolean Fields
    1. FixedBitSet
    2. Random updatable and searchable
  3. Enumerable Fields with fixed cardinality
    • Low cardinality : List<FixedBitSet>
    • High cardinality : byte[]
  4. Numeric fields
    • Array
  5. Tag fields with dynamic values
    • 2D int array. tag → int, fieldname → int

50

51 of 70

Foreign Key + Array Based Implementation

51

NrtId(3)

2

Lookup Engine

250

Price(2)

Latency : ~100 ms for ~1 Million lookups

DocId : 3

field : price

52 of 70

NRT Inverted Index

52

53 of 70

Filters : Requirement

53

234K Products

54 of 70

NRT Forward Store ⇒ Posting List

54

NRT Filter

NRT Forward Store

NRT Inverter

Lucene Segment

0

ProductB

1

ProductA

2

ProductC

3

ProductD

Inverted Index | Posting List

Availability : T

0

3

Offer : O1

2

3

Offer:O1

DocIdSet

55 of 70

Final Solution

  1. Lucene Extensions
  2. Integrate with Custom Inverted Index
  3. Eventual consistency between replicas
  4. Target
    1. liveliness
    2. higher throughput (indexing as well query)
    3. Consistent latency

55

56 of 70

Solr Integration Points

56

  • ValueSources
  • Filtering
    • Custom Filter Implementation for cached DocIdSet
    • Custom PostFilter
  • Query
    • Wrapper over Filter
  • Custom FacetComponent

57 of 70

Near Real Time Solr Architecture

57

Solr

Kafka

Ingestion pipeline

NRT Forward Index

Ranking

Matching

Faceting

Redis

Bootstrap

NRT Inverted store

Solr Master

NRT Updates

Lucene Updates

Catalogue

Pricing

Availability

Offers

Seller Quality

Commit

+

Replicate

+

Reopen

Lucene

Others

58 of 70

58

800K active users

160K requests per sec

  • 40K service
  • 10k solr

median : 11 ms

99th perc: 1.1 sec

59 of 70

Accomplishments

  • Real time sorting
  • Real time filtering : PostFilter
    • Higher latency
  • Near real time filtering : cached DocIdSet
    • No consistency between lookup and filtering
  • Independent of lucene commits
  • Query latency comparable to DocValues
    • Consistent 99% performance

60 of 70

Accomplishments @ Flipkart

  • Real time consumption for ~150 Signals
  • Reduction in shown out of stock products by 2X
  • Production instances of ~50K updates/second real time

61 of 70

61

QUESTIONS?

62 of 70

Twitter

62

  1. Realtime Search at Twitter - Michael Busch : KeyNote @ Euro Con
  2. Search @ Twitter : Lucene Rev 2014

63 of 70

63

LinkedIn

64 of 70

Product /Listing: Attributes

64

Product aka SKU

Listings

65 of 70

SCALE @ FLIPKART

65

66 of 70

Root Cause

  1. Data Lag (Push)
    1. Source of Truth → Search
    2. Multiple intermediate services
  2. Response Lag (Pull)
    • Front end → Search
    • Multiple layers of caches
  3. During BBD 2014, Lag showed up

66

67 of 70

Comparison with SolrCloud

  1. SolrCloud is an ID sharded Index
  2. It doesn’t give scalable independent field updates
  3. Partial Update update
    1. Partial document update is a sugar coating
    2. Partial DocValue update doesn’t scale.
  4. Good for Slowish Update rates , large indexes
  5. Bad for Extremely high update rates, medium-large size indexes

67

68 of 70

Lucene Index

68

ProductA

brand : Apple

availability : T

price : 45000

ProductB

brand : Samsung

availability : T

price : 23000

ProductC

brand : Apple

availability : F

price : 5000

Document ID Mappings

Posting List

(Inverted Index)

DocValues

(columunar data)

Lucene Segment

0

ProductA

1

ProductB

2

ProductC

45000

23000

5000

Price

availability : T

brand : Samsung

brand : Apple

0 , 2

1

0 , 1

Terms

Sparse Bitsets

69 of 70

A Typical Search Flow

69

Query Rewrite

Results

Query

Matching

Ranking

Faceting

Stats

Posting List

Doc Values

Other Components

Lucene Segment

Inverted Index

Forward Index

NRT Store

samsung mobiles

Offer : exchange offer

price desc

category : mobiles

brand : samsung

Offer : exchange offer

70 of 70

NRT Store Filter - PostFilter

PostFilter(Price:[100 TO 150])

Lucene Segment

0

ProductB

1

ProductA

2

ProductC

3

ProductD

Don’t Delegate

DocId - NrtId

0

1

2

3

3

0

1

2

DocId : 3

NrtId(3)

2

Price(2)

NRT Forward Index (Segment Independent)

100

200

250

150

Price

0

ProductA

1

ProductC

2

ProductD

3

ProductB

Availability

T

F

F

T

Status

01

10

01

00

for d in [matched-docs]

collect d