FIFTH ELEPHANT
JULY 27th 2017
JULY 27-28, 2017 • BANGALORE, INDIA
Near Real time Indexing
Building Real Time Search Index For E-Commerce
Work Done @ Flipkart
Umesh Prasad,
Independent Consultant, Solr/Lucene Expert
Search & ML @Unbxd
About Myself
3
4
AGENDA
E-Commerce Marketplace
5
6
231 million docs
drill down filters
Top positions at premium*
EXPERIENCE
7
The Good
“Our teams and sellers worked days and nights to make this sale a success – and our efforts paid off. We got a billion hits on our site today and achieved our 24 hour sales target of $100 mn in GMV in just 10 hours”
8
!! Flipkart [Sherlock] has BBD Deals[an Offer] ??[expired]
!! Steal Deals !!
10
Engineering 101
11
Normal Day - Ranking
12
Sales day Ranking
* But it is all essential for growth
13
Reduce Data Lag
Source of Truth → Search Index → Front End
14
Source of Truth
15
Seller Rating
Service
catalogue service
Promise Service
Availability Service
Offer
Service
Pricing
Service
Product aka SKU
Listings
Standard Lambda architecture, Literature Available
16
Search INDEX
17
Demystifying Lucene
18
Lucene Resources
19
Inside a Lucene Index
Blog : https://lingpipe-blog.com/2012/07/24/using-luke-the-lucene-index-browser-to-develop-search-queries/
Michael McCandless’s blogs http://blog.mikemccandless.com/2017/07/lucene-gets-concurrent-deletes-and.html
Video : https://www.youtube.com/watch?list=PLGeM09tlguZTaS5FNoJGYEohaubtIvErS&v=fQAAzpk4oQ4#t=392
Write ups :
https://github.com/DmitryKey/luke/wiki
Github :
https://github.com/DmitryKey/luke
20
Personal Experience
21
Final Solution : Take Away
Search index
Basically We built a
22
Why Not use Source of Truth during Ranking
23
SolrCloud : How it works
24
SolrCloud : Rejection
25
E-commerce Marketplace
Special Data Characteristics
26
E-commerce Document
27
Text Relevance vs Real time Attribute
[update rates comparison]
28
| updates / sec | updates /hr | |
| normal | Peak | |
text / catalogue | ~10 | ~100 | ~100K |
pricing | ~100 | ~1K | ~10 million |
availability | ~100 | ~10K | ~10 million |
offer | ~100 | ~10K | ~10 million |
seller rating | ~10 | ~1K | ~1 million |
signal 6 | ~10 | ~100 | ~1 million |
signal 7 | ~100 | ~10K | ~10 million |
signal 8 | ~100 | ~10K | ~10 million |
Ingestion pipeline
Catalogue API
Pricing API
Availability API
Offers API
...
Document Builder
Change Propagation
Documents {L1,L2 … P1}
Updates Stream 1
Updates Stream 2
Updates Stream 3
Bottleneck 1 : Document Builder
Partial Data
Bottleneck 2 : Lucene Segment Merges
30
Credits : http://blog.mikemccandless.com/
What we tried [ & failed]
31
QUOTE
32
Thomas Edison in Digital Age
33
Agile + Team
34
E-commerce Document
35
Base Index
36
BUILDING NRT STORE
37
NRT Store : Requirements
38
Let’s put some numbers
39
Why Commitless ?
40
Demystifying NRT Store
41
Demystifying NRT Store
42
NRT Forward Index - Considerations
43
NRT Forward Index Design - V1
44
HashMap based Implementation
45
NRT Forward Index
Lucene Segment
Lookup Engine
0
ProductB
1
ProductA
2
ProductC
3
ProductD
ProductD
ProductA
ProductB
ProductC
ProductD
True
False
False
True
100
150
200
250
ProductId(3)
<ProductD,price>
250
ProductId
Availability
Price
Latency : ~10 secs for ~1 Million lookups
DocId : 3
field : price
HashMap :BottleNecks
46
NRT Forward Index : V2
47
ID Mappings
Lucene Segment
0
ProductB
1
ProductA
2
ProductC
3
ProductD
DocId - NrtId
0
1
2
3
3
0
1
2
NRT Forward Index (Segment Independent)
100
200
250
150
Price
0
ProductA
1
ProductC
2
ProductD
3
ProductB
Availability
T
F
F
T
Status
01
10
01
00
Data structures
49
Type Tuned Data structures
50
Foreign Key + Array Based Implementation
51
NrtId(3)
2
Lookup Engine
250
Price(2)
Latency : ~100 ms for ~1 Million lookups
DocId : 3
field : price
NRT Inverted Index
52
Filters : Requirement
53
234K Products
NRT Forward Store ⇒ Posting List
54
NRT Filter
NRT Forward Store
NRT Inverter
Lucene Segment
0
ProductB
1
ProductA
2
ProductC
3
ProductD
Inverted Index | Posting List
Availability : T
0
3
Offer : O1
2
3
Offer:O1
DocIdSet
Final Solution
55
Solr Integration Points
56
Near Real Time Solr Architecture
57
Solr
Kafka
Ingestion pipeline
NRT Forward Index
Ranking
Matching
Faceting
Redis
Bootstrap
NRT Inverted store
Solr Master
NRT Updates
Lucene Updates
Catalogue
Pricing
Availability
Offers
Seller Quality
Commit
+
Replicate
+
Reopen
Lucene
Others
58
800K active users
160K requests per sec
median : 11 ms
99th perc: 1.1 sec
Accomplishments
Accomplishments @ Flipkart
61
QUESTIONS?
62
63
Product /Listing: Attributes
64
Product aka SKU
Listings
SCALE @ FLIPKART
65
Root Cause
66
Comparison with SolrCloud
67
Lucene Index
68
ProductA
brand : Apple
availability : T
price : 45000
ProductB
brand : Samsung
availability : T
price : 23000
ProductC
brand : Apple
availability : F
price : 5000
Document ID Mappings
Posting List
(Inverted Index)
DocValues
(columunar data)
Lucene Segment
0
ProductA
1
ProductB
2
ProductC
45000
23000
5000
Price
availability : T
brand : Samsung
brand : Apple
0 , 2
1
0 , 1
Terms
Sparse Bitsets
A Typical Search Flow
69
Query Rewrite
Results
Query
Matching
Ranking
Faceting
Stats
Posting List
Doc Values
Other Components
Lucene Segment
Inverted Index
Forward Index
NRT Store
samsung mobiles
Offer : exchange offer
price desc
category : mobiles
brand : samsung
Offer : exchange offer
NRT Store Filter - PostFilter
PostFilter(Price:[100 TO 150])
Lucene Segment
0
ProductB
1
ProductA
2
ProductC
3
ProductD
Don’t Delegate
DocId - NrtId
0
1
2
3
3
0
1
2
DocId : 3
NrtId(3)
2
Price(2)
NRT Forward Index (Segment Independent)
100
200
250
150
Price
0
ProductA
1
ProductC
2
ProductD
3
ProductB
Availability
T
F
F
T
Status
01
10
01
00
for d in [matched-docs]
collect d