1 of 24

Things I Wish I'd Known About Elasticsearch When I Started Use It As NoSQL

Jimin Hsieh

2 of 24

Agenda

  • Summary
  • What is Elasticsearch?
  • What I wish I’d known since the beginning.

3 of 24

Summary

4 of 24

What is Elasticsearch?

  • Distributed data store
  • NoSQL
  • JSON Document
    • Semi-structured and schema free

5 of 24

Architecture

6 of 24

Inverted Index

7 of 24

Analogy between ES and SQL

  • Index = database
  • Type = table
  • Document = row
  • Field = column
  • Mapping = schema

8 of 24

Document

{

"_index": "hanmvngj",

"_type": "stresstest",

"_id": "7y2lLXkB3FI9wq1OkmCn",

"_score": 1,

"_source": {

"f": "amhksvjeocyogkmlzibvmpakuc",

"i": "xcwclqmpniijqgeebapkspmcfvzqoypitjvzbcqtba",

"ak": "gtomqdwncpymlvfljsnhufojytjqouvxsuidoqwtttplvrm",

"qnxxdsf": "gqrktshotoqijetacvbnqbpdhbumhomuutbiqdqjsfqros",

"qhxcumzqc": "kuiwcgnrcpzggeldypdvnijq",

"hwx": "cyexgvrrcdismskdcflwgytcijkoibuvsqxijlgwlxjv",

"eih": "x"

}

}

9 of 24

Mapping

{

"hanmvngj": {

"mappings": {

"properties": {

"afh": {

"type": "text",

"fields": {

"keyword": {

"type": "keyword",

"ignore_above": 256

}

}

}

}

}

}

}

10 of 24

Data Type

  • JSON data type
    • String
    • Number
    • Boolean
    • Null
    • Array
    • Object
  • Elasticsearch data type
    • Number
      • Long
      • Integer
      • Short
      • Byte
    • Text
    • Keyword
    • …ect

11 of 24

Mapping

  • Don’t use default dynamic mappings.

12 of 24

Mapping

  • Dynamic field mapping
    • Pick up the type of data for you
  • Explicit mapping
    • Decide on your own

13 of 24

Text vs Keyword

Analyzer

Structured

Example

Keyword

No

Structured Content

Service Name, Log Level, IP…etc

Text

Yes

Unstructured�Content

Log

14 of 24

Analyzer

    • Default Analyzer
      • Standard analyzer
        • Standard tokenizer
        • Lowercase token filter
        • Stop token filter

15 of 24

Pagination

16 of 24

Must vs Filter

  • Filter will ignore scoring.
  • Filter could be cached.

17 of 24

Increase write throughput

  • Turn off replica
  • index.refresh_interval

18 of 24

Turn Off Replica

19 of 24

Index Refresh Interval

20 of 24

Mapping Explosion

  • NoSQL prefers flatten data model.
  • When you have too many fields in your document
    • index.mapping.total_fields.limit = 1000 (default)
    • If your field mappings contain a large, arbitrary set of keys, consider using the flattened data type. By official tips
      • But you have to pay for the commercial version.

21 of 24

Spark Elasticsearch

  • Use a maximum limit of task writing to Elasticsearch
    • coalesce
      • Be aware imbalanced data

22 of 24

Java API

23 of 24

Reference

24 of 24

FAQ

Thank you for your attention.