1 of 38

Log Analytics with ELK Stack

(Architecture for aggressive cost optimization and infinite data scale)

Denis D’Souza

DevOps Engineer

2 of 38

About me...

  • Currently a DevOps engineer at Moonfrog Labs

  • 6 + years working as DevOps Engineer, SRE and Linux administrator

Worked on a variety of technologies in both service-based and product-based organisations

  • How do I spend my free time ?

Learning new technologies and Playing PC Games

www.linkedin.com/in/denis-dsouza

3 of 38

Who we are?

  • A Mobile Gaming Company making mass market social games
  • More than 5M+ Daily Active, 15M+ Weekly Active Users
  • Real time, Cross platform games optimized for Primary Market(s) - India and subcontinent
  • Profitable!

Current Scale

Peak Concurrent Playing Users : 500,000 (Half million)

4 of 38

Our problem statement

“ We were looking for a Log Analytics Platform that fits our requirements.

We evaluated, compared and decided to build a self managed ELK stack

  1. Our business requirements
  2. Choosing the right option
  3. ELK Stack overview
  4. Our ELK architecture
  5. Optimizations we did
  6. Backup and restore
  7. Disaster recovery
  8. Cost savings
  9. Key takeaways

5 of 38

Our business requirements

  • Log analytics platform (Web-Server, Application, Database logs)
  • Data Ingestion rate: ~300GB/day
  • Frequently accessed data: last 8 days
  • Infrequently accessed data: 82 days (90 - 8 days)
  • Uptime: 99.90
  • Hot Retention period: 90 days
  • Cold Retention period: 90 days (with potential to increase)
  • Simple and Cost effective solution
  • Fairly predictable concurrent user-base
  • Not to be used for storing user/business data

6 of 38

Choosing the right option

ELK stack

Splunk

Sumo logic

Product

Self managed

Cloud

Professional

Pricing

~ $30 per GB / month

~ $100 per GB / month *

~ $108 per GB / month *

Data Ingestion

~ 300 GB / day

~ 100 GB / day *�(post ingestion custom pricing)

~ 20 GB / day *�(post ingestion custom pricing)

Retention

~ 90 days

~ 90 days *

~ 30 days *

Cost/GB/day

~$ 0.98 per GB / day

~$ 3.33 per GB / day *

~$ 3.60 per GB /day *

* values are estimations taken from the ‘product pricing web-page’ of the respective products, they may not represent the actual values and are meant for the purpose of comparison only.�References:�https://www.splunk.com/en_us/products/pricing/calculator.html#tabs/tab2�https://www.sumologic.com/pricing/apac/

7 of 38

ELK Stack overview

8 of 38

ELK Stack overview: Terminologies

  • Index
  • Shard
    • Primary
    • Replica
  • Segment
  • Node

References:

https://www.elastic.co/guide/en/elasticsearch/reference/5.6/_basic_concepts.html

9 of 38

Our ELK architecture

10 of 38

Our ELK architecture: Hot-Warm-Cold data storage (infinite scale)

11 of 38

Our ELK architecture: Size and scale

Service

Number of Nodes

Total CPU Cores

Total RAM

Storage EBS

1

Elasticsearch

7

28

141 GB

2

Logstash

3

6

12 GB

3

Kibana

1

1

4 GB

Total

11

35

157 GB

~ 20 TB

Data-ingestion per day

~ 300 GB

Hot Retention policy

90 days

Docs/sec (at peak load)

~ 7K

12 of 38

Optimizations we did

Application Side

  • Logstash
  • Elasticsearch

Infrastructure Side

  • EC2
  • EBS
  • Data transfer

13 of 38

Optimizations we did: Application side

14 of 38

Optimizations we did: Logstash

Pipeline Workers:

  • Adjusted "pipeline.workers" to x4 the number of Cores to improve CPU utilisation on Logstash server (as threads may spend significant time in an I/O wait state)

### Core-count: 2 ###�...�pipeline.workers: 8�...

Configuration Optimizations

logstash.yml

Workers: 8

Workers: 2

References:

https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html

15 of 38

Optimizations we did: Logstash

'info' logs:

  • Separated application 'info' log to be store in a different index with retention policy of fewer days

if [sourcetype] == "app_logs" and [level] == "info" {

elasticsearch {

index => "%{sourcetype}-%{level}-%{+YYYY.MM.dd}"

...

if [sourcetype] == "nginx" and [status] == "200"

{

elasticsearch {

index => "%{sourcetype}-%{status}-%{+YYYY.MM.dd}"

...

'200' response-code logs:

  • Separated Access log with '200' response-code be store in a different index with retention policy of fewer days

Filter Optimizations

filters.conf

References:

https://www.elastic.co/guide/en/logstash/current/event-dependent-configuration.html

16 of 38

Optimizations we did: Logstash

Log ‘message’ field:

  • Removed "message" field if there were no 'grok-failures' in logstash while applying grok patterns�(reduced storage footprint by ~30% per doc)

if "_grokparsefailure" not in [tags] {

mutate {

remove_field => ["message"]

}

}

Filter Optimizations

filters.conf

Eg:

Nginx Log-message: 127.0.0.1 - - [26/Mar/2016:19:09:19 -0400] "GET / HTTP/1.1" 401 194 "" "Mozilla/5.0 Gecko" "-"

Grok Pattern: %{IPORHOST:clientip} (?:-|(%{WORD}.%{WORD})) %{USER:ident} \[%{HTTPDATE:timestamp}\] "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{NUMBER:bytes}|-) %{QS:referrer} %{QS:agent} %{QS:forwarder}

17 of 38

Optimizations we did: Application side

18 of 38

Optimizations we did: Elasticsearch

JVM heap vs non-heap memory:

  • Optimised JVM heap-size by monitoring the GC interval, this helped in efficient utilization of system Memory (33% for JVM, 66% for non-heap) *

### Total system Memory 15GB ###

-Xms5g

-Xmx5g

Configuration Optimizations

jvm.options

Heap too small

Heap too large

Optimised Heap

* Recommended heap-size settings by Elastic:�https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html

19 of 38

Optimizations we did: Elasticsearch

Shards:

  • Created templates with number of shards which are multiples of the number of Elasticsearch nodes �(helps fix issues with shards distribution imbalance which resulted in uneven disk, compute resource usage)

### Number of ES nodes: 5 ###

{

...

"template": "appserver-*",

"settings": {

"number_of_shards": "5",

"number_of_replicas": "0",

...

}

}'

Template configuration

Trade-offs:

  • Removing replicas will result in search queries running slower as replicas are used while performing search operations
  • It is not recommended to run production clusters without replicas

Replicas:

  • Removed replicas for the required indexes (50% savings on storage cost, ~30% reduction in compute resource utilization)

20 of 38

Optimizations we did: Infrastructure side

AWS

  • EC2
  • EBS
  • Data transfer (Inter AZ)

Spotinst platform allows users to reliably leverage excess capacity, simplify cloud operations and save 80% on compute costs.

21 of 38

Optimizations we did: Infrastructure side

22 of 38

Optimizations we did: EC2 and spot

- Stateful EC2 Spot instances:

  • Moved all ELK nodes to run on spot instances �(Instances maintain IP address, EBS volumes)

Recovery time: < 10 mins

Trade-offs:

  • Prefer using previous generation instance types to reduce frequent spot take-backs

23 of 38

Optimizations we did: EC2 and spot

Auto-Scaling:

  • Performance/time based auto-scaling for Logstash Instances

24 of 38

Optimizations we did: Infrastructure side

25 of 38

Optimizations we did: EBS

- "Hot-Warm" Architecture:

  • "Hot" nodes: store active indexes, use GP2 EBS-disks (General purpose SSD)
  • "Warm" nodes: store passive indexes, use SC1 EBS-disks (Cold storage)�(~69% savings on storage cost)

node.attr.box_type: hot

...

"template": "appserver-*",�"settings": {

"index": {

"routing": {

"allocation": {

"require": {

"box_type": "hot"}

}

}

},

...

elasticsearch.yml

Template configuration

Trade-offs:

  • Since "Warm" nodes are using SC1 EBS-disks,�they have lower IOPS, throughput this will result in search operations being comparatively slower

References:

https://cinhtau.net/2017/06/14/hot-warm-architecture/

26 of 38

Optimizations we did: EBS

Moving indexes to "Warm" nodes:

  • Reallocated indexes older than 8 days to "Warm" nodes
  • Recommended to perform this operation during off-peak hours as it is I/O intensive

actions:

1:

action: allocation

description: "Move index to Warm-nodes after 8 days"

options:

key: box_type

value: warm

allocation_type: require

wait_for_completion: true

timeout_override:

continue_if_exception: false

filters:

- filtertype: age

source: name

direction: older

timestring: '%Y.%m.%d'

unit: days

unit_count: 8

...

Curator configuration

References:

https://www.elastic.co/blog/hot-warm-architecture-in-elasticsearch-5-x

27 of 38

Optimizations we did: Inter-AZ data transfer

Single Availability Zone:

  • Migrated all ELK node to a single availability zone �(reduce inter AZ data transfer cost for ELK nodes by 100%)
  • Data transfer/day: ~700GB�(Logstash to Elasticsearch: ~300GB,�Elasticsearch inter-communication: ~400GB)

Trade-offs:

  • It is not recommended to run production clusters in a single AZ as it will result in downtime and potential data loss in case of AZ failures

28 of 38

Data backup and restore

Using S3 for index Snapshots:

  • Take snapshots of indexes and store them in S3

curl -XPUT "http://<domain>:9200/_snapshot/s3_repository/snap1?pretty?wait_for_completion=true" -d'

{

"indices": "index_1,index_2",

"ignore_unavailable": true,

"include_global_state": false

}

Backup:

References:�https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html

https://medium.com/@federicopanini/elasticsearch-backup-snapshot-and-restore-on-aws-s3-f1fc32fbca7f

29 of 38

Data backup and restore

curl -s -XPOST --url "http://<domain>:9200/_snapshot/s3_repository/snap1/_restore" -d'

{

"indices": "index_1,index_2",

"ignore_unavailable": true,

"include_global_state": false,

}'

On-demand Elasticsearch cluster:

  • Launching a on demand ES cluster and importing the snapshots from S3

Existing Cluster:

  • Restore the required snapshots to existing cluster

Restore:

30 of 38

Disaster recovery

Data corruption:

  • List out indexes with status as ‘Red’
  • Deleted the corrupted indexes
  • Restore indexes from S3 snapshots
  • Recovery time: depends of size of data

Node failure due to AZ going down:

  • Launch a new ELK cluster using AWS cloud formation templates
  • Do the necessary config changes in Filebeat, Logstash etc.
  • Restore the required indexes from S3 snapshots
  • Recovery time: depends on provisioning time and size of data

Node failures due to underlying hardware issue:

  • Recycle node in Spotinst console �(will take AMI of root volume, launch new instance, attach EBS volumes, maintain private IP)
  • Recovery time: < 10 mins/node

Snapshot restore time (estimates):

  • < 4mins for a 20GB snapshot (test-cluster: 3 nodes, multiple indexes with 3 primary shards each, no replicas)

31 of 38

Cost savings: EC2

EC2

Instance type

Service

Daily cost

5 x r5.xlarge (20C, 160GB)

Elasticsearch

40.80

3 x c5.large (6C, 12GB)

Logstash

7.17

1 x t3.medium (2C, 4GB)

Kibana

1.29

Total

~ 49.26$

EC2 (optimized)

Instance type

Service

Daily cost�65% savings + Spotinst charges (20% of savings)

Total Savings

5 x m4.xlarge (20C, 80GB)

Elasticsearch Hot

14.64

2 x r4.xlarge (8C, 61GB)

Elasticsearch Warm

7.50

3 x c4.large (6C, 12GB)

Logstash

3.50

1 x t2.medium (2C, 4GB)

Kibana

0.69

Total

~ 26.33$

~ 47%

32 of 38

Cost savings: Storage

Ingesting: 300GB/day

Retention: 90 days

Replica count: 1

Storage

Storage type

Retention

Daily cost

~54TB (GP2)

90 days

~ 237.60$

Storage (optimized)

Storage type

Retention

Daily cost

Total Savings

~ 3TB (GP2) Hot

8 days

12.00

~ 24TB (SC1) Warm

82 days

24.00

~ 27TB (S3) Backup

90 days

22.50

Total

~ 58.50$

~ 75%

Ingesting: 300GB/day

Retention: 90 days

Replica count: 0

Backups: Daily S3 snapshots

33 of 38

Total Savings

ELK stack

ELK stack �(optimized)

Savings

EC2

49.40

26.33

47%

Storage

237.60

58.50

75%

Data-transfer

7

0

100%

Total (daily cost)

~ 294.00$

~ 84.83$

~ 71% *

Cost/GB (daily)

~ 0.98$

~ 0.28$

* Total savings are exclusive of some of the application-level optimizations done

34 of 38

Our Costs vs other Platforms

ELK Stack (optimized)

ELK Stack

Splunk

Sumo logic

Product

Self managed

Self managed

Cloud

Professional

Data Ingestion

~ 300GB/day

~ 300GB/day

~ 100 GB / day *�(post ingestion custom pricing)

~ 20 GB / day *�(post ingestion custom pricing)

Retention

~ 90 days

~ 90 days

~ 90 days *

~ 30 days *

Cost/GB/day

~ $ 0.28 per GB /day

~ $ 0.98 per GB /day

~ $ 3.33 per GB /day *

~ $ 3.60 per GB /day *

Savings over traditional ELK stack: 71% *

* values are an estimation taken from the ‘product pricing web-page’ of the respective products, they may not represent the actual values and are meant for the purpose of comparison only

* Total savings are exclusive of some of the application-level optimizations done

35 of 38

Key Takeaways

ELK Stack Scalability:

  • Logstash: auto-scaling
  • Elasticsearch: overprovisioning (nodes run at 60% capacity during peak load), predictive vertical/horizontal scaling

Handling potential data-loss while AZ is down:

  • DR mechanisms in place, daily/hourly backups stored in S3, Potential chances of data loss of about 1 hour
  • We do not store user-data or business metrics in ELK, users/business will not be impacted

Handling potential data-corruptions in Elasticsearch:

  • DR mechanisms in place, recover index from S3 index-snapshots

Managing downtime during spot take-backs:

  • Logstash: multiple nodes, minimal impact
  • Elasticsearch/Kibana: < 10min downtime per node
  • Use previous generation instance types as spot take-back chances are comparatively low

36 of 38

Key Takeaways

Handling back-pressure when a node is down:

  • Filebeat: will auto-retry to send old logs
  • Logstash: use ‘date’ filter for document timestamp, auto-scaling
  • Elasticsearch: overprovisioning

Other log analytics alternatives:

  • We have only evaluated ELK, Splunk and Sumo Logic

ELK stack upgrade path:

  • Blue Green deployment for major version upgrade

37 of 38

Reflection

  • We built a platform tailored to our requirements, yours might be different...

  • Building a log analytics platform is not rocket science, but it can be painfully iterative if you are not aware of the options

  • Be aware of the trade-offs you are ‘OK with’ and you can roll out a solution optimised for your specific requirements

38 of 38

Thank you!

Denis D’Souza

DevOps Engineer

Happy to take your questions..

Copyright Disclaimer: All rights to the materials used for this presentation belongs to their respective owners..

www.linkedin.com/in/denis-dsouza