2 of 38

About me...

Currently a DevOps engineer at Moonfrog Labs

6 + years working as DevOps Engineer, SRE and Linux administrator

Worked on a variety of technologies in both service-based and product-based organisations

How do I spend my free time ?

Learning new technologies and Playing PC Games

www.linkedin.com/in/denis-dsouza

3 of 38

Who we are?

A Mobile Gaming Company making mass market social games
More than 5M+ Daily Active, 15M+ Weekly Active Users
Real time, Cross platform games optimized for Primary Market(s) - India and subcontinent
Profitable!

Current Scale

Peak Concurrent Playing Users : 500,000 (Half million)

4 of 38

Our problem statement

“ We were looking for a Log Analytics Platform that fits our requirements.

We evaluated, compared and decided to build a self managed ELK stack ”

Our business requirements
Choosing the right option
ELK Stack overview
Our ELK architecture
Optimizations we did
Backup and restore
Disaster recovery
Cost savings
Key takeaways

5 of 38

Our business requirements

Log analytics platform (Web-Server, Application, Database logs)
Data Ingestion rate: ~300GB/day
Frequently accessed data: last 8 days
Infrequently accessed data: 82 days (90 - 8 days)

Uptime: 99.90
Hot Retention period: 90 days
Cold Retention period: 90 days (with potential to increase)
Simple and Cost effective solution
Fairly predictable concurrent user-base
Not to be used for storing user/business data

6 of 38

Choosing the right option

	ELK stack	Splunk	Sumo logic
Product	Self managed	Cloud	Professional
Pricing	~ $30 per GB / month	~ $100 per GB / month *	~ $108 per GB / month *
Data Ingestion	~ 300 GB / day	~ 100 GB / day *�(post ingestion custom pricing)	~ 20 GB / day *�(post ingestion custom pricing)
Retention	~ 90 days	~ 90 days *	~ 30 days *
Cost/GB/day	~$ 0.98 per GB / day	~$ 3.33 per GB / day *	~$ 3.60 per GB /day *

* values are estimations taken from the ‘product pricing web-page’ of the respective products, they may not represent the actual values and are meant for the purpose of comparison only.�References:�https://www.splunk.com/en_us/products/pricing/calculator.html#tabs/tab2�https://www.sumologic.com/pricing/apac/�

7 of 38

ELK Stack overview

8 of 38

ELK Stack overview: Terminologies

Index
Shard

Primary
Replica

Segment
Node

References:

https://www.elastic.co/guide/en/elasticsearch/reference/5.6/_basic_concepts.html

9 of 38

Our ELK architecture

10 of 38

Our ELK architecture: Hot-Warm-Cold data storage (infinite scale)

11 of 38

Our ELK architecture: Size and scale

	Service	Number of Nodes	Total CPU Cores	Total RAM	Storage EBS
1	Elasticsearch	7	28	141 GB
2	Logstash	3	6	12 GB
3	Kibana	1	1	4 GB
	Total	11	35	157 GB	~ 20 TB

Data-ingestion per day	~ 300 GB
Hot Retention policy	90 days
Docs/sec (at peak load)	~ 7K

12 of 38

Optimizations we did

Application Side

Logstash
Elasticsearch

Infrastructure Side

EC2
EBS
Data transfer

13 of 38

Optimizations we did: Application side

14 of 38

Optimizations we did: Logstash

Pipeline Workers:

Adjusted "pipeline.workers" to x4 the number of Cores to improve CPU utilisation on Logstash server (as threads may spend significant time in an I/O wait state)

### Core-count: 2 ###�...�pipeline.workers: 8�...

Configuration Optimizations

logstash.yml

Workers: 8

Workers: 2

References:

https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html

15 of 38

Optimizations we did: Logstash

'info' logs:

Separated application 'info' log to be store in a different index with retention policy of fewer days

if [sourcetype] == "app_logs" and [level] == "info" {

elasticsearch {

index => "%{sourcetype}-%{level}-%{+YYYY.MM.dd}"

...

if [sourcetype] == "nginx" and [status] == "200"

{

elasticsearch {

index => "%{sourcetype}-%{status}-%{+YYYY.MM.dd}"

...

'200' response-code logs:

Separated Access log with '200' response-code be store in a different index with retention policy of fewer days

Filter Optimizations

filters.conf

References:

https://www.elastic.co/guide/en/logstash/current/event-dependent-configuration.html

16 of 38

Optimizations we did: Logstash

Log ‘message’ field:

Removed "message" field if there were no 'grok-failures' in logstash while applying grok patterns�(reduced storage footprint by ~30% per doc)

if "_grokparsefailure" not in [tags] {

mutate {

remove_field => ["message"]

}

Filter Optimizations

filters.conf

Eg:

Nginx Log-message: 127.0.0.1 - - [26/Mar/2016:19:09:19 -0400] "GET / HTTP/1.1" 401 194 "" "Mozilla/5.0 Gecko" "-"

Grok Pattern: %{IPORHOST:clientip} (?:-|(%{WORD}.%{WORD})) %{USER:ident} \[%{HTTPDATE:timestamp}\] "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{NUMBER:bytes}|-) %{QS:referrer} %{QS:agent} %{QS:forwarder}

17 of 38

Optimizations we did: Application side

18 of 38

Optimizations we did: Elasticsearch

JVM heap vs non-heap memory:

Optimised JVM heap-size by monitoring the GC interval, this helped in efficient utilization of system Memory (33% for JVM, 66% for non-heap) *

### Total system Memory 15GB ###

-Xms5g

-Xmx5g

Configuration Optimizations

jvm.options

Heap too small

Heap too large

Optimised Heap

* Recommended heap-size settings by Elastic:�https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html

19 of 38

Optimizations we did: Elasticsearch

Shards:

Created templates with number of shards which are multiples of the number of Elasticsearch nodes �(helps fix issues with shards distribution imbalance which resulted in uneven disk, compute resource usage)

### Number of ES nodes: 5 ###

{

...

"template": "appserver-*",

"settings": {

"number_of_shards": "5",

"number_of_replicas": "0",

...

}

Template configuration

Trade-offs:

Removing replicas will result in search queries running slower as replicas are used while performing search operations
It is not recommended to run production clusters without replicas

Replicas:

Removed replicas for the required indexes (50% savings on storage cost, ~30% reduction in compute resource utilization)

20 of 38

Optimizations we did: Infrastructure side

AWS

EC2
EBS
Data transfer (Inter AZ)

Spotinst platform allows users to reliably leverage excess capacity, simplify cloud operations and save 80% on compute costs.

21 of 38

Optimizations we did: Infrastructure side

22 of 38

Optimizations we did: EC2 and spot

- Stateful EC2 Spot instances:

Moved all ELK nodes to run on spot instances �(Instances maintain IP address, EBS volumes)

Recovery time: < 10 mins

Trade-offs:

Prefer using previous generation instance types to reduce frequent spot take-backs

23 of 38

Optimizations we did: EC2 and spot

Auto-Scaling:

Performance/time based auto-scaling for Logstash Instances

24 of 38

Optimizations we did: Infrastructure side

25 of 38

Optimizations we did: EBS

- "Hot-Warm" Architecture:

"Hot" nodes: store active indexes, use GP2 EBS-disks (General purpose SSD)�
"Warm" nodes: store passive indexes, use SC1 EBS-disks (Cold storage)�(~69% savings on storage cost)

node.attr.box_type: hot

...

"template": "appserver-*",�"settings": {

"index": {

"routing": {

"allocation": {

"require": {

"box_type": "hot"}

}

...�

elasticsearch.yml

Template configuration

Trade-offs:

Since "Warm" nodes are using SC1 EBS-disks,�they have lower IOPS, throughput this will result in search operations being comparatively slower

References:

https://cinhtau.net/2017/06/14/hot-warm-architecture/

26 of 38

Optimizations we did: EBS

Moving indexes to "Warm" nodes:

Reallocated indexes older than 8 days to "Warm" nodes
Recommended to perform this operation during off-peak hours as it is I/O intensive

actions:

action: allocation

description: "Move index to Warm-nodes after 8 days"

options:

key: box_type

value: warm

allocation_type: require

wait_for_completion: true

timeout_override:

continue_if_exception: false

filters:

- filtertype: age

source: name

direction: older

timestring: '%Y.%m.%d'

unit: days

unit_count: 8

...�

Curator configuration

References:

https://www.elastic.co/blog/hot-warm-architecture-in-elasticsearch-5-x

27 of 38

Optimizations we did: Inter-AZ data transfer

Single Availability Zone:

Migrated all ELK node to a single availability zone �(reduce inter AZ data transfer cost for ELK nodes by 100%)
Data transfer/day: ~700GB�(Logstash to Elasticsearch: ~300GB,�Elasticsearch inter-communication: ~400GB)

Trade-offs:

It is not recommended to run production clusters in a single AZ as it will result in downtime and potential data loss in case of AZ failures

28 of 38

Data backup and restore

Using S3 for index Snapshots:

Take snapshots of indexes and store them in S3

curl -XPUT "http://<domain>:9200/_snapshot/s3_repository/snap1?pretty?wait_for_completion=true" -d'

{

"indices": "index_1,index_2",

"ignore_unavailable": true,

"include_global_state": false

}

Backup:

References:�https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html

https://medium.com/@federicopanini/elasticsearch-backup-snapshot-and-restore-on-aws-s3-f1fc32fbca7f

29 of 38

Data backup and restore

curl -s -XPOST --url "http://<domain>:9200/_snapshot/s3_repository/snap1/_restore" -d'

{

"indices": "index_1,index_2",

"ignore_unavailable": true,

"include_global_state": false,

On-demand Elasticsearch cluster:

Launching a on demand ES cluster and importing the snapshots from S3

Existing Cluster:

Restore the required snapshots to existing cluster

Restore:

30 of 38

Disaster recovery

Data corruption:

List out indexes with status as ‘Red’
Deleted the corrupted indexes
Restore indexes from S3 snapshots
Recovery time: depends of size of data

Node failure due to AZ going down:

Launch a new ELK cluster using AWS cloud formation templates
Do the necessary config changes in Filebeat, Logstash etc.
Restore the required indexes from S3 snapshots
Recovery time: depends on provisioning time and size of data

Node failures due to underlying hardware issue:

Recycle node in Spotinst console �(will take AMI of root volume, launch new instance, attach EBS volumes, maintain private IP)
Recovery time: < 10 mins/node

Snapshot restore time (estimates):

< 4mins for a 20GB snapshot (test-cluster: 3 nodes, multiple indexes with 3 primary shards each, no replicas)

31 of 38

Cost savings: EC2

EC2
Instance type	Service	Daily cost
5 x r5.xlarge (20C, 160GB)	Elasticsearch	40.80
3 x c5.large (6C, 12GB)	Logstash	7.17
1 x t3.medium (2C, 4GB)	Kibana	1.29
	Total	~ 49.26$

EC2 (optimized)
Instance type	Service	Daily cost�65% savings + Spotinst charges (20% of savings)	Total Savings
5 x m4.xlarge (20C, 80GB)	Elasticsearch Hot	14.64
2 x r4.xlarge (8C, 61GB)	Elasticsearch Warm	7.50
3 x c4.large (6C, 12GB)	Logstash	3.50
1 x t2.medium (2C, 4GB)	Kibana	0.69
	Total	~ 26.33$	~ 47%

32 of 38

Cost savings: Storage

Ingesting: 300GB/day
Retention: 90 days
Replica count: 1

Storage
Storage type	Retention	Daily cost
~54TB (GP2)	90 days	~ 237.60$

Storage (optimized)
Storage type	Retention	Daily cost	Total Savings
~ 3TB (GP2) Hot	8 days	12.00
~ 24TB (SC1) Warm	82 days	24.00
~ 27TB (S3) Backup	90 days	22.50
Total		~ 58.50$	~ 75%

Ingesting: 300GB/day
Retention: 90 days
Replica count: 0
Backups: Daily S3 snapshots

33 of 38

Total Savings

	ELK stack	ELK stack �(optimized)	Savings
EC2	49.40	26.33	47%
Storage	237.60	58.50	75%
Data-transfer	7	0	100%
Total (daily cost)	~ 294.00$	~ 84.83$	~ 71% *
Cost/GB (daily)	~ 0.98$	~ 0.28$

* Total savings are exclusive of some of the application-level optimizations done

34 of 38

Our Costs vs other Platforms

	ELK Stack (optimized)	ELK Stack	Splunk	Sumo logic
Product	Self managed	Self managed	Cloud	Professional
Data Ingestion	~ 300GB/day	~ 300GB/day	~ 100 GB / day *�(post ingestion custom pricing)	~ 20 GB / day *�(post ingestion custom pricing)
Retention	~ 90 days	~ 90 days	~ 90 days *	~ 30 days *
Cost/GB/day	~ $ 0.28 per GB /day	~ $ 0.98 per GB /day	~ $ 3.33 per GB /day *	~ $ 3.60 per GB /day *

Savings over traditional ELK stack: 71% *

* values are an estimation taken from the ‘product pricing web-page’ of the respective products, they may not represent the actual values and are meant for the purpose of comparison only

* Total savings are exclusive of some of the application-level optimizations done

35 of 38

Key Takeaways

ELK Stack Scalability:

Logstash: auto-scaling
Elasticsearch: overprovisioning (nodes run at 60% capacity during peak load), predictive vertical/horizontal scaling

Handling potential data-loss while AZ is down:

DR mechanisms in place, daily/hourly backups stored in S3, Potential chances of data loss of about 1 hour
We do not store user-data or business metrics in ELK, users/business will not be impacted

Handling potential data-corruptions in Elasticsearch:

DR mechanisms in place, recover index from S3 index-snapshots

Managing downtime during spot take-backs:

Logstash: multiple nodes, minimal impact
Elasticsearch/Kibana: < 10min downtime per node
Use previous generation instance types as spot take-back chances are comparatively low

36 of 38

Key Takeaways

Handling back-pressure when a node is down:

Filebeat: will auto-retry to send old logs
Logstash: use ‘date’ filter for document timestamp, auto-scaling
Elasticsearch: overprovisioning

Other log analytics alternatives:

We have only evaluated ELK, Splunk and Sumo Logic

ELK stack upgrade path:

Blue Green deployment for major version upgrade

37 of 38

Reflection

We built a platform tailored to our requirements, yours might be different...

Building a log analytics platform is not rocket science, but it can be painfully iterative if you are not aware of the options

Be aware of the trade-offs you are ‘OK with’ and you can roll out a solution optimised for your specific requirements

38 of 38

Thank you!

Denis D’Souza

DevOps Engineer

Happy to take your questions..