Log Analytics with ELK Stack�
(Architecture for aggressive cost optimization and infinite data scale)
Denis D’Souza
DevOps Engineer
About me...
Worked on a variety of technologies in both service-based and product-based organisations
Learning new technologies and Playing PC Games
www.linkedin.com/in/denis-dsouza
Who we are?
Current Scale
Peak Concurrent Playing Users : 500,000 (Half million)
Our problem statement
“ We were looking for a Log Analytics Platform that fits our requirements.
We evaluated, compared and decided to build a self managed ELK stack ”
Our business requirements
Choosing the right option
| ELK stack | Splunk | Sumo logic |
Product | Self managed | Cloud | Professional |
Pricing | ~ $30 per GB / month | ~ $100 per GB / month * | ~ $108 per GB / month * |
Data Ingestion | ~ 300 GB / day | ~ 100 GB / day *�(post ingestion custom pricing) | ~ 20 GB / day *�(post ingestion custom pricing) |
Retention | ~ 90 days | ~ 90 days * | ~ 30 days * |
Cost/GB/day | ~$ 0.98 per GB / day | ~$ 3.33 per GB / day * | ~$ 3.60 per GB /day * |
* values are estimations taken from the ‘product pricing web-page’ of the respective products, they may not represent the actual values and are meant for the purpose of comparison only.�References:�https://www.splunk.com/en_us/products/pricing/calculator.html#tabs/tab2�https://www.sumologic.com/pricing/apac/�
ELK Stack overview
ELK Stack overview: Terminologies
References:
https://www.elastic.co/guide/en/elasticsearch/reference/5.6/_basic_concepts.html
Our ELK architecture
Our ELK architecture: Hot-Warm-Cold data storage (infinite scale)
Our ELK architecture: Size and scale
| Service | Number of Nodes | Total CPU Cores | Total RAM | Storage EBS |
1 | Elasticsearch | 7 | 28 | 141 GB | |
2 | Logstash | 3 | 6 | 12 GB | |
3 | Kibana | 1 | 1 | 4 GB | |
| Total | 11 | 35 | 157 GB | ~ 20 TB |
Data-ingestion per day | ~ 300 GB |
Hot Retention policy | 90 days |
Docs/sec (at peak load) | ~ 7K |
Optimizations we did
Application Side
Infrastructure Side
Optimizations we did: Application side
Optimizations we did: Logstash
Pipeline Workers:
### Core-count: 2 ###�...�pipeline.workers: 8�...
Configuration Optimizations
logstash.yml
Workers: 8
Workers: 2
References:
https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html
Optimizations we did: Logstash
'info' logs:
if [sourcetype] == "app_logs" and [level] == "info" {
elasticsearch {
index => "%{sourcetype}-%{level}-%{+YYYY.MM.dd}"
...
if [sourcetype] == "nginx" and [status] == "200"
{
elasticsearch {
index => "%{sourcetype}-%{status}-%{+YYYY.MM.dd}"
...
'200' response-code logs:
Filter Optimizations
filters.conf
References:
https://www.elastic.co/guide/en/logstash/current/event-dependent-configuration.html
Optimizations we did: Logstash
Log ‘message’ field:
if "_grokparsefailure" not in [tags] {
mutate {
remove_field => ["message"]
}
}
Filter Optimizations
filters.conf
Eg:
Nginx Log-message: 127.0.0.1 - - [26/Mar/2016:19:09:19 -0400] "GET / HTTP/1.1" 401 194 "" "Mozilla/5.0 Gecko" "-"
Grok Pattern: %{IPORHOST:clientip} (?:-|(%{WORD}.%{WORD})) %{USER:ident} \[%{HTTPDATE:timestamp}\] "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{NUMBER:bytes}|-) %{QS:referrer} %{QS:agent} %{QS:forwarder}
Optimizations we did: Application side
Optimizations we did: Elasticsearch
JVM heap vs non-heap memory:
### Total system Memory 15GB ###
-Xms5g
-Xmx5g
Configuration Optimizations
jvm.options
Heap too small
Heap too large
Optimised Heap
* Recommended heap-size settings by Elastic:�https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
Optimizations we did: Elasticsearch
Shards:
### Number of ES nodes: 5 ###
{
...
"template": "appserver-*",
"settings": {
"number_of_shards": "5",
"number_of_replicas": "0",
...
}
}'
Template configuration
Trade-offs:
Replicas:
Optimizations we did: Infrastructure side
AWS
Spotinst platform allows users to reliably leverage excess capacity, simplify cloud operations and save 80% on compute costs.
Optimizations we did: Infrastructure side
Optimizations we did: EC2 and spot
- Stateful EC2 Spot instances:
Recovery time: < 10 mins
Trade-offs:
Optimizations we did: EC2 and spot
Auto-Scaling:
Optimizations we did: Infrastructure side
Optimizations we did: EBS
- "Hot-Warm" Architecture:
node.attr.box_type: hot
...
"template": "appserver-*",�"settings": {
"index": {
"routing": {
"allocation": {
"require": {
"box_type": "hot"}
}
}
},
...�
elasticsearch.yml
Template configuration
Trade-offs:
References:
https://cinhtau.net/2017/06/14/hot-warm-architecture/
Optimizations we did: EBS
Moving indexes to "Warm" nodes:
actions:
1:
action: allocation
description: "Move index to Warm-nodes after 8 days"
options:
key: box_type
value: warm
allocation_type: require
wait_for_completion: true
timeout_override:
continue_if_exception: false
filters:
- filtertype: age
source: name
direction: older
timestring: '%Y.%m.%d'
unit: days
unit_count: 8
...�
Curator configuration
References:
https://www.elastic.co/blog/hot-warm-architecture-in-elasticsearch-5-x
Optimizations we did: Inter-AZ data transfer
Single Availability Zone:
Trade-offs:
Data backup and restore
Using S3 for index Snapshots:
curl -XPUT "http://<domain>:9200/_snapshot/s3_repository/snap1?pretty?wait_for_completion=true" -d'
{
"indices": "index_1,index_2",
"ignore_unavailable": true,
"include_global_state": false
}
Backup:
References:�https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html
https://medium.com/@federicopanini/elasticsearch-backup-snapshot-and-restore-on-aws-s3-f1fc32fbca7f
Data backup and restore
curl -s -XPOST --url "http://<domain>:9200/_snapshot/s3_repository/snap1/_restore" -d'
{
"indices": "index_1,index_2",
"ignore_unavailable": true,
"include_global_state": false,
}'
On-demand Elasticsearch cluster:
Existing Cluster:
Restore:
Disaster recovery
Data corruption:
Node failure due to AZ going down:
Node failures due to underlying hardware issue:
Snapshot restore time (estimates):
Cost savings: EC2
EC2 | ||
Instance type | Service | Daily cost |
5 x r5.xlarge (20C, 160GB) | Elasticsearch | 40.80 |
3 x c5.large (6C, 12GB) | Logstash | 7.17 |
1 x t3.medium (2C, 4GB) | Kibana | 1.29 |
| Total | ~ 49.26$ |
EC2 (optimized) | |||
Instance type | Service | Daily cost�65% savings + Spotinst charges (20% of savings) | Total Savings |
5 x m4.xlarge (20C, 80GB) | Elasticsearch Hot | 14.64 | |
2 x r4.xlarge (8C, 61GB) | Elasticsearch Warm | 7.50 | |
3 x c4.large (6C, 12GB) | Logstash | 3.50 | |
1 x t2.medium (2C, 4GB) | Kibana | 0.69 | |
| Total | ~ 26.33$ | ~ 47% |
Cost savings: Storage
Ingesting: 300GB/day |
Retention: 90 days |
Replica count: 1 |
Storage | ||
Storage type | Retention | Daily cost |
~54TB (GP2) | 90 days | ~ 237.60$ |
Storage (optimized) | |||
Storage type | Retention | Daily cost | Total Savings |
~ 3TB (GP2) Hot | 8 days | 12.00 | |
~ 24TB (SC1) Warm | 82 days | 24.00 | |
~ 27TB (S3) Backup | 90 days | 22.50 | |
Total | | ~ 58.50$ | ~ 75% |
Ingesting: 300GB/day |
Retention: 90 days |
Replica count: 0 |
Backups: Daily S3 snapshots |
Total Savings
| ELK stack | ELK stack �(optimized) | Savings |
EC2 | 49.40 | 26.33 | 47% |
Storage | 237.60 | 58.50 | 75% |
Data-transfer | 7 | 0 | 100% |
Total (daily cost) | ~ 294.00$ | ~ 84.83$ | ~ 71% * |
Cost/GB (daily) | ~ 0.98$ | ~ 0.28$ | |
* Total savings are exclusive of some of the application-level optimizations done
Our Costs vs other Platforms
| ELK Stack (optimized) | ELK Stack | Splunk | Sumo logic |
Product | Self managed | Self managed | Cloud | Professional |
Data Ingestion | ~ 300GB/day | ~ 300GB/day | ~ 100 GB / day *�(post ingestion custom pricing) | ~ 20 GB / day *�(post ingestion custom pricing) |
Retention | ~ 90 days | ~ 90 days | ~ 90 days * | ~ 30 days * |
Cost/GB/day | ~ $ 0.28 per GB /day | ~ $ 0.98 per GB /day | ~ $ 3.33 per GB /day * | ~ $ 3.60 per GB /day * |
Savings over traditional ELK stack: 71% *
* values are an estimation taken from the ‘product pricing web-page’ of the respective products, they may not represent the actual values and are meant for the purpose of comparison only
* Total savings are exclusive of some of the application-level optimizations done
Key Takeaways
ELK Stack Scalability:
Handling potential data-loss while AZ is down:
Handling potential data-corruptions in Elasticsearch:
Managing downtime during spot take-backs:
Key Takeaways
Handling back-pressure when a node is down:
Other log analytics alternatives:
ELK stack upgrade path:
Reflection
Thank you!
Denis D’Souza
DevOps Engineer
Happy to take your questions..
Copyright Disclaimer: All rights to the materials used for this presentation belongs to their respective owners..
www.linkedin.com/in/denis-dsouza