Loki: like Prometheus, but for logs.
Tom Wilkie & David Kaltschmidt, March 2018
This document aims to explain the motivations for, and design of, the Grafana Loki service. This document does not attempt to describe in depth every possible detail of the design, but hopefully explains the key points and should allow us to spot any obvious mistakes ahead of time.
This document aims to answer question not only about how we’re going to build this, but also why we’re building it, what it will be used for, and who will be using it.
Metrics are key to incident response; alerts are typically written as conditions over time series. But metrics can only be used to expose anticipated behaviour, due to their nature (they need to be pre-declared and limited in cardinality). Therefore metrics can only tell half the story; to get a complete picture of the cause of a incident, engineers typically resort to logs (textual output of a program) to get more detailed information.
It follows then that commonly incident response first starts with an alert, then some dashboard consultation (perhaps evolving the query in an ad hoc fashion) before pinpointing the exact service, host or instance that the errors are coming from. The engineer will then attempt to find the logs for that service/host/instance and time range to find the root cause. As the current status quo is for metrics and logs to be stored in two disparate systems, it is up to the engineer to translate the query from one language and interface to another.
Therefore the first hypothesis of this design is that minimising the cost of the context switching between logs and metrics will help reduce incident response times and improve user experience.
Log aggregation is not a new idea; like time series monitoring, there are many SaaS vendors and open source projects competing for attention. Almost all existing solutions involve using full-text search systems to index logs; at first glance this seems like obvious solutions, with a rich and powerful feature set allowing for complex queries.
These existing solution are complicated to scale, resource intensive and difficult to operate. As mentioned above, an increasingly common pattern is the use of time series monitoring in combination with log aggregation - therefore the flexibility and sophistication offered by the query facilities often go unused; the majority of queries just focusing on a time range and some simple parameters (host, service etc). Using these systems for log aggregation is akin to using a sledgehammer to crack a nut.
The challenges and operational overheads of existing systems has lead many buyers into the hands of SaaS operators. Therefore the second hypothesis of this design is that a different trade off can be struck between ease of operation and sophistication of the query language, one favouring ease of operation.
With the migration towards SaaS log aggregation, the excessive costs of such systems has become more obvious. This cost arises not just from the technology used to implement full-text search - scaling and sharding an inverted index is hard; either writes touch each shard, or reads must - but also from the operation complexity.
A common experience amongst buyers looking for a log aggregation system, after receiving an initial quote for their existing logging load at an order-of-magnitude more than they are willing to spend, is to turn to the engineers and ask them to log less. As logging exists to cover the unanticipated errors (see above), a typical response from engineers is one of disbelief - “what’s the point in logging if I have to be conscious of what I log?”.
Some systems have recently emerged offering a different trade off here. The open source OKLOG project by Peter Bourgon (now archived), eskews all forms of indexing beyond time-based, and adopts a eventually consistent, mesh-based distribution strategy. These two design decisions offer massive cost reduction and radically simpler operations, but in our opinion don’t meet our other design requirements - queries are not expressive enough and too expensive. We do however recognise this as being an attractive on-premise solution.
Therefore the third hypothesis is that a significantly more cost effective solution, with a slightly different indexing trade off, would be a really big deal...
A interesting aside is to consider how logging has changed in the modern cloud native / microservices / containerised workloads. The standard pattern is now for applications to simply writes logs to STDOUT or STDERR. Platforms such as Kubernetes and Docker build on this to offer limited log aggregation features; logs are stored locally on nodes and can be fetched and aggregated on demand, using label selectors.
But with these simple systems, logs are often lost when a pod or node disappears. This is often one of the first triggers for a buyer to realise they need log aggregation - a pod or node mysteriously dies and no logs are available to diagnose why.
Lastly it is worth covering how Prometheus fits into this picture. Prometheus is a monitoring system centered around a time series database. The TSDB indexes collections of samples (a time series) using a set of key-value pairs. Queries over these time series can be made by specifying a subset of these labels (matchers), returning all time series which match these labels. This differentiates from the likes of legacy Graphite hierarchical labels by making queries robust to the presence of new or changing labels.
In Prometheus (and Cortex), these labels are stored in an inverted index, making queries against these labels fast. This inverted index in Cortex exists in memory for recent data, and in a distributed KV store (BigTable, DynamoDB or Cassandra) for historic data. The Cortex index scales linearly by both retention and throughput, but by design is limited in cardinality of any given label.
The Prometheus system contains many components, but one noteworthy component for this discussion is mtail (https://github.com/google/mtail). Mtail allows you to “extract whitebox monitoring data from application logs for collection in a time series database”. This allows you to build time series monitoring and alerts for applications which do not expose any metrics natively.
One common use case for log aggregation systems is to store structured, event-based data - for instance emitting an event for every request to a system, and including all the request details and metadata. With these kind of deployments comes the ability to ask questions like “show me top 10 users with highest 99th percentile latency”, something that you typically cannot do with time series metrics system due to the high cardinality of users. Whilst this use case is totally valid, it is not something we are targeting with this system.
We will build a hosted log aggregation system (the system) which indexes metadata associated with those log streams, rather than indexing the contents of the log streams themselves. This metadata will take the form of Prometheus-style multi-dimensional labels. These labels will be consistent with the labels associated with time series/metrics ingested from a job, such that the same labels can be used to find logs from a job as can be used to find time series from said job, enabling quick context switching in the UI.
The system will not solve many of the complicated distributed systems and storage challenges typically associated with log aggregation, but rather will offload them to existing distributed databases and object storage systems. This will reduce operational complexity by having the majority of the systems services be stateless and ephemeral, and allowing operators of the system to use the hosted services offered by cloud vendors.
By only indexing metadata associated with log streams, the system will reduce the load on the index by many orders of magnitude - I expect there for be ~1KB of metadata for potentially 100MBs of log data. The actual log data will be stored in a hosted object storage service (S3, GCS etc), which are experiencing massive downward cost pressure due to competition between vendors. We will be able to pass on these savings and offer the system at a price a few orders of magnitude lower than competitors. For example, GCS cost $0.026/GB/month, whereas Loggly costs ~$100/GB/month.
By virtue of this being a hosted system, logs will be trivially available after clients hosts or pods fail. An agent will be deployed to each node in a client’s system to ship logs to our service, and ensure metadata is consistent with metrics.
This section is a work in progress.
The first challenge is obtaining reliable metadata that is consistent with the metadata associated with the time series / metrics. To achieve this we will use the same service discovery and label relabelling libraries as Prometheus. This will be packaged up in a daemon that discovers targets, produces metadata labels and tails log files to produce streams of logs, which will be momentarily buffered on the client side and the sent to the service. Given the requirement about having recent logs on node failure, there is a fundamental limit to the amount of batching it can do.
This component exists and is called Promtail.
The server-side components on the write path will mirror the Cortex architecture:
The chunk format will be important to the cost and performance of the system. A chunk is all logs for a given label set over a certain period. The chunks must support appends, seeks and streaming reads. Assuming an “average” node will produce 10 GB of logs per day, and runs on average 30 containers, then each log stream will write at 4 KB/s. Expected compression rates should be around 10x for log data.
When picking the optimal chunk size we need to consider:
For example, 12 hours of log data will produce chunks around ~100 MB uncompressed, and ~10 MB compressed. 12 hours is also the upper limit for chunk length we use in Cortex. Given the memory requirements for building these, a chunk size closer to 1 MB (compressed) looks more likely.
The proposal is for a chunk to consist of a series of blocks; the first block is a gorilla-style encoded time index, and the subsequent blocks contain compressed log data. Once enough blocks have been produced to form a big enough chunk, they will be appended together to form a chunk. Some experimentation will be necessary to find the right chunk format, input here is welcome.
As chunks are many orders of magnitude larger than Prometheus/Cortex chunks (Cortex chunks are maximum 1KB in size), it won’t be possible to load them and decompress them in their entirety. Therefore we will need to support streaming and iterating over them, only decompressing the parts we need as we go. Again, lots of details to work out here, but I suspect aggressive use of gRPC streaming and heaps will be in order (see the new Cortex iterating query PR)
TBD how to implement mtail like functionality - extra PromQL functions? The ruler?
This section is a work in progress. Feel free to add questions not explained.
This section is a work in progress.
We experimented with various compression schemes and block sizes on some example log data: