Caliper Design Document

What Caliper is and why it does what it does.

Command-line tool

Runner

Runner/worker communication

Example runner executions

A successful trial

A failed trial

Instruments & Workers

Microbenchmark Instrument

Performing measurements

Responding to messages

Options

Allocation Instrument

Performing measurements

Responding to messages

Options

Macrobenchmark Instrument

The Goal: good data, little hassle

Caliper is an entire toolchain for making performance-related decisions that work in concert to help users get complete and accurate information about performance while minimizing the opportunities for misinformation. To that end, each component plays a particular role.

Caliper Runner: One invocation, many experiments
The runner turns the user’s intent into individual, isolated experiments. It then executes and monitors them so that well-founded conclusions can be made by comparing the measurements.
Caliper Webapp: The data, the whole data and nothing but the data
The number of factors in code, execution patterns and JVM parameters that might impact performance is staggering. The Caliper webapp exists to help users meaningfully process them all, but draws no conclusions of its own. It highlights differences, but always contains the complete dataset to eliminate any ambiguity about the results.

Above all, Caliper aims to be as simple as possible while providing more unambiguous, consistent and comprehensive data than equivalent benchmarks written without it.

Command-line tool

Runner

The runner is the parent process that spawns and manages individual worker processes that perform benchmarks and report results. The runner itself does no measurement. It’s job is to take input from the user, derive experiments to run and execute them in a separate JVM (worker process).

While worker processes are executing, the runner monitors the output and verifies that the worker and its JVM are behaving in a way that is consistent with the measurement methodology for the given instrument. The output that is parsed and interpreted can come from a variety of sources including console output by the worker (System.out, System.err), a named pipe created specifically for communication between the runner and worker, GC logs (-XX:+PrintGC) and compilation logs (-XX:+PrintCompilation). Since many of the facilities used to monitor the worker aren’t available to the worker itself, it is necessarily the responsibility of the runner to detect and respond to any sequence of events that may produce invalid measurements. The only mechanism by which the runner can communicate with the worker process is via signal, which is limited to SIGTERM by the Process API.

In the event that the runner does detect an issue it can respond in one of the following ways:

Discard results
If an event occurs that indicates that some number of results are invalid, it is the responsibility of the runner to discard that data. For instance, if a compilation event occurs while measuring runtime via the microbenchmark instrument, the runner will discard all measurements taken prior to that event. (See the description of the microbenchmark instrument for more information.) In the future, the runner may be enhanced to discard early results if it is found that they are subject to more variance than those taken later.
Kill the process
If an event occurs that indicates that no further results will ever be valid (or needed), the runner can destroy the worker process. Exceeding the maximum allowable runtime of an instrument, consistently producing invalid measurements (perhaps due to GC) or producing a sufficient number of valid measurements are examples of valid reasons to kill the worker.

Runner/worker communication

The following is the list of the types messages that are collected by the runner and made available to instruments:

Garbage collection
This is a garbage collection event as reported by -XX:+PrintGC.
Hotspot compilation
This is a hotspot compilation event as reported by -XX:+PrintCompilation.
Starting measurement
This signifies that measurement is currently being performed.
Ending measurement
This signifies that measurement has completed and contains the data.
VM option
This is the message by which a VM option is reported.
VM property
This is the message by which a system property is reported.
Failure
This is a failure in the worker process and the exception that caused it.
Generic text
This is generic output (stdout or stderr) produced by the worker.

Example runner executions

There are quite a few factors that impact how the runner responds to the worker, so two sample executions are provided to help clarify. Note that the parameters specified in the examples are unrealistic and tuned to keep the examples brief.

A successful trial

This trial uses the microbenchmark instrument to measure runtime. GC events invalidate an individual measurement and Hotspot compilation invalidates all prior measurements. 2 measurements per trial have been specified. The maximum run time is disabled.

Measurement starts.
The runner detects a compilation event in the worker. The measurement is marked as invalid.
The measurement completes. It is discarded. There are no valid measurements.
Measurement starts.
The measurement completes. It is retained. There is 1 valid measurement.
Measurement starts.
The runner detects a GC event in the worker. The measurement is marked invalid.
The measurement completes. It is discarded. There is 1 valid measurement.
Measurement starts.
A compilation event is detected, the measurement is marked as invalid. All prior measurements are discarded as well.
The measurement completes. It is discarded. There are no valid measurements.
Measurement starts.
The measurement completes. It is retained. There is 1 valid measurement.
Measurement starts.
The measurement completes. It is retained. There are 2 valid measurements.
The trial is completed.

A failed trial

This trial also uses the microbenchmark instrument to measure runtime. 2 measurements per trial have been specified. The maximum run time is 2 seconds.

Measurement starts.
The runner detects a GC event in the worker.
The measurement completes in 750ms. It is discarded. There are no valid measurements.
Measurement starts.
The runner detects a GC event in the worker.
The measurement completes in 750ms. It is discarded. There are no valid measurements.
Measurement starts.
2 seconds have elapsed before 2 valid measurements have been collected. The worker is killed and the runner reports an error.

Instruments & Workers

Instruments and Workers are the components that coordinate to measure a value. An Instrument subclass is instantiated in the runner process to discover and validate benchmark methods, process options and gather measurements. A Worker implementation is instantiated in the worker process to perform the actual measurement.

Since the JVM is a very complex and dynamic environment, it would be difficult to make any assertion about the behavior of any particular piece of code if it is running after or alongside other code. The nature of the profiling compiler virtually guarantees that execution order will impact the compiled code in some way. So, Caliper attempts to do the absolute bare minimum in a worker process to set up an execute the experiment and make the vast majority of the code executed that of the benchmark. Each experiment always gets its own JVM.

The obvious counterpoint to this choice is that it is extremely dissimilar to any application. In “the real world”, code from a whole variety of libraries will be executing. So Caliper will always report results based on a hyper-optimized version of their benchmark. While this is true, Caliper is steadfast in its assertion that it will not be able to report data that can be interpreted as an absolute measure of performance. Rather, it attempts to produce data that is mutually comparable and pristine JVMs provide such a basis.

Worker processes are designed to be as lightweight as possible. They should be single-threaded and focus as much of their execution on performing measurement as feasible. They also are expected to run and measure indefinitely until explicitly killed by the runner process. The accumulation of measurements and any evaluation of their validity is the responsibility of the instrument.

Caliper has some room for improvement in the code that is executed in the worker VM (especially for the sensitive microbenchmark instrument). JSON parsing and serialization, reflection, etc. are all somewhat heavyweight. Optimization or maybe even a code generation solution are probably worth investigating.

Default instruments:

Microbenchmark runtime (nanoseconds)
Allocation (bytes / instances)
Arbitrary measurement (user-generated measurements)

Experimental instruments:

Macrobenchmark runtime (nanoseconds)

Proposed instruments:

Profile (outputs textual profiling report)
Instruction counting (using perf events)

Microbenchmark Instrument

The microbenchmark instrument is specifically tuned to accurately measure the runtime of operations that execute too quickly to be measured individually by the available clocks and timers. This is accomplished by invoking a method parameterized by a number of reps (int or long) and measuring the total runtime of those reps - the benchmark method itself is responsible for implementing the repetition.

An example microbenchmark method looks like the following:

public int timeSomething(int reps) {

// A dummy value to return to prevent dead code elimination

int dummy = 0;

for (int i = 0; i < reps; i++) {

dummy |= invokeSomething();

}

return dummy;

}

A single measurement for an invocation of such a method is the difference between the values returned by System.nanoTime(). The weight of the measurement is the number of reps.

Performing measurements

While benchmark methods are written in terms of a certain number of reps, measurement is performed by targeting a timing interval and invoking the benchmark method with a number of reps that has been estimated to run in approximately that amount of time.

The warm-up phase, not to be confused with invalid measurements as determined by the runner, is a brief execution of a small number of reps used to seed the algorithm for estimating reps for a timing interval. This estimate is admittedly inaccurate as hotspot compilation will likely not have completed for the benchmarked code, but the estimate is continuously refined in the measurement phase.

Once warm-up has completed, the measurement phase begins. The number of reps chosen for a measurement is calculated in a two-step process:

Estimate the target number of reps that will exactly fill a timing interval.
Add some variance to the length of the measurement by choosing a normally distributed random number with .

Each measurement is immediately sent back to the runner upon completion where it is evaluated for validity.

Responding to messages

Garbage collection
The microbenchmark instrument discards all measurements during which GC occurs. Because full GCs can take several orders of magnitude more time than a single rep of a microbenchmark, a GC that occurs during timing can have a drastic effect on the reported runtime. Plus, it is impossible to predict GC behavior as it will fluctuate wildly depending on the multitude of tuning parameters available for JVM memory management. Thus, Caliper enforces that runtime measurements measure only the runtime of the algorithm and not that of the garbage collector by requiring that the garbage collector not execute during timing. Since many benchmarks will allocate, this means that some may need to be run with greater heap sizes or smaller timing intervals in order to allocate less and complete successfully.
It should also be noted that while the sum of the runtime and the garbage collection time has been used to approximate the “total cost” of the algorithm, the separate data from the Allocation Instrument is the preferred way to evaluate the memory requirements. By default, each benchmark is run with both instruments.

Hotspot compilation
The JVM optimizes methods that have been invoked enough times to exceed its CompileThreshold. In a simple test of sorting an array, the difference between interpreted code and compiled code was roughly an order of magnitude.
Since the operations being benchmarked are quite small, it can be assumed that they are only of interest due to the fact that they are expected to be invoked a great number of times. Thus, the vast majority of their executions should occur post-compilation. So, the microbenchmark instrument attempts to exclusively measure compiled code. However, it isn’t possible to request compilation or query which methods have been compiled. Instead, Caliper discards measurements prior to any compilation event. Any results prior to and including that event will be treated as an extension of the warm-up and results will be collected starting from the next measurement.
The one exception is that, while small, there is code that executes in the worker, but outside measurement. As more measurements are taken, code may be compiled that is independent of the benchmark. In these cases, a warning is displayed, but it does not invalidate results.
Also bear in mind that it is possible to run a microbenchmark without compilation at all. Setting the -Xint parameter for the JVM will disable compilation completely.

Options

warmup
This is the amount of time that the benchmark must consume before measurement begins.
timingInterval
This is the target amount of time for which a benchmark is run. The reps parameter is calculated so that the benchmark will execute for approximately this amount of time.
measurements
The number of measurements to record for a complete trial.
gcBeforeEach
Whether or not to force a garbage collection between each measurement.

Allocation Instrument

The allocation instrument is the complementary instrument to the microbenchmark instrument. It measures the number of instances and the number of bytes allocated for a rep of a microbenchmark method. The actual measurement is delegated to the Java Allocation Instrumenter.

The allocation instrument uses the same benchmark methods as the microbenchmark instrument.

The allocation instrument produces pairs of measurements. One is the total number of bytes allocated and the other is the total number of Object instances. The weight of the measurement is the number of reps.

Performing measurements

Since it is not guaranteed that each rep has a uniform allocation cost measurement is slightly more complicated than just executing a rep and measuring the cost.

First, the allocation worker executes a single rep and discards the result. This is to warm up benchmark and ensure that any lazy initialization has occurred.

Next, the allocation worker measures between 1 and 5 reps. This is used as a baseline to account for any allocation that occurs within the benchmark, but outside the timing loop. The small bit of variation is only to ensure that there isn’t some odd behavior outside the timing loop that would cause different behavior for different numbers of reps.

Finally, between 1 and 100 reps above the number used for the baseline is measured. The difference between that measurement and that of the baseline is used as the result reported to the runner.

Responding to messages

The allocation instrument takes no particular actions for any of the messages other than to record the measurements.

Options

allocationAgentJar
By default, the allocation instrument looks for a copy of the allocation instrument jar on the classpath of the runner. This parameter allows users to override the location of that jar and set it explicitly.

Macrobenchmark Instrument

This instrument is the experimental corollary to the microbenchmark instrument that is designed to measure the runtime of operations that are sufficiently long as to not require the reps parameter. I.e. the granularity of the timer is insignificant compared to the runtime of the operation being benchmarked.

Consequently, the methodology for the macrobenchmark instrument is greatly simplified from that of the microbenchmark instrument. Each invocation of the benchmark method is timed separately and the weight of the measurement is always 1.

The simplification also allows for an extra mechanism that wasn’t available to microbenchmarks: setup and teardown on a per-rep basis. The @BeforeRep and @AfterRep methods can be used to annotate methods that should be run before and after each measurement, but outside of timing.

Table of Contents

The Goal: good data, little hassle

Command-line tool

Runner

Runner/worker communication

Example runner executions

A successful trial

A failed trial

Instruments & Workers

Microbenchmark Instrument

Performing measurements

Responding to messages

Options

Allocation Instrument

Performing measurements

Responding to messages

Options

Macrobenchmark Instrument