Published using Google Docs
Caliper Design Document
Updated automatically every 5 minutes

Caliper Design Document

What Caliper is and why it does what it does.

Table of Contents

Table of Contents

The Goal: good data, little hassle

Command-line tool

Runner

Runner/worker communication

Example runner executions

A successful trial

A failed trial

Instruments & Workers

Microbenchmark Instrument

Performing measurements

Responding to messages

Options

Allocation Instrument

Performing measurements

Responding to messages

Options

Macrobenchmark Instrument

The Goal: good data, little hassle

Caliper is an entire toolchain for making performance-related decisions that work in concert to help users get complete and accurate information about performance while minimizing the opportunities for misinformation. To that end, each component plays a particular role.

Above all, Caliper aims to be as simple as possible while providing more unambiguous, consistent and comprehensive data than equivalent benchmarks written without it.

Command-line tool

Runner

The runner is the parent process that spawns and manages individual worker processes that perform benchmarks and report results.  The runner itself does no measurement.  It’s job is to take input from the user, derive experiments to run and execute them in a separate JVM (worker process).

While worker processes are executing, the runner monitors the output and verifies that the worker and its JVM are behaving in a way that is consistent with the measurement methodology for the given instrument.  The output that is parsed and interpreted can come from a variety of sources including console output by the worker (System.out, System.err), a named pipe created specifically for communication between the runner and worker, GC logs (-XX:+PrintGC) and compilation logs (-XX:+PrintCompilation).  Since many of the facilities used to monitor the worker aren’t available to the worker itself, it is necessarily the responsibility of the runner to detect and respond to any sequence of events that may produce invalid measurements.  The only mechanism by which the runner can communicate with the worker process is via signal, which is limited to SIGTERM by the Process API.

In the event that the runner does detect an issue it can respond in one of the following ways:

Runner/worker communication

The following is the list of the types messages that are collected by the runner and made available to instruments:

Example runner executions

There are quite a few factors that impact how the runner responds to the worker, so two sample executions are provided to help clarify.  Note that the parameters specified in the examples are unrealistic and tuned to keep the examples brief.

A successful trial

This trial uses the microbenchmark instrument to measure runtime.  GC events invalidate an individual measurement and Hotspot compilation invalidates all prior measurements. 2 measurements per trial have been specified.  The maximum run time is disabled.

  1. Measurement starts.
  2. The runner detects a compilation event in the worker. The measurement is marked as invalid.
  3. The measurement completes.  It is discarded.  There are no valid measurements.
  4. Measurement starts.
  5. The measurement completes.  It is retained.  There is 1 valid measurement.
  6. Measurement starts.
  7. The runner detects a GC event in the worker. The measurement is marked invalid.
  8. The measurement completes.  It is discarded.  There is 1 valid measurement.
  9. Measurement starts.
  10. A compilation event is detected, the measurement is marked as invalid. All prior measurements are discarded as well.
  11. The measurement completes.  It is discarded.  There are no valid measurements.
  12. Measurement starts.
  13. The measurement completes.  It is retained.  There is 1 valid measurement.
  14. Measurement starts.
  15. The measurement completes.  It is retained.  There are 2 valid measurements.
  16. The trial is completed.

A failed trial

This trial also uses the microbenchmark instrument to measure runtime.  2 measurements per trial have been specified.  The maximum run time is 2 seconds.

  1. Measurement starts.
  2. The runner detects a GC event in the worker.
  3. The measurement completes in 750ms.  It is discarded.  There are no valid measurements.
  4. Measurement starts.
  5. The runner detects a GC event in the worker.
  6. The measurement completes in 750ms.  It is discarded.  There are no valid measurements.
  7. Measurement starts.
  8. 2 seconds have elapsed before 2 valid measurements have been collected.  The worker is killed and the runner reports an error.

Instruments & Workers

Instruments and Workers are the components that coordinate to measure a value.  An Instrument subclass is instantiated in the runner process to discover and validate benchmark methods, process options and gather measurements.  A Worker implementation is instantiated in the worker process to perform the actual measurement.

Since the JVM is a very complex and dynamic environment, it would be difficult to make any assertion about the behavior of any particular piece of code if it is running after or alongside other code. The nature of the profiling compiler virtually guarantees that execution order will impact the compiled code in some way.  So, Caliper attempts to do the absolute bare minimum in a worker process to set up an execute the experiment and make the vast majority of the code executed that of the benchmark. Each experiment always gets its own JVM.

The obvious counterpoint to this choice is that it is extremely dissimilar to any application.  In “the real world”, code from a whole variety of libraries will be executing. So Caliper will always report results based on a hyper-optimized version of their benchmark.  While this is true, Caliper is steadfast in its assertion that it will not be able to report data that can be interpreted as an absolute measure of performance.  Rather, it attempts to produce data that is mutually comparable and pristine JVMs provide such a basis.

Worker processes are designed to be as lightweight as possible.  They should be single-threaded and focus as much of their execution on performing measurement as feasible.  They also are expected to run and measure indefinitely until explicitly killed by the runner process.  The accumulation of measurements and any evaluation of their validity is the responsibility of the instrument.

Caliper has some room for improvement in the code that is executed in the worker VM (especially for the sensitive microbenchmark instrument).  JSON parsing and serialization, reflection, etc. are all somewhat heavyweight.  Optimization or maybe even a code generation solution are probably worth investigating.

Default instruments:

Experimental instruments:

Proposed instruments:

Microbenchmark Instrument

The microbenchmark instrument is specifically tuned to accurately measure the runtime of operations that execute too quickly to be measured individually by the available clocks and timers.  This is accomplished by invoking a method parameterized by a number of reps (int or long) and measuring the total runtime of those reps - the benchmark method itself is responsible for implementing the repetition.

An example microbenchmark method looks like the following:

public int timeSomething(int reps) {

  // A dummy value to return to prevent dead code elimination

  int dummy = 0;

  for (int i = 0; i < reps; i++) {

    dummy |= invokeSomething();

  }

  return dummy;

}

A single measurement for an invocation of such a method is the difference between the values returned by System.nanoTime().  The weight of the measurement is the number of reps.

Performing measurements

While benchmark methods are written in terms of a certain number of reps, measurement is performed by targeting a timing interval and invoking the benchmark method with a number of reps that has been estimated to run in approximately that amount of time.

The warm-up phase, not to be confused with invalid measurements as determined by the runner, is a brief execution of a small number of reps used to seed the algorithm for estimating reps for a timing interval.  This estimate is admittedly inaccurate as hotspot compilation will likely not have completed for the benchmarked code, but the estimate is continuously refined in the measurement phase.

Once warm-up has completed, the measurement phase begins.  The number of reps chosen for a measurement is calculated in a two-step process:

  1. Estimate the target number of reps that will exactly fill a timing interval.
  2. Add some variance to the length of the measurement by choosing a normally distributed random number with .

Each measurement is immediately sent back to the runner upon completion where it is evaluated for validity.

Responding to messages

Options

Allocation Instrument

The allocation instrument is the complementary instrument to the microbenchmark instrument.  It measures the number of instances and the number of bytes allocated for a rep of a microbenchmark method.  The actual measurement is delegated to the Java Allocation Instrumenter.

The allocation instrument uses the same benchmark methods as the microbenchmark instrument.

The allocation instrument produces pairs of measurements.  One is the total number of bytes allocated and the other is the total number of Object instances.  The weight of the measurement is the number of reps.

Performing measurements

Since it is not guaranteed that each rep has a uniform allocation cost measurement is slightly more complicated than just executing a rep and measuring the cost.

First, the allocation worker executes a single rep and discards the result. This is to warm up benchmark and ensure that any lazy initialization has occurred.

Next, the allocation worker measures between 1 and 5 reps.  This is used as a baseline to account for any allocation that occurs within the benchmark, but outside the timing loop. The small bit of variation is only to ensure that there isn’t some odd behavior outside the timing loop that would cause different behavior for different numbers of reps.

Finally, between 1 and 100 reps above the number used for the baseline is measured.  The difference between that measurement and that of the baseline is used as the result reported to the runner.

Responding to messages

The allocation instrument takes no particular actions for any of the messages other than to record the measurements.

Options

Macrobenchmark Instrument

This instrument is the experimental corollary to the microbenchmark instrument that is designed to measure the runtime of operations that are sufficiently long as to not require the reps parameter.  I.e. the granularity of the timer is insignificant compared to the runtime of the operation being benchmarked.

Consequently, the methodology for the macrobenchmark instrument is greatly simplified from that of the microbenchmark instrument.  Each invocation of the benchmark method is timed separately and the weight of the measurement is always 1.

The simplification also allows for an extra mechanism that wasn’t available to microbenchmarks: setup and teardown on a per-rep basis.  The @BeforeRep and @AfterRep methods can be used to annotate methods that should be run before and after each measurement, but outside of timing.