Compressed pointers in V8

Attention: Shared Google-externally

Authors: ishell@chromium.org
Status: FINAL

Last updated: 2019-09-11

This document outlines the plan to evaluate and implement pointer compression in V8.

Tracking bug: https://bugs.chromium.org/p/v8/issues/detail?id=7703 

Link to this doc: go/v8-pointer-compression

Shipping plan and implementation details are being discussed here.

Motivation

We expect about ~35% of V8 heap reduction on 64-bit platforms on real-world web sites (see benefits section for details). The updated V8 heap stats collector and visualizer can show the potential savings of pointer compression: go/v8-heap-stats. However, because of additional work for compressing/decompressing of values some performance cost will have to be paid and the generated code size may also increase.

The exact performance penalty value is yet to be figured out. However, the compression scheme can be chosen such that the compression-decompression code may be branch-less which in turn should reduce the penalty on modern CPUs.

On the other hand reducing memory may also improve performance because of better utilization of CPU caches which will be able to fit twice as many compressed tagged pointers as uncompressed ones. Note, that the heap constants embedded into the code will also be 32-bit instead of 64-bit.

There are also other potential benefits that may come out of pointer compression, see the respective section below.

Update: we are implementing Variant 4 as it proved to be faster than the other variants.

The idea

Encoding

On 32-bit platforms V8 uses one lower bit to distinguish small integers (Smis) from heap objects and second lower bit of heap object address to distinguish strong references from weak ones:

Smi:         [xx...31...xxx]0

Strong HeapObject:        [xx...30...xx]01

Weak HeapObject:        [xx...30...xx]11

Cleared weak reference:        [00...30...00]11

So we have 30 bits that we can use as an heap object address payload. Supported allocation granularity is 4 bytes which gives us 4 Gb of addressable space.

On 64-bit architectures we may use similar approach - change representation of all Tagged values stored in the heap to be 32-bit instead of 64-bit. V8 in Chrome already has a 2Gb “soft” limit on V8 heap (which may be exceeded in certain OOM situations when DevTools is open), so we don’t need additional pointer multiplication and we may use 4-byte allocation alignment in V8 heap. Other embedders, such as Node.js or Electron or future Chromes, may require bigger heaps, so the proposed pointer compression scheme will not be applicable there and therefore we still require V8 with full pointers support.

Note, that once we make the simple compression scheme work reasonably well and once we introduce proper bottlenecks for compression/decompression of values we may play with other compression schemes allowing bigger heaps.

64-bit V8 already uses a dedicated root register so we may use it as a base register. The root list can still be accessible via additional indirection through the V8 address space header. Or even better option would to move the whole root list to the V8 heap header. That way we may still access root entries without additional indirection via the base register.

Heap layout

V8 heap header

64-bit architectures usually have a lot of registers and V8 reserves one as a “root” register which points to the array or v8 roots (v8::internal::Heap::root_[]). In addition, V8 uses it for accessing external references table and builtins table. All these tables and the root list are “close” enough to the “root” base so that we can use [root, #constant_offset] addressing mode on all architectures to access the values.

Since the plan is to use “root” register as a V8 heap base register we have to ensure that all the values previously accessible via the root register would be still accessible via the V8 heap base register. We can achieve that by allocating a “V8 heap header” page of C++ memory at the base address and put all the mentioned data there. In addition we will store there a pointer to the v8::internal::Isolate object.

Variant 1 - base points to the beginning, page aligned

V8 should reserve a contiguous 4Gbyte region of address space with arbitrary alignment. The whole heap should not overlap with the first 4Gbytes of address space (the base must not be equal to 0).

With such a heap layout the compression will look like:

        int32_t compressed_tagged;

int64_t uncompressed_tagged;

// Compression

if (uncompressed_tagged & 1) {

    compressed_tagged = int32_t(uncompressed_tagged - base);

} else {

    compressed_tagged = int32_t(uncompressed_tagged);

}

// Decompression

if (compressed_tagged & 1) {

    uncompressed_tagged = uint64_t(compressed_tagged) + base;

} else {

    uncompressed_tagged = int64_t(compressed_tagged);

}

Note:

  1. Both compression and decompression do not affect tagging (because base is at least page aligned)
  2. Compression and decompression of Smi and HeapObject values is different
  3. GC may still distinguish word-size-compressed HeapObject addresses from the fully decompressed ones by comparing the upper half-word of the value with zero (GC may meet word-size-compressed HeapObject pointers on the optimized or interpreter stack). Another approach would be to make GC always deal with the lower-half part of the stack pointers - the upper-half will either be zero or already contain a base value.

Variant 2 - base points to the beginning, 4Gbyte-aligned

V8 should reserve a 4Gbyte-aligned contiguous 4Gbyte region of address space. The whole heap should not overlap with the first 4Gbytes of address space (the base must not be equal to 0).

With such a heap layout the compression will look like:

        int32_t compressed_tagged;

int64_t uncompressed_tagged, selector_mask;

int64_t smi_result, ho_result;

// Compression

compressed_tagged = int32_t(uncompressed_tagged);

// Branchful decompression

if (compressed_tagged & 1) {

    uncompressed_tagged = uint64_t(compressed_tagged) + base;

} else {

    uncompressed_tagged = int64_t(compressed_tagged);

}

// Branchless decompression

smi_result = int64_t(compressed_tagged);

ho_result = base | uint64_t(compressed_tagged);

selector_mask = ((compressed_tagged & 1) << 63) >> 63;

uncompressed_tagged = (smi_result & ~selector_mask) |

                      (ho_result & selector_mask);

// Getting base from HeapObject

base = heap_object & 0xffffffff00000000;

Note:

  1. Both compression and decompression do not affect tagging.
  2. Compression of both Smi and HeapObject is trivial (it’s just a truncation).
  3. Branchless decompression consists of the following steps:
  1. compute selector mask value that should be (-1) for HeapObjects or (0) for Smis
  2. compute result for the Smi case
  3. compute result for the HeapObject case
  4. bitwise OR intermediate results for Smi and HeapObject cases masked with the selector mask
  1. GC may still distinguish word-size-compressed HeapObject addresses from the fully decompressed ones by comparing the upper half-word of the value with zero (GC may encounter word-size-compressed HeapObject pointers on the optimized or interpreter stack). Another approach would be to make GC always deal with the lower-half part of the stack pointers - the upper-half will either be zero or already contain a base value.
  2. Computing of a base value from HeapObject is trivial

The problem with such an approach is that in order to properly decompress a tagged value which is a Smi we have to check the value of SmiTag bit first and then either do a sign extension or zero extension of the value. This makes the branchless decompression quite complicated. Variant 3 does not have such a draw-back.

Variant 3 - base points to the middle, 4Gbyte-aligned (preferred one)

V8 should reserve a contiguous 4Gbyte region of address space so that it’s middle is 4Gbyte-aligned. The whole heap should not overlap with the first and last 4Gbytes of address space (the base must not be equal to 0 or to 0xffffffff00000000).

With such a heap layout the compression will look like:

        int32_t compressed_tagged;

int64_t uncompressed_tagged, selector_mask, sign_extended_tagged;

// Compression

compressed_tagged = int32_t(uncompressed_tagged);

// Branchful decompression

sign_extended_tagged = int64_t(compressed_tagged);

if (sign_extended_tagged & 1) {

  uncompressed_tagged = sign_extended_tagged + base;

} else {

  uncompressed_tagged = sign_extended_tagged;

}

// Branchless decompression

sign_extended_tagged = int64_t(compressed_tagged);

selector_mask = (((sign_extended_tagged & 1) << 63) >> 63);

uncompressed_tagged = sign_extended_tagged + (base & selector_mask);

// Getting base from HeapObject

base = (heap_object + 0x80000000) & 0xffffffff00000000;

Note:

  1. Both compression and decompression do not affect tagging.
  2. Compression of both Smi and HeapObject is trivial (it’s just a truncation).
  3. Branchless decompression consists of the following steps:
  1. sign-extend compressed 32-bit value to 64-bit
  2. compute selector mask value that should be (-1) for HeapObjects or (0) for Smis
  3. bitwise AND the selector mask with the base (so the result will be either base if the tagged was a HeapObject or 0 if the tagged was a Smi)
  4. add the result of base masking to the sign-extended tagged value
  1. GC may still distinguish word-size-compressed HeapObject addresses from the fully decompressed ones by comparing the upper half-word of the value with zero or 0xffffffff (GC may meet word-size-compressed HeapObject pointers on the optimized or interpreter stack).
  2. Computing of a base value from HeapObject is simple

Variant 4 - base points to the beginning, 4Gbyte-aligned, smi-corrupting

The heap layout is the same as in variant 2: V8 should reserve a 4Gbyte-aligned contiguous 4Gbyte region of address space. The whole heap should not overlap with the first 4Gbytes of address space (the base must not be equal to 0).

If we change V8 to treat the upper 32 bits of Smi values as undefined by making it look only on the lower 32 bits then we may simplify decompression in the following way:

        uint32_t compressed_tagged;

uint64_t uncompressed_tagged;

// Compression

compressed_tagged = unt32_t(uncompressed_tagged);

// Decompression

uncompressed_tagged = uint64_t(compressed_tagged) + base;

// Getting base from HeapObject

base = heap_object & 0xffffffff00000000;

Note:

  1. Both compression and decompression do not affect tagging.
  2. Compression of both Smi and HeapObject is trivial (it’s just a truncation).
  3. Decompression is just a simple addition
  4. Decompression “corrupts” upper 32 bits of uncompressed Smi values which may contain either zero or base value
  5. Computing of a base value from HeapObject is trivial

The C++ code is not affected by undefined state of Smi values because Smi::value() already looks only on the lower 32 bits. The complications with such an approach come from the CSA and Torque code: we can no longer combine Smi index untagging with multiplication in addressing modes on x64:

// Let rbx be an array and rcx be a Smi index

movq rax,[rbx+rcx*2+0x7]

The Smi index must be normalized (sign or zero-extended) before it can be used in such an addressing mode. The solution would be to stop using Smi indices in builtins, however that requires

  1. refactoring of existing builtins.
  2. Torque must be able to generate custom call interface descriptors for helper builtins to be able to pass untagged values as arguments.
  3. Implementation of decompression elimination in TurboFan can be simplified. See the design doc.

Benefits

The main benefit is the reduction of memory occupied by tagged pointers in the V8 heap, however there are opportunities for further improvements.

Real-world memory savings

v8.browsing_desktop

v8.browsing_desktop story

Average V8 heap size (KB)

Average ptr compression saving

media_flickr_infinite_scroll

10175.25 KB

34.64%

media_pinterest

14065.82 KB

34.13%

media_youtube

8900.08 KB

33.30%

news_cnn

18664.18 KB

34.31%

news_flipboard

7585.29 KB

33.77%

news_hackernews

(no gc stats)

news_nytimes

15683.61 KB

34.09%

news_reddit

(no gc stats)

search_google

11476.61 KB

34.36%

search_google_india

3767.88 KB

33.19%

social_facebook_infinite_scroll

6884.87 KB

33.87%

social_tumblr_infinite_scroll

8630.76 KB

35.94%

social_twitter

7006.31 KB

35.19%

social_twitter_infinite_scroll

8049.68 KB

34.31%

tech_discourse_infinite_scroll

9040.64 KB

34.77%

tools_earth

24108.84 KB

35.47%

tools_maps

54244.61 KB

37.15%

gmail.google.com

69720.32 KB

34.83%

inbox.google.com

38591.47 KB

35.19%

Faster V8 instantiation opportunity

V8 uses a heap snapshot to reduce the time it takes to create a new isolate (and possibly context). At build time, a snapshot of the heap is created using the serializer, and stored in a raw data file that is distributed with the V8 binary called snapshot_blob.bin. When a new isolate is created, the snapshot file is loaded into memory and deserialized to create an identical heap state, saving time on startup.

During deserialization V8 may “allocates” objects in the V8 heap. V8 implements ASLR and each new memory chunk is usually allocated at random place in the address space.

As another security measure we recompute the hashes of strings using a new random hash seed and rehash internal containers (various caches, dictionaries, descriptor arrays, transition arrays, etc.).

If we decide to not do the ASLRing of the snapshot data and avoid rehashing then the snapshot blob may be just a copy of memory chunks (already position-independent) and deserialization would just require loading of certain pages from the snapshot to the same offset from the V8 heap base. We may even consider mmap-ing snapshot blob into V8 heap in a copy-on-write manner (this may be related to read-only pages effort [2]). Which may speed up instantiation of a new V8 instance.

Note, that avoiding rehashing does not seem to be an option for Node.js because of security concerns and startup time is also not that much of an issue there.

Obstacles

Raw pointers in the heap

The existing tagging scheme allows to store 2-byte aligned C++ pointers as Smis in V8 heap and V8 API provides respecting accessor methods. Here’s where V8 API may store such values:

  1. In so called “embedder” fields of V8 objects created via API.
  2. In the “embedder data” FixedArray stored in respective slot of NativeContext.

Note, that V8 API allows to store both these raw pointers and V8 heap objects to the mentioned places. This works fine when the size of platform word matches the size of the tagged field, but will no longer work with compressed pointers.

We may address (1) by the reserving two consecutive 32-bit slots for each embedder field and let GC deal only with the “lower” part of each embedder field. GC can’t ignore these fields since they may also contain heap pointers. Taking into account proposed compression scheme, the lower half word of aligned raw pointers will still look like a Smi.

We may address (2) by introducing a new kind of FixedArray for storing pairs of <raw 32-bit, tagged 32-bit> values. Another way of handling this is to establish a pair of arrays - one for storing of tagged values and lower half-words of raw pointers and another array for storing upper half-words of raw pointers.

Note, that the usual amount of embedded fields in the V8 heap is about 0.5% so we can harmlessly leave them in “extended” form.

Hand-written platform stubs and builtins

We still have some number of code stubs and builtins which have not been ported to CSA yet.

Platform stubs:

  • JSEntryStub
  • ProfileEntryHookStub
  • InternalArrayConstructorStub
  • CallApiCallbackStub
  • CallApiGetterStub

Builtins:

  • Adaptor (trivial, does not require porting)
  • AdaptorWith[Builtin]ExitFrame (cl)
  • JSConstructStubGenericRestrictedReturn
  • JSConstructStubGenericUnrestrictedReturn
  • JSBuiltinsConstructStub
  • ConstructedNonConstructable
  • JSEntryTrampoline
  • JSConstructEntryTrampoline
  • ResumeGeneratorTrampoline
  • InterpreterEntryTrampoline
  • InterpreterPushArgsThenCallImpl
  • InterpreterPushArgsThenConstructImpl
  • InterpreterEnterBytecodeAdvance
  • InterpreterEnterBytecodeDispatch
  • CompileLazy[DeoptimizedCode] (cl)
  • DeserializeLazy (cl)
  • InstantiateAsmJs
  • ContinueToCodeStubBuiltin
  • ContinueToCodeStubBuiltinWithResult
  • ContinueToJavaScriptBuiltin
  • ContinueToJavaScriptBuiltinWithResult
  • NotifyDeoptimized
  • FunctionPrototypeApply
  • FunctionPrototypeCall
  • ReflectApply
  • ReflectConstruct
  • InternalArrayConstructor
  • ArrayConstructor (cl1, cl2, cl3)
  • AllocateIn[New,Old]Space (cl)
  • Abort (cl)
  • ArgumentsAdaptorTrampoline
  • CallOrConstructVarargs
  • CallOrConstructForwardVarargs
  • CallFunction
  • CallBoundFunctionImpl
  • Call
  • ConstructFunction
  • ConstructBoundFunction
  • Construct
  • OnStackReplacement
  • InterpreterOnStackReplacement
  • WasmCompileLazy
  • CEntry
  • DoubleToI
  • MathPowInternal

Builtins which can’t be ported to CSA:

It makes sense to port at least some of them to CSA/Torque (non-entry and non-trampoline ones) to ease the process of supporting compressed pointers. The candidates are Array*, Reflect* and those builtins which just tail call to runtime functions. Ideally, in the end only the trampoline code should be left in a hand-written assembly form.

32-bit 31-bit Smis

Current Smi tagging scheme encodes 32-bit value in the upper-part of the 64-bit word. Once we start compressing pointers it will no longer be possible to have direct access to Smi values stored in memory (by reading/writing only the upper 32-bit of the tagged pointers). This may also cost us some performance.

Complicated unboxed doubles support

It would be hard to support unboxed double fields because of object field shifting when changing representation from Smi to Double or from Double to Tagged. However, it is still doable.

Or we may decide to use a completely different approach for unboxing doubles which will also work for 32-bit architectures: we may consider using FixedDoubleArray as a separate backing for all the double properties. This would be a bit less efficient but more GC friendly approach than we have right now.

NaN tagging

The NaN tagging aims at reducing the cost of updating metadata upon field representation changes, so that Smi->Double->Tagged transitions may be done by just updating the hidden class in-place and maybe deoptimizing the optimized code that relies on the exact representation. Given that the size of Tagged field will not be equal to size of double field the NaN tagging would not be an option anymore with enabled pointer compression.

However, given that the full pointer mode support will stay in V8 we may still implement NaN tagging there.

Testability and finchability

It looks like that pointer compression will add a new dimension (or at least half of it) to the configuration matrix because the full pointer version will be needed for Node.js and probably for other high-performance-big-memory devices given the “memory-is-cheap” tendency. So probably even low-end devices will get “huge” amounts of memory at some point. And people tend to write even bigger web applications for Chrome.

Unfortunately the pointer compression can’t be implemented as a finch experiment because it will not be a runtime flag. The reason is that the feature affects such core things like pointer-size constant, V8 objects layout constants and Smi tagging scheme which are used extremely wide across all V8 code (including runtime, builtins, interpreter and compilers) and it will be a huge complexity effort and a performance regression to turn all those constants into variables. Moreover, these internal constants are also exposed via V8 API which impose a dependency to the embedder.

Probably, the maximum we can do to make this finchable is to hide all the internal constants from the V8 API and ship two V8 binaries with Chrome and make it use a proper one depending on the finch switch. With this approach it would not be possible to compress V8 values exposed via V8 API - they will have to be at least of pointer size.

Note, however, that we may still use finch flag between steps 2 (switch Smis to 31-bit) and 9 (change size of tagged fields) for testing of the implementation (see the plan below).

Incremental implementation plan (x64)

  1. Add respective V8 build mode
  2. Support 31-bit Smis on 64-bit platforms with payload in lower half-word (sign-extended).
  3. Make GC allocate pages in a properly aligned contiguous “heap” region of address space.
  4. Turn root register into a base register. Introduce the V8 header page and move the following data there:
  • Root list from v8::internal::Heap
  • External reference table from v8::internal::Heap
  • Builtin table from v8::internal::Builtins
  1. Teach GC and runtime to deal with embedder fields in API objects
  2. Support embedder data in NativeContext
  3. Naively support compressed pointers: decompress on every load, compress on every store, while keeping the compressed tagged fields in 64-bit form and accessing only the lower-half. Substeps:
  1. Teach GC to deal with compressed pointers stored in the heap
  2. Support compressed pointers in optimized stack frames
  3. Runtime and CSA support: store compressed values in heap objects (update C++ macros used for accessing V8 object fields, and respective CSA macros).
  4. Support compressed pointers in TF
  5. Support compressed pointers in Liftoff
  6. Support compressed pointers in Wasm
  7. Support compressed pointers in hand-written stubs and builtins.
  1. Enable compressed pointers implementation in Chrome by finch experiment (all the tagged values be accessed as 32-bit values, but still occupy 64-bits in memory).
  2. Change the size of compressed tagged fields to 32-bit and fix all the places where we compute offsets of the index’th tagged value by multiplying index by kPointerSize.
  3. Optimize away unnecessary uncompression-compression rounds in runtime, CSA and TF (this may require addition of another value representation, however, we may still use the existing stack maps)
  4. Revisit the way we handle Smis in CSA and TF code. It may be beneficial in terms of code size to operate on 32-bit values directly instead of 64-bit values.
  5. Measure performance. Profit?
  6. Consider using segment registers %fs or %gs on operating systems that support reconfiguring of the segment base address. This may allow us to reduce the register pressure in generated code that loads from or stores to the V8 heap. This is especially helpful when JIT produces instruction that already uses two registers for addressing. x86 CPUs do not support 3-register addressing. Using %fs or %gs segment prefix permits adding the base address without a need to split this instruction into two. Slavamn@ claims that %fs/%gs prefix does not affect latency on Intel CPUs. Note that instructions explicitly or implicitly accessing stack on x64 silently ignore segment override.
  7. Once we can use the segment prefix in memory accessing instructions we may generate x32 style code. In such a code we may avoid a lot of 64-bit sign- and zero-extension instructions and 64-bit instruction prefixes and in the end we may even get a reduction of generated code size.
  8. Consider storing compressed pointers in V8 API values which may also decrease Blink heap size.

Rejected ideas

The obvious solution - to run the 32 bit V8 everywhere - fails for several reasons:

  • Chrome already switched to 64-bit and it runs V8 in a 64-bit renderer process which forces V8 to be a 64-bit process too. We don’t consider establishing an IPC boundary between Chrome and V8 in this document because of overheads of accessing V8 objects from Blink and vice versa.
  • Android is moving to 64 bit too (faster for Nexus/Pixel than OEMs).

Similarly, the ART runtime solution of plain 32 bit heap pointers in the 64 bit VM would not work for us:

  • MacOSX does not allow allocation of memory in the low 4Gbyte at all.
  • Fuchsia processes may not have proper access rights for allocation of virtual memory at specific address
  • Precisely because ART does this, the low 4Gbyte (32-bit range) of virtual memory is already very crowded on Android.
  • Although each V8 instance’s heap is limited by 2Gbyte there might be multiple V8 instances in one process (for example, each Web Worker has its own V8 heap).

Related work

[1] Compressed pointers in the Dart VM

[2] A Read-only space in V8

[3] Compressed oops in the Hotspot JVM

[4] Compressed pointers support in IBM J9 VM. Whitepaper.

[5] Object-Relative Addressing: Compressed Pointers in 64-bit Java Virtual Machines