Compressed pointers in V8
Attention: Shared Google-externally
Authors: ishell@chromium.org
Status: FINAL
Last updated: 2019-09-11
This document outlines the plan to evaluate and implement pointer compression in V8.
Tracking bug: https://bugs.chromium.org/p/v8/issues/detail?id=7703
Link to this doc: go/v8-pointer-compression
Shipping plan and implementation details are being discussed here.
We expect about ~35% of V8 heap reduction on 64-bit platforms on real-world web sites (see benefits section for details). The updated V8 heap stats collector and visualizer can show the potential savings of pointer compression: go/v8-heap-stats. However, because of additional work for compressing/decompressing of values some performance cost will have to be paid and the generated code size may also increase.
The exact performance penalty value is yet to be figured out. However, the compression scheme can be chosen such that the compression-decompression code may be branch-less which in turn should reduce the penalty on modern CPUs.
On the other hand reducing memory may also improve performance because of better utilization of CPU caches which will be able to fit twice as many compressed tagged pointers as uncompressed ones. Note, that the heap constants embedded into the code will also be 32-bit instead of 64-bit.
There are also other potential benefits that may come out of pointer compression, see the respective section below.
Update: we are implementing Variant 4 as it proved to be faster than the other variants.
On 32-bit platforms V8 uses one lower bit to distinguish small integers (Smis) from heap objects and second lower bit of heap object address to distinguish strong references from weak ones:
Smi: [xx...31...xxx]0
Strong HeapObject: [xx...30...xx]01
Weak HeapObject: [xx...30...xx]11
Cleared weak reference: [00...30...00]11
So we have 30 bits that we can use as an heap object address payload. Supported allocation granularity is 4 bytes which gives us 4 Gb of addressable space.
On 64-bit architectures we may use similar approach - change representation of all Tagged values stored in the heap to be 32-bit instead of 64-bit. V8 in Chrome already has a 2Gb “soft” limit on V8 heap (which may be exceeded in certain OOM situations when DevTools is open), so we don’t need additional pointer multiplication and we may use 4-byte allocation alignment in V8 heap. Other embedders, such as Node.js or Electron or future Chromes, may require bigger heaps, so the proposed pointer compression scheme will not be applicable there and therefore we still require V8 with full pointers support.
Note, that once we make the simple compression scheme work reasonably well and once we introduce proper bottlenecks for compression/decompression of values we may play with other compression schemes allowing bigger heaps.
64-bit V8 already uses a dedicated root register so we may use it as a base register. The root list can still be accessible via additional indirection through the V8 address space header. Or even better option would to move the whole root list to the V8 heap header. That way we may still access root entries without additional indirection via the base register.
64-bit architectures usually have a lot of registers and V8 reserves one as a “root” register which points to the array or v8 roots (v8::internal::Heap::root_[]). In addition, V8 uses it for accessing external references table and builtins table. All these tables and the root list are “close” enough to the “root” base so that we can use [root, #constant_offset] addressing mode on all architectures to access the values.
Since the plan is to use “root” register as a V8 heap base register we have to ensure that all the values previously accessible via the root register would be still accessible via the V8 heap base register. We can achieve that by allocating a “V8 heap header” page of C++ memory at the base address and put all the mentioned data there. In addition we will store there a pointer to the v8::internal::Isolate object.
V8 should reserve a contiguous 4Gbyte region of address space with arbitrary alignment. The whole heap should not overlap with the first 4Gbytes of address space (the base must not be equal to 0).
With such a heap layout the compression will look like:
int32_t compressed_tagged;
int64_t uncompressed_tagged;
// Compression
if (uncompressed_tagged & 1) {
compressed_tagged = int32_t(uncompressed_tagged - base);
} else {
compressed_tagged = int32_t(uncompressed_tagged);
}
// Decompression
if (compressed_tagged & 1) {
uncompressed_tagged = uint64_t(compressed_tagged) + base;
} else {
uncompressed_tagged = int64_t(compressed_tagged);
}
Note:
V8 should reserve a 4Gbyte-aligned contiguous 4Gbyte region of address space. The whole heap should not overlap with the first 4Gbytes of address space (the base must not be equal to 0).
With such a heap layout the compression will look like:
int32_t compressed_tagged;
int64_t uncompressed_tagged, selector_mask;
int64_t smi_result, ho_result;
// Compression
compressed_tagged = int32_t(uncompressed_tagged);
// Branchful decompression
if (compressed_tagged & 1) {
uncompressed_tagged = uint64_t(compressed_tagged) + base;
} else {
uncompressed_tagged = int64_t(compressed_tagged);
}
// Branchless decompression
smi_result = int64_t(compressed_tagged);
ho_result = base | uint64_t(compressed_tagged);
selector_mask = ((compressed_tagged & 1) << 63) >> 63;
uncompressed_tagged = (smi_result & ~selector_mask) |
(ho_result & selector_mask);
// Getting base from HeapObject
base = heap_object & 0xffffffff00000000;
Note:
The problem with such an approach is that in order to properly decompress a tagged value which is a Smi we have to check the value of SmiTag bit first and then either do a sign extension or zero extension of the value. This makes the branchless decompression quite complicated. Variant 3 does not have such a draw-back.
V8 should reserve a contiguous 4Gbyte region of address space so that it’s middle is 4Gbyte-aligned. The whole heap should not overlap with the first and last 4Gbytes of address space (the base must not be equal to 0 or to 0xffffffff00000000).
With such a heap layout the compression will look like:
int32_t compressed_tagged;
int64_t uncompressed_tagged, selector_mask, sign_extended_tagged;
// Compression
compressed_tagged = int32_t(uncompressed_tagged);
// Branchful decompression
sign_extended_tagged = int64_t(compressed_tagged);
if (sign_extended_tagged & 1) {
uncompressed_tagged = sign_extended_tagged + base;
} else {
uncompressed_tagged = sign_extended_tagged;
}
// Branchless decompression
sign_extended_tagged = int64_t(compressed_tagged);
selector_mask = (((sign_extended_tagged & 1) << 63) >> 63);
uncompressed_tagged = sign_extended_tagged + (base & selector_mask);
// Getting base from HeapObject
base = (heap_object + 0x80000000) & 0xffffffff00000000;
Note:
The heap layout is the same as in variant 2: V8 should reserve a 4Gbyte-aligned contiguous 4Gbyte region of address space. The whole heap should not overlap with the first 4Gbytes of address space (the base must not be equal to 0).
If we change V8 to treat the upper 32 bits of Smi values as undefined by making it look only on the lower 32 bits then we may simplify decompression in the following way:
uint32_t compressed_tagged;
uint64_t uncompressed_tagged;
// Compression
compressed_tagged = unt32_t(uncompressed_tagged);
// Decompression
uncompressed_tagged = uint64_t(compressed_tagged) + base;
// Getting base from HeapObject
base = heap_object & 0xffffffff00000000;
Note:
The C++ code is not affected by undefined state of Smi values because Smi::value() already looks only on the lower 32 bits. The complications with such an approach come from the CSA and Torque code: we can no longer combine Smi index untagging with multiplication in addressing modes on x64:
// Let rbx be an array and rcx be a Smi index
movq rax,[rbx+rcx*2+0x7]
The Smi index must be normalized (sign or zero-extended) before it can be used in such an addressing mode. The solution would be to stop using Smi indices in builtins, however that requires
The main benefit is the reduction of memory occupied by tagged pointers in the V8 heap, however there are opportunities for further improvements.
v8.browsing_desktop story | Average V8 heap size (KB) | Average ptr compression saving |
media_flickr_infinite_scroll | 10175.25 KB | 34.64% |
media_pinterest | 14065.82 KB | 34.13% |
media_youtube | 8900.08 KB | 33.30% |
news_cnn | 18664.18 KB | 34.31% |
news_flipboard | 7585.29 KB | 33.77% |
news_hackernews | (no gc stats) | |
news_nytimes | 15683.61 KB | 34.09% |
news_reddit | (no gc stats) | |
search_google | 11476.61 KB | 34.36% |
search_google_india | 3767.88 KB | 33.19% |
social_facebook_infinite_scroll | 6884.87 KB | 33.87% |
social_tumblr_infinite_scroll | 8630.76 KB | 35.94% |
social_twitter | 7006.31 KB | 35.19% |
social_twitter_infinite_scroll | 8049.68 KB | 34.31% |
tech_discourse_infinite_scroll | 9040.64 KB | 34.77% |
tools_earth | 24108.84 KB | 35.47% |
tools_maps | 54244.61 KB | 37.15% |
gmail.google.com | 69720.32 KB | 34.83% |
inbox.google.com | 38591.47 KB | 35.19% |
V8 uses a heap snapshot to reduce the time it takes to create a new isolate (and possibly context). At build time, a snapshot of the heap is created using the serializer, and stored in a raw data file that is distributed with the V8 binary called snapshot_blob.bin. When a new isolate is created, the snapshot file is loaded into memory and deserialized to create an identical heap state, saving time on startup.
During deserialization V8 may “allocates” objects in the V8 heap. V8 implements ASLR and each new memory chunk is usually allocated at random place in the address space.
As another security measure we recompute the hashes of strings using a new random hash seed and rehash internal containers (various caches, dictionaries, descriptor arrays, transition arrays, etc.).
If we decide to not do the ASLRing of the snapshot data and avoid rehashing then the snapshot blob may be just a copy of memory chunks (already position-independent) and deserialization would just require loading of certain pages from the snapshot to the same offset from the V8 heap base. We may even consider mmap-ing snapshot blob into V8 heap in a copy-on-write manner (this may be related to read-only pages effort [2]). Which may speed up instantiation of a new V8 instance.
Note, that avoiding rehashing does not seem to be an option for Node.js because of security concerns and startup time is also not that much of an issue there.
The existing tagging scheme allows to store 2-byte aligned C++ pointers as Smis in V8 heap and V8 API provides respecting accessor methods. Here’s where V8 API may store such values:
Note, that V8 API allows to store both these raw pointers and V8 heap objects to the mentioned places. This works fine when the size of platform word matches the size of the tagged field, but will no longer work with compressed pointers.
We may address (1) by the reserving two consecutive 32-bit slots for each embedder field and let GC deal only with the “lower” part of each embedder field. GC can’t ignore these fields since they may also contain heap pointers. Taking into account proposed compression scheme, the lower half word of aligned raw pointers will still look like a Smi.
We may address (2) by introducing a new kind of FixedArray for storing pairs of <raw 32-bit, tagged 32-bit> values. Another way of handling this is to establish a pair of arrays - one for storing of tagged values and lower half-words of raw pointers and another array for storing upper half-words of raw pointers.
Note, that the usual amount of embedded fields in the V8 heap is about 0.5% so we can harmlessly leave them in “extended” form.
We still have some number of code stubs and builtins which have not been ported to CSA yet.
Platform stubs:
Builtins:
Builtins which can’t be ported to CSA:
It makes sense to port at least some of them to CSA/Torque (non-entry and non-trampoline ones) to ease the process of supporting compressed pointers. The candidates are Array*, Reflect* and those builtins which just tail call to runtime functions. Ideally, in the end only the trampoline code should be left in a hand-written assembly form.
Current Smi tagging scheme encodes 32-bit value in the upper-part of the 64-bit word. Once we start compressing pointers it will no longer be possible to have direct access to Smi values stored in memory (by reading/writing only the upper 32-bit of the tagged pointers). This may also cost us some performance.
It would be hard to support unboxed double fields because of object field shifting when changing representation from Smi to Double or from Double to Tagged. However, it is still doable.
Or we may decide to use a completely different approach for unboxing doubles which will also work for 32-bit architectures: we may consider using FixedDoubleArray as a separate backing for all the double properties. This would be a bit less efficient but more GC friendly approach than we have right now.
The NaN tagging aims at reducing the cost of updating metadata upon field representation changes, so that Smi->Double->Tagged transitions may be done by just updating the hidden class in-place and maybe deoptimizing the optimized code that relies on the exact representation. Given that the size of Tagged field will not be equal to size of double field the NaN tagging would not be an option anymore with enabled pointer compression.
However, given that the full pointer mode support will stay in V8 we may still implement NaN tagging there.
It looks like that pointer compression will add a new dimension (or at least half of it) to the configuration matrix because the full pointer version will be needed for Node.js and probably for other high-performance-big-memory devices given the “memory-is-cheap” tendency. So probably even low-end devices will get “huge” amounts of memory at some point. And people tend to write even bigger web applications for Chrome.
Unfortunately the pointer compression can’t be implemented as a finch experiment because it will not be a runtime flag. The reason is that the feature affects such core things like pointer-size constant, V8 objects layout constants and Smi tagging scheme which are used extremely wide across all V8 code (including runtime, builtins, interpreter and compilers) and it will be a huge complexity effort and a performance regression to turn all those constants into variables. Moreover, these internal constants are also exposed via V8 API which impose a dependency to the embedder.
Probably, the maximum we can do to make this finchable is to hide all the internal constants from the V8 API and ship two V8 binaries with Chrome and make it use a proper one depending on the finch switch. With this approach it would not be possible to compress V8 values exposed via V8 API - they will have to be at least of pointer size.
Note, however, that we may still use finch flag between steps 2 (switch Smis to 31-bit) and 9 (change size of tagged fields) for testing of the implementation (see the plan below).
The obvious solution - to run the 32 bit V8 everywhere - fails for several reasons:
Similarly, the ART runtime solution of plain 32 bit heap pointers in the 64 bit VM would not work for us:
[1] Compressed pointers in the Dart VM
[3] Compressed oops in the Hotspot JVM
[4] Compressed pointers support in IBM J9 VM. Whitepaper.
[5] Object-Relative Addressing: Compressed Pointers in 64-bit Java Virtual Machines