V8 CFI Plans

Last update: 05/2022

Status: implementation in progress^[a]^[b]

Visibility: SHARED EXTERNALLY

Outline

This document describes a proposal to add control-flow integrity to V8. Our goal is to prevent arbitrary shellcode execution.

Besides the common forward- and backward-edge CFI, we need to solve protecting the output of the JIT compiler. ^[c]

We’re planning to use a verifier that checks the integrity of the assembler output before making it executable^[d]^[e]^[f]^[g]^[h]^[i]^[j]^[k]^[l]^[m]^[n]. Integrity in this case means that only a subset of the instructions can be emitted (e.g. no syscall instruction) and that direct and indirect jumps end up at allowed targets.

We assume that the attacker has a concurrent arbitrary read/write primitive. Since the verifier is running in the same process, it needs to protect its data using per-thread memory permissions.

Hardware requirements^[o]^[p]^[q]^[r]^[s]^[t]^[u]^[v]^[w]:

Backward-edge CFI: ideally a shadow stack (CET^[x]^[y]^[z]) or separate control stack, alternatively signed return addresses (PAC)
Per-thread memory permissions with fast switching^[aa]^[ab]^[ac]

E.g. memory protection keys (PKU/PKEYS) or APRR (will use PKEY as the example going forward)

OS requirements:

PKEY^[ad]^[ae] support in the OS
Memory sealing: prevent unmapping pkey-protected memory^[af] (details)

This could be avoided if the key is part of the mapping/address.^[ag]^[ah]

PKEY-protected signal/exception handlers (details)

The rest of the document will discuss the challenges we’re aware of, i.e. how an attacker can gain native code execution, and how we’re planning to address them.

Design

Return address corruption

Attackers can overwrite return addresses on the stack and do ROP.

Mitigation:

We’re planning to rely on hardware features to protect against these attacks:

Intel CET (shadow stack)
PAC on ARM

Forward-edge protection

Attack can overwrite function pointers or vtables to gain $pc control.

Mitigation:

We will use forward-edge CFI to limit valid targets for indirect calls. We will start with a coarse-grained CFI scheme, i.e. we will allow all address-taken functions as targets and not perform function signature checks. The reasoning behind this is:

Coarse-grained CFI will have lower performance overhead.
In our model, the attacker can overwrite any function pointer. It’s possible that they will find one with the right signature.
Since the attacker is concurrent, they can overwrite the stack from other running threads. Instead of calling the gadget, they can wait for another thread to execute the target function and mess with local variables.

While fine-grained CFI^[ai]^[aj]^[ak]^[al]^[am]^[an] will impose stricter constraints, it’s not clear yet if it’s worth^[ao] the extra performance overhead and we will reassess this at a later point in time.

Note that we have to update valid call targets at runtime for JIT compiled code.

The particular CFI scheme to use is hardware and OS dependent:

Windows: system DLLs already use CFG and it will be the easiest for compatibility to adopt it as well.

When allowlisting JIT compiled code as valid function entry points, we need to ensure that the attacker can’t mess with the data. For that, we need to make sure to only perform this operation while our thread’s memory is protected with pkeys. (more details later)

Other OSes: everywhere else, we can use hardware support like Intel CET’s ENDBRANCH or ARM BTI. With these features, any indirect call needs to end at an endbranch instruction, i.e. valid targets need to be specifically marked as such. There are some extra challenges with this approach:

We need to ensure that JIT code doesn’t include (unaligned) endbranch instructions. This means we will have to mask constants in code to prevent emitting this byte pattern.
System libraries need to be compiled with endbranch support. Though we explicitly don’t want all function entry points marked with endbranch instructions, only address-taken functions. We will need to detect at runtime if the system libraries have endbranch support.

JIT compilation

The attacker can overwrite the machine code generated by the JIT compiler. This doesn’t just include the final code, but also intermediate buffers and local variables used by the assembler if they’re stored on the stack.

Mitigation:

We can ensure that the generated machine code has certain properties by building a verifier that checks what kind of instructions the code will execute. The verifier itself needs to run in a way that the attacker can’t overwrite its data: the stack and any data it reads need to be protected with per-thread memory permissions (i.e. pkeys).

This could look roughly like:

Enter secure state (enable pkey^[ap]^[aq], change stack)
Copy code to protected memory
Verify code
Copy^[ar]^[as]^[at] code to executable region^[au]^[av]
Leave secure state^[aw]^[ax]^[ay]^[az]^[ba]^[bb]^[bc]

Before we talk about the verification steps, there’s a bunch of data we need to protect.

pkey protected stack:

Create a stack at process initialization time, pkey protect it and store it in a global variable. That variable needs to be pkey protected too. We can do this by putting it into a page aligned struct and protecting it at startup.
Since we can have multiple threads trying to run the verification at the same time, we need to have a mutex for the stack pointer and can map a new stack after switching to keep the lock for as short as possible. Alternatively, we can keep more than one secure stack to reduce contention.
Note that we don’t want to allocate a new stack without switching to another secure stack first, since we can’t guarantee that the allocation doesn’t spill the new pointer onto the stack. This could be solved on Linux by doing raw syscalls ourselves.

Global variables:

We need to assume all global variables are untrusted. If we need to access some, we need to protect them during process initialization time as above. This includes for example the pkey id we’re using for protected memory.

Dynamic memory:

Any data we dynamically allocate needs to be pkey protected. For example, the mapping in which we copy the JIT code before doing our verification. We can write a std::allocator that only hands out pkey protected memory.

Code space metadata:

When copying the verified code into the executable region, we need to know that the target address is a valid code space address (we don’t want to overwrite our stack for example) and that there’s no existing code that we might partially overwrite.
Thus, we need to track all code mappings and free/used memory. We already have to perform all writes to the code space from a pkey-protected thread. In addition to that, we need to perform all code space allocations from a pkey-protected thread too and keep the metadata somewhere secure.

What does the verification need to do?

Only a subset of machine instructions can be emitted (e.g. no syscalls)
All direct branches need to go to aligned instructions. For this, we need to track valid instruction starts for this code object (or take them as input from the assembler).
All direct branches need to go to either the same code object, a valid entry of another code object or a runtime/builtin function.
If we use indirect branches in the code, we need to verify that they conform with our CFI scheme. In particular, with CFG you need to call a verification function first.
For CFI schemes that use endbranch instructions, we need to check that these byte patterns don’t exist aligned or unaligned in the code. To prevent overlaps with the next code block, we can pad the code with any byte not in the pattern.

Finally, we need to allow this new function in our CFI scheme, e.g. by emitting an endbranch instruction at the function start or by calling the right system library function to allow it with CFG.

Variant

If the validation turns out to have too much performance impact, we can also experiment with executing the assembler or the whole compiler in the protected state. The downside is that there will be more data to protect and we expose a bigger attack surface.

Code Space Compaction^[bd]^[be]^[bf]

The garbage collector moves code objects around to prevent fragmentation. To do this, it needs to do two dangerous things:

Write to the code space
Perform relocations on the code object and all code objects referencing it

An attacker could mess with the data of the GC thread and make it corrupt the code space.

Mitigation^[bg]^[bh]:

We can again run the critical section of this code in a pkey-protected state and ensure that all data we need to trust is also pkey-protected. From the previous section, we already need to track where code objects live. This will allow us to avoid overwriting existing code and performing the relocations in a secure way.

In addition, we need to make sure to invalidate the CFG entry for the old function.

Signal handlers / Exception handlers

Signal handlers / exception handlers push the CPUs register state into user memory (e.g. the stack) and load it again when returning from it. A concurrent attacker could overwrite the register state and gain code execution by overwriting the $pc value or changing the state of the pkey register and thereby bypassing our mitigations.

Mitigation:

On Linux, we can use sigaltstack + pkeys to protect this data. At program start, we set up a sigaltstack that is pkey protected. The signal handler can then enable the pkey as its first instruction before handling the signal. We need to be careful with all data accesses inside the signal handler though since any non-stack data can be attacker controlled and lead to memory corruption.

We currently don’t have an idea how to solve this on Windows. Maybe we could ask for OS help^[bi]^[bj], e.g. sigaltstack support.

PC Fixups

We have cases in which v8 needs to update the $pc to point to a new location:

During optimization/deoptimization we need to switch control flow between interpreter and optimized code.
The wasm trap handler updates the $pc inside a signal handler.

Mitigation:

In both cases, we can verify the new $pc values at creation time and store them in pkey-protected memory. For example, when verifying generated wasm machine code, we can remember the fixup values and store them in pkey-protected metadata.

Similarly, when verifying the optimized JavaScript code, we can store the valid entry points in pkey-protected metadata.

Corrupt data of running threads

As mentioned before, we assume the attacker can overwrite any data from concurrent threads. In particular, there are many code paths that can get executed where it can lead to bad results if the attacker can overwrite local variables on the stack.

For example, imagine the garbage collector wants to unmap some page on the heap and the attacker can replace the address with the address of the pkey-protected stack. The kernel would then unmap the stack and can re-use the memory on the next mmap call, effectively clearing the pkey protection.

Mitigation:

Protecting against these attacks will need to focus on the worst cases we can find. E.g. to protect against the munmap example above, we could ask OS vendors to give us a way to seal these mappings or only allow unmapping them if the current thread has pkey write permissions to it.

This could also be avoided if the pkey was part of the address or if we can disable read/write access to the default key in our verifier.

Another bad example would be if the syscall number of a syscall is ever in attacker-writable memory. We’ll need to find such cases and handle them one-by-one.

Command-line Flags

Security features can be disabled with command-line flags. An attacker can overwrite these flags in memory.

Mitigation:

We need to make the security critical flags read-only. To prevent the munmap attack, we can mark the page with a pkey and rely on the solution of the previous section.

Alternatively, we could JIT compile flags into small functions that return the value.

Kernel callbacks

Some syscalls have function pointers as arguments that are later used as callbacks. Signal handlers are one example for this. If the attacker can reach the signal handler registration and replace the function pointer argument, they can gain $pc control.

Mitigation:

This is an actual problem in the v8 code since the signal handler registration is guarded by a global variable and can be reached via a virtual function call. We need to look for similar cases and prevent this pattern through code changes.

Hardware Support

x64

Intel Tiger Lake and later support PKU and CET (shadow stacks + landing pads).

AMD Zen3 supports PKU and shadow stacks, but afaik no landing pads.

ARM

Chromebook Hardware support

[a]@sroettger@google.com what is the status of this work? I'd be very interested in helping with this, e.g., by sharing our experience from implementing this approach a long time ago (see https://research.google/pubs/language-independent-sandboxing-of-just-in-time-compilation-and-self-modifying-code/). CC

@rcvalle@google.com @keescook@google.com

_Assigned to sroettger@google.com_

[b]Protections for JIT and code metadata are implemented and I'm currently working on JIT specific forward-edge CFI.

JIT validation (or trusted translation) is planned for next Q. I'd love to chat with you about it!

[c]1 total reaction

Thiago Atauri Turini reacted with 🫥 at 2024-07-04 18:14 PM

[d]smells like NACL. Do we want to mention it here and/or discuss why NACL didn't wly and why the same approach will work here?

[e]What was the issue that nacl was hitting? Since we're talking about code V8 generates from JS/Wasm input perhaps that reduced scope helps?

[f]I don't think there was one issue. And I don't know the details. Perhaps, we may want to consult with the nacl folks to avoid repeating their mistakes?

[g]Yep, would be worth following up with rsc@

I know that, among various problems with NaCl, one was that writing a disassembler that understands *all* semantics of the instruction stream even against adversarial input is incredibly hard.

[h]I worked on NaCl for many years, including writing an x86 NaCl validator. Writing an instruction parser that handles adversarial input is indeed difficult.

One of the conclusions I reached was that it would be better to have a "validating assembler" than a validator: something that takes an instruction stream in a form that's easier to parse than real instructions and that outputs real instructions (and enforces that this output is safe). This avoids some tricky decoding, so it could run faster and be easier to make secure than a validator. This approach might be applicable to V8 CFI.

[i]That sounds like a good idea to me

[j]_Marked as resolved_

[k]_Re-opened_

This isn't covered in the doc so I don't think it should be marked as resolved.

[l]I second what @mseaborn@google.com said here. I have also written several machine-code validators, and they have got really impractical interactions with software development processes. Both eBPF and Wasm have been very successful verifying "generic assembly" that is lowered to real machine code only after verification. This approach is highly compatible with software-engineering practices and is not all that much less secure in practice.

[m]_Marked as resolved_

[n]_Re-opened_

[o]Any sense for what portion of Chrome deployments meet these requirements today?

[p]for example, the story told here suggests PKU is a *very* exotic feature in the wild

https://bugs.chromium.org/p/v8/issues/detail?id=11763

[q]The number is definitely very low. There are some ChromeOS devices that support PKU and also recent devices have CET support. I'm still trying to figure out if the newer ones support both.

@clemensb@google.com I think we had some metrics about PKU support in the wild?

[r]We are currently at ~20% on Linux and ~1% on CrOS: go/pku-support

[s]20% seems astonishingly high, but there might be sampling bias for where people are running Chrome on Linux. Any plans to collect data on Windows or Mac?

[t]Mac and Windows do not currently support PKU. Thus we would need to rely on CPU information, which is more difficult to implement.

[u]Seems like we're going to need that information one way or another -- especially if we want to convince platform vendors of the value we can provide if they expose PKU.

Have those discussions started already?

[v]_Marked as resolved_

[w]_Re-opened_

[x]This needs OS support too, and Linux is very far behind on this, unfortunately. Intel has been churning away at patches for literally years now. :|

[y]_Marked as resolved_

[z]_Re-opened_

[aa]How fast is the switching you get with the memory protection mechanisms you're proposing (in nanoseconds), and how fast do you need the switching to be?

[ab]We haven't done any benchmarking on this yet. The speed might influence the design however, i.e. if the switching is too slow, we might have to do verification in batches.

[ac]Do you have a rough idea of the order of magnitude, though? Do you want to be able to switch in 10ns, 100ns, 1000ns, etc.?

The alternative to using per-thread memory permissions is to use separate processes. Presumably using per-thread memory permissions (which requires fancy new CPU features that are not yet widely available or widely supported by OSes) only makes sense if it's significantly faster than using separate processes (which doesn't have that requirement).

It seems like this is the assumption behind using per-thread memory permissions that isn't explicitly stated in the doc.

[ad]I do not believe that Windows has PKEY support, and yet has the most users on Desktop. Does mandating this requirement mean the impact of the proposal might be less than desired?

[ae]This is definitely a big drawback. But unfortunately we don't have an alternative plan without access to some per-thread memory permissions.

[af]We might be able to do without this if we remove access to the default key.

[ag]errr. is this true? If the attacker has access to syscalls, they could unmap, map shared, and create aliases at the same virtual address. I can't imagine any protection from anything if the attacker has arbitrary access to syscalls.

[ah]We assume that the attacker doesn't have access to arbitrary syscalls but can mess with some syscall arguments of legitimate code (e.g. munmap on a pointer taken from the heap).

There are definitely certain code patterns that we need to search for statically and need to avoid, otherwise the attacker will be able to break the CFI guarantees with the arbitrary read/write.

Some examples what would be game over in our scheme:

* syscall with syscall number taken from memory

* mmap with protections taken from memory

* mprotect +x with an address from memory

etc.

[ai]are you aware of https://docs.google.com/document/d/1UZsQrIMoGRPNCGLZ90SETn_-716WkA-lSIu6NdJwZN4/edit#heading=h.krxf0uaiwgpk ?

(which has tagged landing pads)

[aj]I wasn't. I like the idea of the Landing point label register.

Related, do I understand it correctly that risc-v supports jumping into unaligned instructions? Enforcing instruction alignment would make it easier to check that we never emit these landing pads as part of a constant for example.

I guess we can always enforce alignment in the emitted code though, i.e. before an indirect jmp, do an `& 0xff..ffc` or so.

[ak](not an expert). IIRC, RISC-V has different power-of-2 instruction sizes. @mmaas@google.com ?

[al]RISC-V has an ISA parameter (IALIGN) that specifies the required instruction alignment. This parameter is 32 by default and 16 with the compressed ISA. The expected behavior without the C ("compressed") extension is that anything that is not 32bit-aligned will cause a misaligned instruction exception. The C extension removes these exceptions and I am not 100% sure how it is supposed to behave when trying to jump into a non-aligned instruction; but I would assume it is disallowed.

[am]_Marked as resolved_

[an]_Re-opened_

[ao]I haven't looked at Chrome exploits, but Linux kernel indirect call exploits have tended to call into either fully-controlled memory contents (which an attacker can prepare with, e.g., ENDBR instructions), or into existing functions. Course-grained CFI would have had basically zero impact on any of those exploits, so we must be using fine-grained CFI.

[ap]The number of the pkey needs to be only read-only outside of the secure state right? Should this be added under "bunch of data we need to protect"?

You could mprotect(PROT_READ) the page that stores the pkey but then preventing mprotect(PROT_WRITE) becomes another gadget you have to prevent against

[aq]> The number of the pkey needs to be only read-only outside of the secure state right? Should this be added under "bunch of data we need to protect"?

Yeah, good point. I added it to the global variables section.

> You could mprotect(PROT_READ) the page that stores the pkey but then preventing mprotect(PROT_WRITE) becomes another gadget you have to prevent against

We definitely need a way to protect read-only memory against modification. I wrote a bit about that in the "Corrupt data of running threads" section, i.e. we do assume that the attacker will be able to modify some syscall arguments. We might be able to defend mprotect(PROT_WRITE) in code, but others like munmap(controlled) will be even harder.

There are some ideas how to protect against this in userspace, but our preferred solution is to have some kind of memory sealing in the kernel: https://docs.google.com/document/d/1qqVoVfRiF2nRylL3yjZyCQvzQaej1HRPh3f5wj1AS9I/edit

[ar]do you need to copy it again?

[as]You want to switch state from not-executable to executable at this point. So that would either be an mprotect, a memcpy to a different region, or you could copy it bit by bit while you're doing the verification.

I would expect the memcpy to be the fastest since CPUs like linear memory accesses. But that's something to benchmark in practice.

[at]ok, sounds good.

[au]As defense in-depth, can we use seccomp to say "only allow mprotect/mmap with PROT_EXEC if the thread is in the secure state (has RW access to the pkey of the secure state)"? This would require allocating the "verifier" pkey ahead of time and using it in the seccomp BPF code

[av]Great idea! We're also looking into if we can use seccomp to help us prevent munmap(pkey_mem) if you don't have the right pkey

[aw]That's where the trouble is. I don't know how to allow the good actor to leave the secure state while not allowing this to the bad actor.

my mental model is that the thread's secure state never changes after init. Each thread is dedicated to do just one thing, and threads talk to each other.

[ax]Can you elaborate on this?

Our threat model is that the attacker has an arbitrary read/write primitive but can't execute arbitrary code. If it works, it shouldn't be possible for the attacker to reach code that changes the pkey state and does something nefarious.

[ay]At the very least, it'll violate the defense-in-depth principle. Arbitrary read/write has been well known to lead to code execution. CFI is not going to be perfect, at least forward CFI, and some level of remote execution may remain (e,g, a gadget into a code that calls WRPKRU).

[az]right, but at that point there will be more similarly powerful gadgets that the attacker could reach that we can't prevent this way. I.e. the attacker might find a gadget that calls mprotect +x.

I think we'll end up having t o do some static analysis to prevent certain classes of powerful gadgets to be reachable and the wrpkru would be one of them.

[ba]wrpkru is a 3-byte instruction. Last time I checked, there were 2-3 such gadgets in the Chrome binary (in the middle of other instructions). We can ensure they are not present, but it will take some non-trivial compiler/linker work.

[bb]I didn't mean preventing the byte pattern from being present in the binary. Rather we need to make sure that there's no indirect control flow transfer that ends up calling wrpkru.

If the attacker found a way to jump to the middle of existing instructions, CFI is completely broken and I don't think there's anything we can do at this point.

[bc]ack!

[bd]@mlippautz@google.com Hey Michael, are there other cases where the GC needs to write to code pages besides moving code and performing relocations?

Do you think the proposals from this and the previous section are feasible?

_Assigned to mlippautz@google.com_

[be](Will check the details in a bit.)

There's definitely still cases where we write to the page header of a code page. At the top of my head:

1) markbits

2) slots maintenance (slot sets are externally allocated but lazily allocated, so the slot set pointer would get written)

3) some other metadata we keep around, basically a subset of the fields [1]

I think 2) and 3) could probably be easily be avoided (at some cost) with another indirection. For 1) we already have a mechanism for external mark bitmaps which we only used in prototypes. We would need to see how expensive the indirection there is.

As for writing into the code object, besides relocation that you have mentioned, there's one case left:

4) Maintaining the free list in payload. We only do this on the main thread these days (concurrent sweeper is disabled for code space). Arguably we don't write into a code object here (it's free memory) but we write into memory that used to be a code object.

[1] https://source.chromium.org/chromium/chromium/src/+/main:v8/src/heap/memory-chunk-layout.h;l=41;drc=3f264e6e605b35f5bcd64cb6070d70c70c7236a7

[bf]Is the page header part of the RWX mapping? I.e. the first page could just be marked RW-, in which case we don't need the pkey protection, right?

> 4) Maintaining the free list in payload.

Ah interesting. I think mitigating an attack on this is probably similar to the relocation. I.e. we need to have the metadata protected that tell us that this is actually free.

One thing that might be tricky: if we use endbranch instructions for CFI then we can't allow the endbranch byte pattern to show up in pointers. Maybe this is something that the allocator could guarantee.

[bg]Another option is to turn off compaction. @mlippautz@google.com we recently evaluated the impact of code compaction, right?

[bh]Missed that part. Indeed, if this is just about GC when we have non-trivial stack, then yes, we can disable compaction for code space. We already discussed this previously.

A Finich experiment (Canary/Dev) didn't show any significant impact on high-level metrics: https://uma.googleplex.com/p/chrome/variations?sid=ce468887eb6f629df34b0005303c41eb

[bi]I wonder if the old proposed sigreturn work might be interesting here:

https://github.com/KSPP/linux/issues/34

[bj]The attack vector is a bit different here. The SROP mitigation is trying to prevent calling sigreturn if you're not in the context of a signal handler. However, we do assume that a valid signal handler is running and that a second thread can corrupt the signal context before the legitimate sigreturn call.

In our model, the SROP exploitation technique should be blocked by the CFI as it requires the attacker to have RIP control first.

Outline

Design

Return address corruption

Forward-edge protection

JIT compilation

Variant

Code Space Compaction[bd][be][bf]

Signal handlers / Exception handlers

PC Fixups

Corrupt data of running threads

Command-line Flags

Kernel callbacks

Hardware Support

x64

ARM

Code Space Compaction^[bd]^[be]^[bf]