1 of 43

KernelThreadSanitizer (KTSAN)

a data race detector for the Linux kernel

Andrey Konovalov

November 23rd 2015

2 of 43

Userspace tools

  • AddressSanitizer (ASan)
    • detects use-after-free and out-of-bounds

  • ThreadSanitizer (TSan)
    • detects data races and deadlocks

  • MemorySanitizer (MSan)
    • detects uninitialized memory uses

  • UndefinedBehaviorSanitizer (UBSan)
    • detects undefined behaviors in C/C++

3 of 43

Userspace TSan

ThreadSanitizer - a fast data race detector for C/C++ and Go

  • Status
    • C++: Linux / FreeBSD
    • Go: Linux / Mac / Windows / FreeBSD
    • The average slowdown is ~5x
    • The average memory overhead is ~5-10x
    • 1000+ bugs found

  • Easy to use
    • $ gcc -fsanitize=thread main.c
    • $ clang -fsanitize=thread main.c
    • $ go run -race main.go

4 of 43

KernelThreadSanitizer (KTSAN)

5 of 43

Data race

  • A data race occurs when two threads access the same variable concurrently and at least one of the accesses is a write

  • Data races cause undefined behavior
    • According to the new standard
    • Very hard to debug

6 of 43

Two parts

  • Compiler module
    • Instruments memory accesses

  • Runtime part
    • Bug detection algorithm

7 of 43

Compiler instrumentation

void foo(int *p) {

__tsan_func_entry(__builtin_return_address(0));

__tsan_write4(p);

*p = 42;

__tsan_func_exit();

}

void foo(int *p) {

*p = 42;

}

8 of 43

Runtime part

  • Internals
    • shadow memory
    • memory accesses
    • functions entry / exit
    • reporting data races

  • Annotations
    • synchronization primitives (mutexes, atomics, …)
    • threads (create, destroy, start, stop)

9 of 43

Race detection algorithm

  • Happens-Before race detector

  • Based on Vector Clocks

10 of 43

Thread time

  • Per-thread
  • Abstract
  • Incremented on every event

void foo(int *x) {

time = 1

mutex_lock(&lock);

time = 2

*x = 1;

time = 3

mutex_unlock(&lock);

time = 4

}

time = 5

11 of 43

Vector clocks

  • T_i[i] - own current time of T_i
  • T_i[j] - the time of T_j when T_i last synchronized with it
  • Each thread has an associated VC:
  • T[i] - time when thead T_i made last unlock (or atomic release)
  • Each mutex (or atomic var) has an associated VC:

T_i[0]

...

T_i[i]

...

T_i[j]

...

T_i[N]

T[0]

...

...

T[i]

...

...

T[N]

12 of 43

Vector clocks update: memory access

  • On memory access:

thr->vc[thr->id]++

13 of 43

Vector clocks update: mutex lock

  • On mutex lock (atomic acquire):

for (i = 0; i <= N; i++)

thr->vc[i] = max(thr->vc[i], mtx->vc[i])

  • Thread vector clock acquires mutex vector clock:

max

max

Mutex

Thread

14 of 43

Vector clocks update: mutex unlock

  • On mutex unlock (atomic release):
  • Thread vector clock releases mutex vector clock:

for (i = 0; i <= N; i++)

mtx->vc[i] = max(mtx->vc[i], thr->vc[i])

max

max

Thread

Mutex

15 of 43

Shadow state

Memory cell

(8 bytes)

Thread ID

Timestamp

Position

IsWrite

Thread ID

Timestamp

Position

IsWrite

Thread ID

Timestamp

Position

IsWrite

Thread ID

Timestamp

Position

IsWrite

Shadow cells

16 of 43

Shadow cell

Shadow cell - represents one access

to some of the bytes in 8-byte word

  • 16 bits: Thread ID
  • 42 bits: Timestamp
  • 3 bits: Offset in 8-byte word
  • 2 bits: Size
  • 1 bit: IsWrite

Thread ID

Timestamp

Position

IsWrite

17 of 43

Example: first access

Memory

T1

TS1

0:2

W

Shadow

Write in T1

18 of 43

Example: second access

Memory

T1

TS1

0:2

W

Shadow

T2

TS2

4:4

R

Read in T2

19 of 43

Example: third access

Memory

T1

TS1

0:2

W

T3

TS3

0:4

R

Shadow

T2

TS2

4:4

R

Read in T3

20 of 43

Example: race?

T1

TS1

0:2

W

T3

TS3

0:4

R

Shadow

T2

TS2

4:4

R

  • Overlap?

  • Diff erent threads?

  • At least one is write?

  • Synchronized?

21 of 43

Example: synchronized?

  • Previous access by T1 at TS1
  • Current access by T3

  • T3.vc[T1] > TS1 => no race
  • T3.vc[T1] < TS1 => RACE

  • T3.vc[T1] is time of T1 when T3 last synchronized with it

22 of 43

Stack traces

  • Current stack trace => piece of cake

  • Previous stack trace => ummm...

23 of 43

Trace

  • Per-thread cyclic buffer of events
  • Event types:
    • Memory accesses (for exact top stack trace frame)
    • Function entry / exit (for stack traces)
  • 64 bits per event:
    • 16 bits: event type
    • 48 bits: PC or address
  • "Replay" the trace to get stack trace for the interesting memory access
    • Slow to replay (irrelevant, only during report)
    • Fast to add
    • Information is lost after some time (64K events by default)
    • Store current stack trace in the trace header after overflow

24 of 43

Shadow memory

  • Per-page shadow memory

  • Whenever the kernel allocates a memory page, KTSAN allocates 4 memory pages to store the accesses info

25 of 43

Kernel synchronization primitives

  • spinlock, rwlock
  • mutex, rwsem, percpu_rwsem
  • sema, completion
  • per-cpu variables
  • atomics, memory barriers
  • seqlock
  • RCU
  • ...

26 of 43

Mutex annotations

void __sched mutex_lock(struct mutex *lock)

{

might_sleep();

ktsan_mtx_pre_lock(lock, true, false);

__mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);

mutex_set_owner(lock);

// Current thread acquires mutex: thr->vc[i] = max(thr->vc[i], mtx->vc[i])

ktsan_mtx_post_lock(lock, true, false, true);

}

void __sched mutex_unlock(struct mutex *lock)

{

// Current thread releases mutex: mtx->vc[i] = max(mtx->vc[i], thr->vc[i])

ktsan_mtx_pre_unlock(lock, true);

__mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath);

ktsan_mtx_post_unlock(lock, true);

}

27 of 43

Per-cpu annotations

  • Different tasks can access the same per-cpu variable
  • Preemption (or interrupts) must be disabled
  • The kernel uses this_cpu_var() (and friends) to access per-cpu variables

  • When preempt_disable() is called thr->preempt_depth is incremented
  • When this_cpu_var() is called thread acquires the variable and puts it into a per-thread list of used per-cpu variables
  • When preempt_enable() is called thr->preempt_depth is decremented
  • When thr->preempt_depth reaches zero thread releases and removes all per-cpu variables from it’s list

28 of 43

Kernel atomics API

  • atomic_set / atomic_load / atomic_add / atomic_inc / ...
    • relaxed
  • xchg / cmpxchg / atomic_inc_return / …
    • release-acquire
  • xadd
    • not documented at all
    • release-acquire based on the source code
  • WRITE_ONCE / READ_ONCE (ACCESS_ONCE)
    • documented as macros to prevent compiler reordering
    • used as relaxed stores / loads
  • smp_store_release / smp_load_acquire
    • used as store-release / load-acquire

29 of 43

Atomic annotations

  • Easy for operations with release or acquire semantics
    • smp_store_release(): thr->clock releases atomic->clock
    • smp_load_acquire(): thr->clock acquires atomic->clock

30 of 43

Atomic annotations, continued

  • Not so easy for relaxed operations and standalone memory barriers
    • Each thread has additional release and acquire vector clocks
    • atomic_read(): thr->acquire_clock acquires atomic->clock
    • smp_rmb(): thr->clock acquires thr->acquire_clock
    • atomic_set(): thr->release_clock releases atomic->clock
    • smp_wmb(): thr->clock releases thr->release_clock

  • Effectively turns
    • atomic_read() + smp_rmb() into load-acquire
    • smp_wmb() + atomic_set() into store-release

  • Barriers get tamed (forgotten) after some time (a few function entries / exits)

31 of 43

Headaches

  • Scheduler
  • Interrupts

  • Memory consumption

  • Benign data races

  • Stubborn kernel developers

32 of 43

Scheduler

  • Scheduler is a part of the kernel =>
    • Instrumented (or uses instrumented code)
    • Uses synchronization primitives (parasitic synchronization)

  • KTSAN ignores most of the events from scheduler
    • Except for task related events (start, stop, …)

33 of 43

Interrupts

  • Can’t ignore interrupts
    • Some synchronization happens through them

  • Currently each interrupt runs in a separate virtual thread
    • Allows to detect races between interrupts and interrupted code

34 of 43

Memory consumption

  • The kernel
    • simultaneously creates large number of tasks (~1K during boot, many more can be created by starting processes in user space)
    • uses large number of synchronization primitives (~60K during boot, ~1KK on moderate workload)
  • Each synchronization primitive has a vector clock of size N, where N is the number of tasks
  • As a result, KTSAN requires 16 Gb of RAM on moderate workload

  • Observation: not many tasks can run simultaneously (no more than the number of CPUs), most of them sleep
  • Idea: reuse KTSAN threads for different kernel tasks

35 of 43

Per-CPU mode

  • Create one KTSAN thread for each CPU

  • Pros
    • Doesn’t require much memory (~1 Gb)
    • ~3 times faster
    • Possible to find data races in scheduler
  • Cons
    • Can miss some data races

36 of 43

Per-CPU mode

  • Green - number of reports in normal mode
  • Red - number of reports in per-cpu mode

37 of 43

Benign data races in the kernel

  • GCC used to guarantee that machine-word-sized accesses would be atomic
    • not anymore
    • a lot of kernel code relies on this
    • a lot of benign races as a result

  • Benign data races
    • cause undefined behavior according to the new standard
    • do not allow any formal verification

  • LOTS of reports
  • Easy to miss real races

38 of 43

Stubborn kernel developers

“The Linux kernel largely ignores the C memory model definition, and relies on practical compiler behavior.

So-called 'data races' are common in kernel code.”

Peter Hurley (https://lkml.org/lkml/2015/8/25/707)

“Also if the following is true:

> As the consequence C compilers stopped guarantying that "word accesses are atomic".

a lot of stuff will break in the kernel. Maybe compilers should stop moving towards the lala land?”

Dmitry Torokhov (https://lkml.org/lkml/2015/9/4/547)

39 of 43

KTSAN status

  • Prototype available
    • Work in progress
    • x86-64 only
  • Kernel slowdown
    • ~6x for normal mode
    • ~2x for per-cpu mode
  • Kernel memory usage
    • ~16 GB for normal mode
    • ~1 GB for per-cpu mode

40 of 43

Trophies

  • 23 bugs found so far
    • Mostly non-atomic accesses to shared variables
    • Some missed barriers

  • CVE-2015-7613
    • Local Privilege Escalation

41 of 43

Report example

ThreadSanitizer: data-race in ipc_obtain_object_check

Read at 0xffff88047f810f68 of size 8 by thread 2749 on CPU 5:

[<ffffffff8147d84d>] ipc_obtain_object_check+0x7d/0xd0 ipc/util.c:621

[< inline >] msq_obtain_object_check ipc/msg.c:90

[<ffffffff8147e708>] msgctl_nolock.constprop.9+0x208/0x430 ipc/msg.c:480

[< inline >] SYSC_msgctl ipc/msg.c:538

[<ffffffff8147f061>] SyS_msgctl+0xa1/0xb0 ipc/msg.c:522

[<ffffffff81ee3e11>] entry_SYSCALL_64_fastpath+0x31/0x95 arch/x86/entry/entry_64.S:188

Previous write at 0xffff88047f810f68 of size 8 by thread 2755 on CPU 4:

[<ffffffff8147cf97>] ipc_addid+0x217/0x260 ipc/util.c:257

[<ffffffff8147eb4c>] newque+0xac/0x240 ipc/msg.c:141

[< inline >] ipcget_public ipc/util.c:355

[<ffffffff8147daa2>] ipcget+0x202/0x280 ipc/util.c:646

[< inline >] SYSC_msgget ipc/msg.c:255

[<ffffffff8147efaa>] SyS_msgget+0x7a/0x90 ipc/msg.c:241

[<ffffffff81ee3e11>] entry_SYSCALL_64_fastpath+0x31/0x95 arch/x86/entry/entry_64.S:188

42 of 43

Report example, continued

220 int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm *new, int size)

...

// Publish new (make it accessible to other threads)

240 id = idr_alloc(&ids->ipcs_idr, new,

241 (next_id < 0) ? 0 : ipcid_to_idx(next_id), 0,

242 GFP_NOWAIT);

...

// Initialize new

252 current_euid_egid(&euid, &egid);

253 new->cuid = new->uid = euid;

254 new->gid = new->cgid = egid;

  • Object is published before getting properly initialized:
  • While this happens another thread can get access to the object and do uid check on the uninitialized garbage, which can give falsely give access to the shared object to a process that should not have access to the object

43 of 43

Thank you!

Questions?

Andrey Konovalov, andreyknvl@gmail.com

Dmitry Vyukov, dvyukov@google.com