Published using Google Docs
[PUBLIC] seccomp-bpf filter in Minijail v2
Updated automatically every 5 minutes

        Page  of

Minijail

[PUBLIC] seccomp-bpf filter in Minijail v2

Summary

Introduce a new version of the seccomp-bpf policy syntax with features to make it more usable, and move the compilation of the filters off-process.

Author: lhchavez

Contributors:

Intended audience: 

Status:[1]In Review
Created: 2018-11-30

Intentionally left blank, for spacing.

Objective

Now that Minijail can load arbitrary seccomp-bpf programs, the possibility to move the compilation of the policy to a separate process is now open. Without the limitations of writing an in-process compiler that needs to be fast, small, simple, and have no build or runtime dependencies, we are introducing a new syntax that will only be supported by a separate out-of-process compiler that is more usable, less error-prone. The out-of-process compiler can also perform further optimizations to the generated code.

Background

Minijail currently parses and emits the seccomp-bpf program from a policy file as part of libminijail. Since this is done as part of a library and usually as part of the process of launching a new container, it needs to be very fast and not use a lot of additional resources. The way this is achieved is by doing a simple two-pass compilation: in the first pass, the policy file is parsed while the complete BPF filter is created in a two-layered singly-linked list with most jump sources/targets being encoded within the filter nodes themselves, using only an external dictionary of jump labels to filter entries to aid in the jump address resolution later. Since BPF forbids loops (i.e., there are no back edges), once the filter entries are allocated, they are copied into a contiguous array in reverse order so that all the jump targets are in the right place by the time when the jump sources need to be emitted.

The parser also tries to avoid compile-time complexity, since the project needs to be built in both Android and Chrome OS, which have different build systems.

Detailed design

Policy syntax definition

This is the new syntax in EBNF:

policy-file = policy-line , { { '\n' } , policy-line }

            ;

policy-line = statement , [ comment ]

            | comment
           ;

statement = include-statement

          | frequency-statement

          | default-statement

          | filter-statement

          ;
include-statement = '@include' , posix-path

                  ;
frequency-statement = '@frequency' , posix-path

                    ;

default-statement = '@default' , action

                  ;
filter-statement = '{'
, syscall-descriptor , [ { ',', syscall-descriptor } ] , '}' ,

                   ':' , filter

                 | syscall-descriptor , ':' , filter

                 ;
syscall-descriptor = syscall-name , [ metadata ]

                   | libc-function , [ metadata ]
                  ;

syscall-name = identifier

             ;

libc-function = identifier , '@libc'

              ;
filter = '{' , single-filter , [ { ',' , single-filter } ] , '}'

       | single-filter

       ;

single-filter = action

              | filter-expression , [ ';' , action ]

              ;
filter-expression = clause , [ { '||' , clause } ]

                  ;

clause = atom , [ { '&&' , atom } ]

       ;

atom = argument , op , value

     ;

argument = 'arg' , digit

         ;

op = '=='

   | '!='

   | '<='

   | '<'

   | '>='

   | '>'

   | '&'

   | 'in'

   ;

value = constant , [ { '|' , constant } ]

      ;
constant = [ '~' ] , '(' , value , ')'

         | [ '~' ] , single-constant
        ;

single-constant = identifier

                | numeric-constant

                ;

action = 'allow' | '1'
      | 'kill'
      | 'trap'
      | 'return' , single-constant
      ;
metadata = '[' , key-value-pair , [ { ';' , key-value-pair } ] , ']'

         ;
key-value-pair = identifier , '=', identifier , [ { ',' , identifier } ]

               ;
comment = '#' { ? any character ? }

        ;

numeric-constant = ['-'] , { digit }
                | ['-'] , '0x' , { hex-digit }
                | ['-'] , '0o' , { octal-digit }

                 ;
digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

      ;
octal-digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7'

            ;

hex-digit = digit

          | 'a' | 'b' | 'c' | 'd' | 'e' | 'f'

          | 'A' | 'B' | 'C' | 'D' | 'E' | 'F'

          ;

posix-path = ? a sequence of characters that can be interpreted as a path ?

           ;

Frequency file syntax definition

This is the syntax in EBNF for the frequency file:

frequency-file = frequency-line , { { '\n' } , frequency-line }

            ;

frequency-line = statement , [ comment ]

               | comment
              ;

statement = syscall-descriptor , ':' , number

          ;
syscall-descriptor = ( syscall-name

                     | libc-function

                     ) [ metadata ]

                   ;

syscall-name = identifier

             ;

libc-function = identifier , '@libc'

              ;

number = { digit }

       ;
comment = '#' { ? any character ? }

        ;
digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

      ;

posix-path = ? a sequence of characters that can be interpreted as a path ?

           ;

 New semantics

ioctl: arg1 == TCGETS

ioctl: arg1 == 0xf00; return ENOSYS

is acceptable, but the following is not:

ioctl: allow

ioctl: arg1 == 0xf00; return ENOSYS

ioctl: { arg1 == TCGETS; allow, arg1 == 0xf00; return ENOSYS }

{ mmap[arch=x86,x86_64], mmap64[arch=arm,arm64] }: allow
# alternatively

mmap@libc: allow

Off-process compiler / optimizer

There is an implementation of an optimizing compiler written in Python. It needs to know the mapping of syscall names to numbers, as well as the libc function names to syscalls per-architecture and can be obtained by running parse-seccomp-policy --dump. Similar to the current implementation, the emitted code is structured into three contiguous sections:

The following optimizations are considered:

@default kill

ioctl: { arg1 == TCGETS; allow, arg1 == TCSETSF; return ENOSYS }
# is compiled to
 ld $data[24]                           # Load the upper 32 bits

                                         # of arg1
 jeq 0  true:lo false:kill

lo:

  ld $data[28]                           # Load the lower 32 bits
                                        # of arg1

  jeq 21505  true: allow false: tcsetsf

tcsetsf:

  jeq 21508  true: enosys false: kill    # No need to load or compare

                                         # the upper 32 bits again!
allow:

  ret ALLOW

enosys:

  ret ERRNO(38)
kill:

  ret KILL

Syscall decision tree cost model

The cost to evaluate a BPF program has been experimentally measured[2] to be proportional to the number of filter blocks in it. There are two ways to encode a decision tree in BPF: a linear comparison chain (which is just a series of equality comparisons between the register and syscall numbers), and a binary search tree. The intermediate nodes of the BST can also be modeled in the same way, so it is possible to embed a linear comparison chain into a BST node if it would perform fewer operations overall. The nodes of the BST can also be rotated so that syscalls that are called more frequently appear higher in the tree.

The compiler produces the syscall decision tree that minimizes the number of operations that would need to be done if each syscall in the policy was invoked, multiplied by the frequency of that syscall. The compiler does this the following way, given a list of syscalls as input:

This last step of choosing all of the possible cuts results in evaluating all interesting tree rotations, which translates to the compiler being able to choose the BST configuration that minimizes the overall cost, while preserving the BST property.

libc function-to-syscall mapping

This can be obtained by running static analysis over libc.so. There is a prototype written using llvm's MCDisassembler that is promising so far.

Alternatives considered

libseccomp

libseccomp has a concept of pseudo-syscalls, which are aliases for syscalls in each architecture, but it doesn't keep track of the latest libc implementation. This is more complicated for Android since it uses a different implementation of libc. libseccomp also does not perform optimizations on the generated code beyond what Minijail already does: ordering the syscalls by frequency.

clang's eBPF backend

clang can compile C code to eBPF programs, which is mostly used for Linux tracing. Unfortunately, both of the outputs that can be obtained by clang are not usable:

It also would not be able to perform optimizations w.r.t. the order of the syscalls since there is no way to express this simply in C unless the compiler would generate the code in an optimized fashion, which defeats most of the purpose of offloading the optimization to clang.


[1] Draft = Don’t use or rely on any of this material. WIP = Somewhat reliable content, but expect changes/deletions/additions. In Review = Reviewers are reviewing this document, authors are making suggested changes. Final = The content of this document will not change anymore. Approved by foo@: This document was approved by foo@, and its status is Final.

[2] [citation needed]. I did the measurement several years ago to be about 0.6ns, but need fresher data.