Eigenstate :: How to sandbox code under Linux with Seccomp

Denying Syscalls with Seccomp

The web is an ugly place, and anything that touches it in any significant capacity is likely to be attacked. Programs are full of bugs and exploits, and eliminating all compromises is unlikely to be within the reach of most mortals.

So, the problem at hand -- given a compromised program, how can we keep it from doing too much damage? What can we do to restrict the attack surface, and what mechanisms are available for that?

More specifically, imagine we had a program that looked something like this:

int main(int argc, char **argv) {
    /* the goal is to say hello world */
    printf("hello there!\n");

    /*
    but if you know the secret password, you can
    compromise the program, and we start sending
    spam.
    */
    if (argc > 1 && strcmp(argv[1], "haxor") == 0) {
        int fd = socket(AF_INET6, SOCK_STREAM, 0);
        /* ...start spamming */;
    }
}

Traditionally, chroot has been used to lock down applications, but it's clearly not enough in this case: Sockets don't touch the file system, and there's a lot of malicious stuff you can do from inside a chroot.

OpenBSD's pledge has been getting lots of press lately, but it's OpenBSD specific, and much of the world runs on Linux. Thankfully, it turns out that Linux has seccomp, which is a far more complicated, but also more powerful, tool for restricting what system calls you can do.

Seccomp isn't a complete sandbox, but it works with many other parts that Linux provides -- rlimit, process and network namespaces, cgroups, and more -- as a building block for jailing a process.

Seccomp

Seccomp was initially added to Linux in 2005, but in its initial form, seccomp was extremely limited. It would restrict your program to exactly four system calls, killing it if it tried to do anything outside of those:

exit()
sigreturn()
read()
write()

This meant that the restricted process would need to request a parent process to do any system call on its behalf, and would have severe difficulty doing anything as simple as a dynamic memory allocation. As a sandboxing mechanism, this wouldn't fly, and soon enough, seccomp grew the ability for relatively fine grained control over allowing system calls.

The initial mode is still available, and you can request it on your process using:

prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);

After this call, your process will be restricted, and any system call other than the four listed above will lead to a premature death.

The mode that is more flexible, and more useful, is filtered mode, which uses a Berkeley Packet Filter to set up the list of system calls. The code to put the filter in place, assuming you've defined the filter data structure appropriately, is below:

prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, filter);

Of course, this call does not ban the program from calling prctl() with a more permissive filter, so Linux won't allow us to call it if the process allows raising the privileges it has. Before we set the filter, we need to drop the ability to add capabilities, one way or another. The simplest is with another prctl() call:

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);

This leads us to the next question: What is the filter data structure, and how is it set up?

For this, we'll need to make a short diversion look at the Berkeley Packet Filter.

An Aside: BPF

The Berkeley Packet Filter is a programmable packet filtering and classification system that runs within the kernel. It was initially put in place to handle packet filtering and monitoring, without a roundtrip to userspace for every packet that needed to be inspected.

BPF takes the form of a virtual machine in the kernel, processing a simple, restricted instruction set, and returning an integer to the kernel that describes what to do with it. The BPF instruction set includes most arithmetic operations, loads and stores, and forward jumps -- no backward jumps are allowed, in order to guarantee that filter programs will terminate.

The BPF instructions operate on the BPF virtual machine, which has four main elements: The accumulator register A, the index register X, the packet memory, and the scratch memory M[].

The full list of instructions is listed below:

Operator	Effect
Loads
ld	Load word into `A`
ldi	Load word into `A`
ldh	Load half-word into `A`
ldb	Load byte into `A`
ldx	Load word into `x`
ldxi	Load word into `x`
ldxb	Load byte into `x`
Stores
st	Store `A` into `M[]`
stx	Store `x` into `M[]`
Jumps
jmp	Jump to offset
ja	Jump to offset
jeq	Jump on `k == A`
jneq	Jump on `k != A`
jne	Jump on `k != A`
jlt	Jump on `k < A`
jle	Jump on `k <= A`
jgt	Jump on `k > A`
jge	Jump on `k >= A`
jset	Jump on `k & A`
Arithmetic
add	`A` + <x>
sub	`A` - <x>
mul	`A` * <x>
div	`A` / <x>
mod	`A` % <x>
neg	!`A`
and	`A` & <x>
or	`A`
xor	`A` ^ <x>
lsh	`A` << <x>
rsh	`A` >> <x>
Misc
tax	Copy `A` into `x`
txa	Copy `x` into `A`
ret	Return

The operand is treated in one of several ways, depending on the addressing mode:

BPF_IMM: Use the instruction's `k` field as an immediate value.
BPF_ABS: Use the instruction's `k` field as an index into packet memory.
BPF_IND: Use the instruction's `k` field as an index into packet memory, adding the contents of the `X` register to the offset.
BPF_MEM: Use the instruction's `k` field as an index into scratch memory `M[]`.
BPF_LEN: A special value that loads the size of the packet.
BPF_MSH: An efficient way to load the message size from an IP header.

The size of the value can be one of

Size	Bytes
W	4
H	2
B	1

The instructions themselves are encoded in the following structure:

struct sock_filter {
    uint16_t code;  /* the opcode */
    uint8_t jt; /* if true: jump displacement */
    uint8_t jf; /* if false: jump displacement */
    uint32_t k; /* immediate operand */
};

The opcode for each instruction is constructed by taking the base opcode type, adding in the specific operation type, and finally adding in the mode type. For example, if I wanted to construct a division that divided the A register by a 4 byte value value 24 bytes into scratch memory, I would take the base code for BPF_ALU, add BPF_DIV to mark the ALU operation as a division, add BPF_MEM to mark the addressing mode as scratch memory, and add in BPF_W to load it from scratch memory. The jump displacements jt and jf are unused by ALU operations, so the last value, k, would be the index into memory.

struct sock_filter div_insn = {
    .code = BPF_ALU + BPF_DIV + BPF_MEM + BPF_W
    .k = 24
};

There are macros provided by the Linux kernel for producing these instructions -- one for statements, which ignore the jt and jf fields, and another for jumps.

#define BPF_STMT(code, k) \
    { (unsigned short)(code), 0, 0, k }
#define BPF_JUMP(code, k, jt, jf) \
    { (unsigned short)(code), jt, jf, k }

So, in order to construct a sequence of instructions, one would build an array in statically, as follows:

struct sock_filter[] filter = {
    /* A <- pkt[666:666 + 4] */
    BPF_STMT(
        BPF_LD + BPF_ABS + BPF_W,   /* opcode */
        666)    /* k value */
    /* if a == 123: jump forward 7; else: jump forward 9 */
    BPF_JUMP(
        BPF_JMP + BPF_JEQ + BPF_K,  /*opcode */
        7,  /* jump target if true */
        9,  /* jump target if false */
        123)    /* constant to compare against */
}

Finally, all BPF sequences must end with a return, which is used by the OS to decide what to do with the packet.

These filters are then encapsulated in a program header, which would be defined as follows:

struct sock_fprog filterprog = {
    .len = sizeof(filter)/sizeof(filter[0]),
    .filter = filter
};

The program is then passed to the kernel, which stores it, and executes it when needed.

Applying BPF

Now that we've covered how BPF works, we can talk about how it's used when filtering system calls. Once we install the filter for seccomp to use, seccomp will send it "packets" that represent system calls. Each packet sent to seccomp looks like this:

struct seccomp_data {
    int nr;
    __u32 arch;
    __u64 instruction_pointer;
    __u64 args[6];
};

From there, we use BPF to pick out the values we would like to inspect from the seccomp filter, analyze it, and then return one of the following five values:

SECCOMP_RET_KILL    /* kill the task immediately */
SECCOMP_RET_TRAP    /* disallow and force a SIGSYS */
SECCOMP_RET_ERRNO   /* returns an errno */
SECCOMP_RET_TRACE   /* pass to a tracer or disallow */
SECCOMP_RET_ALLOW   /* allow */

A word of caution -- before checking system call numbers, we must always verify the architecture. Since system call numbers in Linux may vary across architectures, neglecting to verify the architecture can lead to banning, or worse, allowing, the wrong system calls.

Generally, for unexpected system calls, we will want to return the first value, since allowing the program to proceed with either SIGSYS or ENOSYS will allow a malicious or compromised program to probe our defenses, and possibly find some loophole that can be exploited.

Implementing the filter

So, with all of this in mind, we will want to write a BPF filter that matches the pseudocode below: something like

if (pkt.arch != MY_ARCH)
    deny;
if (pkt.nr == SYS_read)
    allow;
if (pkt.nr == SYS_write)
    allow;
...
else
    deny;

Using the knowledge of BPF bytecode that we have, this implies that our instructions should be:

    ldw offsetof(pkt, arch)
    jeq MY_ARCH, ok
    ret SECCOMP_RET_KILL
ok:
    ldw offsetof(pkt, nr)
    jeq ALLOWED_SYSCALL, .L0, .L1
.L0:
    ret SECCOMP_RET_ALLOW
.L1:
    jeq ALLOWED_SYSCALL, .L1, L2
.L2:
    ret SECCOMP_RET_ALLOW
.L3:
    ...
    ret SECCOMP_RET_DENY

Manually converting this to the macros in linux/filter.h, we can get something that looks like:

static struct filter = {
    BPF_STMT(
        /* ldw from abs offset */
        BPF_LD+BPF_W+BPF_ABS,
        offsetof(struct seccomp_data, arch)
    ),
    BPF_JUMP(
        /* jeq instruction */
        BPF_JMP+BPF_JEQ+BPF_K,
        /* the value to test */
        AUDIT_ARCH_X86_64,
        /* jump distance if true */
        1, 
        /* jump distance if false */
        0),
    BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
    /* load the syscall number */
    BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)),
    /* allow read() */
    BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, SYS_read, 0, 1),
    BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW)
    /* deny anything else */
    BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL)
};

To improve readability, I usually define a macro named Allow, which simplifies the previous two macros:

#define Allow(syscall) \
    BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_##syscall, 0, 1), \
    BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW)

If efficiency becomes a concern, the linear sequence of checks can be replaced by something more efficient, such as a binary search. That is left as an exercise.

Putting It All Together

Going back to our original program, how can it be set up such that once it is initialized, it can never harm us?

int main(int argc, char **argv) {
    /* the goal is to say hello world */
    printf("hello there!\n");

    /*
    but if you know the secret password, you can
    compromise the program, and we start sending
    spam.
    */
    if (argc > 1 && strcmp(argv[1], "haxor") == 0) {
        int fd = socket(AF_INET6, SOCK_STREAM, 0);
        /* ...and start sending spam */
    }
}

Let's put in a BPF filter that allows only the system calls we expect. We want the program to be able to exit, and call printf, and nothing more. Therefore, we will allow the system calls needed for those, and nothing more.

The exit_group() system call is used to exit a process under Linux (exit() only exits the current thread). write(), and as it turns out, fstat() are used directly by printf(). And finally, since the stdio api can potentially allocate memory, we will want to allow brk(), mmap(), and munmap().

So, all of those go into the filter, and we get this result:

struct sock_filter filter[] = {
    /* validate arch */
    BPF_STMT(BPF_LD+BPF_W+BPF_ABS, ArchField),
    BPF_JUMP( BPF_JMP+BPF_JEQ+BPF_K, AUDIT_ARCH_X86_64, 1, 0),
    BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),

    /* load syscall */
    BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)),

    /* list of allowed syscalls */
    Allow(exit_group),  /* exits a processs */
    Allow(brk),     /* for malloc(), inside libc */
    Allow(mmap),        /* also for malloc() */
    Allow(munmap),      /* for free(), inside libc */
    Allow(write),       /* called by printf */
    Allow(fstat),       /* called by printf */

    /* and if we don't match above, die */
    BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
};

So, putting it into the program, and we get a process that has restricted access to system calls, and which will be killed if it's exploited using our extreme cracking skills (ie, typing ./a.out haxor) at the command line.

#include <stdlib.h>
#include <stdio.h>
#include <stddef.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>

#include <sys/types.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <sys/socket.h>

#include <linux/filter.h>
#include <linux/seccomp.h>
#include <linux/audit.h>

#define ArchField offsetof(struct seccomp_data, arch)

#define Allow(syscall) \
    BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, SYS_##syscall, 0, 1), \
    BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW)

struct sock_filter filter[] = {
    /* validate arch */
    BPF_STMT(BPF_LD+BPF_W+BPF_ABS, ArchField),
    BPF_JUMP( BPF_JMP+BPF_JEQ+BPF_K, AUDIT_ARCH_X86_64, 1, 0),
    BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),

    /* load syscall */
    BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)),

    /* list of allowed syscalls */
    Allow(exit_group),  /* exits a processs */
    Allow(brk),     /* for malloc(), inside libc */
    Allow(mmap),        /* also for malloc() */
    Allow(munmap),      /* for free(), inside libc */
    Allow(write),       /* called by printf */
    Allow(fstat),       /* called by printf */

    /* and if we don't match above, die */
    BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
};
struct sock_fprog filterprog = {
    .len = sizeof(filter)/sizeof(filter[0]),
    .filter = filter
};

int main(int argc, char **argv) {
    char buf[1024];

    /* set up the restricted environment */
    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
        perror("Could not start seccomp:");
        exit(1);
    }
    if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &filterprog) == -1) {
        perror("Could not start seccomp:");
        exit(1);
    }

    /* printf only writes to stdout, but for some reason it stats it. */
    printf("hello there!\n");

    if (argc > 1 && strcmp(argv[1], "haxor") == 0) {
        int fd = socket(AF_INET6, SOCK_STREAM, 0);
        /* ...and start sending spam */
    }
}

And indeed, we get that:

$ cc test.c
$ ./a.out 
hello there!

The program behaves as expected. But with a malicious input:

$ ./a.out haxor
hello there!
Bad system call

Tips on Debugging

The set of required system calls is often not entirely obvious, so when debugging, it often proves useful to return SECCOMP_RET_TRAP from the filter instead of SECCOMP_RET_KILL. When this is done, strace will show the syscall number that caused the signal, and cross referencing with the system call numbers in asm/unistd_64.h will show the call that was disallowed.

For example, when writing the example program, it died when trying to use printf. Setting it to trap gave me this output in strace:

<snip>
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)  = 0
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, {len = 17, filter = 0x600b80}) = 0
syscall_18446744073709551615(0x1, 0x7ffc13c78390, 0x7ffc13c78390, 
    0x7ffc13c78280, 0x7f65984157a0, 0x7f6598427e30) = 0x5
--- SIGSYS {si_signo=SIGSYS, si_code=SYS_SECCOMP, 
    si_call_addr=0x7f6598148fa4, si_syscall=5, si_arch=3221225534} ---
+++ killed by SIGSYS +++
Bad system call

The sigsys tells me that si_syscall=5, and looking in unistd.h, we find:

#define __NR_fstat 5

So I added fstat to the allowed set, and printf worked fine.

Eigenstate : Seccomp Sandboxing