Denying Syscalls with Seccomp
The web is an ugly place, and anything that touches it in any significant capacity is likely to be attacked. Programs are full of bugs and exploits, and eliminating all compromises is unlikely to be within the reach of most mortals.
So, the problem at hand -- given a compromised program, how can we keep it from doing too much damage? What can we do to restrict the attack surface, and what mechanisms are available for that?
More specifically, imagine we had a program that looked something like this:
int main(int argc, char **argv) {
/* the goal is to say hello world */
printf("hello there!\n");
/*
but if you know the secret password, you can
compromise the program, and we start sending
spam.
*/
if (argc > 1 && strcmp(argv[1], "haxor") == 0) {
int fd = socket(AF_INET6, SOCK_STREAM, 0);
/* ...start spamming */;
}
}
Traditionally, chroot has been used to lock down applications, but it's clearly not enough in this case: Sockets don't touch the file system, and there's a lot of malicious stuff you can do from inside a chroot.
OpenBSD's pledge has been getting lots of press lately, but it's OpenBSD specific, and much of the world runs on Linux. Thankfully, it turns out that Linux has seccomp, which is a far more complicated, but also more powerful, tool for restricting what system calls you can do.
Seccomp isn't a complete sandbox, but it works with many other parts that Linux provides -- rlimit, process and network namespaces, cgroups, and more -- as a building block for jailing a process.
Seccomp
Seccomp was initially added to Linux in 2005, but in its initial form, seccomp was extremely limited. It would restrict your program to exactly four system calls, killing it if it tried to do anything outside of those:
exit()
sigreturn()
read()
write()
This meant that the restricted process would need to request a parent process to do any system call on its behalf, and would have severe difficulty doing anything as simple as a dynamic memory allocation. As a sandboxing mechanism, this wouldn't fly, and soon enough, seccomp grew the ability for relatively fine grained control over allowing system calls.
The initial mode is still available, and you can request it on your process using:
prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
After this call, your process will be restricted, and any system call other than the four listed above will lead to a premature death.
The mode that is more flexible, and more useful, is filtered mode, which uses a Berkeley Packet Filter to set up the list of system calls. The code to put the filter in place, assuming you've defined the filter data structure appropriately, is below:
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, filter);
Of course, this call does not ban the program from calling prctl()
with a
more permissive filter, so Linux won't allow us to call it if the process
allows raising the privileges it has. Before we set the filter, we need to
drop the ability to add capabilities, one way or another. The simplest is with
another prctl()
call:
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
This leads us to the next question: What is the filter data structure, and how is it set up?
For this, we'll need to make a short diversion look at the Berkeley Packet Filter.
An Aside: BPF
The Berkeley Packet Filter is a programmable packet filtering and classification system that runs within the kernel. It was initially put in place to handle packet filtering and monitoring, without a roundtrip to userspace for every packet that needed to be inspected.
BPF takes the form of a virtual machine in the kernel, processing a simple, restricted instruction set, and returning an integer to the kernel that describes what to do with it. The BPF instruction set includes most arithmetic operations, loads and stores, and forward jumps -- no backward jumps are allowed, in order to guarantee that filter programs will terminate.
The BPF instructions operate on the BPF virtual machine, which has four
main elements: The accumulator register A
, the index register X
,
the packet memory, and the scratch memory M[]
.
The full list of instructions is listed below:
Operator | Effect |
---|---|
Loads | |
ld | Load word into A |
ldi | Load word into A |
ldh | Load half-word into A |
ldb | Load byte into A |
ldx | Load word into x |
ldxi | Load word into x |
ldxb | Load byte into x |
Stores | |
st | Store A into M[] |
stx | Store x into M[] |
Jumps | |
jmp | Jump to offset |
ja | Jump to offset |
jeq | Jump on k == A |
jneq | Jump on k != A |
jne | Jump on k != A |
jlt | Jump on k < A |
jle | Jump on k <= A |
jgt | Jump on k > A |
jge | Jump on k >= A |
jset | Jump on k & A |
Arithmetic | |
add | A + <x> |
sub | A - <x> |
mul | A * <x> |
div | A / <x> |
mod | A % <x> |
neg | !A |
and | A & <x> |
or | A |
xor | A ^ <x> |
lsh | A << <x> |
rsh | A >> <x> |
Misc | |
tax | Copy A into x |
txa | Copy x into A |
ret | Return |
The operand is treated in one of several ways, depending on the addressing mode:
- BPF_IMM
- Use the instruction's `k` field as an immediate value.
- BPF_ABS
- Use the instruction's `k` field as an index into packet memory.
- BPF_IND
- Use the instruction's `k` field as an index into packet memory, adding the contents of the `X` register to the offset.
- BPF_MEM
- Use the instruction's `k` field as an index into scratch memory `M[]`.
- BPF_LEN
- A special value that loads the size of the packet.
- BPF_MSH
- An efficient way to load the message size from an IP header.
The size of the value can be one of
Size | Bytes |
---|---|
W | 4 |
H | 2 |
B | 1 |
The instructions themselves are encoded in the following structure:
struct sock_filter {
uint16_t code; /* the opcode */
uint8_t jt; /* if true: jump displacement */
uint8_t jf; /* if false: jump displacement */
uint32_t k; /* immediate operand */
};
The opcode for each instruction is constructed by taking the base opcode type,
adding in the specific operation type, and finally adding in the mode type.
For example, if I wanted to construct a division that divided the A register
by a 4 byte value value 24 bytes into scratch memory, I would take the base code
for BPF_ALU
, add BPF_DIV
to mark the ALU operation as a division, add
BPF_MEM
to mark the addressing mode as scratch memory, and add in BPF_W
to
load it from scratch memory. The jump displacements jt
and jf
are unused
by ALU operations, so the last value, k
, would be the index into memory.
struct sock_filter div_insn = {
.code = BPF_ALU + BPF_DIV + BPF_MEM + BPF_W
.k = 24
};
There are macros provided by the Linux kernel for producing these
instructions -- one for statements, which ignore the jt
and jf
fields, and another for jumps.
#define BPF_STMT(code, k) \
{ (unsigned short)(code), 0, 0, k }
#define BPF_JUMP(code, k, jt, jf) \
{ (unsigned short)(code), jt, jf, k }
So, in order to construct a sequence of instructions, one would build an array in statically, as follows:
struct sock_filter[] filter = {
/* A <- pkt[666:666 + 4] */
BPF_STMT(
BPF_LD + BPF_ABS + BPF_W, /* opcode */
666) /* k value */
/* if a == 123: jump forward 7; else: jump forward 9 */
BPF_JUMP(
BPF_JMP + BPF_JEQ + BPF_K, /*opcode */
7, /* jump target if true */
9, /* jump target if false */
123) /* constant to compare against */
}
Finally, all BPF sequences must end with a return, which is used by the OS to decide what to do with the packet.
These filters are then encapsulated in a program header, which would be defined as follows:
struct sock_fprog filterprog = {
.len = sizeof(filter)/sizeof(filter[0]),
.filter = filter
};
The program is then passed to the kernel, which stores it, and executes it when needed.
Applying BPF
Now that we've covered how BPF works, we can talk about how it's used when filtering system calls. Once we install the filter for seccomp to use, seccomp will send it "packets" that represent system calls. Each packet sent to seccomp looks like this:
struct seccomp_data {
int nr;
__u32 arch;
__u64 instruction_pointer;
__u64 args[6];
};
From there, we use BPF to pick out the values we would like to inspect from the seccomp filter, analyze it, and then return one of the following five values:
SECCOMP_RET_KILL /* kill the task immediately */
SECCOMP_RET_TRAP /* disallow and force a SIGSYS */
SECCOMP_RET_ERRNO /* returns an errno */
SECCOMP_RET_TRACE /* pass to a tracer or disallow */
SECCOMP_RET_ALLOW /* allow */
A word of caution -- before checking system call numbers, we must always verify the architecture. Since system call numbers in Linux may vary across architectures, neglecting to verify the architecture can lead to banning, or worse, allowing, the wrong system calls.
Generally, for unexpected system calls, we will want to return the first value, since allowing the program to proceed with either SIGSYS or ENOSYS will allow a malicious or compromised program to probe our defenses, and possibly find some loophole that can be exploited.
Implementing the filter
So, with all of this in mind, we will want to write a BPF filter that matches the pseudocode below: something like
if (pkt.arch != MY_ARCH)
deny;
if (pkt.nr == SYS_read)
allow;
if (pkt.nr == SYS_write)
allow;
...
else
deny;
Using the knowledge of BPF bytecode that we have, this implies that our instructions should be:
ldw offsetof(pkt, arch)
jeq MY_ARCH, ok
ret SECCOMP_RET_KILL
ok:
ldw offsetof(pkt, nr)
jeq ALLOWED_SYSCALL, .L0, .L1
.L0:
ret SECCOMP_RET_ALLOW
.L1:
jeq ALLOWED_SYSCALL, .L1, L2
.L2:
ret SECCOMP_RET_ALLOW
.L3:
...
ret SECCOMP_RET_DENY
Manually converting this to the macros in linux/filter.h
, we can
get something that looks like:
static struct filter = {
BPF_STMT(
/* ldw from abs offset */
BPF_LD+BPF_W+BPF_ABS,
offsetof(struct seccomp_data, arch)
),
BPF_JUMP(
/* jeq instruction */
BPF_JMP+BPF_JEQ+BPF_K,
/* the value to test */
AUDIT_ARCH_X86_64,
/* jump distance if true */
1,
/* jump distance if false */
0),
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
/* load the syscall number */
BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)),
/* allow read() */
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, SYS_read, 0, 1),
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW)
/* deny anything else */
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL)
};
To improve readability, I usually define a macro named Allow
,
which simplifies the previous two macros:
#define Allow(syscall) \
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_##syscall, 0, 1), \
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW)
If efficiency becomes a concern, the linear sequence of checks can be replaced by something more efficient, such as a binary search. That is left as an exercise.
Putting It All Together
Going back to our original program, how can it be set up such that once it is initialized, it can never harm us?
int main(int argc, char **argv) {
/* the goal is to say hello world */
printf("hello there!\n");
/*
but if you know the secret password, you can
compromise the program, and we start sending
spam.
*/
if (argc > 1 && strcmp(argv[1], "haxor") == 0) {
int fd = socket(AF_INET6, SOCK_STREAM, 0);
/* ...and start sending spam */
}
}
Let's put in a BPF filter that allows only the system calls we expect. We want the program to be able to exit, and call printf, and nothing more. Therefore, we will allow the system calls needed for those, and nothing more.
The exit_group()
system call is used to exit a process under Linux (exit()
only exits the current thread). write()
, and as it turns out, fstat()
are used directly by printf()
. And finally, since the stdio api can
potentially allocate memory, we will want to allow brk(), mmap(), and
munmap().
So, all of those go into the filter, and we get this result:
struct sock_filter filter[] = {
/* validate arch */
BPF_STMT(BPF_LD+BPF_W+BPF_ABS, ArchField),
BPF_JUMP( BPF_JMP+BPF_JEQ+BPF_K, AUDIT_ARCH_X86_64, 1, 0),
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
/* load syscall */
BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)),
/* list of allowed syscalls */
Allow(exit_group), /* exits a processs */
Allow(brk), /* for malloc(), inside libc */
Allow(mmap), /* also for malloc() */
Allow(munmap), /* for free(), inside libc */
Allow(write), /* called by printf */
Allow(fstat), /* called by printf */
/* and if we don't match above, die */
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
};
So, putting it into the program, and we get a process that has restricted
access to system calls, and which will be killed if it's exploited using our
extreme cracking skills (ie, typing ./a.out haxor
) at the command line.
#include <stdlib.h>
#include <stdio.h>
#include <stddef.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <sys/socket.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <linux/audit.h>
#define ArchField offsetof(struct seccomp_data, arch)
#define Allow(syscall) \
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, SYS_##syscall, 0, 1), \
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW)
struct sock_filter filter[] = {
/* validate arch */
BPF_STMT(BPF_LD+BPF_W+BPF_ABS, ArchField),
BPF_JUMP( BPF_JMP+BPF_JEQ+BPF_K, AUDIT_ARCH_X86_64, 1, 0),
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
/* load syscall */
BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)),
/* list of allowed syscalls */
Allow(exit_group), /* exits a processs */
Allow(brk), /* for malloc(), inside libc */
Allow(mmap), /* also for malloc() */
Allow(munmap), /* for free(), inside libc */
Allow(write), /* called by printf */
Allow(fstat), /* called by printf */
/* and if we don't match above, die */
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
};
struct sock_fprog filterprog = {
.len = sizeof(filter)/sizeof(filter[0]),
.filter = filter
};
int main(int argc, char **argv) {
char buf[1024];
/* set up the restricted environment */
if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
perror("Could not start seccomp:");
exit(1);
}
if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &filterprog) == -1) {
perror("Could not start seccomp:");
exit(1);
}
/* printf only writes to stdout, but for some reason it stats it. */
printf("hello there!\n");
if (argc > 1 && strcmp(argv[1], "haxor") == 0) {
int fd = socket(AF_INET6, SOCK_STREAM, 0);
/* ...and start sending spam */
}
}
And indeed, we get that:
$ cc test.c
$ ./a.out
hello there!
The program behaves as expected. But with a malicious input:
$ ./a.out haxor
hello there!
Bad system call
Tips on Debugging
The set of required system calls is often not entirely obvious, so when
debugging, it often proves useful to return SECCOMP_RET_TRAP
from the
filter instead of SECCOMP_RET_KILL
. When this is done, strace
will
show the syscall number that caused the signal, and cross referencing with the
system call numbers in asm/unistd_64.h
will show the call that was
disallowed.
For example, when writing the example program, it died when trying to use printf. Setting it to trap gave me this output in strace:
<snip>
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) = 0
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, {len = 17, filter = 0x600b80}) = 0
syscall_18446744073709551615(0x1, 0x7ffc13c78390, 0x7ffc13c78390,
0x7ffc13c78280, 0x7f65984157a0, 0x7f6598427e30) = 0x5
--- SIGSYS {si_signo=SIGSYS, si_code=SYS_SECCOMP,
si_call_addr=0x7f6598148fa4, si_syscall=5, si_arch=3221225534} ---
+++ killed by SIGSYS +++
Bad system call
The sigsys tells me that si_syscall=5
, and looking in unistd.h, we find:
#define __NR_fstat 5
So I added fstat to the allowed set, and printf worked fine.
Further Reading
Linux Documentation on BPF Filters
BPF Manpage
BPF Paper
Linux Kernel Documentation
LWN Post