		Seccomp filtering
		=================

Introduction
------------

A large number of system calls are exposed to every userland process
with many of them going unused for the entire lifetime of the process.
As system calls change and mature, bugs are found and eradicated.  A
certain subset of userland applications benefit by having a reduced set
of available system calls.  The resulting set reduces the total kernel
surface exposed to the application.  System call filtering is meant for
use with those applications.

The implementation currently leverages both the existing seccomp
infrastructure and the kernel tracing infrastructure.  By centralizing
hooks for attack surface reduction in seccomp, it is possible to assure
attention to security that is less relevant in normal ftrace scenarios,
such as time-of-check, time-of-use attacks.  However, ftrace provides a
rich, human-friendly environment for interfacing with system call
specific arguments.  (As such, this requires FTRACE_SYSCALLS for any
introspective filtering support.)


What it isn't
-------------

System call filtering isn't a sandbox.  It provides a clearly defined
mechanism for minimizing the exposed kernel surface.  Beyond that,
policy for logical behavior and information flow should be managed with
a combinations of other system hardening techniques and, potentially, a
LSM of your choosing.  Expressive, dynamic filters based on the ftrace
filter engine provide further options down this path (avoiding
pathological sizes or selecting which of the multiplexed system calls in
socketcall() is allowed, for instance) which could be construed,
incorrectly, as a more complete sandboxing solution.


Usage
-----

An additional seccomp mode is exposed through mode '13'.
This mode depends on CONFIG_SECCOMP_FILTER.  By default, it provides
only the most trivial of filter support "1" or cleared.  However, if
CONFIG_FTRACE_SYSCALLS is enabled, the ftrace filter engine may be used
for more expressive filters.

A collection of filters may be supplied via prctl, and the current set
of filters is exposed in /proc/<pid>/seccomp_filter.

Interacting with seccomp filters can be done through three new prctl calls
and one existing one.

PR_SET_SECCOMP:
	A pre-existing option for enabling strict seccomp mode (1) or
	filtering seccomp (13).

	Usage:
		prctl(PR_SET_SECCOMP, 1);  /* strict */
		prctl(PR_SET_SECCOMP, 13);  /* filters */

PR_SET_SECCOMP_FILTER:
	Allows the specification of a new filter for a given system
	call, by number, and filter string.  By default, the filter
	string may only be "1".  However, if CONFIG_FTRACE_SYSCALLS is
	supported, the filter string may make use of the ftrace
	filtering language's awareness of system call arguments.

	In addition, the event id for the system call entry may be
	specified in lieu of the system call number itself, as
	determined by the 'type' argument.  This allows for the future
	addition of seccomp-based filtering on other registered,
	relevant ftrace events.

	All calls to PR_SET_SECCOMP_FILTER for a given system
	call will append the supplied string to any existing filters.
	Filter construction looks as follows:
		(Nothing) + "fd == 1 || fd == 2" => fd == 1 || fd == 2
		... + "fd != 2" => (fd == 1 || fd == 2) && fd != 2
		... + "size < 100" =>
			((fd == 1 || fd == 2) && fd != 2) && size < 100
	If there is no filter and the seccomp mode has already
	transitioned to filtering, additions cannot be made.  Filters
	may only be added that reduce the available kernel surface.

	Usage (per the construction example above):
		unsigned long type = PR_SECCOMP_FILTER_SYSCALL;
		prctl(PR_SET_SECCOMP_FILTER, type, __NR_write,
			"fd == 1 || fd == 2");
		prctl(PR_SET_SECCOMP_FILTER, type, __NR_write,
			"fd != 2");
		prctl(PR_SET_SECCOMP_FILTER, type, __NR_write,
			"size < 100");

	The 'type' argument may be one of PR_SECCOMP_FILTER_SYSCALL or
	PR_SECCOMP_FILTER_EVENT.

PR_CLEAR_SECCOMP_FILTER:
	Removes all filter entries for a given system call number or
	event id.  When called prior to entering seccomp filtering mode,
	it allows for new filters to be applied to the same system call.
	After transition, however, it completely drops access to the
	call.

	Usage:
		prctl(PR_CLEAR_SECCOMP_FILTER,
			PR_SECCOMP_FILTER_SYSCALL, __NR_open);

PR_GET_SECCOMP_FILTER:
	Returns the aggregated filter string for a system call into a
	user-supplied buffer of a given length.

	Usage:
		prctl(PR_GET_SECCOMP_FILTER,
			PR_SECCOMP_FILTER_SYSCALL, __NR_write, buf,
			sizeof(buf));

All of the above calls return 0 on success and non-zero on error.  If
CONFIG_FTRACE_SYSCALLS is not supported and a rich-filter was specified,
the caller may check the errno for -ENOSYS.  The same is true if
specifying an filter by the event id fails to discover any relevant
event entries.


Example
-------

Assume a process would like to cleanly read and write to stdin/out/err
as well as access its filters after seccomp enforcement begins.  This
may be done as follows:

  int filter_syscall(int nr, char *buf) {
    return prctl(PR_SET_SECCOMP_FILTER, PR_SECCOMP_FILTER_SYSCALL,
                 nr, buf);
  }

  filter_syscall(__NR_read, "fd == 0");
  filter_syscall(_NR_write, "fd == 1 || fd == 2");
  filter_syscall(__NR_exit, "1");
  filter_syscall(__NR_prctl, "1");
  prctl(PR_SET_SECCOMP, 13);

  /* Do stuff with fdset . . .*/

  /* Drop read access and keep only write access to fd 1. */
  prctl(PR_CLEAR_SECCOMP_FILTER, PR_SECCOMP_FILTER_SYSCALL, __NR_read);
  filter_syscall(__NR_write, "fd != 2");

  /* Perform any final processing . . . */
  syscall(__NR_exit, 0);


Caveats
-------

- Avoid using a filter of "0" to disable a filter.  Always favor calling
  prctl(PR_CLEAR_SECCOMP_FILTER, ...).  Otherwise the behavior may vary
  depending on if CONFIG_FTRACE_SYSCALLS support exists -- though an
  error will be returned if the support is missing.

- execve is always blocked.  seccomp filters may not cross that boundary.

- Filters can be inherited across fork/clone but only when they are
  active (e.g., PR_SET_SECCOMP has been set to 13), but not prior to use.
  This stops the parent process from adding filters that may undermine
  the child process security or create unexpected behavior after an
  execve.

- Some platforms support a 32-bit userspace with 64-bit kernels.  In
  these cases (CONFIG_COMPAT), system call numbers may not match across
  64-bit and 32-bit system calls. When the first PRCTL_SET_SECCOMP_FILTER
  is called, the in-memory filters state is annotated with whether the
  call has been made via the compat interface.  All subsequent calls will
  be checked for compat call mismatch.  In the long run, it may make sense
  to store compat and non-compat filters separately, but that is not
  supported at present. Once one type of system call interface has been
  used, it must be continued to be used.


Adding architecture support
-----------------------

Any platform with seccomp support should be able to support the bare
minimum of seccomp filter features.  However, since seccomp_filter
requires that execve be blocked, it expects the architecture to expose a
__NR_seccomp_execve define that maps to the execve system call number.
On platforms where CONFIG_COMPAT applies, __NR_seccomp_execve_32 must
also be provided.  Once those macros exist, "select HAVE_SECCOMP_FILTER"
support may be added to the architectures Kconfig.
