Docker的Seccomp安全配置文件
安全计算模式(seccomp)是Linux内核的一个功能。你可以使用它来限制容器内可用的操作。seccomp()系统调用作用于调用进程的seccomp状态。你可以使用此功能来限制应用程序的访问。
此功能仅在Docker已使用seccomp构建且内核配置启用了CONFIG_SECCOMP时可用。要检查您的内核是否支持seccomp:
$ grep CONFIG_SECCOMP= /boot/config-$(uname -r)
CONFIG_SECCOMP=y
为容器传递配置文件
默认的seccomp配置文件为运行容器提供了一个合理的默认设置,并在300多个系统调用中禁用了大约44个。它在提供广泛应用程序兼容性的同时,提供了中等程度的保护。默认的Docker配置文件可以在这里找到。
实际上,配置文件是一个默认拒绝访问系统调用,然后允许特定系统调用的白名单。配置文件通过定义defaultAction为SCMP_ACT_ERRNO并仅为特定系统调用覆盖该操作来工作。SCMP_ACT_ERRNO的效果是导致Permission Denied错误。接下来,配置文件定义了一个完全允许的系统调用列表,因为它们的action被覆盖为SCMP_ACT_ALLOW。最后,一些特定规则适用于个别系统调用,如personality等,以允许这些系统调用的特定参数变体。
seccomp 对于以最小权限运行 Docker 容器至关重要。不建议更改默认的 seccomp 配置文件。
当你运行一个容器时,除非你使用--security-opt选项覆盖它,否则它将使用默认配置文件。例如,以下明确指定了一个策略:
$ docker run --rm \
-it \
--security-opt seccomp=/path/to/seccomp/profile.json \
hello-world
默认配置文件阻止的重要系统调用
Docker的默认seccomp配置文件是一个允许列表,它指定了允许的调用。下表列出了由于不在允许列表上而实际上被阻止的重要(但不是全部)系统调用。该表包括每个系统调用被阻止而不是被列入白名单的原因。
| Syscall | Description |
|---|---|
acct | Accounting syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_PACCT. |
add_key | Prevent containers from using the kernel keyring, which is not namespaced. |
bpf | Deny loading potentially persistent BPF programs into kernel, already gated by CAP_SYS_ADMIN. |
clock_adjtime | Time/date is not namespaced. Also gated by CAP_SYS_TIME. |
clock_settime | Time/date is not namespaced. Also gated by CAP_SYS_TIME. |
clone | Deny cloning new namespaces. Also gated by CAP_SYS_ADMIN for CLONE_* flags, except CLONE_NEWUSER. |
create_module | Deny manipulation and functions on kernel modules. Obsolete. Also gated by CAP_SYS_MODULE. |
delete_module | Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE. |
finit_module | Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE. |
get_kernel_syms | Deny retrieval of exported kernel and module symbols. Obsolete. |
get_mempolicy | Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE. |
init_module | Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE. |
ioperm | Prevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO. |
iopl | Prevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO. |
kcmp | Restrict process inspection capabilities, already blocked by dropping CAP_SYS_PTRACE. |
kexec_file_load | Sister syscall of kexec_load that does the same thing, slightly different arguments. Also gated by CAP_SYS_BOOT. |
kexec_load | Deny loading a new kernel for later execution. Also gated by CAP_SYS_BOOT. |
keyctl | Prevent containers from using the kernel keyring, which is not namespaced. |
lookup_dcookie | Tracing/profiling syscall, which could leak a lot of information on the host. Also gated by CAP_SYS_ADMIN. |
mbind | Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE. |
mount | Deny mounting, already gated by CAP_SYS_ADMIN. |
move_pages | Syscall that modifies kernel memory and NUMA settings. |
nfsservctl | Deny interaction with the kernel NFS daemon. Obsolete since Linux 3.1. |
open_by_handle_at | Cause of an old container breakout. Also gated by CAP_DAC_READ_SEARCH. |
perf_event_open | Tracing/profiling syscall, which could leak a lot of information on the host. |
personality | Prevent container from enabling BSD emulation. Not inherently dangerous, but poorly tested, potential for a lot of kernel vulnerabilities. |
pivot_root | Deny pivot_root, should be privileged operation. |
process_vm_readv | Restrict process inspection capabilities, already blocked by dropping CAP_SYS_PTRACE. |
process_vm_writev | Restrict process inspection capabilities, already blocked by dropping CAP_SYS_PTRACE. |
ptrace | Tracing/profiling syscall. Blocked in Linux kernel versions before 4.8 to avoid seccomp bypass. Tracing/profiling arbitrary processes is already blocked by dropping CAP_SYS_PTRACE, because it could leak a lot of information on the host. |
query_module | Deny manipulation and functions on kernel modules. Obsolete. |
quotactl | Quota syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_ADMIN. |
reboot | Don't let containers reboot the host. Also gated by CAP_SYS_BOOT. |
request_key | Prevent containers from using the kernel keyring, which is not namespaced. |
set_mempolicy | Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE. |
setns | Deny associating a thread with a namespace. Also gated by CAP_SYS_ADMIN. |
settimeofday | Time/date is not namespaced. Also gated by CAP_SYS_TIME. |
stime | Time/date is not namespaced. Also gated by CAP_SYS_TIME. |
swapon | Deny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN. |
swapoff | Deny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN. |
sysfs | Obsolete syscall. |
_sysctl | Obsolete, replaced by /proc/sys. |
umount | Should be a privileged operation. Also gated by CAP_SYS_ADMIN. |
umount2 | Should be a privileged operation. Also gated by CAP_SYS_ADMIN. |
unshare | Deny cloning new namespaces for processes. Also gated by CAP_SYS_ADMIN, with the exception of unshare --user. |
uselib | Older syscall related to shared libraries, unused for a long time. |
userfaultfd | Userspace page fault handling, largely needed for process migration. |
ustat | Obsolete syscall. |
vm86 | In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN. |
vm86old | In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN. |
不使用默认的 seccomp 配置文件运行
你可以传递unconfined来运行一个没有默认seccomp配置文件的容器。
$ docker run --rm -it --security-opt seccomp=unconfined debian:latest \
unshare --map-root-user --user sh -c whoami