This content originally appeared on Level Up Coding - Medium and was authored by Mohammed Shamim
Seccomp — Secure Computing Mode | Kubernetes | Docker
Seccomp for docker and kubernetes
In this article, we will discuss Seccomp. Seccomp stands for secure computing mode and has been a feature of the Linux kernel since version 2.6.12. It can be used to sandbox the privileges of a process, restricting the calls it is able to make from userspace into kernel space. Kubernetes lets us automatically apply seccomp profiles loaded onto a node to the Pods and containers. Before jumping into the seccomp let’s discuss container isolation, user space, kernel space, and system calls.
Container Isolation
A container is nothing but a process. Containers are isolated from the host operating system and other processes running on the host by using namespaces and cgroups.
Namespaces restrict what processes can see such as Users, Filesystem, and Other Processes. For instance — Which process can see which processes?
cgroups restrict the resource usage of the processes (CPU, RAM, Disk). For instance — how much CPU can a process use?
User Space and Kernel Space
Linux divides its memory into two distinct spaces:
User Space is the virtual memory space where all user applications or software will run.
Kernal Space is the virtual memory space, where the core of the operating system (kernel) runs.
Since the containers run in the user space. How do the containers communicate with Kernal space to mount volumes or read files from file systems?
The answer is by using system calls.
When applications/processes running on Linux want to use the resources managed by the Linux kernel such as — reading files, creating processes, etc. The application process makes system calls to the Linux kernel, Sequentially the Linux kernel performs the necessary operations, and then leaves control back to the calling program.

Containers share the same kernel space between themselves and the host. So, it would be possible for a container to delete file systems using system calls or write on files that require privileges. That makes containers less secure than virtual machines. Because each virtual machine has its own kernel.
So the question is how we can restrict system calls from a container?
To restrict system calls from containers we can use Seccomp (secure computing mode). Using the Seccomp utility we can limit the syscalls a process/container can make to the Linux kernel.
Check whether Seccomp is enabled on the kernel or not:
$ grep -i seccomp /boot/config-$(uname -r)
--------------------------------------------------------------------
CONFIG_SECCOMP=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP_FILTER=y #'y' Indicates that the Seccomp feature is enabled
Seccomp Profile
During the creation of a container/pod, we can pass a seccomp profile by determining what kind of system calls the container or pod can make.
There is a JSON format for writing custom seccomp profiles: A fundamental seccomp profile has three main elements: defaultAction, architectures and syscalls:
{
"defaultAction": "",
"architectures": [],
"syscalls": [
{
"names": [],
"action": ""
}
]
}
In the syscalls section we will list the system calls under the "names"array that is allowed or blocked depending on what is being set as "action" .
In the architectures section we have to define what architectures we are targeting. This is very essential because the seccomp filter will operate at the kernel level. And also during the filtering, syscall IDs will be used and not the names we defined in syscalls.names section.
defaultAction defines what will happen if no matching system call is found inside the syscalls list.
There are several patterns to create a seccomp profile. Let’s discuss some of them briefly :
Whitelisting System Calls :
Using the following pattern we can whitelist only those system calls we want to allow from a process.
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_X86",
"SCMP_ARCH_X32"
],
"syscalls": [
{
"names": [
"pselect6",
"getsockname",
..
..
"execve",
"exit"
],
"action": "SCMP_ACT_ALLOW"
}
]
}
In the above demonstration: under thesyscalls.names section several system calls are listed and followed by syscalls.action is set to
"SCMP_ACT_ALLOW" that conveys only listed system calls will be allowed to be executed.
But what will happen if no match is found? — As "defaultAction" is set to "SCMP_ACT_ERRNO" that implies if no match system call is found inside the syscalls.names list then the system calls execution will be blocked.
Blacklisting System Calls :
In contrast, if we write a seccomp profile similar to the following pattern that will help us to blacklist the system calls we want to restrict and all other calls will be allowed.
{
"defaultAction": "SCMP_ACT_ALLOW",
"architectures": [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_X86",
"SCMP_ARCH_X32"
],
"syscalls": [
{
"names": [
"pselect6",
"getsockname",
..
..
..
"execve",
"exit"
],
"action": "SCMP_ACT_ERRNO"
}
]
}
Now, "defaultAction" is set to"SCMP_ACT_ALLOW" and syscalls.action is set to SCMP_ACT_ERRNO , which implies all the listed system calls inside the syscalls will be blacklisted and others will be allowed.
Auditing System Calls
To audit the system calls, we can use the following seccomp profile:
{
"defaultAction": "SCMP_ACT_LOG"
}
If we use the above-defined seccomp profile then, the seccomp filter will have no influence on syscall calls but all the syscalls will be logged on the hosts /var/log/syslog file.
But if required we can use "SCMP_ACT_LOG" , "SCMP_ACT_ALLOW" and SCMP_ACT_ERRNO all together. Following is the illustration of allowing certain syscalls, restricting some syscalls as per our needs, and other than that logging all syscalls that are included neither in the allowed list nor in the restricted list.
{
"defaultAction": "SCMP_ACT_LOG",
"architectures": [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_X86",
"SCMP_ARCH_X32"
],
"syscalls": [
{
"names": [
"mmap",
"gettid",
"tgkill",
"rt_sigaction"
],
"action": "SCMP_ACT_ALLOW"
},
{
"names": [
"keyctl",
"ptrace"
],
"action": "SCMP_ACT_ERRNO"
}
]
}
Seccomp profile for docker containers
By default docker container runs with a default profile that can be found here.
But if we want we can modify the default profile by passing a custom profile using --security-opt option while creating a container :
$ docker run --rm \
-it \
--security-opt seccomp=/path/to/seccomp/custom.json \
hello-world
Seccomp for Kubernetes
In order to assign a seccomp profile to a pod we have to place the seccomp profile JSON file in the nodes directories so that kubelet can access that easily while scheduling the pod into the corresponding nodes.
As per the documentation version v1.25, the default root directory of the kubelet is : /var/lib/kubelet
Let’s say we want to attach the following seccomp profile named custom.json to the upcoming pods.
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_X86",
"SCMP_ARCH_X32"
],
"syscalls": [
{
"names": [
"accept4",
"epoll_wait",
"pselect6",
"futex",
"madvise",
"epoll_ctl",
"getsockname",
"setsockopt",
"vfork",
"mmap",
"read",
"write",
"close",
"arch_prctl",
"sched_getaffinity",
"munmap",
"brk",
"rt_sigaction",
"rt_sigprocmask",
"sigaltstack",
"gettid",
"clone",
"bind",
"socket",
"openat",
"readlinkat",
"exit_group",
"epoll_create1",
"listen",
"rt_sigreturn",
"sched_yield",
"clock_gettime",
"connect",
"dup2",
"epoll_pwait",
"execve",
"exit",
"fcntl",
"getpid",
"getuid",
"ioctl",
"mprotect",
"nanosleep",
"open",
"poll",
"recvfrom",
"sendto",
"set_tid_address",
"setitimer",
"writev"
],
"action": "SCMP_ACT_ALLOW"
}
]
}
Move the custom.json file under the kubelet root directory. So that kubelet can access it directly.
# create new directory under kubelet root directory
$ mkdir -p /var/lib/kubelet/seccomp/profiles
# move "custom.json"
$ mv custom.json /var/lib/kubelet/seccomp/profiles/
Attach a seccompProfile into a pod
To set the Seccomp profile to a pod/container, include the seccompProfile field in the securityContext section of the Pod or Container manifest.
There are various kinds of seccompProfile :
Localhost — a seccomp profile defined in a file on the node where the pod will be scheduled.
RuntimeDefault — the container runtime default profile should be used. Unconfined — no profile should be applied. (default, if no profile is defined)
Seccomp Profile — Localhost
# type "localhost"
securityContext:
seccompProfile:
type: Localhost
localhostProfile: my-profiles/profile-allow.json
seccompProfile.type indicates which kind of seccomp profile will be applied.
seccompProfile.localhostProfile indicates a seccomp profile defined in a file on the node where the pod will run. The profile must be preconfigured on the node to work. The path must be relative to the kubelet’s root directory. Must only be set if the type is “Localhost”.
Seccomp Profile — RuntimeDefault
# type "RuntimeDefault"
securityContext:
seccompProfile:
type: RuntimeDefault
As we discussed by default the container runs with a default seccomp profile. If we set seccompProfile.type to RuntimeDefault then the pod will use the default seccomp profile of the container.
Now, Create a new pod named pod-1 by attaching the custom.json file as a seccompProfile under the pod’s securityContext section.
apiVersion: v1
kind: Pod
metadata:
name: pod-1
labels:
app: pod-1
spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/custom.json
containers:
- name: test-container
image: hashicorp/http-echo:0.2.3
args:
- "-text=just made some syscalls!"
securityContext:
allowPrivilegeEscalation: false
To ensure the container does not get more privileges than the pod, we must set container allowPrivilegeEscalation to false.
And finally, create the pod:
>> kubectl create -f pod-1.yaml
# list the pod-1
>> kubectl get pods
---------------------------------------------------------------------------
NAME READY STATUS RESTARTS AGE
pod-1 1/1 Running 5 (90s ago) 3m3s
As we can see pod-1 is running without any issues. It indicates that the syscalls permitted for pod-1 are sufficient enough for the pod to operate.
If you found this article helpful, please don’t forget to hit the Follow 👉 and Clap 👏 buttons to help me write more articles like this.
Thank You 🖤
References
- Restrict a Container's Syscalls with seccomp
- Seccomp in Kubernetes — Part I: 7 things you should know before you even start!
Seccomp — Secure Computing Mode | Kubernetes | Docker was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.
This content originally appeared on Level Up Coding - Medium and was authored by Mohammed Shamim

Mohammed Shamim | Sciencx (2022-11-16T03:16:54+00:00) Seccomp — Secure Computing Mode | Kubernetes | Docker. Retrieved from https://www.scien.cx/2022/11/16/seccomp-secure-computing-mode-kubernetes-docker/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.