Linux Jail Security Model

On Linux, capsa-vmm jails itself for defense-in-depth isolation. This document explains the threat model, implementation, and configuration.

Security Layers

Capsa applies multiple independent security layers. Each layer is applied if possible, even if other layers fail:

Layer	When Applied	Requires
Hardware isolation	Always	KVM (/dev/kvm)
vCPU thread seccomp	Always (per vCPU thread)	None
PR_SET_NO_NEW_PRIVS	Always (if jail config present)	None
Capability dropping	Always (if jail config present)	None
VMM thread seccomp	Always (if jail config present)	None
Namespaces (mount, IPC)	If unshare succeeds	Kernel support, sometimes root
Pivot root	If namespaces succeed	Mount namespace

This means that even if namespace creation fails (EPERM on unprivileged systems), you still get:

Hardware virtualization isolation (guest cannot access host memory)
vCPU seccomp (limits what a VM escape could do)
VMM seccomp (limits VMM syscalls to a safe subset)
No new privileges (prevents setuid escalation)
No capabilities (prevents privileged operations)

Threat Model

Guest code is assumed untrusted. The jail protects against VMM compromise leading to host access. If an attacker exploits a bug in the VMM process (device emulation, virtio handling, etc.), the jail limits what they can do:

Attack Vector	Mitigation
Arbitrary file read/write	Pivot root to empty tmpfs; only bind-mounted paths accessible
Shared directory escape	`openat2(RESOLVE_BENEATH)` confines access; absolute symlinks rejected
Process escape via ptrace	Seccomp blocks ptrace, process_vm_* syscalls
Privilege escalation	All capabilities dropped; user namespace maps to unprivileged
Network access	No network namespace escape; only established connections
Kernel exploitation	Seccomp reduces attack surface by blocking unnecessary syscalls

What the Jail Does NOT Protect Against

The jail is not a VM escape prevention mechanism. These threats are outside its scope:

Kernel bugs: A kernel vulnerability could bypass all userspace isolation
Side channels: Spectre/Meltdown-style attacks operate below the jail layer
KVM vulnerabilities: Bugs in the hypervisor itself bypass all userspace protection
Hardware attacks: DMA, firmware, or physical access attacks
External filesystem modification: Changes to shared directories from the host side are trusted; the jail only protects against guest-initiated escapes

The jail is defense-in-depth: it raises the bar for exploitation but doesn't guarantee containment.

Implementation

The jail applies isolation layers in a specific order, with fallback behavior when layers fail:

1. Always-Applied Layers (No Special Privileges Required)

These layers are applied unconditionally when a jail config is present:

PR_SET_NO_NEW_PRIVS: Prevents privilege escalation via setuid binaries. Always succeeds.

Capability dropping: All Linux capabilities are cleared. The VMM runs with CapEff: 0000000000000000.

VMM thread seccomp: Restricts syscalls for the main VMM thread. Applied after the Tokio runtime starts.

vCPU thread seccomp: Each vCPU thread applies its own strict filter before entering the KVM_RUN loop.

2. Namespace Isolation (May Require Privileges)

These layers require kernel support and may need root or user namespace support:

Mount namespace: Isolates filesystem view. Required for pivot_root.

IPC namespace: Isolates System V IPC and POSIX message queues.

User namespace (optional): Maps root in the namespace to unprivileged user outside.

If namespace creation fails (EPERM), the VMM logs a warning and continues with the always-applied layers.

3. Filesystem Isolation (Requires Mount Namespace)

Creates a minimal tmpfs root
Bind-mounts only required paths (e.g., /dev/kvm)
Uses pivot_root to change the root filesystem
Unmounts old root to prevent access

Seccomp Filter Details

Two seccomp profiles are applied to different threads:

VMM thread filter: Allows syscalls needed for Tokio async runtime, file I/O, and RPC handling. Blocks dangerous syscalls like ptrace, execve, mount.

vCPU thread filter: Minimal syscall allowlist for KVM ioctl operations only. Much more restrictive than the VMM filter.

Blocked syscalls return EPERM. The calling code receives an error and typically logs it.

Configuration

When using capsa through the standard API (e.g., capsa::sandbox() or capsa::vm()), a JailConfig is automatically provided to the VMM subprocess with sensible defaults. The "no jail config" case only applies to:

Direct capsa-vmm invocation without fd 5
Custom spawners that don't use SpawnConfig::with_jail_config()

Disabling the Jail

For debugging or when jail features are unavailable:

bash

# Via command-line flag
capsa-vmm --no-jail

# Via environment variable (for tests)
CAPSA_NO_JAIL=1 cargo test

Bind Mounts

The spawner configures bind mounts via JailConfig:

rust,ignore

let mut config = JailConfig::new();
config.add_bind_mount("/dev/kvm", "/dev/kvm");
config.add_readonly_bind_mount("/etc/resolv.conf", "/etc/resolv.conf");

let spawn_config = SpawnConfig::new().with_jail_config(config);

Default bind mounts include:

/dev/kvm (required for KVM)
/dev/null, /dev/zero, /dev/urandom (standard devices)

Troubleshooting

User Namespaces Disabled

If you see:

text

Warning: user namespace creation failed (Operation not permitted)

User namespaces may be disabled on your system. Check:

bash

cat /proc/sys/kernel/unprivileged_userns_clone
# 0 = disabled, 1 = enabled

To enable (requires root):

bash

sudo sysctl kernel.unprivileged_userns_clone=1

Without user namespaces, the jail falls back to privileged mode (requires root) or continues without full isolation.

Permission Errors

If you see filesystem setup errors:

text

Warning: failed to enter jail: filesystem setup failed

Common causes:

/tmp is mounted noexec or nosuid
Insufficient permissions to create directories
Another VMM instance is using the same jail root

The VMM logs a warning and continues without jailing. This is intentional—availability is prioritized over strict isolation in development.

Seccomp Blocking Legitimate Syscalls

If the VMM hangs or crashes unexpectedly, seccomp may be blocking a required syscall. Run with debug logging:

bash

RUST_LOG=capsa_vmm=debug capsa-vmm

If you identify a missing syscall, report it as a bug.

Verification

After jail setup, the VMM verifies isolation:

Capabilities: Reads /proc/self/status to confirm CapEff: 0
Working directory: Confirms cwd is /
Mount namespace: Confirms namespace link is readable
Seccomp: Confirms seccomp mode is 2 (filter mode, actively filtering syscalls)

Verification failures are logged as warnings but don't abort the VMM.

Linux Jail Security Model ​

Security Layers ​

Threat Model ​

What the Jail Does NOT Protect Against ​

Implementation ​

1. Always-Applied Layers (No Special Privileges Required) ​

2. Namespace Isolation (May Require Privileges) ​

3. Filesystem Isolation (Requires Mount Namespace) ​

Seccomp Filter Details ​

Configuration ​

Disabling the Jail ​

Bind Mounts ​

Troubleshooting ​

User Namespaces Disabled ​

Permission Errors ​

Seccomp Blocking Legitimate Syscalls ​

Verification ​

Linux Jail Security Model

Security Layers

Threat Model

What the Jail Does NOT Protect Against

Implementation

1. Always-Applied Layers (No Special Privileges Required)

2. Namespace Isolation (May Require Privileges)

3. Filesystem Isolation (Requires Mount Namespace)

Seccomp Filter Details

Configuration

Disabling the Jail

Bind Mounts

Troubleshooting

User Namespaces Disabled

Permission Errors

Seccomp Blocking Legitimate Syscalls

Verification