Capsa is experimental software. APIs may change without notice.
Skip to content

Linux Jail Security Model

On Linux, capsa-vmm jails itself for defense-in-depth isolation. This document explains the threat model, implementation, and configuration.

Security Layers

Capsa applies multiple independent security layers. Each layer is applied if possible, even if other layers fail:

LayerWhen AppliedRequires
Hardware isolationAlwaysKVM (/dev/kvm)
vCPU thread seccompAlways (per vCPU thread)None
PR_SET_NO_NEW_PRIVSAlways (if jail config present)None
Capability droppingAlways (if jail config present)None
VMM thread seccompAlways (if jail config present)None
Namespaces (mount, IPC)If unshare succeedsKernel support, sometimes root
Pivot rootIf namespaces succeedMount namespace

This means that even if namespace creation fails (EPERM on unprivileged systems), you still get:

  • Hardware virtualization isolation (guest cannot access host memory)
  • vCPU seccomp (limits what a VM escape could do)
  • VMM seccomp (limits VMM syscalls to a safe subset)
  • No new privileges (prevents setuid escalation)
  • No capabilities (prevents privileged operations)

Threat Model

Guest code is assumed untrusted. The jail protects against VMM compromise leading to host access. If an attacker exploits a bug in the VMM process (device emulation, virtio handling, etc.), the jail limits what they can do:

Attack VectorMitigation
Arbitrary file read/writePivot root to empty tmpfs; only bind-mounted paths accessible
Shared directory escapeopenat2(RESOLVE_BENEATH) confines access; absolute symlinks rejected
Process escape via ptraceSeccomp blocks ptrace, process_vm_* syscalls
Privilege escalationAll capabilities dropped; user namespace maps to unprivileged
Network accessNo network namespace escape; only established connections
Kernel exploitationSeccomp reduces attack surface by blocking unnecessary syscalls

What the Jail Does NOT Protect Against

The jail is not a VM escape prevention mechanism. These threats are outside its scope:

  • Kernel bugs: A kernel vulnerability could bypass all userspace isolation
  • Side channels: Spectre/Meltdown-style attacks operate below the jail layer
  • KVM vulnerabilities: Bugs in the hypervisor itself bypass all userspace protection
  • Hardware attacks: DMA, firmware, or physical access attacks
  • External filesystem modification: Changes to shared directories from the host side are trusted; the jail only protects against guest-initiated escapes

The jail is defense-in-depth: it raises the bar for exploitation but doesn't guarantee containment.

Implementation

The jail applies isolation layers in a specific order, with fallback behavior when layers fail:

1. Always-Applied Layers (No Special Privileges Required)

These layers are applied unconditionally when a jail config is present:

PR_SET_NO_NEW_PRIVS: Prevents privilege escalation via setuid binaries. Always succeeds.

Capability dropping: All Linux capabilities are cleared. The VMM runs with CapEff: 0000000000000000.

VMM thread seccomp: Restricts syscalls for the main VMM thread. Applied after the Tokio runtime starts.

vCPU thread seccomp: Each vCPU thread applies its own strict filter before entering the KVM_RUN loop.

2. Namespace Isolation (May Require Privileges)

These layers require kernel support and may need root or user namespace support:

Mount namespace: Isolates filesystem view. Required for pivot_root.

IPC namespace: Isolates System V IPC and POSIX message queues.

User namespace (optional): Maps root in the namespace to unprivileged user outside.

If namespace creation fails (EPERM), the VMM logs a warning and continues with the always-applied layers.

3. Filesystem Isolation (Requires Mount Namespace)

  • Creates a minimal tmpfs root
  • Bind-mounts only required paths (e.g., /dev/kvm)
  • Uses pivot_root to change the root filesystem
  • Unmounts old root to prevent access

Seccomp Filter Details

Two seccomp profiles are applied to different threads:

VMM thread filter: Allows syscalls needed for Tokio async runtime, file I/O, and RPC handling. Blocks dangerous syscalls like ptrace, execve, mount.

vCPU thread filter: Minimal syscall allowlist for KVM ioctl operations only. Much more restrictive than the VMM filter.

Blocked syscalls return EPERM. The calling code receives an error and typically logs it.

Configuration

When using capsa through the standard API (e.g., capsa::sandbox() or capsa::vm()), a JailConfig is automatically provided to the VMM subprocess with sensible defaults. The "no jail config" case only applies to:

  • Direct capsa-vmm invocation without fd 5
  • Custom spawners that don't use SpawnConfig::with_jail_config()

Disabling the Jail

For debugging or when jail features are unavailable:

bash
# Via command-line flag
capsa-vmm --no-jail

# Via environment variable (for tests)
CAPSA_NO_JAIL=1 cargo test

Bind Mounts

The spawner configures bind mounts via JailConfig:

rust,ignore
let mut config = JailConfig::new();
config.add_bind_mount("/dev/kvm", "/dev/kvm");
config.add_readonly_bind_mount("/etc/resolv.conf", "/etc/resolv.conf");

let spawn_config = SpawnConfig::new().with_jail_config(config);

Default bind mounts include:

  • /dev/kvm (required for KVM)
  • /dev/null, /dev/zero, /dev/urandom (standard devices)

Troubleshooting

User Namespaces Disabled

If you see:

text
Warning: user namespace creation failed (Operation not permitted)

User namespaces may be disabled on your system. Check:

bash
cat /proc/sys/kernel/unprivileged_userns_clone
# 0 = disabled, 1 = enabled

To enable (requires root):

bash
sudo sysctl kernel.unprivileged_userns_clone=1

Without user namespaces, the jail falls back to privileged mode (requires root) or continues without full isolation.

Permission Errors

If you see filesystem setup errors:

text
Warning: failed to enter jail: filesystem setup failed

Common causes:

  • /tmp is mounted noexec or nosuid
  • Insufficient permissions to create directories
  • Another VMM instance is using the same jail root

The VMM logs a warning and continues without jailing. This is intentional—availability is prioritized over strict isolation in development.

Seccomp Blocking Legitimate Syscalls

If the VMM hangs or crashes unexpectedly, seccomp may be blocking a required syscall. Run with debug logging:

bash
RUST_LOG=capsa_vmm=debug capsa-vmm

If you identify a missing syscall, report it as a bug.

Verification

After jail setup, the VMM verifies isolation:

  • Capabilities: Reads /proc/self/status to confirm CapEff: 0
  • Working directory: Confirms cwd is /
  • Mount namespace: Confirms namespace link is readable
  • Seccomp: Confirms seccomp mode is 2 (filter mode, actively filtering syscalls)

Verification failures are logged as warnings but don't abort the VMM.

Released under the MIT License.