Linux Jail Security Model
On Linux, capsa-vmm jails itself for defense-in-depth isolation. This document explains the threat model, implementation, and configuration.
Security Layers
Capsa applies multiple independent security layers. Each layer is applied if possible, even if other layers fail:
| Layer | When Applied | Requires |
|---|---|---|
| Hardware isolation | Always | KVM (/dev/kvm) |
| vCPU thread seccomp | Always (per vCPU thread) | None |
| PR_SET_NO_NEW_PRIVS | Always (if jail config present) | None |
| Capability dropping | Always (if jail config present) | None |
| VMM thread seccomp | Always (if jail config present) | None |
| Namespaces (mount, IPC) | If unshare succeeds | Kernel support, sometimes root |
| Pivot root | If namespaces succeed | Mount namespace |
This means that even if namespace creation fails (EPERM on unprivileged systems), you still get:
- Hardware virtualization isolation (guest cannot access host memory)
- vCPU seccomp (limits what a VM escape could do)
- VMM seccomp (limits VMM syscalls to a safe subset)
- No new privileges (prevents setuid escalation)
- No capabilities (prevents privileged operations)
Threat Model
Guest code is assumed untrusted. The jail protects against VMM compromise leading to host access. If an attacker exploits a bug in the VMM process (device emulation, virtio handling, etc.), the jail limits what they can do:
| Attack Vector | Mitigation |
|---|---|
| Arbitrary file read/write | Pivot root to empty tmpfs; only bind-mounted paths accessible |
| Shared directory escape | openat2(RESOLVE_BENEATH) confines access; absolute symlinks rejected |
| Process escape via ptrace | Seccomp blocks ptrace, process_vm_* syscalls |
| Privilege escalation | All capabilities dropped; user namespace maps to unprivileged |
| Network access | No network namespace escape; only established connections |
| Kernel exploitation | Seccomp reduces attack surface by blocking unnecessary syscalls |
What the Jail Does NOT Protect Against
The jail is not a VM escape prevention mechanism. These threats are outside its scope:
- Kernel bugs: A kernel vulnerability could bypass all userspace isolation
- Side channels: Spectre/Meltdown-style attacks operate below the jail layer
- KVM vulnerabilities: Bugs in the hypervisor itself bypass all userspace protection
- Hardware attacks: DMA, firmware, or physical access attacks
- External filesystem modification: Changes to shared directories from the host side are trusted; the jail only protects against guest-initiated escapes
The jail is defense-in-depth: it raises the bar for exploitation but doesn't guarantee containment.
Implementation
The jail applies isolation layers in a specific order, with fallback behavior when layers fail:
1. Always-Applied Layers (No Special Privileges Required)
These layers are applied unconditionally when a jail config is present:
PR_SET_NO_NEW_PRIVS: Prevents privilege escalation via setuid binaries. Always succeeds.
Capability dropping: All Linux capabilities are cleared. The VMM runs with CapEff: 0000000000000000.
VMM thread seccomp: Restricts syscalls for the main VMM thread. Applied after the Tokio runtime starts.
vCPU thread seccomp: Each vCPU thread applies its own strict filter before entering the KVM_RUN loop.
2. Namespace Isolation (May Require Privileges)
These layers require kernel support and may need root or user namespace support:
Mount namespace: Isolates filesystem view. Required for pivot_root.
IPC namespace: Isolates System V IPC and POSIX message queues.
User namespace (optional): Maps root in the namespace to unprivileged user outside.
If namespace creation fails (EPERM), the VMM logs a warning and continues with the always-applied layers.
3. Filesystem Isolation (Requires Mount Namespace)
- Creates a minimal tmpfs root
- Bind-mounts only required paths (e.g.,
/dev/kvm) - Uses
pivot_rootto change the root filesystem - Unmounts old root to prevent access
Seccomp Filter Details
Two seccomp profiles are applied to different threads:
VMM thread filter: Allows syscalls needed for Tokio async runtime, file I/O, and RPC handling. Blocks dangerous syscalls like ptrace, execve, mount.
vCPU thread filter: Minimal syscall allowlist for KVM ioctl operations only. Much more restrictive than the VMM filter.
Blocked syscalls return EPERM. The calling code receives an error and typically logs it.
Configuration
When using capsa through the standard API (e.g., capsa::sandbox() or capsa::vm()), a JailConfig is automatically provided to the VMM subprocess with sensible defaults. The "no jail config" case only applies to:
- Direct
capsa-vmminvocation without fd 5 - Custom spawners that don't use
SpawnConfig::with_jail_config()
Disabling the Jail
For debugging or when jail features are unavailable:
# Via command-line flag
capsa-vmm --no-jail
# Via environment variable (for tests)
CAPSA_NO_JAIL=1 cargo testBind Mounts
The spawner configures bind mounts via JailConfig:
let mut config = JailConfig::new();
config.add_bind_mount("/dev/kvm", "/dev/kvm");
config.add_readonly_bind_mount("/etc/resolv.conf", "/etc/resolv.conf");
let spawn_config = SpawnConfig::new().with_jail_config(config);Default bind mounts include:
/dev/kvm(required for KVM)/dev/null,/dev/zero,/dev/urandom(standard devices)
Troubleshooting
User Namespaces Disabled
If you see:
Warning: user namespace creation failed (Operation not permitted)User namespaces may be disabled on your system. Check:
cat /proc/sys/kernel/unprivileged_userns_clone
# 0 = disabled, 1 = enabledTo enable (requires root):
sudo sysctl kernel.unprivileged_userns_clone=1Without user namespaces, the jail falls back to privileged mode (requires root) or continues without full isolation.
Permission Errors
If you see filesystem setup errors:
Warning: failed to enter jail: filesystem setup failedCommon causes:
/tmpis mountednoexecornosuid- Insufficient permissions to create directories
- Another VMM instance is using the same jail root
The VMM logs a warning and continues without jailing. This is intentional—availability is prioritized over strict isolation in development.
Seccomp Blocking Legitimate Syscalls
If the VMM hangs or crashes unexpectedly, seccomp may be blocking a required syscall. Run with debug logging:
RUST_LOG=capsa_vmm=debug capsa-vmmIf you identify a missing syscall, report it as a bug.
Verification
After jail setup, the VMM verifies isolation:
- Capabilities: Reads
/proc/self/statusto confirmCapEff: 0 - Working directory: Confirms cwd is
/ - Mount namespace: Confirms namespace link is readable
- Seccomp: Confirms seccomp mode is
2(filter mode, actively filtering syscalls)
Verification failures are logged as warnings but don't abort the VMM.