Making Machines Move
read at source ↗ fly.io
Making Machines Move
Source: fly.io Date: 2025-02-12 URL: https://fly.io/blog/machine-migrations/
Summary
Engineering writeup on live migration of stateful Fly Machines (with attached NVMe volumes) across physical servers with minimal downtime and zero data loss. The solution: dm-clone (Linux device mapper) for block-level lazy copying, iSCSI for network block protocol across the fleet, and orchestration in flyd. The “stop-clone-boot” sequence replaces “stop-copy-boot” — cloning is asynchronous, reducing interruption to seconds rather than minutes. Candid about complications: encryption keys, LUKS2 header issues, IPv6 6PN routing, Corrosion consistency.
Implications
Edge deployment economics / Machines API as agent-runtime substrate. Live migration is critical infrastructure for the Sprites/Machines model: if agents run persistent VMs, those VMs need to move between physical hosts for maintenance without data loss. The dm-clone approach is elegant — lazy block copying means the VM can boot from the destination before the copy is complete. This is what makes Fly’s VM economics viable at scale: servers can be drained and reallocated without customer impact. For agent workloads specifically, this means an agent’s stateful environment (filesystem, scratch databases, in-progress work) survives host migrations transparently.