Corrosion
read at source ↗ fly.io
Corrosion
Source: fly.io Date: 2025-12-10 URL: https://fly.io/blog/corrosion/
Summary
Engineering writeup and open-source announcement for Corrosion, Fly.io’s distributed state synchronization system. Corrosion abandons consensus protocols in favor of gossip-based propagation of a SQLite database across thousands of edge servers, using SWIM membership, QUIC transport, and cr-sqlite (CRDT-based conflict resolution) for last-write-wins semantics. The post is candid about severe incidents: a deadlock that locked the entire proxy fleet, a nullable column migration that triggered cluster-wide reconciliation storms. Mitigations include watchdog timers, checkpoint backups, eliminating partial updates, and regionalization.
Implications
Infrastructure substrate / edge deployment economics. Corrosion is the distributed nervous system of Fly’s fleet — it’s how every Fly Machine knows where every other machine is in real-time, globally, exposed as a SQLite interface. The open-source release turns a proprietary advantage into a community resource, which is Fly’s consistent pattern (Litestream, LiteFS, Corrosion). For the radar, the cr-sqlite/CRDT approach to conflict resolution is worth tracking as a pattern for agent-to-agent shared state — if multiple agents need to coordinate on shared data without a central coordinator, this is a viable architecture. The failure modes documented (deadlocks, reconciliation storms) are the same failure modes any sufficiently large agent fleet will encounter.