parking_lot: ffffffffffffffff...
read at source ↗ fly.io
parking_lot: ffffffffffffffff…
Source: fly.io Date: 2025-06-09 URL: https://fly.io/blog/parking-lot-ffffffffffffffff/
Summary
Engineering writeup documenting a critical concurrency bug in the parking_lot Rust library’s RWLock implementation. The bug: when a write lock timeout and reader unlock coincided with specific timing, a “double free” of lock state bits corrupted the lock word to 0xFFFFFFFFFFFFFFFF, deadlocking the affected proxy. Discovered while refactoring fly-proxy to regionalize the Anycast router. The parking_lot team confirmed and deployed a fix quickly; Fly added improved lock instrumentation to detect future waker/concurrency issues.
Implications
Infrastructure substrate / edge deployment economics. This is a signal about the failure modes of Rust async at Fly’s scale: a timing-dependent bit-flip in a widely-used locking library caused production outages before being caught. The discovery method (symptom → regionalization → isolation → root cause) is worth noting — the bug was masked by global state until they started isolating regions. For teams building high-throughput async Rust systems, the lesson is that parking_lot RWLock timeouts combined with concurrent readers can trigger this bug pre-fix, and the mitigation (improved waker instrumentation, watchdog timers) is the right posture. The fix is in parking_lot; verify your version is post-patch.