Distributed Coordination and Locking
Distributed locks coordinate processes accessing shared resources across multiple machines. They’re fundamental for preventing concurrent writes to the same record, ensuring single-leader behavior, and guarding non-idempotent operations.
The hard part: distributed environments don’t have shared memory or synchronized clocks. Network failures partition nodes without warning. Long GC pauses can make a healthy node look dead. Building a correct distributed lock is genuinely tricky.
Why It’s Hard
Common Pitfalls
The Implementation Options
RDBMS locks
Relational databases offer two approaches you can use without any extra infrastructure.
Best for: scenarios where all clients use the same database, no need for cross-DB coordination, simple use cases like “guard one scheduled job.”
Redis simple lock
The canonical Redis lock pattern:
SET resource_key my_random_value NX PX 30000
NX = set only if key doesn’t exist. PX 30000 = expire in 30 seconds. my_random_value = unique token chosen by client.
To release, use a Lua script that checks the value matches before deleting:
if redis.call("get", KEYS[1]) == ARGV[1]
then return redis.call("del", KEYS[1])
else return 0
end
Pros: Very fast, simple. Lock auto-expires if client crashes. Cons: Single Redis instance is a SPOF. Replication is asynchronous — failover can produce two clients thinking they hold the lock. No fencing token. No safety beyond the TTL.
Redlock
Redis’s multi-master lock algorithm. Uses N independent Redis nodes (typically 5):
- Client attempts
SET NX PXon each Redis master, recording timing. - If set succeeds on a majority (3 of 5) and total time < TTL, lock acquired.
- Effective validity = TTL − elapsed time.
- Failure → delete any partial locks, retry later.
ZooKeeper locks
ZooKeeper’s classic lock recipe uses ephemeral sequential znodes:
- Client creates
/locks/mylock/lock-NNNNNwithEPHEMERAL+SEQUENTIALflags. - Sequential flag appends a monotonically-increasing suffix.
- Client lists children, checks if its znode has the lowest number.
- If yes → it has the lock.
- If no → set a watch on the immediate predecessor, wait for it to be deleted.
- Release = delete the znode (or it auto-deletes when client disconnects).
The cost: you need to run and maintain a ZooKeeper ensemble.
etcd locks
etcd is a Raft-based KV store with built-in concurrency primitives. The Lock API uses a lease (TTL) plus an atomic compare-and-swap:
lease = etcd.leaseGrant(ctx, 10) // 10s TTL
txn = etcd.txn(ctx)
.If(compare(CreateRevision(key), "=", 0))
.Then(put(key, "locked", WithLease(lease.ID)))
.Commit()
This atomically creates the key under the lease if it didn’t exist. Lock held until Unlock or lease expires.
Fencing Tokens: The Safety Mechanism
A fencing token is a monotonically-increasing number issued with the lock. The protected resource (database, storage system) checks the token on every operation and rejects operations with stale tokens.
1. Acquire lock → get token N
2. Operation: write_data(payload, token=N)
3. Storage checks: latest_token >= N? Reject if smaller.
This protects against the “paused holder” problem. Even if your client paused for an hour past TTL, when it resumes and tries to write with token N, the storage rejects because someone else has token N+1.
Choosing the Right Tool
| Use case | Recommendation |
|---|---|
| Simple mutex inside one Postgres app | Postgres advisory locks |
| Cache invalidation, transient locks | Redis simple lock with TTL |
| Multi-DB coordination, ok with rare anomalies | Redlock (with caveats) |
| Correctness-critical (financial, leader election) | ZooKeeper or etcd, with fencing tokens |
| Already running K8s | etcd Lock API (you have it anyway) |
Recap
- Local locks coordinate one process. Distributed locks coordinate machines that don’t share memory.
- The hard parts: partial failures, clock skew, network partitions, GC pauses, atomicity, fencing.
- Common pitfalls: no TTL, no fencing tokens, single point of failure, non-atomic acquire.
- Tools: RDBMS advisory locks (simple), Redis (fast but unsafe under adversarial conditions), Redlock (controversial), ZooKeeper / etcd (strongly consistent, with fencing).
- Fencing tokens are the difference between “looks safe” and “actually safe” — but require storage cooperation.
- For correctness-critical use cases, prefer ZooKeeper or etcd. For convenience, Redis is fine.