CRDT Gossip: How NANDA Nodes Stay in Sync

The Sync Problem

The NANDA Index isn't a single centralized registry. As described in the Registry Quilt architecture, it's a federation of independent nodes — each operated by a different organization, each authoritative for its own agents, but collectively forming a global discovery layer.

This creates a distributed systems problem: when Nexartis registers an agent, how does MIT's node know about it? When MIT updates an agent's metadata, how does the rest of the mesh see the change? Traditional solutions — Raft consensus, Paxos, primary/replica — require tight coordination and don't scale across independent organizations with different availability guarantees.

Design constraint. NANDA nodes are operated by independent organizations. They go offline independently, restart independently, and may have network partitions between them. The sync protocol must handle all of this gracefully — no split-brain, no data loss, no coordination overhead.

LWW-Register CRDTs

CRDTs (Conflict-free Replicated Data Types) are data structures that can be replicated across multiple nodes and merged without coordination — mathematically guaranteed to converge to the same state. No consensus protocol needed.

We use the Last-Writer-Wins Register (LWW-Register) variant. Each AgentAddr record carries a timestamp, and when two nodes have conflicting values for the same agent, the one with the later timestamp wins. Simple, deterministic, and partition-tolerant.

LWW-Register Merge

function merge(local: CrdtEntry, remote: CrdtEntry): CrdtEntry {
  // Later timestamp always wins
  if (remote.timestamp > local.timestamp) return remote;
  // Tie-break: higher agent_id wins (deterministic)
  if (remote.timestamp === local.timestamp
      && remote.agent_id > local.agent_id) return remote;
  return local;
}

The tie-breaking rule (higher agent_id wins on equal timestamps) ensures that even in the pathological case of simultaneous writes, all nodes converge to the same value. The merge function is commutative, associative, and idempotent — the three properties that make CRDTs work.

This is a perfect fit for AgentAddr records: they're small (≤120 bytes), each agent has exactly one authoritative owner, and the most recent registration is always the correct one. More complex CRDT types (G-Counter, OR-Set) would add unnecessary complexity for this use case.

The Gossip Protocol

CRDTs tell us how to merge. The gossip protocol tells us when to exchange data. Our gossip implementation follows an anti-entropy model: periodically, each node selects a peer and exchanges its recent changes.

Gossip Round

// Every gossip interval (default: 60 seconds per peer)
1. Select random peer from known peer list
2. Compute delta: records changed since last sync with this peer
3. Send delta to peer via POST /federation/gossip
4. Receive peer's delta in response
5. Merge received records using LWW-Register merge
6. Update local state (stale KV cache entries expire via TTL)

The gossip interval balances freshness against bandwidth. With a 60-second per-peer rate limit, a new registration propagates to all nodes within a few rounds — typically under 5 minutes for a mesh of 10 nodes. Reducing the interval improves freshness at the cost of more network traffic. The interval is configurable per-node.

Each gossip message includes the sender's node_id, a vector clock for consistency tracking, and the batch of changed CRDT entries. The receiver merges each entry using the LWW-Register merge function, updating its local D1 store for any entries where the remote value wins.

Federation & the Registry Quilt

The gossip protocol is the engine that powers the Registry Quilt — our model for federated agent discovery where each node is a "patch" that stitches together with peers to form a seamless global registry.

Federation adds an authentication layer on top of gossip. Each peer-to-peer connection is authenticated using the NANDA_FEDERATION_ADMIN_KEY shared secret via Authorization: Bearer headers. This prevents unauthorized nodes from injecting records into the mesh while keeping the protocol simple — no PKI infrastructure needed for the initial deployment.

When a gossip merge updates a local record, the resolver's KV cache is immediately invalidated for that agent. This means a query that arrives after a gossip update will get fresh data from D1, which is then cached in KV. The result: eventual consistency with a convergence window of gossip_interval + cache_TTL in the worst case.

< 2 min

Propagation Time (10 nodes)

60s

Push Interval Per Peer

Coordination Overhead

Quilt Routing & SafeSearch

The Quilt architecture doesn't just replicate data — it routes queries intelligently. When a node receives a resolution request for an agent it doesn't hold locally, it checks its peer registry. If a peer is known to be authoritative for that agent's namespace, the query is forwarded to that peer and the result is cached locally.

This routing layer also enables Agentic SafeSearch. Each AgentFacts document includes content flags (financial_advice, medical_content, adult_content) that flow through the gossip mesh. Orchestrators can filter discovery results by content policy — a children's education platform can request only kid-safe agents, and the NANDA Index enforces this at the resolution layer.

SafeSearch isn't just a tag filter. It integrates with NANDA's Zero-Trust Agent Architecture (ZTAA) — content flags are part of the verifiable credential chain, meaning they're cryptographically attested, not self-declared. A node can reject agents whose content flags fail verification, adding a trust layer to content filtering that traditional safe search can't provide.

Why Not Consensus?

A natural question: why CRDTs instead of Raft or Paxos? The answer is operational independence. NANDA nodes are operated by different organizations. MIT, Nexartis, enterprise customers — each runs their own node with their own availability SLAs. A consensus protocol would require a majority of nodes to be online for any write to succeed. CRDTs let each node operate independently, accepting writes locally and syncing asynchronously.

The tradeoff is eventual consistency rather than strong consistency. For agent discovery, this is the right tradeoff. An agent that was registered 60 seconds ago but hasn't propagated to all nodes yet is a minor inconvenience. An agent registry that goes down because half the nodes are offline is a catastrophic failure. CRDTs choose availability over consistency — exactly what the CAP theorem says we must choose for a partition-tolerant distributed system.

For the formal foundations, see Shapiro et al.'s original CRDT paper and the NANDA Index paper's discussion of federation consistency models.

The Sync Problem

LWW-Register CRDTs

LWW-Register Merge

The Gossip Protocol

Gossip Round

Federation & the Registry Quilt

Quilt Routing & SafeSearch

Why Not Consensus?

Further Reading

Continue Reading

Registry Quilt Architecture

Building DNS for AI Agents

NANDA Node — Coming soon