AODV Routing in an ESP-NOW Mesh: How Packets Find Their Way
Put two BLEShark Nano nodes next to each other and they communicate directly. Put 8 nodes across two floors of a building, where some nodes are out of direct range of others, and you need routing. Some nodes have to relay messages on behalf of others - the mesh needs to know which path to take.
Shiver uses AODV - Ad-hoc On-demand Distance Vector routing - to handle this. AODV is a well-established protocol designed specifically for mobile ad-hoc networks (MANETs): situations where devices may join, leave, or move, and where you don't want to maintain a full routing table for every possible destination all the time.
The "on-demand" part is key. AODV doesn't proactively compute routes to everywhere. It discovers routes when they're needed, caches them briefly, and lets them expire. In a mesh where nodes might be running battery-powered with intermittent radio availability, on-demand routing is more efficient than proactive routing (which would require constantly flooding the network with routing updates).
graph TD
A["Node A"] -->|"Data"| B["Node B"]
B -->|"Data"| C["Node C"]
C -.->|"Link broken"| D["Node D"]
C -->|"RERR"| B
B -->|"RERR"| A
A -->|"New RREQ"| RREQ["Route rediscovery"]
sequenceDiagram
participant A as Node A (source)
participant B as Node B
participant C as Node C
participant D as Node D (destination)
A->>B: RREQ broadcast
A->>C: RREQ broadcast
B->>C: RREQ forward
B->>D: RREQ forward
Note over D: Route found!
D->>B: RREP (unicast back)
B->>A: RREP forwarded
Note over A: Route to D via B cached
Table of Contents
- Why On-Demand Routing
- Route Request: RREQ
- Expanding Ring Search
- Jitter Forwarding
- Route Reply: RREP
- Route Error: RERR
- The Route Table
- Passive Route Learning
- Rate Limits and Safety Valves
Why On-Demand Routing
The alternative to on-demand routing is proactive routing: protocols like OLSR (Optimized Link State Routing) that constantly broadcast link state updates so every node knows the full topology. Proactive routing gives you instantly available routes, but at the cost of constant routing traffic.
In a Shiver mesh, proactive routing would be wasteful. Most nodes don't need to reach most other nodes most of the time. A node running a captive portal only needs to send credential events to the initiator. A node doing a WiFi scan only needs to send results back to whoever requested them. The mesh doesn't benefit from maintaining 16x16 routing tables at all times.
AODV discovers a route when the first packet needs it, caches it for 90 seconds of use (or 180 seconds hard limit), and then lets it expire. If the route is needed again after expiry, discovery runs again. In practice, during an active assessment when commands and results are flowing, routes stay alive. During idle periods between operations, the expired routes are fine - they'll be rediscovered when the next operation starts.
Route Request: RREQ
When Node A needs to reach Node D and has no cached route, it initiates route discovery by broadcasting a Route Request (RREQ).
An RREQ contains:
- Origin eFuse: who is looking for a route (Node A's hardware ID)
- Target eFuse: who we're trying to reach (Node D's hardware ID)
- RREQ ID: a per-origin sequence number that uniquely identifies this particular route request
- TTL: how many hops this RREQ is allowed to travel
- Hop count: how many hops it has already traveled
The RREQ is broadcast to all direct neighbors of Node A. Each neighbor that receives it does two things:
- Installs a reverse route: "to reach Node A, go through the node who just sent this to me." This reverse route is what allows the Route Reply to travel back to the origin.
- If this node is not Node D and TTL permits: forward the RREQ toward Node D's neighbors.
If a node receives an RREQ it has already processed (tracked by origin eFuse + RREQ ID), it drops the duplicate immediately. This prevents the flood from spiraling into an infinite loop.
Expanding Ring Search
Rather than flooding the entire mesh with a high-TTL RREQ immediately, Shiver uses expanding ring search. The discovery starts with TTL=1 (only direct neighbors), waits for a reply, and if none comes within 3 seconds, tries TTL=2 (neighbors of neighbors), and so on.
The sequence:
- TTL=1, wait 3 seconds
- TTL=2, wait 3 seconds
- TTL=3, wait 3 seconds
- TTL=4, wait 3 seconds
- After TTL=4: escalate to full flood (max TTL=15, 20-second timeout)
For a typical 3-4 node mesh in a building, discovery usually succeeds at TTL=1 or TTL=2. The escalation to full flood is a safety net for larger, more spread-out deployments.
Maximum TTL is 15 hops. In practice, a 16-node mesh with reasonable density would rarely need more than 5-6 hops to reach any node. The 15-hop limit exists as an upper bound to prevent runaway flooding in pathological topologies.
Jitter Forwarding
When multiple nodes receive the same RREQ and all try to forward it simultaneously, you get a broadcast storm - multiple copies of the same frame collide on the air and interfere with each other.
Shiver prevents this with jitter forwarding. When a node receives an RREQ to forward, it doesn't forward immediately. Instead, it schedules the forward with a random delay between 20ms and 70ms. The parent table entry records who the RREQ came from (for the reverse route), and a periodic tick function sends the forwarded RREQ when the timer fires.
The jitter spreads out the forwarding across time, reducing collisions. Different nodes pick different random delays, so their forwarded copies of the RREQ arrive at the next hop staggered over a 50ms window rather than all at once.
This is a well-known technique in wireless mesh routing. Without it, RREQs in dense meshes frequently collide and get lost, causing discovery to fail at low hop counts and forcing expensive escalations to higher TTLs or full flood.
Route Reply: RREP
When the RREQ reaches Node D (the target), Node D sends a Route Reply (RREP) back toward Node A. The RREP travels the reverse path established when intermediate nodes processed the RREQ - each node recorded "to reach Node A, go this way," so the RREP follows that path backwards.
Each node that receives and forwards the RREP installs a forward route: "to reach Node D, go through the node that sent me this RREP." By the time the RREP reaches Node A, every intermediate node along the path has a route entry for Node D.
Node A receives the RREP, installs the complete route, and fires a hook callback that resumes whatever was waiting for the route. The delivery that triggered the route discovery proceeds.
Route installation along the path is a standard AODV optimization. If Node B is on the path from A to D, and later Node C (who isn't on the path) needs to reach D, Node C can ask Node B to forward on its behalf without triggering a new route discovery - Node B already has a valid route to D.
Route Error: RERR
Routes become invalid when a neighbor is lost. If Node B is forwarding traffic from A to D, and Node B loses its direct connection to Node C (the next hop toward D), Node B can't forward anymore. It sends a Route Error (RERR) back toward Node A, indicating that Node D is no longer reachable via this path.
The RERR contains the unreachable destination's eFuse and the reporter's eFuse. When Node A receives the RERR, it invalidates its cached route to D and can trigger a new discovery.
Shiver's RERR is 1-hop only - it's sent to the immediate upstream neighbor, not propagated further. In a small mesh, this is sufficient: the originating node A will simply do a new RREQ if it needs to reach D again. Full RERR propagation in larger networks adds complexity and traffic that isn't necessary at Shiver's scale.
RERRs are also generated on neighbor eviction. If Node B's neighbor table evicts Node C (too many missed keepalives, stale), Node B sends a RERR for any routes that used Node C as next hop.
The Route Table
Shiver's route table holds 16 entries. Each entry tracks:
- Destination eFuse: who this route reaches
- Next hop eFuse: which direct neighbor to use
- Hop count: total path length to destination
- Flags: whether the route is valid, stale, or local
- Last used timestamp: for idle expiry
- Created timestamp: for hard expiry
Expiry rules:
- Idle expiry: 90 seconds without use. Routes are cached for active use, not forever.
- Hard expiry: 180 seconds regardless of use. Prevents stale paths from persisting in long-running meshes.
- Local routes (direct neighbors): never evicted automatically. You always know how to reach your direct neighbors.
When the route table fills up (all 16 entries occupied), new route discoveries can still run - they'll just need to find a slot. The table eviction policy prefers stale entries. Valid, recently-used routes are protected.
Passive Route Learning
Shiver installs routes passively from every received data frame, not just from RREQ/RREP. When Node B receives a forwarded data frame from Node A (even if the frame is addressed to Node D and just passing through B), Node B installs a reverse route: "to reach the original sender, use the node who gave me this frame."
This eliminates the need for explicit route discovery for reply paths. If Node A sends a command to Node D via intermediate nodes, Node D already has a route back to Node A by the time the command arrives. The result (success/failure response from Node D) can be delivered immediately without triggering a new RREQ for the return path.
Passive learning significantly reduces routing traffic in practice. Commands and results flow in pairs. With passive learning, only the outbound command triggers discovery (if needed). The return path is free.
Rate Limits and Safety Valves
Uncontrolled route discovery in a mesh can generate significant traffic. If every node simultaneously decides it needs to find routes to all other nodes, the resulting flood of RREQs overwhelms the channel. Shiver applies several rate limits:
Global RREQ interval: At least 1 second between any two RREQ transmissions, regardless of destination. This hard limit prevents a single node from flooding the mesh with discovery traffic.
Per-minute cap: Maximum 60 RREQs per minute globally. If a mesh is generating this many route requests, something is wrong with topology stability - the cap prevents it from getting worse.
No concurrent duplicate discoveries: Only one route discovery per destination can be active at a time. If discovery for Node D is already in progress, a second request for Node D queues up and waits for the first to complete (or fail).
Max concurrent discoveries: Up to 15 simultaneous route discoveries for different destinations. Beyond that, new discoveries queue.
On the retry side, the reliability layer triggers a new route discovery if a retransmission returns "no route." The pending ACK entry stays alive through the new discovery attempt - if a route is found within the backoff window, the retry proceeds. If discovery fails entirely, the reliability layer calls the delivery callback with a failure status.
AODV at Scale
AODV was designed for mobile ad-hoc networks that might have hundreds of nodes in field deployments. In Shiver's context - up to 16 nodes in a building - the protocol is running well below its designed scale. The route table (16 entries) can hold routes to all possible nodes simultaneously. Discovery rarely needs more than TTL=3. The RERR propagation simplification (1-hop only) is entirely adequate.
What AODV gives Shiver at this scale is clean semantics: routes exist when needed, don't exist when not needed, and failures are reported clearly. The alternative (a static routing table configured at setup time) would break every time a node moved or went offline. AODV handles the dynamic reality of battery-powered nodes in physical spaces without requiring manual reconfiguration.