X Tutup
Skip to content

fix(agent): serialize pings in stats collection#22868

Open
mafredri wants to merge 1 commit intomainfrom
fix/cap-concurrent-pings
Open

fix(agent): serialize pings in stats collection#22868
mafredri wants to merge 1 commit intomainfrom
fix/cap-concurrent-pings

Conversation

@mafredri
Copy link
Member

@mafredri mafredri commented Mar 9, 2026

Problem

During network disruption, agent.Collect() spawns one goroutine per
active peer to call Ping(). All N goroutines attempt to acquire
magicsock.Conn.mu simultaneously. The 5-second pingCtx timeout is
ineffective because pingWithType spawns a fire-and-forget goroutine
that remains blocked on the lock after the context expires, still
tracked by the wgengine watchdog. With 4 peers, this creates ~80s of
blocked goroutines, exceeding the 45-second watchdog threshold and
killing the agent via log.Fatalf.

Solution

Add a capacity-1 channel semaphore so at most one Ping is in-flight at
a time. Each goroutine acquires the semaphore before calling Ping (with
a pingCtx.Done() fallback) and releases it after. This reduces
watchdog exposure from N * timeout to 1 * timeout.

The peer-ping loop is extracted into collectPeerLatencies() behind a
peerPinger interface to enable direct testing of the concurrency
constraint.

Testing

  • New TestCollectPeerLatencies_SerializedPings: creates 5 fake peers
    with a 10ms ping delay, asserts max concurrent pings never exceeds 1
    using an atomic high-water mark counter.
  • Verified the test fails without the semaphore (concurrent count = 5).
  • Existing TestStatsReporter and TestAgent_Stats_* tests pass.

Refs #22864

Cap concurrent Ping calls in Collect() to one at a time using a
channel semaphore. During network disruption, N parallel Pings
pile up on magicsock.Conn.mu, and orphaned goroutines from expired
pingCtx still trigger the wgengine watchdog. With 4 peers this
meant ~80s of blocked goroutines vs the 45s watchdog threshold.

Extracts collectPeerLatencies() behind a peerPinger interface for
direct testing of the concurrency constraint.

Refs #22864
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

X Tutup