fix(agent): serialize pings in stats collection#22868
Open
Conversation
Cap concurrent Ping calls in Collect() to one at a time using a channel semaphore. During network disruption, N parallel Pings pile up on magicsock.Conn.mu, and orphaned goroutines from expired pingCtx still trigger the wgengine watchdog. With 4 peers this meant ~80s of blocked goroutines vs the 45s watchdog threshold. Extracts collectPeerLatencies() behind a peerPinger interface for direct testing of the concurrency constraint. Refs #22864
9a32ccc to
4e551ef
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
During network disruption,
agent.Collect()spawns one goroutine peractive peer to call
Ping(). All N goroutines attempt to acquiremagicsock.Conn.musimultaneously. The 5-secondpingCtxtimeout isineffective because
pingWithTypespawns a fire-and-forget goroutinethat remains blocked on the lock after the context expires, still
tracked by the wgengine watchdog. With 4 peers, this creates ~80s of
blocked goroutines, exceeding the 45-second watchdog threshold and
killing the agent via
log.Fatalf.Solution
Add a capacity-1 channel semaphore so at most one Ping is in-flight at
a time. Each goroutine acquires the semaphore before calling Ping (with
a
pingCtx.Done()fallback) and releases it after. This reduceswatchdog exposure from N * timeout to 1 * timeout.
The peer-ping loop is extracted into
collectPeerLatencies()behind apeerPingerinterface to enable direct testing of the concurrencyconstraint.
Testing
TestCollectPeerLatencies_SerializedPings: creates 5 fake peerswith a 10ms ping delay, asserts max concurrent pings never exceeds 1
using an atomic high-water mark counter.
TestStatsReporterandTestAgent_Stats_*tests pass.Refs #22864