X Tutup
Skip to content

fix(agent): make stats and metadata RPC errors non-fatal#22869

Open
mafredri wants to merge 1 commit intomainfrom
fix/non-fatal-stats-metadata
Open

fix(agent): make stats and metadata RPC errors non-fatal#22869
mafredri wants to merge 1 commit intomainfrom
fix/non-fatal-stats-metadata

Conversation

@mafredri
Copy link
Member

@mafredri mafredri commented Mar 9, 2026

Problem

The apiConnRoutineManager uses a single errgroup.Group. When
reportMetadata or statsReporter.reportLoop encounters a transient
RPC error (e.g. from network disruption), the error propagates through
the errgroup, canceling its context and tearing down ALL routines,
including critical ones like coordination and DERP map subscription.

During network disruption, these periodic, non-critical RPCs fail first,
forcing a full API reconnection cycle that compounds the disruption.

Solution

Both routines now log the error as a warning and continue the loop
instead of returning the error to the errgroup:

  • reportMetadata: BatchUpdateMetadata errors are logged, and
    reportInFlight is reset so the next tick can retry. Metadata will be
    re-collected and sent on the next tick.
  • statsReporter.reportLoop: UpdateStats errors in the main loop
    are logged and the loop continues. Stats will be re-sent on the next
    callback. The initial UpdateStats call (to get the report interval)
    remains fatal since the interval is needed to configure the callback.

Testing

  • New TestStatsReporter_NonFatalRPCError: injects a transient error on
    UpdateStats, verifies the loop does NOT exit, then verifies the next
    stats report is sent successfully.
  • Existing TestStatsReporter and TestAgent_Metadata tests pass.

Refs #22864

Transient RPC failures in reportMetadata and reportLoop now log a
warning and continue instead of returning an error that tears down
the entire apiConnRoutineManager errgroup, which kills coordination,
DERP maps, and other critical routines.

Refs #22864
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

X Tutup