X Tutup
Skip to content

fix(profiler): Prevent buffer race condition during rapid start/stop cycles#5622

Draft
ericapisani wants to merge 3 commits intomasterfrom
ep/investigate-flaky-gevent-tests
Draft

fix(profiler): Prevent buffer race condition during rapid start/stop cycles#5622
ericapisani wants to merge 3 commits intomasterfrom
ep/investigate-flaky-gevent-tests

Conversation

@ericapisani
Copy link
Member

@ericapisani ericapisani commented Mar 9, 2026

Fixes #5621 and PY-2127

Fix a race condition in ContinuousScheduler where the profiler thread's cleanup could destroy a buffer created by a concurrent ensure_running() call.

The Problem

During rapid start/stop cycles, a race condition could occur:

  1. stop_profiler() sets self.running = False
  2. The profiler thread's run() loop exits, but hasn't reached cleanup yet
  3. Another thread calls start_profiler() → ensure_running()
  4. ensure_running() acquires the lock, sets self.running = True, and calls reset_buffer() to create a new buffer
  5. The old profiler thread finally reaches cleanup and executes self.buffer = None
  6. This destruction of the new buffer means the new profiler thread's sampler sees self.buffer is None and silently drops all samples

This race is a latent bug on all Python versions, but manifests as flaky tests on Python 3.6 with gevent due to timing and thread scheduling differences.

The Solution

Two changes prevent the race:

  1. In profiler_id property: Add not self.running check to avoid accessing a buffer during shutdown
  2. In run() cleanup: Capture self.buffer in a local variable for flushing and never set self.buffer = None. This ensures concurrent ensure_running() calls can safely create new buffers without them being destroyed.

Testing

Added a test that simulates rapid start/stop cycles and verifies run() doesn't null out self.buffer after exiting the sampling loop.

ericapisani and others added 3 commits February 23, 2026 15:34
…cycles

Fix a race condition in ContinuousScheduler.run() where setting self.buffer = None
after the sampling loop would destroy buffers created by concurrent ensure_running()
calls during rapid stop/start cycles. This caused the profiler to silently drop all
samples when the new thread's buffer was destroyed by the old thread's cleanup.

The fix uses a local buffer reference for flushing and never sets self.buffer = None
from run(). Also updated the profiler_id property to check not self.running first.

Co-Authored-By: Claude <noreply@anthropic.com>
Remove the no-op `buffer = None` assignment on a local variable before
the function returns. Setting a local variable to None provides no GC
benefit since it goes out of scope immediately after.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

Semver Impact of This PR

🟢 Patch (bug fixes)

📋 Changelog Preview

This is how your changes will appear in the changelog.
Entries from this PR are highlighted with a left border (blockquote style).


Bug Fixes 🐛

  • (celery) Propagate user-set headers by sentrivana in #5581
  • (profiler) Prevent buffer race condition during rapid start/stop cycles by ericapisani in #5622
  • (utils) Avoid double serialization of strings in safe_serialize by ericapisani in #5587

Documentation 📚

  • (openai-agents) Remove inapplicable comment by alexander-alderman-webb in #5495
  • Add AGENTS.md by sentrivana in #5579
  • Add set_attribute example to changelog by sentrivana in #5578

Internal Changes 🔧

Openai Agents

  • Set streaming header when library uses with_streaming_response() by alexander-alderman-webb in #5583
  • Replace mocks with httpx for streamed responses by alexander-alderman-webb in #5580
  • Replace mocks with httpx in non-MCP tool tests by alexander-alderman-webb in #5602
  • Replace mocks with httpx in MCP tool tests by alexander-alderman-webb in #5605
  • Replace mocks with httpx in handoff tests by alexander-alderman-webb in #5604
  • Replace mocks with httpx in API error test by alexander-alderman-webb in #5601
  • Replace mocks with httpx in non-error single-response tests by alexander-alderman-webb in #5600
  • Remove test for unreachable state by alexander-alderman-webb in #5584
  • Expect namespace tool field for new openai versions by alexander-alderman-webb in #5599

Other

  • Do not run actions on potel-base by sentrivana in #5614

🤖 This preview updates automatically when you update the PR.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

Codecov Results 📊

13 passed | Total: 13 | Pass Rate: 100% | Execution Time: 8.15s

All tests are passing successfully.

❌ Patch coverage is 25.00%. Project has 13866 uncovered lines.

Files with missing lines (1)
File Patch % Lines
continuous_profiler.py 43.45% ⚠️ 177 Missing and 17 partials

Generated by Codecov Action

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate flaky gevent test

1 participant

X Tutup