ci: remove nick-fields/retry wrapper and add shared sinfo-based GPU partition selection by sbryngelson · Pull Request #1299 · MFlowCode/MFC

sbryngelson · 2026-03-10T03:45:48Z

Summary

Replace nick-fields/retry JS action with plain run: steps for Frontier builds (test.yml) and bench builds (bench.yml). The JS action wrapper was getting SIGKILL'd on Frontier login nodes after the build completed successfully, causing false build failures. Retry logic is handled by retry_build() in .github/scripts/retry-build.sh, which all cluster build.sh scripts already call.
Unified CI scripts — replace per-cluster submit/test/bench scripts with parameterized versions:
- submit-slurm-job.sh: single submit+monitor script for all clusters (replaces phoenix/submit-job.sh, phoenix/submit.sh, frontier/submit.sh). Cluster config (account, QOS, partitions, time limits) selected via case block. Idempotent stale-job cancellation now applies to all clusters.
- common/test.sh: unified test script with conditional build, cluster-aware GPU detection, thread counts, RDMA, and sharding.
- common/bench.sh: unified bench script with conditional build, TMPDIR management (Phoenix-only), and cluster-aware bench flags.
Shared select-gpu-partition.sh script for sinfo-based GPU partition selection, used by both test and benchmark jobs. GPU partition priority: gpu-rtx6000 → gpu-l40s → gpu-v100 → gpu-h200 → gpu-h100 → gpu-a100.
Parallel benchmark jobs require 2 idle/mix nodes (GPU_PARTITION_MIN_NODES=2) before selecting a partition, since PR and master benchmark jobs run concurrently.
Exclude dead GPU node atl1-1-03-002-29-0 (persistent cuInit error 999).
Delete dead code: run-tests-with-retry.sh (never called).

Workflow simplification

test.yml self job: 5 conditional steps → 2 (Build + Test)
test.yml case-opt job: 5 conditional steps → 3
11 per-cluster scripts deleted, 3 unified scripts added

Test plan

Trigger CI on a PR and verify Frontier build steps pass without false failures
Verify Phoenix test jobs land on an available GPU partition via sinfo selection
Verify Phoenix benchmark jobs (PR + master) land on the same partition with 2 available nodes
Verify all cluster/device/interface combinations produce correct SLURM submissions

…nodes The JS action wrapper gets SIGKILL'd on Frontier login nodes under memory pressure, falsely failing the Build step even when build.sh succeeds. retry_build() inside build.sh already handles 2-attempt retry with rm -rf build between attempts. Also move gpu-v100 to last in Phoenix GPU partition priority so SLURM prefers newer GPU nodes (a100/h100/l40s/h200) over the aging V100s that have had recurring driver issues.

Extract partition selection into select-gpu-partition.sh so both test jobs (submit-job.sh) and benchmark jobs (run_parallel_benchmarks.sh) use the same sinfo-based logic with a consistent priority order: gpu-rtx6000 -> gpu-l40s -> gpu-v100 -> gpu-h200 -> gpu-h100 -> gpu-a100 Tests now dynamically pick the best available partition rather than submitting to a static multi-partition list, matching the benchmark approach. Bench still exports BENCH_GPU_PARTITION so PR and master land on the same GPU type for fair comparisons.

Copilot

Pull request overview

This PR updates CI/SLURM automation to centralize Phoenix GPU partition selection logic and simplify the non-Phoenix build step in GitHub Actions.

Changes:

Replace the nick-fields/retry wrapper in test.yml with a direct run step + timeout-minutes.
Introduce a reusable .github/scripts/select-gpu-partition.sh and use it from both Phoenix benchmark and test submission paths.
Simplify .github/scripts/retry-build.sh by removing support for a post-build validation hook.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`.github/workflows/test.yml`	Removes external retry wrapper around non-Phoenix build step.
`.github/workflows/phoenix/submit-job.sh`	Uses shared GPU partition selection for non-benchmark Phoenix submissions.
`.github/scripts/select-gpu-partition.sh`	New shared helper to pick an available Phoenix GPU partition via `sinfo`.
`.github/scripts/run_parallel_benchmarks.sh`	Refactors inline partition selection into the shared helper script.
`.github/scripts/retry-build.sh`	Removes RETRY_VALIDATE_CMD-based post-build validation behavior from retry loop.

.github/scripts/retry-build.sh

 # Provides retry_build(): 2-attempt loop.
 # On failure of attempt 1, nukes the entire build directory before attempt 2.
-# Set RETRY_VALIDATE_CMD to run a post-build validation; failure triggers a retry.
 # Usage: source .github/scripts/retry-build.sh
 #        retry_build ./mfc.sh build -j 8 --gpu acc

 retry_build() {
-    local validate_cmd="${RETRY_VALIDATE_CMD:-}"
    local max_attempts=2
    local attempt=1
    while [ $attempt -le $max_attempts ]; do
        echo "Build attempt $attempt of $max_attempts..."
        if "$@"; then
-            if [ -n "$validate_cmd" ]; then
-                if ! eval "$validate_cmd"; then
-                    echo "Post-build validation failed on attempt $attempt."
-                    if [ $attempt -lt $max_attempts ]; then
-                        echo "  Nuking build directory before retry..."
-                        rm -rf build 2>/dev/null || true
-                        sleep 5
-                        attempt=$((attempt + 1))
-                        continue
-                    else
-                        echo "Validation still failing after $max_attempts attempts."
-                        return 1
-                    fi
-                fi
-            fi
            echo "Build succeeded on attempt $attempt."
            return 0


.github/workflows/phoenix/submit-job.sh

 if [ "$job_type" = "bench" ]; then
-    bench_partition="${BENCH_GPU_PARTITION:-gpu-rtx6000}"
-    echo "Submitting bench GPU job to partition: $bench_partition (BENCH_GPU_PARTITION=${BENCH_GPU_PARTITION:-<unset, using default>})"
-    sbatch_gpu_opts="\
-#SBATCH -p $bench_partition
-#SBATCH --ntasks-per-node=4       # Number of cores per node required
-#SBATCH -G2\
-"
+    # BENCH_GPU_PARTITION is pre-selected by run_parallel_benchmarks.sh so both
+    # PR and master jobs land on the same GPU type for a fair comparison.
+    gpu_partition="${BENCH_GPU_PARTITION:-gpu-rtx6000}"
+    echo "Submitting bench GPU job to partition: $gpu_partition (BENCH_GPU_PARTITION=${BENCH_GPU_PARTITION:-<unset, using default>})"
    sbatch_time="#SBATCH -t 04:00:00"
 else
-    sbatch_gpu_opts="\
-#SBATCH -p gpu-v100,gpu-a100,gpu-h100,gpu-l40s,gpu-h200
+    source "$(dirname "${BASH_SOURCE[0]}")/../../scripts/select-gpu-partition.sh"
+    gpu_partition="$SELECTED_GPU_PARTITION"
+    sbatch_time="#SBATCH -t 03:00:00"


.github/scripts/retry-build.sh

coderabbitai

🧹 Nitpick comments (1)

.github/workflows/phoenix/submit-job.sh (1)
44-44: Consider using a more robust path resolution.

The relative path ../../scripts/select-gpu-partition.sh works correctly but is fragile if the script hierarchy changes. Consider extracting the repository root and using an absolute path pattern:
♻️ Optional: Use repo-root-relative path
 else
-    source "$(dirname "${BASH_SOURCE[0]}")/../../scripts/select-gpu-partition.sh"
+    _repo_root="$(cd "$(dirname "${BASH_SOURCE[0]}")/../../.." && pwd)"
+    source "${_repo_root}/.github/scripts/select-gpu-partition.sh"
     gpu_partition="$SELECTED_GPU_PARTITION"
     sbatch_time="#SBATCH -t 03:00:00"
 fi

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7d42e334-ff80-4016-adf4-47b1c6150d4a

📥 Commits

Reviewing files that changed from the base of the PR and between edff972 and 24185f8.

📒 Files selected for processing (5)

.github/scripts/retry-build.sh
.github/scripts/run_parallel_benchmarks.sh
.github/scripts/select-gpu-partition.sh
.github/workflows/phoenix/submit-job.sh
.github/workflows/test.yml

💤 Files with no reviewable changes (1)

.github/scripts/retry-build.sh

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Free Tier Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

.github/scripts/retry-build.sh

 #        retry_build ./mfc.sh build -j 8 --gpu acc

 retry_build() {
-    local validate_cmd="${RETRY_VALIDATE_CMD:-}"


Make bench jobs use sinfo-based GPU partition selection (via select-gpu-partition.sh) as a baseline, then override with BENCH_GPU_PARTITION only when run_parallel_benchmarks.sh has pre-selected a partition for PR/master consistency. Previously bench jobs fell back to a hardcoded gpu-rtx6000 when BENCH_GPU_PARTITION was unset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…selection For parallel benchmarks (PR + master), both jobs need a GPU node concurrently, so require at least 2 idle/mix nodes before selecting a partition. Add GPU_PARTITION_MIN_NODES parameter to select-gpu-partition.sh (defaults to 1 for single-job test use). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

phoenix/test.sh relies on RETRY_VALIDATE_CMD to smoke-test the freshly built syscheck binary and trigger a rebuild on failure, catching architecture mismatches (SIGILL) from binaries compiled on a different compute node. Mistakenly removed in the previous commit as 'unused'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

.github/scripts/select-gpu-partition.sh

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace per-cluster submit/test/bench scripts with unified versions: - submit-slurm-job.sh: single parameterized submit+monitor script for all clusters (replaces phoenix/submit-job.sh, phoenix/submit.sh, frontier/submit.sh). Cluster config (account, QOS, partitions, time limits) is selected via a case block. Idempotent stale-job cancellation now applies to all clusters, not just Phoenix. - common/test.sh: unified test script with conditional build (skips if build/ exists from Frontier login-node build), cluster-aware GPU detection, thread counts, RDMA, and sharding. - common/bench.sh: unified bench script with conditional build, TMPDIR management (Phoenix-only), and cluster-aware bench flags. Also removes nick-fields/retry from bench.yml (frontier build.sh already uses retry_build internally) and deletes dead code (run-tests-with-retry.sh). test.yml self job: 5 conditional steps -> 2 steps (Build + Test). test.yml case-opt job: 5 conditional steps -> 3 steps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove no-op 'rm -rf build' inside 'if [ ! -d build ]' guard in common/test.sh and common/bench.sh. - Default gpu_partition to 'batch' before dynamic selection to prevent unbound variable error if a new cluster is added. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

On Phoenix, test.yml uses clean:false so build/ can persist across reruns. If the prior run built on a different CPU microarchitecture, the stale binaries would SIGILL. Run syscheck on any existing build and nuke build/ on failure so the rebuild block fires. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Same ISA mismatch fix as test.sh: always rm -rf build on Phoenix. Also add trap EXIT for TMPDIR cleanup so early failures don't leak temp directories under /storage/project. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

bench.py spawns ./mfc.sh run as a subprocess without forwarding -j, so the flag was silently ignored. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-10T05:57:08Z

Claude Code Review

Head SHA: b7e8e75

Files changed: 20

.github/scripts/retry-build.sh (modified)
.github/scripts/run-tests-with-retry.sh (deleted)
.github/scripts/run_parallel_benchmarks.sh (modified)
.github/scripts/select-gpu-partition.sh (new)
.github/scripts/submit-slurm-job.sh (new)
.github/scripts/submit_and_monitor_bench.sh (modified)
.github/workflows/bench.yml (modified)
.github/workflows/common/bench.sh (new)
.github/workflows/common/test.sh (new)
.github/workflows/frontier/bench.sh, frontier/submit.sh, frontier/test.sh, frontier_amd/\*, phoenix/\* (deleted — 9 files)

Summary

Removes nick-fields/retry JS wrapper (SIGKILL root cause fix) in favour of plain run: steps; retry logic for builds moves into retry_build() inside SLURM jobs.
Replaces 9 per-cluster scripts with 3 unified scripts (submit-slurm-job.sh, common/test.sh, common/bench.sh).
Extracts shared sinfo-based GPU partition selection into select-gpu-partition.sh with a configurable minimum-node threshold for concurrent bench jobs.
Stale-job idempotency (was Phoenix-only) now applies to all clusters via the unified submit script.
Dead code deleted: run-tests-with-retry.sh, phoenix/submit-job.sh, cluster-specific test/bench stubs.

Findings

1. common/bench.sh drops -j $n_jobs from the bench run command (behavioural change)
Old phoenix/bench.sh: ./mfc.sh bench $bench_opts -j $n_jobs -o ... (parallelism = capped nproc, up to 64).
New common/bench.sh (line 45–48): ./mfc.sh bench --mem 4 -o ... -- -c $bench_cluster ... — -j is absent.
If ./mfc.sh bench uses -j to run benchmark cases in parallel, this silently serialises them and will inflate wall-clock times. Please confirm whether the flag is intentionally dropped or should be -j $n_jobs.

2. submit-slurm-job.sh: no catch-all in device-opts case blocks (lines 85–115)
Both the cpu and gpu device branches contain case "$cluster" in phoenix) ... frontier|frontier_amd) ... esac with no *) fallback. If a new cluster is added to the case list without updating this file, sbatch_device_opts is left unset and the generated SBATCH script will be silently malformed. A *) echo "ERROR: no sbatch_device_opts for cluster '$cluster'" ; exit 1 ;; guard would make this fail loudly.

3. common/test.sh Phoenix validate-cmd is a no-op when syscheck is absent (line 27)

validate_cmd='syscheck_bin=$(find build/install -name syscheck ...); [ -z "$syscheck_bin" ] || "$syscheck_bin" ...'

If syscheck isn't installed (e.g., clean CI environment), the validate command silently succeeds ([ -z "" ] is true) and the architecture-mismatch check is skipped entirely. This is documented by the comment, so it's intentional, but worth verifying the binary actually exists in the Phoenix install tree so the guard fires when needed.

4. bench.yml: build-failure cleanup path removed (intentional but worth confirming)
The old nick-fields/retry had on_retry_command: rm -rf pr/build master/build. The new plain run: step has no cleanup on failure (lines 108–115 of bench.yml). For Frontier login-node pre-builds that fail, a subsequent re-run of the workflow will encounter a stale build/ directory. The Phoenix path is fine (build directory is always nuked inside SLURM by common/bench.sh). For Frontier, build/ persistence across runs is the intended behaviour (cache), so this is likely fine — just confirming it's a conscious decision.

5. Minor: select-gpu-partition.sh node exclusion not reflected in sinfo count
The dead node atl1-1-03-002-29-0 is excluded via --exclude= in the SBATCH header (submit-slurm-job.sh line ~105), but sinfo in select-gpu-partition.sh does not filter it out. A node with cuInit error 999 is almost certainly in down or drain state (not idle/mix), so the idle count should be accurate in practice — no immediate fix required.

Improvement opportunities

Consider quoting the mkdir -p $tmpbuild and mkdir -p $currentdir calls in common/bench.sh (lines 19–20) since set -u is active and these paths contain a variable.
select-gpu-partition.sh (line 23): add sinfo -p "$_part" validation to handle partitions that no longer exist on the cluster (the || true suppresses this already, so it degrades gracefully — just noting it).

Spencer Bryngelson added 3 commits March 9, 2026 23:36

ci: remove unused RETRY_VALIDATE_CMD from retry_build

2fb93db

Copilot AI review requested due to automatic review settings March 10, 2026 03:45