OCPBUGS-45921: Use HighlyAvailable infra policy for HyperShift serial conformance tests by alebedev87 · Pull Request #75813 · openshift/release

alebedev87 · 2026-03-06T14:35:57Z

The e2e-aws-ovn-conformance-serial test creates a HyperShift hosted cluster with 3 worker nodes but SingleReplica infrastructure topology (the default). This causes the ingress controller to run with only 1 replica, making it vulnerable to NoExecute taint eviction from serial conformance tests like kubectl taint [1]. Switching to HighlyAvailable ensures 2 router replicas so a single-node taint doesn't cause full ingress unavailability.

[1] https://github.com/kubernetes/kubernetes/blob/8911a2d/test/e2e/kubectl/kubectl.go#L1772

Investigation details: link.

… conformance tests The e2e-aws-ovn-conformance-serial test creates a HyperShift hosted cluster with 3 worker nodes but SingleReplica infrastructure topology (the default). This causes the ingress controller to run with only 1 replica, making it vulnerable to NoExecute taint eviction from serial conformance tests like kubectl taint [1]. Switching to HighlyAvailable ensures 2 router replicas so a single-node taint doesn't cause full ingress unavailability. [1] https://github.com/kubernetes/kubernetes/blob/8911a2d/test/e2e/kubectl/kubectl.go#L1772

openshift-ci-robot · 2026-03-06T14:36:03Z

@alebedev87: This pull request references Jira Issue OCPBUGS-45921, which is invalid:

expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The e2e-aws-ovn-conformance-serial test creates a HyperShift hosted cluster with 3 worker nodes but SingleReplica infrastructure topology (the default). This causes the ingress controller to run with only 1 replica, making it vulnerable to NoExecute taint eviction from serial conformance tests like kubectl taint [1]. Switching to HighlyAvailable ensures 2 router replicas so a single-node taint doesn't cause full ingress unavailability.

[1] https://github.com/kubernetes/kubernetes/blob/8911a2d/test/e2e/kubectl/kubectl.go#L1772

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

alebedev87 · 2026-03-06T14:36:41Z

/jira refresh

openshift-ci-robot · 2026-03-06T14:36:49Z

@alebedev87: This pull request references Jira Issue OCPBUGS-45921, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-03-06T14:37:39Z

@alebedev87: This pull request references Jira Issue OCPBUGS-45921, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

Details

In response to this:

The e2e-aws-ovn-conformance-serial test creates a HyperShift hosted cluster with 3 worker nodes but SingleReplica infrastructure topology (the default). This causes the ingress controller to run with only 1 replica, making it vulnerable to NoExecute taint eviction from serial conformance tests like kubectl taint [1]. Switching to HighlyAvailable ensures 2 router replicas so a single-node taint doesn't cause full ingress unavailability.

[1] https://github.com/kubernetes/kubernetes/blob/8911a2d/test/e2e/kubectl/kubectl.go#L1772

Investigation details: link.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-03-06T14:41:54Z

[REHEARSALNOTIFIER]
@alebedev87: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name	Repo	Type	Reason
periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn-conformance-serial	N/A	periodic	Ci-operator config changed
periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-aws-ovn-conformance-serial	N/A	periodic	Ci-operator config changed
periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aws-ovn-conformance-serial	N/A	periodic	Ci-operator config changed
periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn-conformance-serial	N/A	periodic	Ci-operator config changed

Prior to this PR being merged, you will need to either run and acknowledge or opt to skip these rehearsals.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

alebedev87 · 2026-03-06T14:57:34Z

/pj-rehearse

openshift-ci-robot · 2026-03-06T14:57:38Z

@alebedev87: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

jparrill · 2026-03-06T15:07:22Z

/lgtm

openshift-ci · 2026-03-06T15:09:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alebedev87, jparrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/config/openshift/hypershift/OWNERS~~ [jparrill]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alebedev87 · 2026-03-08T13:00:45Z

4.22 cluster

Tests passed.

Infrastructure CR:

status:
  controlPlaneTopology: External
  cpuPartitioning: None
  infrastructureName: c7ed7a3834177cec7638
  infrastructureTopology: HighlyAvailable

Router deployment:

    name: router-default
    namespace: openshift-ingress
  spec:
    replicas: 2

4.21 cluster

The same - tests passed, router deployment has 2 replicas.

4.20 and 4.19 clusters

Routyer deployment is HA (2 replicas) however some tests failed. Not the ones which this PR aims to fix though:

: [sig-auth][Feature:OpenShiftAuthorization][Serial] authorization TestAuthorizationResourceAccessReview should succeed [apigroup:authorization.openshift.io] [Suite:openshift/conformance/serial] expand_more	18s
: [Serial] [sig-auth][Feature:OAuthServer] [RequestHeaders] [IdP] test RequestHeaders IdP [apigroup:config.openshift.io][apigroup:user.openshift.io] [Suite:openshift/conformance/serial] expand_more	2s
: [sig-api-machinery][Feature:APIServer][Late] API LBs follow /readyz of kube-apiserver and stop sending requests before server shutdowns for external clients [Suite:openshift/conformance/parallel] expand_more	35s
: Run multi-stage test e2e-aws-ovn-conformance-serial - e2e-aws-ovn-conformance-serial-conformance-tests container test expand_more	1h40m53s
: Run multi-stage test test phase expand_more

/pj-rehearse periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aws-ovn-conformance-serial
/pj-rehearse periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn-conformance-serial

openshift-ci-robot · 2026-03-08T13:00:48Z

@alebedev87: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

openshift-ci-robot · 2026-03-08T13:07:55Z

@alebedev87: requesting more than one rehearsal in one comment is not supported. If you would like to rehearse multiple specific jobs, please separate the job names by a space in a single command.

alebedev87 · 2026-03-08T14:00:08Z

/pj-rehearse periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn-conformance-serial

openshift-ci-robot · 2026-03-08T14:00:10Z

@alebedev87: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

alebedev87 · 2026-03-08T17:53:38Z

/pj-rehearse periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aws-ovn-conformance-serial periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn-conformance-serial

openshift-ci-robot · 2026-03-08T17:53:40Z

@alebedev87: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

alebedev87 · 2026-03-09T08:50:54Z

: [Serial] [sig-auth][Feature:OAuthServer] [RequestHeaders] [IdP] test RequestHeaders IdP [apigroup:config.openshift.io][apigroup:user.openshift.io] [Suite:openshift/conformance/serial]
{  fail [github.com/openshift/origin/test/extended/oauth/requestheaders.go:548]: Unexpected error:
    <*errors.StatusError | 0xc007419360>: 
    clusteroperators.config.openshift.io "authentication" not found
    {
        ErrStatus: 
            code: 404
            details:
              group: config.openshift.io
              kind: clusteroperators
              name: authentication
            message: clusteroperators.config.openshift.io "authentication" not found
            metadata: {}
            reason: NotFound
            status: Failure,
    }
occurred}

This test is skipped on 4.22 and 4.21 however on 4.20 and 4.19 the status of the authentication operator is probed before the skip kicks in. Looking at the error, seems like authentication operator is not present, should the skip for 4.20 and 4.19 be moved at the top of the testcase (related PR)?

alebedev87 · 2026-03-09T09:26:24Z

: [bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available expand_less	1h59m39s
{  2 unexpected clusteroperator state transitions during e2e test run.  These did not match any known exceptions, so they cause this test-case to fail:

Mar 08 18:23:14.202 E clusteroperator/kube-storage-version-migrator condition/Available reason/KubeStorageVersionMigrator_Deploying status/False KubeStorageVersionMigratorAvailable: Waiting for Deployment
Mar 08 18:23:14.202 - 16s   E clusteroperator/kube-storage-version-migrator condition/Available reason/KubeStorageVersionMigrator_Deploying status/False KubeStorageVersionMigratorAvailable: Waiting for Deployment

1 unwelcome but acceptable clusteroperator state transitions during e2e test run.  These should not happen, but because they are tied to exceptions, the fact that they did happen is not sufficient to cause this test-case to fail:

Mar 08 18:23:31.028 W clusteroperator/kube-storage-version-migrator condition/Available reason/AsExpected status/True All is well (exception: Available=True is the happy case)
}

Seems like KSVM catches the same problem as router: [sig-cli] Kubectl taint - should remove all the taints with the same key off a node test taints a node which can result into eviction. However, unless the cluster ingress operator I don't see that KSVM operator reads Infrastructure CR to set the KSVM deployment to highly available mode:

 1. 18:23:10 — [sig-cli] Kubectl taint - should remove all the taints with the same key off a node started (test #2 of 63)          
  2. 18:23:14 — That test passed (took 3.7s)                                                                                         
  3. 18:23:14 — At the exact same second, mass disruption began on ip-10-0-142-7: readiness probe failures across many pods,         
  TaintManagerEviction of the KSVM migrator pod

# Migrator deployment:
    name: migrator
    namespace: openshift-kube-storage-version-migrator
  spec:
    replicas: 1

cc @sanchezl

alebedev87 · 2026-03-09T09:40:02Z

4.20 cluster

Both of the blocking failures for the 4.20 don't seem to be related to this change.

alebedev87 · 2026-03-09T10:01:54Z

[sig-network][Feature:EgressIP][apigroup:operator.openshift.io] [external-targets][apigroup:user.openshift.io][apigroup:security.openshift.io] pods should have the assigned EgressIPs and EgressIPs can be updated [Skipped:Network/OpenShiftSDN] [Serial] [Suite:openshift/conformance/serial] expand_less	4m5s
{  fail [github.com/openshift/origin/test/extended/networking/egressip.go:360]: Unexpected error:
    <*errors.errorString | 0xc002c64000>: 
    Daemonset still not ready after 48 tries: ready=1, scheduled=2, desired=2
    {
        s: "Daemonset still not ready after 48 tries: ready=1, scheduled=2, desired=2",
    }
occurred
Ginkgo exit error 1: exit with code 1}

This test failed due to rate limiting from registry.ci.openshift.org image registry, from Claude Code analysis:

  What happened step by step:

  1. The test created a DaemonSet e2e-test-egressip-2hsmd-packet-sniffer that needed to run on 2 EgressIP-assignable worker nodes:
    - ip-10-0-136-208.ec2.internal — pod 72f2v
    - ip-10-0-137-113.ec2.internal — pod tvtwf
  2. Pod 72f2v (on ip-10-0-136-208): Successfully pulled the image in ~18.8 seconds (19:51:16 → 19:51:35), started the tcpdump container, and became Ready. Image size was ~1.8 GB.
  3. Pod tvtwf (on ip-10-0-137-113): Started pulling the same image at 19:51:16 but the image pull never completed. The container stayed in ContainerCreating state from 19:51:16 until 19:55:49 (when the pod was gracefully deleted after test failure). There is no Pulled, Created container, or Started event for this pod — the image pull simply hung or was extremely slow on this node.
  4. After 48 retry checks (~4 minutes), the JustBeforeEach at egressip.go:360 timed out and failed the test.

Why the image pull stalled on ip-10-0-137-113:

The cluster was experiencing image pull QPS throttling at the time. Other pods in the cluster (e.g., in namespace e2e-job-7003) were hitting ErrImagePull: pull QPS exceeded errors. The large image size (~1.8 GB from registry.ci.openshift.org) combined with container runtime pull QPS limits likely caused the pull on node ip-10-0-137-113 to be throttled or queued behind other pulls, preventing it from completing within the DaemonSet readiness timeout.

alebedev87 · 2026-03-09T10:06:38Z

Conclusion: I don't see any evident link between the failed tests and this PR. For some things I will follow up (e.g. openshift/origin#30848) but overall acknowledge the rehearsal..

/pj-rehearse ack

openshift-ci-robot · 2026-03-09T10:06:41Z

@alebedev87: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

alebedev87 · 2026-03-09T10:14:51Z

/pj-rehearse ack

openshift-ci-robot · 2026-03-09T10:14:54Z

@alebedev87: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

openshift-ci · 2026-03-09T11:56:46Z

@alebedev87: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/rehearse/periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn-conformance-serial	`2e8014c`	link	unknown	`/pj-rehearse periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn-conformance-serial`
ci/rehearse/periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aws-ovn-conformance-serial	`2e8014c`	link	unknown	`/pj-rehearse periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aws-ovn-conformance-serial`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-03-09T11:58:12Z

@alebedev87: Jira Issue OCPBUGS-45921: Some pull requests linked via external trackers have merged:

The following pull request, linked via external tracker, has not merged:

openshift/cluster-ingress-operator#1377 is open

All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-45921 has not been moved to the MODIFIED state.

Details

In response to this:

The e2e-aws-ovn-conformance-serial test creates a HyperShift hosted cluster with 3 worker nodes but SingleReplica infrastructure topology (the default). This causes the ingress controller to run with only 1 replica, making it vulnerable to NoExecute taint eviction from serial conformance tests like kubectl taint [1]. Switching to HighlyAvailable ensures 2 router replicas so a single-node taint doesn't cause full ingress unavailability.

[1] https://github.com/kubernetes/kubernetes/blob/8911a2d/test/e2e/kubectl/kubectl.go#L1772

Investigation details: link.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

… conformance tests (openshift#75813) The e2e-aws-ovn-conformance-serial test creates a HyperShift hosted cluster with 3 worker nodes but SingleReplica infrastructure topology (the default). This causes the ingress controller to run with only 1 replica, making it vulnerable to NoExecute taint eviction from serial conformance tests like kubectl taint [1]. Switching to HighlyAvailable ensures 2 router replicas so a single-node taint doesn't cause full ingress unavailability. [1] https://github.com/kubernetes/kubernetes/blob/8911a2d/test/e2e/kubectl/kubectl.go#L1772

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 6, 2026

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 6, 2026

openshift-ci bot requested review from enxebre, melvinjoseph86 and sjenning March 6, 2026 14:36

openshift-ci bot assigned jparrill Mar 6, 2026

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 6, 2026

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2026

openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Mar 9, 2026

openshift-merge-bot bot merged commit 9d47214 into openshift:main Mar 9, 2026
15 of 17 checks passed

alebedev87 mentioned this pull request Mar 9, 2026

[release-4.20] OCPBUGS-78025: Skip oauth test for external control plane topology openshift/origin#30848

Open

Conversation

alebedev87 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 6, 2026

Uh oh!

alebedev87 commented Mar 6, 2026

Uh oh!

openshift-ci-robot commented Mar 6, 2026

Uh oh!

openshift-ci-robot commented Mar 6, 2026

Uh oh!

openshift-ci-robot commented Mar 6, 2026

Uh oh!

alebedev87 commented Mar 6, 2026

Uh oh!

openshift-ci-robot commented Mar 6, 2026

Uh oh!

jparrill commented Mar 6, 2026

Uh oh!

openshift-ci bot commented Mar 6, 2026

Uh oh!

alebedev87 commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 8, 2026

Uh oh!

openshift-ci-robot commented Mar 8, 2026

Uh oh!

alebedev87 commented Mar 8, 2026

Uh oh!

openshift-ci-robot commented Mar 8, 2026

Uh oh!

alebedev87 commented Mar 8, 2026

Uh oh!

openshift-ci-robot commented Mar 8, 2026

Uh oh!

alebedev87 commented Mar 9, 2026

Uh oh!

alebedev87 commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alebedev87 commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alebedev87 commented Mar 9, 2026

Uh oh!

alebedev87 commented Mar 9, 2026

Uh oh!

openshift-ci-robot commented Mar 9, 2026

Uh oh!

alebedev87 commented Mar 9, 2026

Uh oh!

openshift-ci-robot commented Mar 9, 2026

Uh oh!

openshift-ci bot commented Mar 9, 2026

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alebedev87 commented Mar 6, 2026 •

edited

Loading

alebedev87 commented Mar 8, 2026 •

edited

Loading

alebedev87 commented Mar 9, 2026 •

edited

Loading

alebedev87 commented Mar 9, 2026 •

edited

Loading