X Tutup
Skip to content

OCPBUGS-45921: Use HighlyAvailable infra policy for HyperShift serial conformance tests#75813

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
alebedev87:OCPBUGS-45921-ha-info-hypershift-conformance-serial
Mar 9, 2026
Merged

OCPBUGS-45921: Use HighlyAvailable infra policy for HyperShift serial conformance tests#75813
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
alebedev87:OCPBUGS-45921-ha-info-hypershift-conformance-serial

Conversation

@alebedev87
Copy link
Contributor

@alebedev87 alebedev87 commented Mar 6, 2026

The e2e-aws-ovn-conformance-serial test creates a HyperShift hosted cluster with 3 worker nodes but SingleReplica infrastructure topology (the default). This causes the ingress controller to run with only 1 replica, making it vulnerable to NoExecute taint eviction from serial conformance tests like kubectl taint [1]. Switching to HighlyAvailable ensures 2 router replicas so a single-node taint doesn't cause full ingress unavailability.

[1] https://github.com/kubernetes/kubernetes/blob/8911a2d/test/e2e/kubectl/kubectl.go#L1772

Investigation details: link.

… conformance tests

The e2e-aws-ovn-conformance-serial test creates a HyperShift hosted
cluster with 3 worker nodes but SingleReplica infrastructure topology
(the default). This causes the ingress controller to run with only 1
replica, making it vulnerable to NoExecute taint eviction from serial
conformance tests like kubectl taint [1]. Switching to HighlyAvailable
ensures 2 router replicas so a single-node taint doesn't cause full
ingress unavailability.

[1] https://github.com/kubernetes/kubernetes/blob/8911a2d/test/e2e/kubectl/kubectl.go#L1772
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 6, 2026
@openshift-ci-robot
Copy link
Contributor

@alebedev87: This pull request references Jira Issue OCPBUGS-45921, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The e2e-aws-ovn-conformance-serial test creates a HyperShift hosted cluster with 3 worker nodes but SingleReplica infrastructure topology (the default). This causes the ingress controller to run with only 1 replica, making it vulnerable to NoExecute taint eviction from serial conformance tests like kubectl taint [1]. Switching to HighlyAvailable ensures 2 router replicas so a single-node taint doesn't cause full ingress unavailability.

[1] https://github.com/kubernetes/kubernetes/blob/8911a2d/test/e2e/kubectl/kubectl.go#L1772

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@alebedev87
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 6, 2026
@openshift-ci-robot
Copy link
Contributor

@alebedev87: This pull request references Jira Issue OCPBUGS-45921, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

@alebedev87: This pull request references Jira Issue OCPBUGS-45921, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

Details

In response to this:

The e2e-aws-ovn-conformance-serial test creates a HyperShift hosted cluster with 3 worker nodes but SingleReplica infrastructure topology (the default). This causes the ingress controller to run with only 1 replica, making it vulnerable to NoExecute taint eviction from serial conformance tests like kubectl taint [1]. Switching to HighlyAvailable ensures 2 router replicas so a single-node taint doesn't cause full ingress unavailability.

[1] https://github.com/kubernetes/kubernetes/blob/8911a2d/test/e2e/kubectl/kubectl.go#L1772

Investigation details: link.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

[REHEARSALNOTIFIER]
@alebedev87: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn-conformance-serial N/A periodic Ci-operator config changed
periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-aws-ovn-conformance-serial N/A periodic Ci-operator config changed
periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aws-ovn-conformance-serial N/A periodic Ci-operator config changed
periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn-conformance-serial N/A periodic Ci-operator config changed

Prior to this PR being merged, you will need to either run and acknowledge or opt to skip these rehearsals.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@alebedev87
Copy link
Contributor Author

/pj-rehearse

@openshift-ci-robot
Copy link
Contributor

@alebedev87: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@jparrill
Copy link
Contributor

jparrill commented Mar 6, 2026

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 6, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 6, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alebedev87, jparrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2026
@alebedev87
Copy link
Contributor Author

alebedev87 commented Mar 8, 2026

4.22 cluster

Tests passed.

Infrastructure CR:

status:
  controlPlaneTopology: External
  cpuPartitioning: None
  infrastructureName: c7ed7a3834177cec7638
  infrastructureTopology: HighlyAvailable

Router deployment:

    name: router-default
    namespace: openshift-ingress
  spec:
    replicas: 2

4.21 cluster

The same - tests passed, router deployment has 2 replicas.

4.20 and 4.19 clusters

Routyer deployment is HA (2 replicas) however some tests failed. Not the ones which this PR aims to fix though:

: [sig-auth][Feature:OpenShiftAuthorization][Serial] authorization TestAuthorizationResourceAccessReview should succeed [apigroup:authorization.openshift.io] [Suite:openshift/conformance/serial] expand_more	18s
: [Serial] [sig-auth][Feature:OAuthServer] [RequestHeaders] [IdP] test RequestHeaders IdP [apigroup:config.openshift.io][apigroup:user.openshift.io] [Suite:openshift/conformance/serial] expand_more	2s
: [sig-api-machinery][Feature:APIServer][Late] API LBs follow /readyz of kube-apiserver and stop sending requests before server shutdowns for external clients [Suite:openshift/conformance/parallel] expand_more	35s
: Run multi-stage test e2e-aws-ovn-conformance-serial - e2e-aws-ovn-conformance-serial-conformance-tests container test expand_more	1h40m53s
: Run multi-stage test test phase expand_more

/pj-rehearse periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aws-ovn-conformance-serial
/pj-rehearse periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn-conformance-serial

@openshift-ci-robot
Copy link
Contributor

@alebedev87: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci-robot
Copy link
Contributor

@alebedev87: requesting more than one rehearsal in one comment is not supported. If you would like to rehearse multiple specific jobs, please separate the job names by a space in a single command.

@alebedev87
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn-conformance-serial

@openshift-ci-robot
Copy link
Contributor

@alebedev87: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@alebedev87
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aws-ovn-conformance-serial periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn-conformance-serial

@openshift-ci-robot
Copy link
Contributor

@alebedev87: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@alebedev87
Copy link
Contributor Author

: [Serial] [sig-auth][Feature:OAuthServer] [RequestHeaders] [IdP] test RequestHeaders IdP [apigroup:config.openshift.io][apigroup:user.openshift.io] [Suite:openshift/conformance/serial]
{  fail [github.com/openshift/origin/test/extended/oauth/requestheaders.go:548]: Unexpected error:
    <*errors.StatusError | 0xc007419360>: 
    clusteroperators.config.openshift.io "authentication" not found
    {
        ErrStatus: 
            code: 404
            details:
              group: config.openshift.io
              kind: clusteroperators
              name: authentication
            message: clusteroperators.config.openshift.io "authentication" not found
            metadata: {}
            reason: NotFound
            status: Failure,
    }
occurred}

This test is skipped on 4.22 and 4.21 however on 4.20 and 4.19 the status of the authentication operator is probed before the skip kicks in. Looking at the error, seems like authentication operator is not present, should the skip for 4.20 and 4.19 be moved at the top of the testcase (related PR)?

@alebedev87
Copy link
Contributor Author

alebedev87 commented Mar 9, 2026

: [bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available expand_less	1h59m39s
{  2 unexpected clusteroperator state transitions during e2e test run.  These did not match any known exceptions, so they cause this test-case to fail:

Mar 08 18:23:14.202 E clusteroperator/kube-storage-version-migrator condition/Available reason/KubeStorageVersionMigrator_Deploying status/False KubeStorageVersionMigratorAvailable: Waiting for Deployment
Mar 08 18:23:14.202 - 16s   E clusteroperator/kube-storage-version-migrator condition/Available reason/KubeStorageVersionMigrator_Deploying status/False KubeStorageVersionMigratorAvailable: Waiting for Deployment

1 unwelcome but acceptable clusteroperator state transitions during e2e test run.  These should not happen, but because they are tied to exceptions, the fact that they did happen is not sufficient to cause this test-case to fail:

Mar 08 18:23:31.028 W clusteroperator/kube-storage-version-migrator condition/Available reason/AsExpected status/True All is well (exception: Available=True is the happy case)
}

Seems like KSVM catches the same problem as router: [sig-cli] Kubectl taint - should remove all the taints with the same key off a node test taints a node which can result into eviction. However, unless the cluster ingress operator I don't see that KSVM operator reads Infrastructure CR to set the KSVM deployment to highly available mode:

 1. 18:23:10 — [sig-cli] Kubectl taint - should remove all the taints with the same key off a node started (test #2 of 63)          
  2. 18:23:14 — That test passed (took 3.7s)                                                                                         
  3. 18:23:14 — At the exact same second, mass disruption began on ip-10-0-142-7: readiness probe failures across many pods,         
  TaintManagerEviction of the KSVM migrator pod

# Migrator deployment:
    name: migrator
    namespace: openshift-kube-storage-version-migrator
  spec:
    replicas: 1

cc @sanchezl

@alebedev87
Copy link
Contributor Author

alebedev87 commented Mar 9, 2026

4.20 cluster

Both of the blocking failures for the 4.20 don't seem to be related to this change.

@alebedev87
Copy link
Contributor Author

[sig-network][Feature:EgressIP][apigroup:operator.openshift.io] [external-targets][apigroup:user.openshift.io][apigroup:security.openshift.io] pods should have the assigned EgressIPs and EgressIPs can be updated [Skipped:Network/OpenShiftSDN] [Serial] [Suite:openshift/conformance/serial] expand_less	4m5s
{  fail [github.com/openshift/origin/test/extended/networking/egressip.go:360]: Unexpected error:
    <*errors.errorString | 0xc002c64000>: 
    Daemonset still not ready after 48 tries: ready=1, scheduled=2, desired=2
    {
        s: "Daemonset still not ready after 48 tries: ready=1, scheduled=2, desired=2",
    }
occurred
Ginkgo exit error 1: exit with code 1}

This test failed due to rate limiting from registry.ci.openshift.org image registry, from Claude Code analysis:

  What happened step by step:

  1. The test created a DaemonSet e2e-test-egressip-2hsmd-packet-sniffer that needed to run on 2 EgressIP-assignable worker nodes:
    - ip-10-0-136-208.ec2.internal — pod 72f2v
    - ip-10-0-137-113.ec2.internal — pod tvtwf
  2. Pod 72f2v (on ip-10-0-136-208): Successfully pulled the image in ~18.8 seconds (19:51:16 → 19:51:35), started the tcpdump container, and became Ready. Image size was ~1.8 GB.
  3. Pod tvtwf (on ip-10-0-137-113): Started pulling the same image at 19:51:16 but the image pull never completed. The container stayed in ContainerCreating state from 19:51:16 until 19:55:49 (when the pod was gracefully deleted after test failure). There is no Pulled, Created container, or Started event for this pod — the image pull simply hung or was extremely slow on this node.
  4. After 48 retry checks (~4 minutes), the JustBeforeEach at egressip.go:360 timed out and failed the test.

Why the image pull stalled on ip-10-0-137-113:

The cluster was experiencing image pull QPS throttling at the time. Other pods in the cluster (e.g., in namespace e2e-job-7003) were hitting ErrImagePull: pull QPS exceeded errors. The large image size (~1.8 GB from registry.ci.openshift.org) combined with container runtime pull QPS limits likely caused the pull on node ip-10-0-137-113 to be throttled or queued behind other pulls, preventing it from completing within the DaemonSet readiness timeout.

@alebedev87
Copy link
Contributor Author

Conclusion: I don't see any evident link between the failed tests and this PR. For some things I will follow up (e.g. openshift/origin#30848) but overall acknowledge the rehearsal..

/pj-rehearse ack

@openshift-ci-robot
Copy link
Contributor

@alebedev87: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci-robot openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Mar 9, 2026
@alebedev87
Copy link
Contributor Author

/pj-rehearse ack

@openshift-ci-robot
Copy link
Contributor

@alebedev87: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 9, 2026

@alebedev87: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/rehearse/periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn-conformance-serial 2e8014c link unknown /pj-rehearse periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn-conformance-serial
ci/rehearse/periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aws-ovn-conformance-serial 2e8014c link unknown /pj-rehearse periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aws-ovn-conformance-serial

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 9d47214 into openshift:main Mar 9, 2026
15 of 17 checks passed
@openshift-ci-robot
Copy link
Contributor

@alebedev87: Jira Issue OCPBUGS-45921: Some pull requests linked via external trackers have merged:

The following pull request, linked via external tracker, has not merged:

All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-45921 has not been moved to the MODIFIED state.

Details

In response to this:

The e2e-aws-ovn-conformance-serial test creates a HyperShift hosted cluster with 3 worker nodes but SingleReplica infrastructure topology (the default). This causes the ingress controller to run with only 1 replica, making it vulnerable to NoExecute taint eviction from serial conformance tests like kubectl taint [1]. Switching to HighlyAvailable ensures 2 router replicas so a single-node taint doesn't cause full ingress unavailability.

[1] https://github.com/kubernetes/kubernetes/blob/8911a2d/test/e2e/kubectl/kubectl.go#L1772

Investigation details: link.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

davdhacs pushed a commit to stackrox/openshift-release that referenced this pull request Mar 9, 2026
… conformance tests (openshift#75813)

The e2e-aws-ovn-conformance-serial test creates a HyperShift hosted
cluster with 3 worker nodes but SingleReplica infrastructure topology
(the default). This causes the ingress controller to run with only 1
replica, making it vulnerable to NoExecute taint eviction from serial
conformance tests like kubectl taint [1]. Switching to HighlyAvailable
ensures 2 router replicas so a single-node taint doesn't cause full
ingress unavailability.

[1] https://github.com/kubernetes/kubernetes/blob/8911a2d/test/e2e/kubectl/kubectl.go#L1772
SeanZhao-redhat pushed a commit to SeanZhao-redhat/openshift-release that referenced this pull request Mar 10, 2026
… conformance tests (openshift#75813)

The e2e-aws-ovn-conformance-serial test creates a HyperShift hosted
cluster with 3 worker nodes but SingleReplica infrastructure topology
(the default). This causes the ingress controller to run with only 1
replica, making it vulnerable to NoExecute taint eviction from serial
conformance tests like kubectl taint [1]. Switching to HighlyAvailable
ensures 2 router replicas so a single-node taint doesn't cause full
ingress unavailability.

[1] https://github.com/kubernetes/kubernetes/blob/8911a2d/test/e2e/kubectl/kubectl.go#L1772
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. rehearsals-ack Signifies that rehearsal jobs have been acknowledged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

X Tutup