X Tutup
Skip to content

feat: Add non-entity retrieval support for ClickHouse offline store#6066

Merged
ntkathole merged 1 commit intofeast-dev:masterfrom
YassinNouh21:feat/clickhouse-non-entity-retrieval
Mar 10, 2026
Merged

feat: Add non-entity retrieval support for ClickHouse offline store#6066
ntkathole merged 1 commit intofeast-dev:masterfrom
YassinNouh21:feat/clickhouse-non-entity-retrieval

Conversation

@YassinNouh21
Copy link
Collaborator

@YassinNouh21 YassinNouh21 commented Mar 5, 2026

What this PR does / why we need it:

Adds support for non-entity historical retrieval (entity_df=None) in the ClickHouse offline store, bringing it to parity with the PostgreSQL offline store.

Changes:

  • Updated ClickhouseOfflineStore.get_historical_features() to accept entity_df=None with optional start_date/end_date kwargs
  • When entity_df is None, a synthetic single-row DataFrame is created using the provided date range (or sensible defaults: end_date=now, start_date derived from max TTL or 30 days)
  • Added 3 unit tests covering: both dates provided, end_date only (start from TTL), and no dates (defaults to now)

Usage:

fs.get_historical_features(
    features=["driver_stats:conv_rate"],
    entity_df=None,
    start_date=datetime(2023, 1, 1),
    end_date=datetime(2023, 6, 1),
)

Which issue(s) this PR fixes:

Fixes #5835

Test plan

  • Unit tests pass (pytest tests/unit/infra/offline_stores/test_clickhouse.py)
  • Ruff lint and format checks pass
  • E2E verification of full FeatureStore -> Provider -> ClickhouseOfflineStore chain
  • No regression on existing entity_df mode
  • DCO sign-off included

Open with Devin

@YassinNouh21 YassinNouh21 requested a review from a team as a code owner March 5, 2026 01:34
@YassinNouh21 YassinNouh21 self-assigned this Mar 5, 2026
devin-ai-integration[bot]

This comment was marked as resolved.

@YassinNouh21 YassinNouh21 force-pushed the feat/clickhouse-non-entity-retrieval branch 3 times, most recently from b3f62c1 to 1e4cd7c Compare March 5, 2026 01:55
devin-ai-integration[bot]

This comment was marked as resolved.

)


class TestNonEntityRetrieval:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests seems over-mocked

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, removed the heavy mocking. For proper coverage, an integration test against a real ClickHouse instance would be more valuable than over-mocked unit tests — the unit tests here may not be necessary at all.

@franciscojavierarceo
Copy link
Member

Why not have an integration test?

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 7 additional findings in Devin Review.

Open in Devin Review

Comment on lines +57 to +65
# Handle non-entity retrieval mode
if entity_df is None:
end_date = kwargs.get("end_date", None)
if end_date is None:
end_date = _utc_now()
else:
end_date = make_tzaware(end_date)

entity_df = pd.DataFrame({"event_timestamp": [end_date]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Non-entity retrieval silently ignores start_date kwarg unlike Postgres counterpart

When entity_df is None, the Clickhouse get_historical_features only reads end_date from kwargs (clickhouse.py:59) and completely ignores start_date. The caller feature_store.py:1369-1370 passes start_date as a kwarg when the user provides it. The Postgres implementation (postgres.py:132-160), which this code is modeled after, uses start_date to compute the entity_df timestamp and to bound the TTL-based data scan window. In the Clickhouse version, a user-provided start_date is silently dropped, meaning the point-in-time join will use end_date as the sole entity timestamp regardless of the user's intent — potentially returning different (and unexpected) feature data compared to the Postgres offline store for the same inputs.

Prompt for agents
In sdk/python/feast/infra/offline_stores/contrib/clickhouse_offline_store/clickhouse.py, lines 57-65, add handling for the start_date kwarg to match the Postgres implementation at sdk/python/feast/infra/offline_stores/contrib/postgres_offline_store/postgres.py lines 132-168. Specifically:

1. Before the `if entity_df is None:` block, extract start_date from kwargs: `start_date = kwargs.get("start_date", None)`
2. Inside the block, after computing end_date, add logic to compute start_date from TTL if not provided (matching postgres.py lines 145-160):
   - If start_date is None, find the max TTL across feature_views and set start_date = end_date - max_ttl (or default to 30 days)
   - If start_date is provided, make it tz-aware with make_tzaware(start_date)
3. You will also need to import timedelta from datetime at the top of the file.
4. Consider whether the entity_df should use start_date or end_date as the event_timestamp (the Postgres version uses start_date via pd.date_range[:1], while the current Clickhouse version uses end_date).
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YassinNouh21 This seems critical issue.

Copy link
Collaborator Author

@YassinNouh21 YassinNouh21 Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bug — this is intentional. The PIT join uses MAX(entity_timestamp) as the upper bound, so the timestamp in the synthetic entity_df IS the query upper bound. Using [end_date] gives the window [end_date - TTL, end_date], which is correct. The Postgres implementation using pd.date_range(start=start_date, ...)[:1] actually has the bug — it takes start_date as the sole timestamp, making end_date unreachable. Our implementation matches Dask and is the correct behavior.
@ntkathole

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ntkathole sounds like there's a bug in postgres???

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I noticed that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will get resolved via #6057 cc @Vperiodt

@YassinNouh21 YassinNouh21 force-pushed the feat/clickhouse-non-entity-retrieval branch from 8fd4f11 to 8dd703c Compare March 5, 2026 13:34
@YassinNouh21
Copy link
Collaborator Author

@franciscojavierarceo @ntkathole take another look !

Enable get_historical_features() to be called with entity_df=None
by passing end_date kwarg instead. When entity_df is None, a synthetic
single-row DataFrame is created using end_date (defaults to now).
The PIT join window is controlled by end_date and TTL.

Includes integration test against a real ClickHouse container.

Fixes feast-dev#5835

Signed-off-by: yassinnouh21 <yassinnouh21@gmail.com>
@ntkathole ntkathole force-pushed the feat/clickhouse-non-entity-retrieval branch from 4662b91 to ee93086 Compare March 10, 2026 02:56
@ntkathole ntkathole merged commit 4d08ddc into feast-dev:master Mar 10, 2026
21 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ClickHouse - historical retrieval without entity dataframe

3 participants

X Tutup