feat: Add non-entity retrieval support for ClickHouse offline store#6066
Conversation
b3f62c1 to
1e4cd7c
Compare
| ) | ||
|
|
||
|
|
||
| class TestNonEntityRetrieval: |
There was a problem hiding this comment.
You're right, removed the heavy mocking. For proper coverage, an integration test against a real ClickHouse instance would be more valuable than over-mocked unit tests — the unit tests here may not be necessary at all.
|
Why not have an integration test? |
| # Handle non-entity retrieval mode | ||
| if entity_df is None: | ||
| end_date = kwargs.get("end_date", None) | ||
| if end_date is None: | ||
| end_date = _utc_now() | ||
| else: | ||
| end_date = make_tzaware(end_date) | ||
|
|
||
| entity_df = pd.DataFrame({"event_timestamp": [end_date]}) |
There was a problem hiding this comment.
🔴 Non-entity retrieval silently ignores start_date kwarg unlike Postgres counterpart
When entity_df is None, the Clickhouse get_historical_features only reads end_date from kwargs (clickhouse.py:59) and completely ignores start_date. The caller feature_store.py:1369-1370 passes start_date as a kwarg when the user provides it. The Postgres implementation (postgres.py:132-160), which this code is modeled after, uses start_date to compute the entity_df timestamp and to bound the TTL-based data scan window. In the Clickhouse version, a user-provided start_date is silently dropped, meaning the point-in-time join will use end_date as the sole entity timestamp regardless of the user's intent — potentially returning different (and unexpected) feature data compared to the Postgres offline store for the same inputs.
Prompt for agents
In sdk/python/feast/infra/offline_stores/contrib/clickhouse_offline_store/clickhouse.py, lines 57-65, add handling for the start_date kwarg to match the Postgres implementation at sdk/python/feast/infra/offline_stores/contrib/postgres_offline_store/postgres.py lines 132-168. Specifically:
1. Before the `if entity_df is None:` block, extract start_date from kwargs: `start_date = kwargs.get("start_date", None)`
2. Inside the block, after computing end_date, add logic to compute start_date from TTL if not provided (matching postgres.py lines 145-160):
- If start_date is None, find the max TTL across feature_views and set start_date = end_date - max_ttl (or default to 30 days)
- If start_date is provided, make it tz-aware with make_tzaware(start_date)
3. You will also need to import timedelta from datetime at the top of the file.
4. Consider whether the entity_df should use start_date or end_date as the event_timestamp (the Postgres version uses start_date via pd.date_range[:1], while the current Clickhouse version uses end_date).
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Not a bug — this is intentional. The PIT join uses MAX(entity_timestamp) as the upper bound, so the timestamp in the synthetic entity_df IS the query upper bound. Using [end_date] gives the window [end_date - TTL, end_date], which is correct. The Postgres implementation using pd.date_range(start=start_date, ...)[:1] actually has the bug — it takes start_date as the sole timestamp, making end_date unreachable. Our implementation matches Dask and is the correct behavior.
@ntkathole
There was a problem hiding this comment.
Yes I noticed that
8fd4f11 to
8dd703c
Compare
|
@franciscojavierarceo @ntkathole take another look ! |
Enable get_historical_features() to be called with entity_df=None by passing end_date kwarg instead. When entity_df is None, a synthetic single-row DataFrame is created using end_date (defaults to now). The PIT join window is controlled by end_date and TTL. Includes integration test against a real ClickHouse container. Fixes feast-dev#5835 Signed-off-by: yassinnouh21 <yassinnouh21@gmail.com>
4662b91 to
ee93086
Compare
What this PR does / why we need it:
Adds support for non-entity historical retrieval (
entity_df=None) in the ClickHouse offline store, bringing it to parity with the PostgreSQL offline store.Changes:
ClickhouseOfflineStore.get_historical_features()to acceptentity_df=Nonewith optionalstart_date/end_datekwargsentity_dfisNone, a synthetic single-row DataFrame is created using the provided date range (or sensible defaults:end_date=now,start_datederived from max TTL or 30 days)Usage:
Which issue(s) this PR fixes:
Fixes #5835
Test plan
pytest tests/unit/infra/offline_stores/test_clickhouse.py)