feat: add BigLake Iceberg support for BigQuery analytics plugin#4750
feat: add BigLake Iceberg support for BigQuery analytics plugin#4750caohy1988 wants to merge 21 commits intogoogle:mainfrom
Conversation
Add `biglake_storage_uri` config option that, when set alongside `connection_id`, automatically creates BigLake managed Iceberg tables and replaces JSON schema fields with STRING (since BigLake Iceberg does not support JSON type). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Normalize connection_id to full resource path for BigLakeConfiguration
(projects/{project}/locations/{loc}/connections/{name}).
2. Skip time partitioning for BigLake Iceberg by default (preview feature);
add biglake_time_partitioning opt-in flag.
3. Document Storage Write API latency caveat for Iceberg metadata refresh
(~90 min for open-source engine visibility).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…connection_id normalization _normalize_biglake_connection_id() now correctly parses "project.location.connection" (e.g. "myproj.us.my-conn") in addition to the two-part "location.connection" and full resource path forms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces robust support for BigLake managed Iceberg tables within the BigQuery analytics plugin. It enables users to configure their analytics to write data to Iceberg format in Google Cloud Storage, leveraging BigQuery's capabilities. The changes include necessary schema adjustments for Iceberg compatibility, standardized connection handling, and flexible partitioning options, ensuring seamless integration and data consistency across different analytics engines. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Response from ADK Triaging Agent Hello @caohy1988, thank you for your contribution! To proceed with the review, could you please address the following points from our contribution guidelines:
Completing these steps will help us move forward with the review process. Thanks! |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces support for BigLake Iceberg tables in the BigQuery analytics plugin, which is a valuable enhancement. The changes are well-implemented, including new configuration options, schema adjustments for Iceberg compatibility, and robust connection ID normalization. The accompanying unit tests are thorough and cover the new functionality comprehensively. I have one minor suggestion to remove a redundant validation check to improve code clarity. Overall, this is a high-quality contribution.
| tbl.clustering_fields = self.config.clustering_fields | ||
| tbl.labels = {_SCHEMA_VERSION_LABEL_KEY: _SCHEMA_VERSION} | ||
| if self.is_biglake: | ||
| from google.cloud.bigquery.table import BigLakeConfiguration | ||
|
|
There was a problem hiding this comment.
This validation check for connection_id is redundant. An equivalent check is already performed in the __init__ method (lines 1956-1959). Failing early during plugin instantiation is preferable to failing during lazy setup, as it's easier to debug. Removing this duplicate check will make the code cleaner.
|
I looked into the Spark BigQuery connector path as a way to validate the “BigLake Iceberg supports high throughput streaming using the Storage Write API” claim. Short conclusion: yes, the Spark connector is a valid proof path for Storage Write API -> BigLake Iceberg, but it is not a good in-process replacement for the current Python plugin writer. Why:
Example shape: (
df.write
.format("bigquery")
.option("writeMethod", "direct")
.option("writeAtLeastOnce", "true")
.mode("append")
.save("project.dataset.biglake_iceberg_table")
)Recommendation:
One additional caveat from the docs: even when streamed writes succeed, Iceberg metadata visibility for open-source engines may lag by up to ~90 minutes, so this should not be treated as immediate cross-engine freshness. Given that, I would not change the plugin implementation to Spark. I would treat Spark/Dataflow as:
|
|
After looking at the documented support surface and the current E2E result, my recommendation is to keep this PR as a minimal MVP for BigLake support. Recommended default behavior:
Why I think this is the right scope for this PR:
Why this split makes sense technically:
So for this PR, I would explicitly avoid expanding scope into:
Those can all be follow-up work if needed. For MVP, the cleanest path is:
That gives users a working feature now, keeps the PR minimal, and avoids overfitting to an undocumented / currently failing backend path. |
|
I think this should be tracked as a Google-internal / product bug, separate from this PR. Reason:
I would recommend filing a product bug with a minimal repro like this: Title: Repro summary:
Questions for product team:
Given that, I would not block this PR on raw Storage Write API. I would keep the PR minimal and use:
That gives users a working MVP now, while the raw |
The Storage Write API v2 (Arrow format) cannot write to BigLake Iceberg tables due to internal _colidentifier_iceberg_1 columns. Route BigLake writes to the legacy streaming API (insert_rows_json) which handles these transparently. - Add LegacyStreamingBatchProcessor with same queue/batch interface - BigLake: create LegacyStreamingBatchProcessor in _get_loop_state() - Non-BigLake: unchanged, uses Storage Write API (BatchProcessor) - Skip Arrow schema creation for BigLake (not needed) - Update _LoopState to accept Union processor type - Add 5 tests for legacy streaming processor and routing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BigLake Iceberg tables cannot handle nested RECORD fields via any streaming API (both Storage Write API and legacy streaming fail with _colidentifier_iceberg errors on RECORD positions). Changes: - _replace_json_with_string now also flattens RECORD/STRUCT fields to STRING (JSON-serialized) for BigLake Iceberg - LegacyStreamingBatchProcessor._prepare_rows_json serializes dict/list values to JSON strings - Updated E2E test scripts to verify flattened schema - Local E2E test passes: 44 events, all 9 event types, all checks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add dynamic wheel discovery and local wheel deployment support so the deploy script works with any ADK version. Also verify content_parts field is STRING in the BigLake verification checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The biglake_storage_uri docstring incorrectly stated data is written via the Storage Write API. Update to reflect the actual implementation which uses the legacy streaming API (insert_rows_json) since the Storage Write API does not yet support BigLake Iceberg tables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract conditional is_biglake branching into AnalyticsTableBackend, NativeBigQueryBackend, and BigLakeIcebergBackend classes per approved RFC. This is a no-behavior-change refactor that routes schema creation, Arrow schema creation, loop-state creation, and table creation through backend classes instead of inline conditionals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add _backend to __getstate__() (reset to None) and __setstate__() (backfill via setdefault) so unpickling a pre-refactor serialized plugin does not raise AttributeError when the lazy backend property is accessed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 2 of the BigLake backend refactor. Introduces EventWriter ABC with StorageWriteApiWriter and LegacyStreamingWriter implementations, hiding the write-path details behind a unified interface. Simplifies _LoopState to a single writer field and routes all append/flush/shutdown/close operations through the writer. Compatibility properties (batch_processor, write_client, write_stream) preserved via __getattribute__. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ose() Wrap _batch_processor.close() in try/finally so the gRPC transport is always closed even if the batch processor teardown fails. This prevents leaked connections when cross-loop shutdown partially fails. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add bq_plugin_test_local and bq_plugin_test_agent_engine sample directories for native BigQuery e2e testing. Update the native Agent Engine deploy script to use local wheel when available (matching the BigLake deploy script) and add BQ verification. All 4 e2e test suites verified against real data: - Local native BQ: 44 events, 15 views, all checks pass - Local BigLake Iceberg: 44 events, PARQUET/ICEBERG config, all checks pass - Agent Engine native BQ: 33 events, time partitioning, all checks pass - Agent Engine BigLake: 33 events, BigLakeConfiguration, all checks pass Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…#4746) CREATE OR REPLACE VIEW can throw 409 Conflict when multiple processes race to create the same view. Catch cloud_exceptions.Conflict in _create_analytics_views() and log at DEBUG level instead of ERROR, since the view was successfully created by the other process. Fixes google#4746 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Response from ADK Triaging Agent Hello @caohy1988, thank you for your contribution! To proceed with the review, could you please address the following point from our contribution guidelines:
Completing this step will help us move forward with the review process. Thanks! |
|
Updated backend gap summary after the latest BigLake The earlier comment I left is now partially outdated on one point: the BigLake view support gap is no longer just an unvalidated concern. The latest E2E update demonstrates that the auto-created analytics views do work on the current BigLake schema, including extraction from This remains a code-level comparison of the current plugin implementation, not a broader product statement about BigQuery vs BigLake in general. Current backend comparison
Important update from latest E2E validationThe BigLake backend now has evidence that the current auto-created analytics views are usable:
So the previous "BigLake views may not work" concern should now be treated as resolved for the current implementation, assuming the current view SQL patterns. Remaining confirmed gaps vs native backendThe biggest remaining BigLake backend gaps are now:
Recommended next tasks to close the BigLake gap, ordered by priority
Bottom lineWith the latest view validation, the BigLake backend is in better shape than before: it now has working ingestion, working analytics views, and working E2E verification. The remaining gap is no longer "can BigLake basically work?" It is now mainly about parity:
If the PR goal remains a minimal working BigLake MVP, the current implementation is reasonable. The next phase should focus on reducing the parity gap rather than questioning whether the current BigLake path is viable at all. |
JSON_VALUE() and JSON_QUERY() are polymorphic in BigQuery and work on both JSON and STRING columns. The previous create_views=False was based on a wrong assumption. Enable views for BigLake samples and add view validation to the BigLake e2e test (15/15 views pass). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update docstring and comments to reflect that BigLake Iceberg uses legacy streaming (insert_rows_json), not Storage Write API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Adds BigLake Iceberg support to the BigQuery Agent Analytics Plugin, with a clean backend/writer abstraction layer and several bug fixes.
Phase 0: BigLake Iceberg Core (commits
723477ba–0a99c045)Schema & Table Creation:
biglake_storage_urionBigQueryLoggerConfig: when set (together withconnection_id), the plugin creates and configures a BigLake Iceberg tablefile_format=PARQUET,table_format=ICEBERG,storage_uri, normalizedconnection_idconnection_idnormalization: Acceptslocation.connection,project.location.connection, or full resource pathbiglake_time_partitioning=TrueWrite Path:
LegacyStreamingBatchProcessorusinginsert_rows_json())Phase 1: Backend Extraction (commits
2cfe9de3–ea8cddd9)Refactors table creation and loop-state setup behind abstract backend classes:
AnalyticsTableBackend(ABC):build_schema(),maybe_build_arrow_schema(),prepare_table_for_create(),create_loop_state()NativeBigQueryBackend: Storage Write API pathBigLakeIcebergBackend: Legacy streaming path__getstate__/__setstate__)Phase 2: Writer Extraction (commits
9ba7382c–288fbfc9)Extracts the write path behind a unified
EventWriterinterface:EventWriter(ABC):append(),flush(),shutdown(),close(),write_stream,atexit_processor()StorageWriteApiWriter: wrapsBigQueryWriteAsyncClient+BatchProcessor, closes gRPC transportLegacyStreamingWriter: wrapsLegacyStreamingBatchProcessor_LoopStateto a singlewriterfieldbatch_processor,write_client,write_stream) preserved via__getattribute__StorageWriteApiWriter.close()viatry/finallyBug Fix: Concurrent View Creation (commit
67b76d82)CREATE OR REPLACE VIEWthrows 409 Conflict when multiple processes race to create the same viewcloud_exceptions.Conflictin_create_analytics_views()and logs at DEBUG levelE2E Test Samples (commit
a13a30c5)Adds 4 end-to-end test suites under
contributing/samples/:bq_plugin_test_local/— Local native BigQuery test (run_and_verify, validate_all_fixes, validate_issue_4694, test_fork_safety)bq_plugin_test_agent_engine/— Agent Engine native BigQuery deploy + verifybq_plugin_test_biglake_local/— Local BigLake Iceberg testbq_plugin_test_biglake_agent_engine/— Agent Engine BigLake Iceberg deploy + verifyE2E Test Results (all 4 passing)
Design Documents
docs/design/rfc_biglake_backend_phase1.md— Backend extraction RFCdocs/design/rfc_biglake_writer_phase2.md— Writer extraction RFCTest plan
Related
🤖 Generated with Claude Code