feat: add BigLake Iceberg support for BigQuery analytics plugin by caohy1988 · Pull Request #4750 · google/adk-python

caohy1988 · 2026-03-07T08:04:42Z

Summary

Adds BigLake Iceberg support to the BigQuery Agent Analytics Plugin, with a clean backend/writer abstraction layer and several bug fixes.

Phase 0: BigLake Iceberg Core (commits `723477ba`–`0a99c045`)

Schema & Table Creation:

New config option biglake_storage_uri on BigQueryLoggerConfig: when set (together with connection_id), the plugin creates and configures a BigLake Iceberg table
Schema flattening for BigLake: JSON fields → STRING, RECORD/STRUCT fields → STRING (JSON-serialized)
BigLakeConfiguration on table creation: file_format=PARQUET, table_format=ICEBERG, storage_uri, normalized connection_id
connection_id normalization: Accepts location.connection, project.location.connection, or full resource path
Time partitioning: Skipped by default for BigLake Iceberg; opt-in via biglake_time_partitioning=True

Write Path:

Native BigQuery → Storage Write API (unchanged)
BigLake Iceberg → Legacy Streaming (LegacyStreamingBatchProcessor using insert_rows_json())

Phase 1: Backend Extraction (commits `2cfe9de3`–`ea8cddd9`)

Refactors table creation and loop-state setup behind abstract backend classes:

AnalyticsTableBackend (ABC): build_schema(), maybe_build_arrow_schema(), prepare_table_for_create(), create_loop_state()
NativeBigQueryBackend: Storage Write API path
BigLakeIcebergBackend: Legacy streaming path
Switches 4 call sites (schema, arrow schema, loop-state, table creation) to use backend
Preserves pickle backward compatibility (__getstate__/__setstate__)

Phase 2: Writer Extraction (commits `9ba7382c`–`288fbfc9`)

Extracts the write path behind a unified EventWriter interface:

EventWriter (ABC): append(), flush(), shutdown(), close(), write_stream, atexit_processor()
StorageWriteApiWriter: wraps BigQueryWriteAsyncClient + BatchProcessor, closes gRPC transport
LegacyStreamingWriter: wraps LegacyStreamingBatchProcessor
Simplifies _LoopState to a single writer field
Routes all append/flush/shutdown/close through the writer
Compatibility properties (batch_processor, write_client, write_stream) preserved via __getattribute__
Guarantees transport cleanup in StorageWriteApiWriter.close() via try/finally

Bug Fix: Concurrent View Creation (commit `67b76d82`)

Fixes [BUG]BQ AA Plugin create view with error #4746: CREATE OR REPLACE VIEW throws 409 Conflict when multiple processes race to create the same view
Now catches cloud_exceptions.Conflict in _create_analytics_views() and logs at DEBUG level

E2E Test Samples (commit `a13a30c5`)

Adds 4 end-to-end test suites under contributing/samples/:

bq_plugin_test_local/ — Local native BigQuery test (run_and_verify, validate_all_fixes, validate_issue_4694, test_fork_safety)
bq_plugin_test_agent_engine/ — Agent Engine native BigQuery deploy + verify
bq_plugin_test_biglake_local/ — Local BigLake Iceberg test
bq_plugin_test_biglake_agent_engine/ — Agent Engine BigLake Iceberg deploy + verify

E2E Test Results (all 4 passing)

Test	Events	Key Validations
Local native BQ	44	9 event types, 15 views OK, DAY time partitioning
Local BigLake Iceberg	44	PARQUET/ICEBERG config, no JSON fields, SAFE.PARSE_JSON works
Agent Engine native BQ	33	No BigLakeConfiguration, DAY time partitioning
Agent Engine BigLake	33	BigLakeConfiguration, correct connection_id & storage_uri

Design Documents

docs/design/rfc_biglake_backend_phase1.md — Backend extraction RFC
docs/design/rfc_biglake_writer_phase2.md — Writer extraction RFC

Test plan

230 unit tests pass (208 original + 8 backend + 12 writer + 1 view conflict + 1 pickle)
4 e2e test suites verified against real BigQuery data
Autoformatting applied
No behavioral changes to existing native BigQuery path

1. Normalize connection_id to full resource path for BigLakeConfiguration (projects/{project}/locations/{loc}/connections/{name}). 2. Skip time partitioning for BigLake Iceberg by default (preview feature); add biglake_time_partitioning opt-in flag. 3. Document Storage Write API latency caveat for Iceberg metadata refresh (~90 min for open-source engine visibility). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…connection_id normalization _normalize_biglake_connection_id() now correctly parses "project.location.connection" (e.g. "myproj.us.my-conn") in addition to the two-part "location.connection" and full resource path forms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

google-cla · 2026-03-07T08:04:58Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

gemini-code-assist · 2026-03-07T08:05:07Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces robust support for BigLake managed Iceberg tables within the BigQuery analytics plugin. It enables users to configure their analytics to write data to Iceberg format in Google Cloud Storage, leveraging BigQuery's capabilities. The changes include necessary schema adjustments for Iceberg compatibility, standardized connection handling, and flexible partitioning options, ensuring seamless integration and data consistency across different analytics engines.

Highlights

BigLake Iceberg Support: Added a new biglake_storage_uri configuration option to BigQueryLoggerConfig to enable the creation of BigLake managed Iceberg tables.
Schema Transformation: Implemented automatic replacement of JSON schema fields with STRING types, as BigLake Iceberg does not support the JSON data type.
BigLake Configuration: Ensured that BigLakeConfiguration (including connection_id, storage_uri, file_format=PARQUET, and table_format=ICEBERG) is correctly set during BigLake table creation.
Connection ID Normalization: Introduced logic to normalize connection_id to its full resource path format (projects/{p}/locations/{l}/connections/{c}), accepting various short forms.
Time Partitioning Control: Configured BigLake Iceberg tables to skip time partitioning by default (as it's a preview feature), with an opt-in mechanism via biglake_time_partitioning.
Documentation Update: Added a note documenting the Storage Write API latency caveat for Iceberg metadata refresh, advising on potential delays for cross-engine visibility.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/google/adk/plugins/bigquery_agent_analytics_plugin.py
- Added biglake_storage_uri and biglake_time_partitioning fields to BigQueryLoggerConfig.
- Implemented _replace_json_with_string to convert JSON schema fields to STRING for BigLake compatibility.
- Created _normalize_biglake_connection_id to standardize BigLake connection IDs to full resource paths.
- Modified _get_events_schema to conditionally apply schema transformations for BigLake tables.
- Added validation to ensure connection_id is provided when biglake_storage_uri is set.
- Introduced an is_biglake property to easily check if BigLake functionality is enabled.
- Updated table creation logic in _ensure_schema_exists to apply BigLakeConfiguration and handle conditional time partitioning.
tests/unittests/plugins/test_bigquery_agent_analytics_plugin.py
- Added a new test suite TestBigLakeIceberg to cover all new BigLake Iceberg functionalities.
- Included tests for biglake_storage_uri configuration and the is_biglake property.
- Verified the requirement of connection_id when biglake_storage_uri is used.
- Tested the _replace_json_with_string function and its effect on BigLake schemas, ensuring no JSON fields remain.
- Validated that BigLake Arrow schemas do not contain google:sqlType:json metadata.
- Confirmed that BigLakeConfiguration is correctly set during BigLake table creation, including connection ID normalization.
- Tested the default skipping of time partitioning for BigLake tables and the opt-in mechanism.
- Added comprehensive tests for _normalize_biglake_connection_id covering various input formats and error cases.
- Ensured that non-BigLake schemas remain unchanged, preserving JSON fields.

Activity

No human activity (comments, reviews) has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

adk-bot · 2026-03-07T08:05:45Z

Response from ADK Triaging Agent

Hello @caohy1988, thank you for your contribution!

To proceed with the review, could you please address the following points from our contribution guidelines:

Sign the Contributor License Agreement (CLA): It looks like the CLA check has failed. Please visit https://cla.developers.google.com/ to sign it.
Associate an Issue: For new features, we require an associated GitHub issue. Could you please create one that describes this feature and link it to this PR?

Completing these steps will help us move forward with the review process. Thanks!

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces support for BigLake Iceberg tables in the BigQuery analytics plugin, which is a valuable enhancement. The changes are well-implemented, including new configuration options, schema adjustments for Iceberg compatibility, and robust connection ID normalization. The accompanying unit tests are thorough and cover the new functionality comprehensively. I have one minor suggestion to remove a redundant validation check to improve code clarity. Overall, this is a high-quality contribution.

gemini-code-assist · 2026-03-07T08:08:52Z

src/google/adk/plugins/bigquery_agent_analytics_plugin.py

      tbl.clustering_fields = self.config.clustering_fields
      tbl.labels = {_SCHEMA_VERSION_LABEL_KEY: _SCHEMA_VERSION}
+      if self.is_biglake:
+        from google.cloud.bigquery.table import BigLakeConfiguration
+


This validation check for connection_id is redundant. An equivalent check is already performed in the __init__ method (lines 1956-1959). Failing early during plugin instantiation is preferable to failing during lazy setup, as it's easier to debug. Removing this duplicate check will make the code cleaner.

caohy1988 · 2026-03-07T08:56:51Z

I looked into the Spark BigQuery connector path as a way to validate the “BigLake Iceberg supports high throughput streaming using the Storage Write API” claim.

Short conclusion: yes, the Spark connector is a valid proof path for Storage Write API -> BigLake Iceberg, but it is not a good in-process replacement for the current Python plugin writer.

Why:

The BigLake docs explicitly position Storage Write API support through connectors like Spark and Dataflow, not through a documented low-level Python AppendRows sample:
https://docs.cloud.google.com/biglake/docs/biglake-iceberg-tables-in-bigquery
The Spark BigQuery connector supports direct writes via Storage Write API using writeMethod=direct:
https://github.com/GoogleCloudDataproc/spark-bigquery-connector

Example shape:

(
    df.write
    .format("bigquery")
    .option("writeMethod", "direct")
    .option("writeAtLeastOnce", "true")
    .mode("append")
    .save("project.dataset.biglake_iceberg_table")
)

Recommendation:

Use Spark/Dataflow as a standalone validation path or external ingestion backend.
Do not try to swap the current BigQueryAgentAnalyticsPlugin writer to Spark in-process. That would be a major architecture change: JVM/Spark runtime, connector jar management, process startup cost, and a very different operational model from the current lightweight Python callback path.
If we want to prove whether Storage Write API can work for this use case, the next practical step is a small standalone Spark or Dataflow repro that writes rows shaped like the plugin events into an existing BigLake Iceberg table.
If that repro succeeds while the current raw Python Storage Write API path still fails with _colidentifier_iceberg_1, then the correct conclusion is likely: connector-supported path works, but the current low-level direct client path used by the plugin is unsupported or at least undocumented.

One additional caveat from the docs: even when streamed writes succeed, Iceberg metadata visibility for open-source engines may lag by up to ~90 minutes, so this should not be treated as immediate cross-engine freshness.

Given that, I would not change the plugin implementation to Spark. I would treat Spark/Dataflow as:

a validation harness for the product capability, and
potentially a separate ingestion architecture if BigLake Iceberg is a hard requirement.

caohy1988 · 2026-03-07T09:05:58Z

After looking at the documented support surface and the current E2E result, my recommendation is to keep this PR as a minimal MVP for BigLake support.

Recommended default behavior:

native BigQuery table -> storage_write_api
BigLake Iceberg table -> legacy_streaming

Why I think this is the right scope for this PR:

It solves the actual user problem in [Feature Request] Support writing analytics events to BigLake tables in BigQuery Agent Analytics plugin #4747: support writing analytics events to BigLake tables.
It avoids blocking the feature on the current raw Python Storage Write API failure (_colidentifier_iceberg_1).
It keeps the change small and easy to reason about instead of introducing multiple new backends / batching strategies / ingestion modes in one PR.
It preserves the current best path for native BigQuery tables, where storage_write_api is already the right default.

Why this split makes sense technically:

For native BigQuery tables, storage_write_api is the existing fast path and should remain the default.
For BigLake Iceberg, we have evidence that some supported write paths work (legacy_streaming, DML, load jobs), while the current low-level direct Storage Write API path does not.
The BigLake docs mention Storage Write API support through connectors like Spark/Dataflow, but that is not the same thing as the current raw Python AppendRows path used by this plugin.

So for this PR, I would explicitly avoid expanding scope into:

multiple BigLake write backends
load-job batching
DML fallback logic
trying to force raw Storage Write API support to work

Those can all be follow-up work if needed. For MVP, the cleanest path is:

add BigLake Iceberg table support
route BigLake writes to legacy_streaming
document that raw Python Storage Write API remains unresolved for BigLake Iceberg

That gives users a working feature now, keeps the PR minimal, and avoids overfitting to an undocumented / currently failing backend path.

caohy1988 · 2026-03-07T09:09:36Z

I think this should be tracked as a Google-internal / product bug, separate from this PR.

Reason:

On the same BigLake Iceberg table, the following write paths work: legacy_streaming, DML INSERT, and batch load / LOAD DATA.
The raw Python Storage Write API path used by this plugin fails with:
INVALID_ARGUMENT: Input schema has missing required field: _colidentifier_iceberg_1
That strongly suggests the issue is not "BigLake Iceberg cannot be written", but specifically that the low-level direct AppendRows path is either unsupported, partially supported, or currently broken for managed BigLake Iceberg tables.

I would recommend filing a product bug with a minimal repro like this:

Title:
Raw BigQuery Storage Write API AppendRows to managed BigLake Iceberg table fails with _colidentifier_iceberg_1

Repro summary:

Create a managed BigLake Iceberg table in BigQuery with BigLakeConfiguration:
- storage_uri=gs://...
- file_format=PARQUET
- table_format=ICEBERG
- valid BigLake connection_id
Use google-cloud-bigquery-storage Python client (BigQueryWriteAsyncClient) to append rows to:
projects/{project}/datasets/{dataset}/tables/{table}/_default
Send a standard Arrow AppendRowsRequest with a schema matching the user-visible table schema.
Observe failure:
INVALID_ARGUMENT: Input schema has missing required field: _colidentifier_iceberg_1
Compare against the same table using:
- legacy streaming inserts
- DML INSERT
- batch load / LOAD DATA
  All of those succeed.

Questions for product team:

Is raw direct Storage Write API AppendRows to managed BigLake Iceberg tables expected to be supported?
If yes, is _colidentifier_iceberg_1 a backend bug or is there an undocumented client requirement?
If no, can docs be clarified to distinguish connector-mediated Storage Write API support (Spark/Dataflow) from raw direct client support?

Given that, I would not block this PR on raw Storage Write API. I would keep the PR minimal and use:

native BigQuery table -> storage_write_api
BigLake Iceberg table -> legacy_streaming

That gives users a working MVP now, while the raw AppendRows behavior is investigated with the product team.

The Storage Write API v2 (Arrow format) cannot write to BigLake Iceberg tables due to internal _colidentifier_iceberg_1 columns. Route BigLake writes to the legacy streaming API (insert_rows_json) which handles these transparently. - Add LegacyStreamingBatchProcessor with same queue/batch interface - BigLake: create LegacyStreamingBatchProcessor in _get_loop_state() - Non-BigLake: unchanged, uses Storage Write API (BatchProcessor) - Skip Arrow schema creation for BigLake (not needed) - Update _LoopState to accept Union processor type - Add 5 tests for legacy streaming processor and routing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

BigLake Iceberg tables cannot handle nested RECORD fields via any streaming API (both Storage Write API and legacy streaming fail with _colidentifier_iceberg errors on RECORD positions). Changes: - _replace_json_with_string now also flattens RECORD/STRUCT fields to STRING (JSON-serialized) for BigLake Iceberg - LegacyStreamingBatchProcessor._prepare_rows_json serializes dict/list values to JSON strings - Updated E2E test scripts to verify flattened schema - Local E2E test passes: 44 events, all 9 event types, all checks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add dynamic wheel discovery and local wheel deployment support so the deploy script works with any ADK version. Also verify content_parts field is STRING in the BigLake verification checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The biglake_storage_uri docstring incorrectly stated data is written via the Storage Write API. Update to reflect the actual implementation which uses the legacy streaming API (insert_rows_json) since the Storage Write API does not yet support BigLake Iceberg tables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Extract conditional is_biglake branching into AnalyticsTableBackend, NativeBigQueryBackend, and BigLakeIcebergBackend classes per approved RFC. This is a no-behavior-change refactor that routes schema creation, Arrow schema creation, loop-state creation, and table creation through backend classes instead of inline conditionals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add _backend to __getstate__() (reset to None) and __setstate__() (backfill via setdefault) so unpickling a pre-refactor serialized plugin does not raise AttributeError when the lazy backend property is accessed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Phase 2 of the BigLake backend refactor. Introduces EventWriter ABC with StorageWriteApiWriter and LegacyStreamingWriter implementations, hiding the write-path details behind a unified interface. Simplifies _LoopState to a single writer field and routes all append/flush/shutdown/close operations through the writer. Compatibility properties (batch_processor, write_client, write_stream) preserved via __getattribute__. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ose() Wrap _batch_processor.close() in try/finally so the gRPC transport is always closed even if the batch processor teardown fails. This prevents leaked connections when cross-loop shutdown partially fails. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add bq_plugin_test_local and bq_plugin_test_agent_engine sample directories for native BigQuery e2e testing. Update the native Agent Engine deploy script to use local wheel when available (matching the BigLake deploy script) and add BQ verification. All 4 e2e test suites verified against real data: - Local native BQ: 44 events, 15 views, all checks pass - Local BigLake Iceberg: 44 events, PARQUET/ICEBERG config, all checks pass - Agent Engine native BQ: 33 events, time partitioning, all checks pass - Agent Engine BigLake: 33 events, BigLakeConfiguration, all checks pass Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…#4746) CREATE OR REPLACE VIEW can throw 409 Conflict when multiple processes race to create the same view. Catch cloud_exceptions.Conflict in _create_analytics_views() and log at DEBUG level instead of ERROR, since the view was successfully created by the other process. Fixes google#4746 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

adk-bot · 2026-03-08T09:46:10Z

Response from ADK Triaging Agent

Hello @caohy1988, thank you for your contribution!

To proceed with the review, could you please address the following point from our contribution guidelines:

Sign the Contributor License Agreement (CLA): It looks like the CLA check has failed. Please visit https://cla.developers.google.com/ to sign it.

Completing this step will help us move forward with the review process. Thanks!

caohy1988 · 2026-03-08T18:35:11Z

Updated backend gap summary after the latest BigLake create_views=True validation.

The earlier comment I left is now partially outdated on one point: the BigLake view support gap is no longer just an unvalidated concern. The latest E2E update demonstrates that the auto-created analytics views do work on the current BigLake schema, including extraction from STRING columns using JSON_VALUE(...) / JSON_QUERY(...).

This remains a code-level comparison of the current plugin implementation, not a broader product statement about BigQuery vs BigLake in general.

Current backend comparison

Area	Native BigQuery backend	BigLake Iceberg backend	Current status / impact
Write transport	Storage Write API via `BigQueryWriteAsyncClient` + `_default` stream	legacy streaming via `insert_rows_json()`	BigLake still does not get the Storage Write API path
Writer resources	`StorageWriteApiWriter(write_client, BatchProcessor)`	`LegacyStreamingWriter(LegacyStreamingBatchProcessor)`	BigLake has no write client and no write stream
`plugin.write_client` compatibility property	returns real `BigQueryWriteAsyncClient`	returns `None`	native-only capability
`plugin.write_stream` compatibility property	returns `_default` stream name	returns `None`	native-only capability
Arrow schema	built via `to_arrow_schema(...)`	not built	BigLake still has no Arrow serialization path
Physical schema fidelity	keeps JSON / RECORD / STRUCT types	flattens JSON and RECORD / STRUCT to `STRING`	BigLake still loses structured schema fidelity
Multi-modal / nested payload storage	nested / repeated structures remain typed	nested / repeated structures serialized into strings	BigLake stores lower-fidelity physical schema
Row preparation	typed Arrow batch write	dict/list serialized to JSON strings, datetime to ISO string	BigLake write path remains stringified / lossy relative to native typing
Default partitioning behavior	daily time partitioning on `timestamp` by default	partitioning skipped by default; opt-in via `biglake_time_partitioning=True`	default table layout still differs
Extra required config	no BigLake-specific config required	requires `biglake_storage_uri` + `connection_id`	BigLake setup remains stricter
Table type	native BigQuery table	BigLake managed Iceberg table with `BigLakeConfiguration`	different storage model and creation path
BigQuery analytics views	supported	now validated as working on BigLake `STRING` columns	this is no longer a gap
Cross-engine freshness caveat	not called out in native path	Iceberg metadata for open-source engines may lag up to ~90 minutes	BigLake still has additional freshness caveat

Important update from latest E2E validation

The BigLake backend now has evidence that the current auto-created analytics views are usable:

create_views=True enabled in the BigLake sample configs
local BigLake E2E validates all auto-created views are queryable
spot-check confirms typed extraction like tool_name from STRING columns works

So the previous "BigLake views may not work" concern should now be treated as resolved for the current implementation, assuming the current view SQL patterns.

Remaining confirmed gaps vs native backend

The biggest remaining BigLake backend gaps are now:

no Storage Write API path
no structured JSON / RECORD physical schema parity
no write_client / write_stream compatibility parity
different default partitioning behavior
lower-fidelity physical storage for nested / multi-modal payloads

Recommended next tasks to close the BigLake gap, ordered by priority

Resolve the write-transport gap
- Goal: determine whether BigLake can ever get parity with the native Storage Write API path.
- Practical next step: keep the product bug / repro active for raw AppendRows against BigLake Iceberg.
- Why this is P0: this is the largest architectural gap between the two backends and the main reason BigLake is on a distinct code path.
Define the target for schema-fidelity parity
- Goal: decide whether BigLake should stay as "logical parity, lower physical fidelity" or whether we want a higher-fidelity representation for JSON / RECORD payloads.
- Practical next step: write a small design on options for preserving more structure without breaking BigLake ingestion.
- Why this is P1: today BigLake works, but the physical schema is materially less expressive than native.
Characterize throughput / latency tradeoffs of legacy streaming vs native Storage Write API
- Goal: quantify the operational cost of the BigLake fallback path.
- Practical next step: benchmark representative event volume and flush settings on both backends.
- Why this is P1: if BigLake is going to be production-worthy, we should know the ingestion cost/perf envelope.
Decide whether compatibility properties should be upgraded or explicitly documented as native-only
- Scope: plugin.write_client and plugin.write_stream.
- Practical next step: either keep them intentionally native-only and document that, or introduce a BigLake-specific equivalent abstraction later.
- Why this is P2: not a functional blocker, but still a visible behavioral difference.
Revisit default partitioning policy for BigLake once preview/feature maturity is clearer
- Goal: decide whether BigLake should eventually converge toward native default partitioning behavior.
- Practical next step: keep current opt-in behavior for now, but treat it as a documented difference rather than a final design.
- Why this is P2: important for table layout and cost, but not the main blocker to BigLake support.

Bottom line

With the latest view validation, the BigLake backend is in better shape than before: it now has working ingestion, working analytics views, and working E2E verification.

The remaining gap is no longer "can BigLake basically work?" It is now mainly about parity:

transport parity
schema fidelity parity
operational/performance parity

If the PR goal remains a minimal working BigLake MVP, the current implementation is reasonable. The next phase should focus on reducing the parity gap rather than questioning whether the current BigLake path is viable at all.

JSON_VALUE() and JSON_QUERY() are polymorphic in BigQuery and work on both JSON and STRING columns. The previous create_views=False was based on a wrong assumption. Enable views for BigLake samples and add view validation to the BigLake e2e test (15/15 views pass). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update docstring and comments to reflect that BigLake Iceberg uses legacy streaming (insert_rows_json), not Storage Write API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

caohy1988 and others added 3 commits March 6, 2026 23:48

adk-bot added the services [Component] This issue is related to runtime services, e.g. sessions, memory, artifacts, etc label Mar 7, 2026

docs: update connection_id comment to reflect all accepted formats

db1c99e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist bot reviewed Mar 7, 2026

View reviewed changes

caohy1988 and others added 13 commits March 7, 2026 01:19

docs(plugins): add BigLake backend extraction RFC

2cfe9de

docs(plugins): refine BigLake backend RFC

2e14463

docs(plugins): add BigLake writer refactor RFC

b2ec685

docs(plugins): refine BigLake writer RFC

19ff2a1

docs(plugins): fix phase 2 RFC inconsistencies

38bf6cd

caohy1988 closed this Mar 8, 2026

caohy1988 reopened this Mar 8, 2026

caohy1988 and others added 2 commits March 8, 2026 11:41

docs: fix stale BigLake sample comments referencing Storage Write API

bfb20c4

Update docstring and comments to reflect that BigLake Iceberg uses legacy streaming (insert_rows_json), not Storage Write API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add BigLake Iceberg support for BigQuery analytics plugin#4750

feat: add BigLake Iceberg support for BigQuery analytics plugin#4750
caohy1988 wants to merge 21 commits intogoogle:mainfrom
caohy1988:feat/biglake-iceberg-support

caohy1988 commented Mar 7, 2026 •

edited

Loading

Uh oh!

google-cla bot commented Mar 7, 2026

Uh oh!

gemini-code-assist bot commented Mar 7, 2026

Uh oh!

adk-bot commented Mar 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 7, 2026

Uh oh!

caohy1988 commented Mar 7, 2026

Uh oh!

caohy1988 commented Mar 7, 2026

Uh oh!

caohy1988 commented Mar 7, 2026

Uh oh!

adk-bot commented Mar 8, 2026

Uh oh!

caohy1988 commented Mar 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

caohy1988 commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Phase 0: BigLake Iceberg Core (commits 723477ba–0a99c045)

Phase 1: Backend Extraction (commits 2cfe9de3–ea8cddd9)

Phase 2: Writer Extraction (commits 9ba7382c–288fbfc9)

Bug Fix: Concurrent View Creation (commit 67b76d82)

E2E Test Samples (commit a13a30c5)

E2E Test Results (all 4 passing)

Design Documents

Test plan

Related

Uh oh!

google-cla bot commented Mar 7, 2026

Uh oh!

gemini-code-assist bot commented Mar 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

adk-bot commented Mar 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

caohy1988 commented Mar 7, 2026

Uh oh!

caohy1988 commented Mar 7, 2026

Uh oh!

caohy1988 commented Mar 7, 2026

Uh oh!

adk-bot commented Mar 8, 2026

Uh oh!

caohy1988 commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current backend comparison

Important update from latest E2E validation

Remaining confirmed gaps vs native backend

Recommended next tasks to close the BigLake gap, ordered by priority

Bottom line

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

caohy1988 commented Mar 7, 2026 •

edited

Loading

Phase 0: BigLake Iceberg Core (commits `723477ba`–`0a99c045`)

Phase 1: Backend Extraction (commits `2cfe9de3`–`ea8cddd9`)

Phase 2: Writer Extraction (commits `9ba7382c`–`288fbfc9`)

Bug Fix: Concurrent View Creation (commit `67b76d82`)

E2E Test Samples (commit `a13a30c5`)

caohy1988 commented Mar 8, 2026 •

edited

Loading