X Tutup
Skip to content

feat: add BigLake Iceberg support for BigQuery analytics plugin#4750

Draft
caohy1988 wants to merge 21 commits intogoogle:mainfrom
caohy1988:feat/biglake-iceberg-support
Draft

feat: add BigLake Iceberg support for BigQuery analytics plugin#4750
caohy1988 wants to merge 21 commits intogoogle:mainfrom
caohy1988:feat/biglake-iceberg-support

Conversation

@caohy1988
Copy link

@caohy1988 caohy1988 commented Mar 7, 2026

Summary

Adds BigLake Iceberg support to the BigQuery Agent Analytics Plugin, with a clean backend/writer abstraction layer and several bug fixes.

Phase 0: BigLake Iceberg Core (commits 723477ba0a99c045)

Schema & Table Creation:

  • New config option biglake_storage_uri on BigQueryLoggerConfig: when set (together with connection_id), the plugin creates and configures a BigLake Iceberg table
  • Schema flattening for BigLake: JSON fields → STRING, RECORD/STRUCT fields → STRING (JSON-serialized)
  • BigLakeConfiguration on table creation: file_format=PARQUET, table_format=ICEBERG, storage_uri, normalized connection_id
  • connection_id normalization: Accepts location.connection, project.location.connection, or full resource path
  • Time partitioning: Skipped by default for BigLake Iceberg; opt-in via biglake_time_partitioning=True

Write Path:

  • Native BigQuery → Storage Write API (unchanged)
  • BigLake Iceberg → Legacy Streaming (LegacyStreamingBatchProcessor using insert_rows_json())

Phase 1: Backend Extraction (commits 2cfe9de3ea8cddd9)

Refactors table creation and loop-state setup behind abstract backend classes:

  • AnalyticsTableBackend (ABC): build_schema(), maybe_build_arrow_schema(), prepare_table_for_create(), create_loop_state()
  • NativeBigQueryBackend: Storage Write API path
  • BigLakeIcebergBackend: Legacy streaming path
  • Switches 4 call sites (schema, arrow schema, loop-state, table creation) to use backend
  • Preserves pickle backward compatibility (__getstate__/__setstate__)

Phase 2: Writer Extraction (commits 9ba7382c288fbfc9)

Extracts the write path behind a unified EventWriter interface:

  • EventWriter (ABC): append(), flush(), shutdown(), close(), write_stream, atexit_processor()
  • StorageWriteApiWriter: wraps BigQueryWriteAsyncClient + BatchProcessor, closes gRPC transport
  • LegacyStreamingWriter: wraps LegacyStreamingBatchProcessor
  • Simplifies _LoopState to a single writer field
  • Routes all append/flush/shutdown/close through the writer
  • Compatibility properties (batch_processor, write_client, write_stream) preserved via __getattribute__
  • Guarantees transport cleanup in StorageWriteApiWriter.close() via try/finally

Bug Fix: Concurrent View Creation (commit 67b76d82)

  • Fixes [BUG]BQ AA Plugin create view with error #4746: CREATE OR REPLACE VIEW throws 409 Conflict when multiple processes race to create the same view
  • Now catches cloud_exceptions.Conflict in _create_analytics_views() and logs at DEBUG level

E2E Test Samples (commit a13a30c5)

Adds 4 end-to-end test suites under contributing/samples/:

  • bq_plugin_test_local/ — Local native BigQuery test (run_and_verify, validate_all_fixes, validate_issue_4694, test_fork_safety)
  • bq_plugin_test_agent_engine/ — Agent Engine native BigQuery deploy + verify
  • bq_plugin_test_biglake_local/ — Local BigLake Iceberg test
  • bq_plugin_test_biglake_agent_engine/ — Agent Engine BigLake Iceberg deploy + verify

E2E Test Results (all 4 passing)

Test Events Key Validations
Local native BQ 44 9 event types, 15 views OK, DAY time partitioning
Local BigLake Iceberg 44 PARQUET/ICEBERG config, no JSON fields, SAFE.PARSE_JSON works
Agent Engine native BQ 33 No BigLakeConfiguration, DAY time partitioning
Agent Engine BigLake 33 BigLakeConfiguration, correct connection_id & storage_uri

Design Documents

  • docs/design/rfc_biglake_backend_phase1.md — Backend extraction RFC
  • docs/design/rfc_biglake_writer_phase2.md — Writer extraction RFC

Test plan

  • 230 unit tests pass (208 original + 8 backend + 12 writer + 1 view conflict + 1 pickle)
  • 4 e2e test suites verified against real BigQuery data
  • Autoformatting applied
  • No behavioral changes to existing native BigQuery path

Related

🤖 Generated with Claude Code

caohy1988 and others added 3 commits March 6, 2026 23:48
Add `biglake_storage_uri` config option that, when set alongside
`connection_id`, automatically creates BigLake managed Iceberg tables
and replaces JSON schema fields with STRING (since BigLake Iceberg
does not support JSON type).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Normalize connection_id to full resource path for BigLakeConfiguration
   (projects/{project}/locations/{loc}/connections/{name}).
2. Skip time partitioning for BigLake Iceberg by default (preview feature);
   add biglake_time_partitioning opt-in flag.
3. Document Storage Write API latency caveat for Iceberg metadata refresh
   (~90 min for open-source engine visibility).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…connection_id normalization

_normalize_biglake_connection_id() now correctly parses
"project.location.connection" (e.g. "myproj.us.my-conn") in addition
to the two-part "location.connection" and full resource path forms.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@google-cla
Copy link

google-cla bot commented Mar 7, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces robust support for BigLake managed Iceberg tables within the BigQuery analytics plugin. It enables users to configure their analytics to write data to Iceberg format in Google Cloud Storage, leveraging BigQuery's capabilities. The changes include necessary schema adjustments for Iceberg compatibility, standardized connection handling, and flexible partitioning options, ensuring seamless integration and data consistency across different analytics engines.

Highlights

  • BigLake Iceberg Support: Added a new biglake_storage_uri configuration option to BigQueryLoggerConfig to enable the creation of BigLake managed Iceberg tables.
  • Schema Transformation: Implemented automatic replacement of JSON schema fields with STRING types, as BigLake Iceberg does not support the JSON data type.
  • BigLake Configuration: Ensured that BigLakeConfiguration (including connection_id, storage_uri, file_format=PARQUET, and table_format=ICEBERG) is correctly set during BigLake table creation.
  • Connection ID Normalization: Introduced logic to normalize connection_id to its full resource path format (projects/{p}/locations/{l}/connections/{c}), accepting various short forms.
  • Time Partitioning Control: Configured BigLake Iceberg tables to skip time partitioning by default (as it's a preview feature), with an opt-in mechanism via biglake_time_partitioning.
  • Documentation Update: Added a note documenting the Storage Write API latency caveat for Iceberg metadata refresh, advising on potential delays for cross-engine visibility.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/google/adk/plugins/bigquery_agent_analytics_plugin.py
    • Added biglake_storage_uri and biglake_time_partitioning fields to BigQueryLoggerConfig.
    • Implemented _replace_json_with_string to convert JSON schema fields to STRING for BigLake compatibility.
    • Created _normalize_biglake_connection_id to standardize BigLake connection IDs to full resource paths.
    • Modified _get_events_schema to conditionally apply schema transformations for BigLake tables.
    • Added validation to ensure connection_id is provided when biglake_storage_uri is set.
    • Introduced an is_biglake property to easily check if BigLake functionality is enabled.
    • Updated table creation logic in _ensure_schema_exists to apply BigLakeConfiguration and handle conditional time partitioning.
  • tests/unittests/plugins/test_bigquery_agent_analytics_plugin.py
    • Added a new test suite TestBigLakeIceberg to cover all new BigLake Iceberg functionalities.
    • Included tests for biglake_storage_uri configuration and the is_biglake property.
    • Verified the requirement of connection_id when biglake_storage_uri is used.
    • Tested the _replace_json_with_string function and its effect on BigLake schemas, ensuring no JSON fields remain.
    • Validated that BigLake Arrow schemas do not contain google:sqlType:json metadata.
    • Confirmed that BigLakeConfiguration is correctly set during BigLake table creation, including connection ID normalization.
    • Tested the default skipping of time partitioning for BigLake tables and the opt-in mechanism.
    • Added comprehensive tests for _normalize_biglake_connection_id covering various input formats and error cases.
    • Ensured that non-BigLake schemas remain unchanged, preserving JSON fields.
Activity
  • No human activity (comments, reviews) has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@adk-bot
Copy link
Collaborator

adk-bot commented Mar 7, 2026

Response from ADK Triaging Agent

Hello @caohy1988, thank you for your contribution!

To proceed with the review, could you please address the following points from our contribution guidelines:

  1. Sign the Contributor License Agreement (CLA): It looks like the CLA check has failed. Please visit https://cla.developers.google.com/ to sign it.
  2. Associate an Issue: For new features, we require an associated GitHub issue. Could you please create one that describes this feature and link it to this PR?

Completing these steps will help us move forward with the review process. Thanks!

@adk-bot adk-bot added the services [Component] This issue is related to runtime services, e.g. sessions, memory, artifacts, etc label Mar 7, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for BigLake Iceberg tables in the BigQuery analytics plugin, which is a valuable enhancement. The changes are well-implemented, including new configuration options, schema adjustments for Iceberg compatibility, and robust connection ID normalization. The accompanying unit tests are thorough and cover the new functionality comprehensively. I have one minor suggestion to remove a redundant validation check to improve code clarity. Overall, this is a high-quality contribution.

Comment on lines +2248 to +2252
tbl.clustering_fields = self.config.clustering_fields
tbl.labels = {_SCHEMA_VERSION_LABEL_KEY: _SCHEMA_VERSION}
if self.is_biglake:
from google.cloud.bigquery.table import BigLakeConfiguration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This validation check for connection_id is redundant. An equivalent check is already performed in the __init__ method (lines 1956-1959). Failing early during plugin instantiation is preferable to failing during lazy setup, as it's easier to debug. Removing this duplicate check will make the code cleaner.

@caohy1988
Copy link
Author

I looked into the Spark BigQuery connector path as a way to validate the “BigLake Iceberg supports high throughput streaming using the Storage Write API” claim.

Short conclusion: yes, the Spark connector is a valid proof path for Storage Write API -> BigLake Iceberg, but it is not a good in-process replacement for the current Python plugin writer.

Why:

Example shape:

(
    df.write
    .format("bigquery")
    .option("writeMethod", "direct")
    .option("writeAtLeastOnce", "true")
    .mode("append")
    .save("project.dataset.biglake_iceberg_table")
)

Recommendation:

  1. Use Spark/Dataflow as a standalone validation path or external ingestion backend.
  2. Do not try to swap the current BigQueryAgentAnalyticsPlugin writer to Spark in-process. That would be a major architecture change: JVM/Spark runtime, connector jar management, process startup cost, and a very different operational model from the current lightweight Python callback path.
  3. If we want to prove whether Storage Write API can work for this use case, the next practical step is a small standalone Spark or Dataflow repro that writes rows shaped like the plugin events into an existing BigLake Iceberg table.
  4. If that repro succeeds while the current raw Python Storage Write API path still fails with _colidentifier_iceberg_1, then the correct conclusion is likely: connector-supported path works, but the current low-level direct client path used by the plugin is unsupported or at least undocumented.

One additional caveat from the docs: even when streamed writes succeed, Iceberg metadata visibility for open-source engines may lag by up to ~90 minutes, so this should not be treated as immediate cross-engine freshness.

Given that, I would not change the plugin implementation to Spark. I would treat Spark/Dataflow as:

  • a validation harness for the product capability, and
  • potentially a separate ingestion architecture if BigLake Iceberg is a hard requirement.

@caohy1988
Copy link
Author

After looking at the documented support surface and the current E2E result, my recommendation is to keep this PR as a minimal MVP for BigLake support.

Recommended default behavior:

  • native BigQuery table -> storage_write_api
  • BigLake Iceberg table -> legacy_streaming

Why I think this is the right scope for this PR:

  1. It solves the actual user problem in [Feature Request] Support writing analytics events to BigLake tables in BigQuery Agent Analytics plugin #4747: support writing analytics events to BigLake tables.
  2. It avoids blocking the feature on the current raw Python Storage Write API failure (_colidentifier_iceberg_1).
  3. It keeps the change small and easy to reason about instead of introducing multiple new backends / batching strategies / ingestion modes in one PR.
  4. It preserves the current best path for native BigQuery tables, where storage_write_api is already the right default.

Why this split makes sense technically:

  • For native BigQuery tables, storage_write_api is the existing fast path and should remain the default.
  • For BigLake Iceberg, we have evidence that some supported write paths work (legacy_streaming, DML, load jobs), while the current low-level direct Storage Write API path does not.
  • The BigLake docs mention Storage Write API support through connectors like Spark/Dataflow, but that is not the same thing as the current raw Python AppendRows path used by this plugin.

So for this PR, I would explicitly avoid expanding scope into:

  • multiple BigLake write backends
  • load-job batching
  • DML fallback logic
  • trying to force raw Storage Write API support to work

Those can all be follow-up work if needed. For MVP, the cleanest path is:

  • add BigLake Iceberg table support
  • route BigLake writes to legacy_streaming
  • document that raw Python Storage Write API remains unresolved for BigLake Iceberg

That gives users a working feature now, keeps the PR minimal, and avoids overfitting to an undocumented / currently failing backend path.

@caohy1988
Copy link
Author

I think this should be tracked as a Google-internal / product bug, separate from this PR.

Reason:

  • On the same BigLake Iceberg table, the following write paths work: legacy_streaming, DML INSERT, and batch load / LOAD DATA.
  • The raw Python Storage Write API path used by this plugin fails with:
    INVALID_ARGUMENT: Input schema has missing required field: _colidentifier_iceberg_1
  • That strongly suggests the issue is not "BigLake Iceberg cannot be written", but specifically that the low-level direct AppendRows path is either unsupported, partially supported, or currently broken for managed BigLake Iceberg tables.

I would recommend filing a product bug with a minimal repro like this:

Title:
Raw BigQuery Storage Write API AppendRows to managed BigLake Iceberg table fails with _colidentifier_iceberg_1

Repro summary:

  1. Create a managed BigLake Iceberg table in BigQuery with BigLakeConfiguration:
    • storage_uri=gs://...
    • file_format=PARQUET
    • table_format=ICEBERG
    • valid BigLake connection_id
  2. Use google-cloud-bigquery-storage Python client (BigQueryWriteAsyncClient) to append rows to:
    projects/{project}/datasets/{dataset}/tables/{table}/_default
  3. Send a standard Arrow AppendRowsRequest with a schema matching the user-visible table schema.
  4. Observe failure:
    INVALID_ARGUMENT: Input schema has missing required field: _colidentifier_iceberg_1
  5. Compare against the same table using:
    • legacy streaming inserts
    • DML INSERT
    • batch load / LOAD DATA
      All of those succeed.

Questions for product team:

  1. Is raw direct Storage Write API AppendRows to managed BigLake Iceberg tables expected to be supported?
  2. If yes, is _colidentifier_iceberg_1 a backend bug or is there an undocumented client requirement?
  3. If no, can docs be clarified to distinguish connector-mediated Storage Write API support (Spark/Dataflow) from raw direct client support?

Given that, I would not block this PR on raw Storage Write API. I would keep the PR minimal and use:

  • native BigQuery table -> storage_write_api
  • BigLake Iceberg table -> legacy_streaming

That gives users a working MVP now, while the raw AppendRows behavior is investigated with the product team.

caohy1988 and others added 13 commits March 7, 2026 01:19
The Storage Write API v2 (Arrow format) cannot write to BigLake
Iceberg tables due to internal _colidentifier_iceberg_1 columns.
Route BigLake writes to the legacy streaming API
(insert_rows_json) which handles these transparently.

- Add LegacyStreamingBatchProcessor with same queue/batch interface
- BigLake: create LegacyStreamingBatchProcessor in _get_loop_state()
- Non-BigLake: unchanged, uses Storage Write API (BatchProcessor)
- Skip Arrow schema creation for BigLake (not needed)
- Update _LoopState to accept Union processor type
- Add 5 tests for legacy streaming processor and routing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BigLake Iceberg tables cannot handle nested RECORD fields via any
streaming API (both Storage Write API and legacy streaming fail
with _colidentifier_iceberg errors on RECORD positions).

Changes:
- _replace_json_with_string now also flattens RECORD/STRUCT fields
  to STRING (JSON-serialized) for BigLake Iceberg
- LegacyStreamingBatchProcessor._prepare_rows_json serializes
  dict/list values to JSON strings
- Updated E2E test scripts to verify flattened schema
- Local E2E test passes: 44 events, all 9 event types, all checks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add dynamic wheel discovery and local wheel deployment support so the
deploy script works with any ADK version. Also verify content_parts
field is STRING in the BigLake verification checks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The biglake_storage_uri docstring incorrectly stated data is written
via the Storage Write API. Update to reflect the actual implementation
which uses the legacy streaming API (insert_rows_json) since the
Storage Write API does not yet support BigLake Iceberg tables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract conditional is_biglake branching into AnalyticsTableBackend,
NativeBigQueryBackend, and BigLakeIcebergBackend classes per approved
RFC. This is a no-behavior-change refactor that routes schema creation,
Arrow schema creation, loop-state creation, and table creation through
backend classes instead of inline conditionals.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add _backend to __getstate__() (reset to None) and __setstate__()
(backfill via setdefault) so unpickling a pre-refactor serialized
plugin does not raise AttributeError when the lazy backend property
is accessed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 2 of the BigLake backend refactor. Introduces EventWriter ABC with
StorageWriteApiWriter and LegacyStreamingWriter implementations, hiding
the write-path details behind a unified interface. Simplifies _LoopState
to a single writer field and routes all append/flush/shutdown/close
operations through the writer. Compatibility properties (batch_processor,
write_client, write_stream) preserved via __getattribute__.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ose()

Wrap _batch_processor.close() in try/finally so the gRPC transport is
always closed even if the batch processor teardown fails. This prevents
leaked connections when cross-loop shutdown partially fails.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@caohy1988 caohy1988 closed this Mar 8, 2026
Add bq_plugin_test_local and bq_plugin_test_agent_engine sample
directories for native BigQuery e2e testing. Update the native Agent
Engine deploy script to use local wheel when available (matching the
BigLake deploy script) and add BQ verification.

All 4 e2e test suites verified against real data:
- Local native BQ: 44 events, 15 views, all checks pass
- Local BigLake Iceberg: 44 events, PARQUET/ICEBERG config, all checks pass
- Agent Engine native BQ: 33 events, time partitioning, all checks pass
- Agent Engine BigLake: 33 events, BigLakeConfiguration, all checks pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@caohy1988 caohy1988 reopened this Mar 8, 2026
…#4746)

CREATE OR REPLACE VIEW can throw 409 Conflict when multiple processes
race to create the same view. Catch cloud_exceptions.Conflict in
_create_analytics_views() and log at DEBUG level instead of ERROR,
since the view was successfully created by the other process.

Fixes google#4746

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@adk-bot
Copy link
Collaborator

adk-bot commented Mar 8, 2026

Response from ADK Triaging Agent

Hello @caohy1988, thank you for your contribution!

To proceed with the review, could you please address the following point from our contribution guidelines:

Completing this step will help us move forward with the review process. Thanks!

@caohy1988
Copy link
Author

caohy1988 commented Mar 8, 2026

Updated backend gap summary after the latest BigLake create_views=True validation.

The earlier comment I left is now partially outdated on one point: the BigLake view support gap is no longer just an unvalidated concern. The latest E2E update demonstrates that the auto-created analytics views do work on the current BigLake schema, including extraction from STRING columns using JSON_VALUE(...) / JSON_QUERY(...).

This remains a code-level comparison of the current plugin implementation, not a broader product statement about BigQuery vs BigLake in general.

Current backend comparison

Area Native BigQuery backend BigLake Iceberg backend Current status / impact
Write transport Storage Write API via BigQueryWriteAsyncClient + _default stream legacy streaming via insert_rows_json() BigLake still does not get the Storage Write API path
Writer resources StorageWriteApiWriter(write_client, BatchProcessor) LegacyStreamingWriter(LegacyStreamingBatchProcessor) BigLake has no write client and no write stream
plugin.write_client compatibility property returns real BigQueryWriteAsyncClient returns None native-only capability
plugin.write_stream compatibility property returns _default stream name returns None native-only capability
Arrow schema built via to_arrow_schema(...) not built BigLake still has no Arrow serialization path
Physical schema fidelity keeps JSON / RECORD / STRUCT types flattens JSON and RECORD / STRUCT to STRING BigLake still loses structured schema fidelity
Multi-modal / nested payload storage nested / repeated structures remain typed nested / repeated structures serialized into strings BigLake stores lower-fidelity physical schema
Row preparation typed Arrow batch write dict/list serialized to JSON strings, datetime to ISO string BigLake write path remains stringified / lossy relative to native typing
Default partitioning behavior daily time partitioning on timestamp by default partitioning skipped by default; opt-in via biglake_time_partitioning=True default table layout still differs
Extra required config no BigLake-specific config required requires biglake_storage_uri + connection_id BigLake setup remains stricter
Table type native BigQuery table BigLake managed Iceberg table with BigLakeConfiguration different storage model and creation path
BigQuery analytics views supported now validated as working on BigLake STRING columns this is no longer a gap
Cross-engine freshness caveat not called out in native path Iceberg metadata for open-source engines may lag up to ~90 minutes BigLake still has additional freshness caveat

Important update from latest E2E validation

The BigLake backend now has evidence that the current auto-created analytics views are usable:

  • create_views=True enabled in the BigLake sample configs
  • local BigLake E2E validates all auto-created views are queryable
  • spot-check confirms typed extraction like tool_name from STRING columns works

So the previous "BigLake views may not work" concern should now be treated as resolved for the current implementation, assuming the current view SQL patterns.

Remaining confirmed gaps vs native backend

The biggest remaining BigLake backend gaps are now:

  1. no Storage Write API path
  2. no structured JSON / RECORD physical schema parity
  3. no write_client / write_stream compatibility parity
  4. different default partitioning behavior
  5. lower-fidelity physical storage for nested / multi-modal payloads

Recommended next tasks to close the BigLake gap, ordered by priority

  1. Resolve the write-transport gap

    • Goal: determine whether BigLake can ever get parity with the native Storage Write API path.
    • Practical next step: keep the product bug / repro active for raw AppendRows against BigLake Iceberg.
    • Why this is P0: this is the largest architectural gap between the two backends and the main reason BigLake is on a distinct code path.
  2. Define the target for schema-fidelity parity

    • Goal: decide whether BigLake should stay as "logical parity, lower physical fidelity" or whether we want a higher-fidelity representation for JSON / RECORD payloads.
    • Practical next step: write a small design on options for preserving more structure without breaking BigLake ingestion.
    • Why this is P1: today BigLake works, but the physical schema is materially less expressive than native.
  3. Characterize throughput / latency tradeoffs of legacy streaming vs native Storage Write API

    • Goal: quantify the operational cost of the BigLake fallback path.
    • Practical next step: benchmark representative event volume and flush settings on both backends.
    • Why this is P1: if BigLake is going to be production-worthy, we should know the ingestion cost/perf envelope.
  4. Decide whether compatibility properties should be upgraded or explicitly documented as native-only

    • Scope: plugin.write_client and plugin.write_stream.
    • Practical next step: either keep them intentionally native-only and document that, or introduce a BigLake-specific equivalent abstraction later.
    • Why this is P2: not a functional blocker, but still a visible behavioral difference.
  5. Revisit default partitioning policy for BigLake once preview/feature maturity is clearer

    • Goal: decide whether BigLake should eventually converge toward native default partitioning behavior.
    • Practical next step: keep current opt-in behavior for now, but treat it as a documented difference rather than a final design.
    • Why this is P2: important for table layout and cost, but not the main blocker to BigLake support.

Bottom line

With the latest view validation, the BigLake backend is in better shape than before: it now has working ingestion, working analytics views, and working E2E verification.

The remaining gap is no longer "can BigLake basically work?" It is now mainly about parity:

  • transport parity
  • schema fidelity parity
  • operational/performance parity

If the PR goal remains a minimal working BigLake MVP, the current implementation is reasonable. The next phase should focus on reducing the parity gap rather than questioning whether the current BigLake path is viable at all.

caohy1988 and others added 2 commits March 8, 2026 11:41
JSON_VALUE() and JSON_QUERY() are polymorphic in BigQuery and work on
both JSON and STRING columns. The previous create_views=False was based
on a wrong assumption. Enable views for BigLake samples and add view
validation to the BigLake e2e test (15/15 views pass).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update docstring and comments to reflect that BigLake Iceberg uses
legacy streaming (insert_rows_json), not Storage Write API.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

services [Component] This issue is related to runtime services, e.g. sessions, memory, artifacts, etc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Support writing analytics events to BigLake tables in BigQuery Agent Analytics plugin [BUG]BQ AA Plugin create view with error

2 participants

X Tutup