X Tutup
Skip to content

feat: add e2e workspace build duration metric#21739

Merged
sreya merged 22 commits intomainfrom
feat/workspace-build-e2e-metric
Feb 6, 2026
Merged

feat: add e2e workspace build duration metric#21739
sreya merged 22 commits intomainfrom
feat/workspace-build-e2e-metric

Conversation

@sreya
Copy link
Collaborator

@sreya sreya commented Jan 28, 2026

Adds coderd_template_workspace_build_duration_seconds histogram that tracks the full duration from workspace build creation to agent ready. This captures the complete user-perceived build time including provisioning and agent startup.

The metric is emitted when the agent reports ready/error/timeout via the lifecycle API, ensuring each build is counted exactly once per replica.

Labels: template_name, organization_name, workspace_transition, status

Fixes #21621

@coder-tasks
Copy link
Contributor

coder-tasks bot commented Feb 3, 2026

Documentation Check

New Documentation Needed

  • docs/admin/integrations/prometheus.md - Add the new coderd_template_workspace_build_duration_seconds metric to the "Available metrics" section (line 208) and to the "Note on Prometheus native histogram support" section (line 210-222)

The new metric should be documented with:

  • Description: Duration from workspace build creation to agent ready, by template. Includes labels for template_name, organization_name, workspace_transition, status, and prebuild.
  • Native histogram note: This metric supports native histograms (like the existing workspace creation metrics) and should be listed in the native histogram section.
  • Multi-replica guidance: The inline code comment in coderd/agentapi/metrics.go:18-22 provides excellent context about multi-replica deployments and query aggregation that could be valuable in the docs.

The metric table is auto-generated by make docs/admin/integrations/prometheus.md, so the table entry will be added automatically. However, the native histogram section should be manually updated to include this new metric.


Automated review via Coder Tasks

@sreya sreya force-pushed the feat/workspace-build-e2e-metric branch from b7f14da to 2a4b136 Compare February 3, 2026 03:03
@coder-tasks
Copy link
Contributor

coder-tasks bot commented Feb 3, 2026

Documentation Check

New Documentation Needed

  • docs/admin/integrations/prometheus.md - Add coderd_template_workspace_build_duration_seconds to the Available Metrics table (this is auto-generated, so the underlying source needs updating)

Details

This PR adds a new Prometheus histogram metric that tracks the complete user-perceived workspace build time from creation to agent ready. The metric should appear in the Available Metrics table with:

  • Name: coderd_template_workspace_build_duration_seconds
  • Type: histogram
  • Description: Duration from workspace build creation to agent ready, by template. Tracks the complete user-perceived build time including provisioning and agent startup.
  • Labels: template_name, organization_name, workspace_transition, status, prebuild

This metric also supports native histograms (as configured in coderd/agentapi/metrics.go), so it should be added to the "Note on Prometheus native histogram support" section alongside the existing native histogram metrics.

The documentation at line 105 indicates this table is auto-generated by make docs/admin/integrations/prometheus.md, so the metric should be included automatically when that generation runs.


Automated review via Coder Tasks

sreya added 7 commits February 5, 2026 02:57
Adds coderd_template_workspace_build_duration_seconds histogram that tracks
the full duration from workspace build creation to agent ready. This captures
the complete user-perceived build time including provisioning and agent startup.

The metric is emitted when the agent reports ready/error/timeout via the
lifecycle API, ensuring each build is counted exactly once per replica.

Labels: template_name, organization_name, workspace_transition, status

Fixes #21621
…tric

- Add 'prebuild' label to distinguish prebuild creation from user builds
- Emit metric only when all agents are ready (multi-agent workspaces)
- Use worst status across all agents (error > timeout > success)
- Use MAX(ready_at) for accurate duration calculation
- Add EnableOpenMetrics to promhttp.HandlerOpts for protobuf scraping
- Remove debug logging from lifecycle.go
Removes the self-join by querying workspace_resources directly instead of
starting from workspace_agents. The agent's ResourceID is already available
at the call site, making this the more efficient approach.
Also adds dbauthz test for GetWorkspaceBuildMetricsByResourceID.
@sreya sreya force-pushed the feat/workspace-build-e2e-metric branch from af281e7 to 8d3fb82 Compare February 5, 2026 02:57
@sreya sreya requested a review from dannykopping February 5, 2026 02:58
@coder-tasks
Copy link
Contributor

coder-tasks bot commented Feb 5, 2026

Documentation Check

Previous Feedback

Partially addressed - Code changes are complete, but documentation generation is pending.

Updates Needed

  • Run make docs/admin/integrations/prometheus.md to auto-generate the metrics table entry for coderd_template_workspace_build_duration_seconds
  • Manually update the "Note on Prometheus native histogram support" section (line 210-222) to include the new metric alongside coderd_workspace_creation_duration_seconds and coderd_prebuilt_workspace_claim_duration_seconds

The new metric is properly registered with native histogram support in coderd/agentapi/metrics.go with:

  • NativeHistogramBucketFactor: 1.1
  • NativeHistogramMaxBucketNumber: 100
  • NativeHistogramMinResetDuration: time.Hour

This matches the configuration of the existing native histogram metrics, so it should be listed in the same documentation section.


Automated review via Coder Tasks

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good but needs tests for the emitted metrics please.

// The "prebuild" label distinguishes prebuild creation (background, no user
// waiting) from user-initiated builds (regular workspace creation or prebuild
// claims).
var WorkspaceBuildDurationSeconds = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider deprecating coderd_workspace_creation_duration_seconds from coderd/provisionerdserver/metrics.go.

})
require.NoError(t, err)
require.Equal(t, lifecycle, resp)
// Metric should be emitted with status="error" label.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are incomplete since we're not actually gathering the produced metrics and validating that they match the expected behaviour.

// The "prebuild" label distinguishes prebuild creation (background, no user
// waiting) from user-initiated builds (regular workspace creation or prebuild
// claims).
var WorkspaceBuildDurationSeconds = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this a package-level var is going to make testing difficult.

cli/server.go Outdated
ctx, logger, promhttp.InstrumentMetricHandler(
options.PrometheusRegistry, promhttp.HandlerFor(options.PrometheusRegistry, promhttp.HandlerOpts{}),
options.PrometheusRegistry, promhttp.HandlerFor(options.PrometheusRegistry, promhttp.HandlerOpts{
EnableOpenMetrics: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this necessary to get the metric working?

-- All agents must have ready_at set (terminal startup state)
COUNT(*) FILTER (WHERE wa.ready_at IS NULL) = 0 AS all_agents_ready,
-- Latest ready_at across all agents (for duration calculation)
MAX(wa.ready_at) AS last_agent_ready_at,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do this you won't need to do the type check and you can just use .IsZero().

Suggested change
MAX(wa.ready_at) AS last_agent_ready_at,
MAX(wa.ready_at)::timestamptz AS last_agent_ready_at,

Comment on lines +61 to +62
lastReadyAt, ok := buildInfo.LastAgentReadyAt.(time.Time)
if !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lastReadyAt, ok := buildInfo.LastAgentReadyAt.(time.Time)
if !ok {
if buildInfo.LastAgentReadyAt.IsZero() {

(when paired with my suggestion about a cast in the return column)

Co-authored-by: Danny Kopping <danny@coder.com>
@coder-tasks
Copy link
Contributor

coder-tasks bot commented Feb 5, 2026

Documentation Check

Updates Needed

  • docs/admin/integrations/prometheus.md - Add the new metric to the "Available metrics" table and update the native histogram section

The new coderd_template_workspace_build_duration_seconds histogram metric should be documented in the metrics table with:

  • Description: "Duration from workspace build creation to agent ready, by template."
  • Labels: template_name organization_name transition status is_prebuild
  • Type: histogram

Additionally, this metric supports native histograms (per coderd/agentapi/metrics.go:31-37), so it should be added to the "Note on Prometheus native histogram support" section (lines 208-223) alongside the existing metrics that support this feature.

The table at lines 107-206 is auto-generated (see comment at line 105), so the actual update should be made to the source that generates this documentation. The native histogram note should be manually updated.


Automated review via Coder Tasks

@coder-tasks
Copy link
Contributor

coder-tasks bot commented Feb 6, 2026

Documentation Check

New Documentation Needed

The new metric coderd_template_workspace_build_duration_seconds needs to be added to the auto-generated metrics table in docs/admin/integrations/prometheus.md.

What needs to be documented:

  • docs/admin/integrations/prometheus.md - Add the new metric to the auto-generated table (lines 105-206)
    • Metric name: coderd_template_workspace_build_duration_seconds
    • Type: histogram (with native histogram support)
    • Description: Duration from workspace build creation to agent ready, by template.
    • Labels: template_name, organization_name, transition, status, is_prebuild
    • Note: The metric should also be added to the native histogram section (after line 213) along with coderd_workspace_creation_duration_seconds and coderd_prebuilt_workspace_claim_duration_seconds, as the code in coderd/agentapi/metrics.go shows native histogram configuration with NativeHistogramBucketFactor, NativeHistogramMaxBucketNumber, and NativeHistogramMinResetDuration.

Additional considerations:

Per the comment in coderd/agentapi/metrics.go (lines 18-27), the metric documentation should note:

  • This metric is recorded by the coderd replica handling the agent's connection
  • In multi-replica deployments, each replica only has observations for agents it handles
  • Prometheus should scrape all replicas and aggregate across instances using queries like:
    histogram_quantile(0.95,
      sum(rate(coderd_template_workspace_build_duration_seconds_bucket[5m])) by (le, template_name)
    )
    

However, since the existing metrics table doesn't include usage examples or aggregation guidance, this detail might be better suited for a separate guide or inline code comment rather than the metrics table itself.


Automated review via Coder Tasks

Makes WorkspaceBuildDurationHistogram injectable on LifecycleAPI so
tests can use a per-test prometheus.Registry. Each metric-related test
now gathers metrics and asserts:

- Correct label values (template_name, org, transition, status, is_prebuild)
- Exactly one observation with positive duration
- No observations when AllAgentsReady is false
@coder-tasks
Copy link
Contributor

coder-tasks bot commented Feb 6, 2026

Documentation Check

Updates Needed

  • docs/admin/integrations/prometheus.md - Add the new coderd_template_workspace_build_duration_seconds metric to the Available Metrics table (auto-generated section). Run make docs/admin/integrations/prometheus.md to regenerate.

Details

This PR adds a new Prometheus histogram metric coderd_template_workspace_build_duration_seconds that tracks end-to-end workspace build duration from build creation to agent ready. Key characteristics:

  • Type: histogram (supports both classic and native histograms)
  • Description: Duration from workspace build creation to agent ready, by template
  • Labels: template_name, organization_name, transition, status, is_prebuild
  • Buckets: Custom buckets from 1s to 1hr, plus native histogram support
  • Multi-replica note: Each coderd replica only records metrics for agents it handles; Prometheus should scrape all replicas

The metric is emitted when the agent reports ready/error/timeout, ensuring each build is counted exactly once per replica.

Action required: Run make docs/admin/integrations/prometheus.md to update the auto-generated metrics table. The comment in coderd/agentapi/metrics.go already documents the metric clearly with example PromQL query.


Automated review via Coder Tasks

Replace custom helper functions with the existing promhelp package
(coderd/coderdtest/promhelp) which is the codebase standard for
Prometheus metric testing.

Each test now uses:
- promhelp.HistogramValue() to validate labels, sample count, and sum
- promhelp.MetricValue() to assert metric absence
- Per-test prometheus.Registry for isolation
@sreya sreya force-pushed the feat/workspace-build-e2e-metric branch from 73a8563 to 01ea121 Compare February 6, 2026 03:16
Move metric name to an exported constant so tests reference it
instead of duplicating the string.
@sreya sreya force-pushed the feat/workspace-build-e2e-metric branch from 01ea121 to b0267f4 Compare February 6, 2026 03:20
sreya added 2 commits February 6, 2026 04:08
Use fixed buildCreatedAt and agentReadyAt times (90s apart) so tests
can assert the exact GetSampleSum() rather than just > 0.
Required for Prometheus to scrape native histograms via protobuf
format, used by the workspace build duration metric.
@coder-tasks
Copy link
Contributor

coder-tasks bot commented Feb 6, 2026

Documentation Check

Previous Feedback

Not yet addressed - Documentation still needs updating for the new metric.

Updates Needed

  • Run make docs/admin/integrations/prometheus.md to regenerate the metrics table with coderd_template_workspace_build_duration_seconds
  • Manually update the "Note on Prometheus native histogram support" section (lines 210-214) to include the new metric alongside coderd_workspace_creation_duration_seconds and coderd_prebuilt_workspace_claim_duration_seconds

The metric is properly configured with native histogram support in coderd/agentapi/metrics.go (lines 552-554) with:

  • NativeHistogramBucketFactor: 1.1
  • NativeHistogramMaxBucketNumber: 100
  • NativeHistogramMinResetDuration: time.Hour

This configuration matches the existing native histogram metrics, so it should be documented in the same section.


Automated review via Coder Tasks

Not required for native histograms - protobuf scraping is controlled
by Prometheus scrape_protocols config, not the server-side handler.
@sreya sreya requested a review from dannykopping February 6, 2026 05:02
Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there 👍

Replace the global WorkspaceBuildDurationSeconds var with
NewBuildDurationHistogram(reg) so tests use the real histogram
definition (same buckets, native histogram config, labels) instead
of a separate test-only histogram.

The histogram is created in coderd.go when Prometheus is enabled,
stored on the API struct, and threaded through to the agent API
via Options.
@coder-tasks
Copy link
Contributor

coder-tasks bot commented Feb 6, 2026

Documentation Check

Updates Needed

  • docs/admin/integrations/prometheus.md - Add new coderd_template_workspace_build_duration_seconds metric to the Available metrics table (auto-generated section)

The metric table in prometheus.md is auto-generated via make docs/admin/integrations/prometheus.md. After merging this PR, run the docs generation to include the new histogram metric with its labels (template_name, organization_name, transition, status, is_prebuild).

Note on Native Histogram Support

Consider adding this metric to the native histogram support section (lines 208-222) if it should support native histograms like coderd_workspace_creation_duration_seconds and coderd_prebuilt_workspace_claim_duration_seconds. The implementation in coderd/agentapi/metrics.go:537-539 includes native histogram configuration:

  • NativeHistogramBucketFactor: 1.1
  • NativeHistogramMaxBucketNumber: 100
  • NativeHistogramMinResetDuration: time.Hour

Automated review via Coder Tasks

sreya added 3 commits February 6, 2026 06:42
- Replace package-level var with LifecycleMetrics struct and
  NewLifecycleMetrics constructor
- Make emitBuildDurationMetric a receiver on LifecycleAPI
- Unexport emitMetricsOnce
- Thread LifecycleMetrics through coderd -> agentapi Options
- Tests use NewLifecycleMetrics with per-test registries
@coder-tasks
Copy link
Contributor

coder-tasks bot commented Feb 6, 2026

Documentation Check

Updates Needed

  • docs/admin/integrations/prometheus.md - Add coderd_template_workspace_build_duration_seconds to the Available metrics table with description, labels (template_name, organization_name, transition, status, is_prebuild), and note native histogram support

Context

This PR adds a new Prometheus histogram metric coderd_template_workspace_build_duration_seconds that tracks end-to-end workspace build duration from creation to agent ready.

The metric should be documented in the auto-generated metrics table at line 107-206 of docs/admin/integrations/prometheus.md. Based on the code:

  • Type: histogram
  • Description: "Duration from workspace build creation to agent ready, by template."
  • Labels: template_name, organization_name, transition, status, is_prebuild
  • Native histogram support: Yes (configured with NativeHistogramBucketFactor, NativeHistogramMaxBucketNumber, NativeHistogramMinResetDuration in coderd/agentapi/metrics.go:503-505)

The metric should also be added to the native histogram support section (line 208-222) alongside the existing coderd_workspace_creation_duration_seconds and coderd_prebuilt_workspace_claim_duration_seconds metrics.


Automated review via Coder Tasks

@sreya sreya requested a review from dannykopping February 6, 2026 07:01
Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@coder-tasks
Copy link
Contributor

coder-tasks bot commented Feb 6, 2026

Documentation Check

Updates Needed

  • docs/admin/integrations/prometheus.md - Add coderd_template_workspace_build_duration_seconds to the Available metrics table with description, labels (template_name, organization_name, transition, status, is_prebuild), and note native histogram support

    ✅ Metric added to table at line 165 in commit 954b5fe

  • docs/admin/integrations/prometheus.md - Fix typo in native histogram section (line 213): change coderd_template_coderd_template_workspace_build_duration_seconds to coderd_template_workspace_build_duration_seconds

    ⚠️ The metric name has a duplication error

Context

This PR adds a new Prometheus histogram metric coderd_template_workspace_build_duration_seconds that tracks end-to-end workspace build duration from creation to agent ready.

The metric was successfully added to the auto-generated metrics table in commit 954b5fe. The metric is properly configured with native histogram support (NativeHistogramBucketFactor, NativeHistogramMaxBucketNumber, NativeHistogramMinResetDuration in coderd/agentapi/metrics.go).

However, there's a typo in the native histogram support section where the metric name is listed with "coderd_template" duplicated: coderd_template_coderd_template_workspace_build_duration_seconds instead of coderd_template_workspace_build_duration_seconds.


Automated review via Coder Tasks

@sreya sreya merged commit 6035e45 into main Feb 6, 2026
34 checks passed
@sreya sreya deleted the feat/workspace-build-e2e-metric branch February 6, 2026 22:26
@github-actions github-actions bot locked and limited conversation to collaborators Feb 6, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: provide an e2e metric tracking workspace build time

2 participants

X Tutup