feat: add e2e workspace build duration metric#21739
Conversation
Documentation CheckNew Documentation Needed
The new metric should be documented with:
The metric table is auto-generated by Automated review via Coder Tasks |
b7f14da to
2a4b136
Compare
Documentation CheckNew Documentation Needed
DetailsThis PR adds a new Prometheus histogram metric that tracks the complete user-perceived workspace build time from creation to agent ready. The metric should appear in the Available Metrics table with:
This metric also supports native histograms (as configured in The documentation at line 105 indicates this table is auto-generated by Automated review via Coder Tasks |
Adds coderd_template_workspace_build_duration_seconds histogram that tracks the full duration from workspace build creation to agent ready. This captures the complete user-perceived build time including provisioning and agent startup. The metric is emitted when the agent reports ready/error/timeout via the lifecycle API, ensuring each build is counted exactly once per replica. Labels: template_name, organization_name, workspace_transition, status Fixes #21621
…tric - Add 'prebuild' label to distinguish prebuild creation from user builds - Emit metric only when all agents are ready (multi-agent workspaces) - Use worst status across all agents (error > timeout > success) - Use MAX(ready_at) for accurate duration calculation
- Add EnableOpenMetrics to promhttp.HandlerOpts for protobuf scraping - Remove debug logging from lifecycle.go
Removes the self-join by querying workspace_resources directly instead of starting from workspace_agents. The agent's ResourceID is already available at the call site, making this the more efficient approach.
Also adds dbauthz test for GetWorkspaceBuildMetricsByResourceID.
af281e7 to
8d3fb82
Compare
Documentation CheckPrevious FeedbackPartially addressed - Code changes are complete, but documentation generation is pending. Updates Needed
The new metric is properly registered with native histogram support in
This matches the configuration of the existing native histogram metrics, so it should be listed in the same documentation section. Automated review via Coder Tasks |
dannykopping
left a comment
There was a problem hiding this comment.
Generally looks good but needs tests for the emitted metrics please.
coderd/agentapi/metrics.go
Outdated
| // The "prebuild" label distinguishes prebuild creation (background, no user | ||
| // waiting) from user-initiated builds (regular workspace creation or prebuild | ||
| // claims). | ||
| var WorkspaceBuildDurationSeconds = prometheus.NewHistogramVec(prometheus.HistogramOpts{ |
There was a problem hiding this comment.
We should consider deprecating coderd_workspace_creation_duration_seconds from coderd/provisionerdserver/metrics.go.
coderd/agentapi/lifecycle_test.go
Outdated
| }) | ||
| require.NoError(t, err) | ||
| require.Equal(t, lifecycle, resp) | ||
| // Metric should be emitted with status="error" label. |
There was a problem hiding this comment.
These tests are incomplete since we're not actually gathering the produced metrics and validating that they match the expected behaviour.
coderd/agentapi/metrics.go
Outdated
| // The "prebuild" label distinguishes prebuild creation (background, no user | ||
| // waiting) from user-initiated builds (regular workspace creation or prebuild | ||
| // claims). | ||
| var WorkspaceBuildDurationSeconds = prometheus.NewHistogramVec(prometheus.HistogramOpts{ |
There was a problem hiding this comment.
Making this a package-level var is going to make testing difficult.
cli/server.go
Outdated
| ctx, logger, promhttp.InstrumentMetricHandler( | ||
| options.PrometheusRegistry, promhttp.HandlerFor(options.PrometheusRegistry, promhttp.HandlerOpts{}), | ||
| options.PrometheusRegistry, promhttp.HandlerFor(options.PrometheusRegistry, promhttp.HandlerOpts{ | ||
| EnableOpenMetrics: true, |
There was a problem hiding this comment.
Was this necessary to get the metric working?
| -- All agents must have ready_at set (terminal startup state) | ||
| COUNT(*) FILTER (WHERE wa.ready_at IS NULL) = 0 AS all_agents_ready, | ||
| -- Latest ready_at across all agents (for duration calculation) | ||
| MAX(wa.ready_at) AS last_agent_ready_at, |
There was a problem hiding this comment.
If you do this you won't need to do the type check and you can just use .IsZero().
| MAX(wa.ready_at) AS last_agent_ready_at, | |
| MAX(wa.ready_at)::timestamptz AS last_agent_ready_at, |
coderd/agentapi/metrics.go
Outdated
| lastReadyAt, ok := buildInfo.LastAgentReadyAt.(time.Time) | ||
| if !ok { |
There was a problem hiding this comment.
| lastReadyAt, ok := buildInfo.LastAgentReadyAt.(time.Time) | |
| if !ok { | |
| if buildInfo.LastAgentReadyAt.IsZero() { |
(when paired with my suggestion about a cast in the return column)
Co-authored-by: Danny Kopping <danny@coder.com>
Documentation CheckUpdates Needed
The new
Additionally, this metric supports native histograms (per The table at lines 107-206 is auto-generated (see comment at line 105), so the actual update should be made to the source that generates this documentation. The native histogram note should be manually updated. Automated review via Coder Tasks |
Documentation CheckNew Documentation NeededThe new metric What needs to be documented:
Additional considerations:Per the comment in
However, since the existing metrics table doesn't include usage examples or aggregation guidance, this detail might be better suited for a separate guide or inline code comment rather than the metrics table itself. Automated review via Coder Tasks |
Makes WorkspaceBuildDurationHistogram injectable on LifecycleAPI so tests can use a per-test prometheus.Registry. Each metric-related test now gathers metrics and asserts: - Correct label values (template_name, org, transition, status, is_prebuild) - Exactly one observation with positive duration - No observations when AllAgentsReady is false
Documentation CheckUpdates Needed
DetailsThis PR adds a new Prometheus histogram metric
The metric is emitted when the agent reports ready/error/timeout, ensuring each build is counted exactly once per replica. Action required: Run Automated review via Coder Tasks |
Replace custom helper functions with the existing promhelp package (coderd/coderdtest/promhelp) which is the codebase standard for Prometheus metric testing. Each test now uses: - promhelp.HistogramValue() to validate labels, sample count, and sum - promhelp.MetricValue() to assert metric absence - Per-test prometheus.Registry for isolation
73a8563 to
01ea121
Compare
Move metric name to an exported constant so tests reference it instead of duplicating the string.
01ea121 to
b0267f4
Compare
Use fixed buildCreatedAt and agentReadyAt times (90s apart) so tests can assert the exact GetSampleSum() rather than just > 0.
Required for Prometheus to scrape native histograms via protobuf format, used by the workspace build duration metric.
Documentation CheckPrevious FeedbackNot yet addressed - Documentation still needs updating for the new metric. Updates Needed
The metric is properly configured with native histogram support in
This configuration matches the existing native histogram metrics, so it should be documented in the same section. Automated review via Coder Tasks |
Not required for native histograms - protobuf scraping is controlled by Prometheus scrape_protocols config, not the server-side handler.
Replace the global WorkspaceBuildDurationSeconds var with NewBuildDurationHistogram(reg) so tests use the real histogram definition (same buckets, native histogram config, labels) instead of a separate test-only histogram. The histogram is created in coderd.go when Prometheus is enabled, stored on the API struct, and threaded through to the agent API via Options.
Documentation CheckUpdates Needed
The metric table in prometheus.md is auto-generated via Note on Native Histogram SupportConsider adding this metric to the native histogram support section (lines 208-222) if it should support native histograms like
Automated review via Coder Tasks |
- Replace package-level var with LifecycleMetrics struct and NewLifecycleMetrics constructor - Make emitBuildDurationMetric a receiver on LifecycleAPI - Unexport emitMetricsOnce - Thread LifecycleMetrics through coderd -> agentapi Options - Tests use NewLifecycleMetrics with per-test registries
Documentation CheckUpdates Needed
ContextThis PR adds a new Prometheus histogram metric The metric should be documented in the auto-generated metrics table at line 107-206 of
The metric should also be added to the native histogram support section (line 208-222) alongside the existing Automated review via Coder Tasks |
Documentation CheckUpdates Needed
ContextThis PR adds a new Prometheus histogram metric The metric was successfully added to the auto-generated metrics table in commit 954b5fe. The metric is properly configured with native histogram support (NativeHistogramBucketFactor, NativeHistogramMaxBucketNumber, NativeHistogramMinResetDuration in However, there's a typo in the native histogram support section where the metric name is listed with "coderd_template" duplicated: Automated review via Coder Tasks |
Adds coderd_template_workspace_build_duration_seconds histogram that tracks the full duration from workspace build creation to agent ready. This captures the complete user-perceived build time including provisioning and agent startup.
The metric is emitted when the agent reports ready/error/timeout via the lifecycle API, ensuring each build is counted exactly once per replica.
Labels: template_name, organization_name, workspace_transition, status
Fixes #21621