Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
2c07fa9
feat: ReconcileUtils for strongly consistent updates (#3106)
csviri Jan 15, 2026
699c493
feat: observability with otel and default grafana dashboard
csviri Feb 4, 2026
22609cc
wip
csviri Feb 4, 2026
0987e91
wip
csviri Feb 4, 2026
148fd79
wip
csviri Feb 4, 2026
8eb5710
wip
csviri Feb 8, 2026
98f1199
wip
csviri Feb 8, 2026
699062c
wip
csviri Feb 8, 2026
ab6264b
wip
csviri Feb 9, 2026
0e359bd
wip
csviri Feb 9, 2026
041a0d2
wip
csviri Feb 9, 2026
3d3b6ea
wip
csviri Feb 9, 2026
dc0e0bb
wip
csviri Feb 9, 2026
4ab5e49
wip
csviri Feb 9, 2026
cd951d4
wip
csviri Feb 9, 2026
209aa6f
wip
csviri Feb 9, 2026
27daaf8
wip
csviri Feb 9, 2026
7036f99
improve: micrometer metrics improvements
csviri Feb 9, 2026
4698f39
wip
csviri Feb 10, 2026
c18b633
wip
csviri Feb 10, 2026
c70ed7c
wip
csviri Feb 10, 2026
f54bbb8
wip
csviri Feb 10, 2026
3607459
wip
csviri Feb 10, 2026
30003c5
wip
csviri Feb 10, 2026
3378ff1
wip
csviri Feb 11, 2026
10fc023
wip
csviri Feb 11, 2026
09c0e4d
wip
csviri Feb 11, 2026
9946130
wip
csviri Feb 11, 2026
85ce0f0
e2e test skeleton
csviri Feb 11, 2026
c7f516f
wip
csviri Feb 12, 2026
703a517
wip
csviri Feb 17, 2026
af635fb
wip
csviri Feb 21, 2026
1786efc
wip
csviri Feb 27, 2026
cd0cc3b
wip
csviri Feb 27, 2026
1f44e86
wip
csviri Feb 27, 2026
4f1ba17
wip
csviri Feb 27, 2026
9c4bfff
wip
csviri Feb 27, 2026
55d62ad
wip
csviri Feb 27, 2026
7b2a8c3
wip
csviri Feb 28, 2026
861494b
wip
csviri Mar 1, 2026
de159ac
documentation update
csviri Mar 1, 2026
e5798e8
wip
csviri Mar 1, 2026
68eb71b
logging
csviri Mar 1, 2026
b08d781
Update sample-operators/metrics-processing/src/main/java/io/javaopera…
csviri Mar 1, 2026
0b7d5b4
Update sample-operators/metrics-processing/pom.xml
csviri Mar 1, 2026
9fa3a76
Update sample-operators/metrics-processing/pom.xml
csviri Mar 1, 2026
d23e27b
Update operator-framework-core/src/main/java/io/javaoperatorsdk/opera…
csviri Mar 1, 2026
59340cd
Update observability/install-observability.sh
csviri Mar 1, 2026
4bdb4ee
wip
csviri Mar 1, 2026
ccbee9c
Update sample-operators/metrics-processing/src/main/resources/io/java…
csviri Mar 1, 2026
2836046
wip
csviri Mar 1, 2026
81fadbe
wip
csviri Mar 1, 2026
4bff248
wip
csviri Mar 1, 2026
88386c3
wip
csviri Mar 1, 2026
c5a01f2
wip
csviri Mar 1, 2026
9cbb6f3
wip
csviri Mar 1, 2026
97cc20c
Refinements on metrics
csviri Mar 3, 2026
7f6cc3e
wip
csviri Mar 3, 2026
cff32d9
docs improvement
csviri Mar 3, 2026
a445035
fix: add deprecation information
metacosm Mar 4, 2026
f6c1cdc
refactor: consistent constant definition
metacosm Mar 4, 2026
e299854
refactor: reuse available methods to help inlining
metacosm Mar 4, 2026
810ec80
refactor: avoid creating intermediate collection
metacosm Mar 4, 2026
8d9692f
refactor: remove unused constant
metacosm Mar 4, 2026
bde2e17
fixed from code review
csviri Mar 4, 2026
974e0bf
wip
csviri Mar 4, 2026
4149d7f
wip
csviri Mar 5, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/e2e-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ jobs:
- "sample-operators/tomcat-operator"
- "sample-operators/webpage"
- "sample-operators/leader-election"
- "sample-operators/metrics-processing"
runs-on: ubuntu-latest
steps:
- name: Checkout
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ on:
paths-ignore:
- 'docs/**'
- 'adr/**'
- 'observability/**'
workflow_dispatch:
jobs:
check_format_and_unit_tests:
Expand Down
125 changes: 101 additions & 24 deletions docs/content/en/docs/documentation/observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,30 +77,108 @@ Metrics metrics; // initialize your metrics implementation
Operator operator = new Operator(client, o -> o.withMetrics(metrics));
```

### Micrometer implementation
### MicrometerMetricsV2

The micrometer implementation is typically created using one of the provided factory methods which, depending on which
is used, will return either a ready to use instance or a builder allowing users to customize how the implementation
behaves, in particular when it comes to the granularity of collected metrics. It is, for example, possible to collect
metrics on a per-resource basis via tags that are associated with meters. This is the default, historical behavior but
this will change in a future version of JOSDK because this dramatically increases the cardinality of metrics, which
could lead to performance issues.
[`MicrometerMetricsV2`](http://github.com/java-operator-sdk/java-operator-sdk/blob/main/micrometer-support/src/main/java/io/javaoperatorsdk/operator/monitoring/micrometer/MicrometerMetricsV2.java)
is the recommended micrometer-based implementation. It is designed with low cardinality in mind:
all meters are scoped to the controller, not to individual resources. This avoids unbounded cardinality growth as
resources come and go.

To create a `MicrometerMetrics` implementation that behaves how it has historically behaved, you can just create an
instance via:
The simplest way to create an instance:

```java
MeterRegistry registry; // initialize your registry implementation
Metrics metrics = MicrometerMetrics.newMicrometerMetricsBuilder(registry).build();
Metrics metrics = MicrometerMetricsV2.newMicrometerMetricsV2Builder(registry).build();
```

Optionally, include a `namespace` tag on per-reconciliation counters (disabled by default to avoid unexpected
cardinality increases in existing deployments):

```java
Metrics metrics = MicrometerMetricsV2.newMicrometerMetricsV2Builder(registry)
.withNamespaceAsTag()
.build();
```

You can also supply a custom timer configuration for `reconciliations.execution.duration`:

```java
Metrics metrics = MicrometerMetricsV2.newMicrometerMetricsV2Builder(registry)
.withExecutionTimerConfig(builder -> builder.publishPercentiles(0.5, 0.95, 0.99))
.build();
```

The class provides factory methods which either return a fully pre-configured instance or a builder object that will
allow you to configure more easily how the instance will behave. You can, for example, configure whether the
implementation should collect metrics on a per-resource basis, whether associated meters should be removed when a
resource is deleted and how the clean-up is performed. See the relevant classes documentation for more details.
#### MicrometerMetricsV2 metrics

All meters use `controller.name` as their primary tag. Counters optionally carry a `namespace` tag when
`withNamespaceAsTag()` is enabled.

| Meter name (Micrometer) | Type | Tags | Description |
|--------------------------------------|---------|---------------------------------------------------|------------------------------------------------------------------|
| `reconciliations.active` | gauge | `controller.name` | Number of reconciler executions currently executing |
| `reconciliations.queue` | gauge | `controller.name` | Number of resources currently queued for reconciliation |
| `custom_resources` | gauge | `controller.name` | Number of custom resources tracked by the controller |
| `reconciliations.execution.duration` | timer | `controller.name` | Reconciliation execution duration with explicit bucket histogram |
| `reconciliations.started.total` | counter | `controller.name`, `namespace`* | Number of reconciliations started (including retries) |
| `reconciliations.success.total` | counter | `controller.name`, `namespace`* | Number of successfully finished reconciliations |
| `reconciliations.failure.total` | counter | `controller.name`, `namespace`* | Number of failed reconciliations |
| `reconciliations.retries.total` | counter | `controller.name`, `namespace`* | Number of reconciliation retries |
| `events.received` | counter | `controller.name`, `event`, `action`, `namespace` | Number of Kubernetes events received by the controller |

\* `namespace` tag is only included when `withNamespaceAsTag()` is enabled.

The execution timer uses explicit boundaries (10ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s, 10s, 30s) to ensure
compatibility with `histogram_quantile()` queries in Prometheus. This is important when using the OpenTelemetry Protocol (OTLP) registry, where
`publishPercentileHistogram()` would otherwise produce Base2 Exponential Histograms that are incompatible with classic
`_bucket` queries.

> **Note on Prometheus metric names**: The exact Prometheus metric name suffix depends on the `MeterRegistry` in use.
> For `PrometheusMeterRegistry` the timer is exposed as `reconciliations_execution_duration_seconds_*`. For
> `OtlpMeterRegistry` (metrics exported via OpenTelemetry Collector), it is exposed as
> `reconciliations_execution_duration_milliseconds_*`.

#### Grafana Dashboard

A ready-to-use Grafana dashboard is available at
[`observability/josdk-operator-metrics-dashboard.json`](http://github.com/java-operator-sdk/java-operator-sdk/blob/main/observability/josdk-operator-metrics-dashboard.json).
It visualizes all of the metrics listed above, including reconciliation throughput, error rates, queue depth, active
executions, resource counts, and execution duration histograms and heatmaps.

The dashboard is designed to work with metrics exported via OpenTelemetry Collector to Prometheus, as set up by the
observability sample (see below).

#### Exploring metrics end-to-end

The
[`metrics-processing` sample operator](http://github.com/java-operator-sdk/java-operator-sdk/tree/main/sample-operators/metrics-processing)
includes a full end-to-end test,
[`MetricsHandlingE2E`](http://github.com/java-operator-sdk/java-operator-sdk/blob/main/sample-operators/metrics-processing/src/test/java/io/javaoperatorsdk/operator/sample/metrics/MetricsHandlingE2E.java),
that:

1. Installs a local observability stack (Prometheus, Grafana, OpenTelemetry Collector) via
`observability/install-observability.sh`. That imports also the Grafana dashboards.
2. Runs two reconcilers that produce both successful and failing reconciliations over a sustained period
3. Verifies that the expected metrics appear in Prometheus

This is a good starting point for experimenting with the metrics and the Grafana dashboard in a real cluster without
having to deploy your own operator.

### MicrometerMetrics (Deprecated)

> **Deprecated**: `MicrometerMetrics` (V1) is deprecated as of JOSDK 5.3.0. Use `MicrometerMetricsV2` instead.
> V1 attaches resource-specific metadata (name, namespace, etc.) as tags to every meter, which causes unbounded
> cardinality growth and can lead to performance issues in your metrics backend.

The legacy `MicrometerMetrics` implementation is still available. To create an instance that behaves as it historically
has:

```java
MeterRegistry registry; // initialize your registry implementation
Metrics metrics = MicrometerMetrics.newMicrometerMetricsBuilder(registry).build();
```

For example, the following will create a `MicrometerMetrics` instance configured to collect metrics on a per-resource
basis, deleting the associated meters after 5 seconds when a resource is deleted, using up to 2 threads to do so.
To collect metrics on a per-resource basis, deleting the associated meters after 5 seconds when a resource is deleted,
using up to 2 threads:

```java
MicrometerMetrics.newPerResourceCollectingMicrometerMetricsBuilder(registry)
Expand All @@ -109,9 +187,9 @@ MicrometerMetrics.newPerResourceCollectingMicrometerMetricsBuilder(registry)
.build();
```

### Operator SDK metrics
#### Operator SDK metrics (V1)

The micrometer implementation records the following metrics:
The V1 micrometer implementation records the following metrics:

| Meter name | Type | Tag names | Description |
|-------------------------------------------------------------|----------------|-------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
Expand All @@ -130,12 +208,11 @@ The micrometer implementation records the following metrics:
| operator.sdk.controllers.execution.cleanup.success | counter | controller, type | Number of successful cleanups per controller |
| operator.sdk.controllers.execution.cleanup.failure | counter | controller, exception | Number of failed cleanups per controller |

As you can see all the recorded metrics start with the `operator.sdk` prefix. `<resource metadata>`, in the table above,
refers to resource-specific metadata and depends on the considered metric and how the implementation is configured and
could be summed up as follows: `group?, version, kind, [name, namespace?], scope` where the tags in square
brackets (`[]`) won't be present when per-resource collection is disabled and tags followed by a question mark are
omitted if the associated value is empty. Of note, when in the context of controllers' execution metrics, these tag
names are prefixed with `resource.`. This prefix might be removed in a future version for greater consistency.
All V1 metrics start with the `operator.sdk` prefix. `<resource metadata>` refers to resource-specific metadata and
depends on the considered metric and how the implementation is configured: `group?, version, kind, [name, namespace?],
scope` where tags in square brackets (`[]`) won't be present when per-resource collection is disabled and tags followed
by a question mark are omitted if the value is empty. In the context of controllers' execution metrics, these tag names
are prefixed with `resource.`.

### Aggregated Metrics

Expand Down
26 changes: 24 additions & 2 deletions docs/content/en/docs/migration/v5-3-migration.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ description: Migrating from v5.2 to v5.3
---


## Renamed JUnit Module
## Rename of JUnit module

If you use JUnit extension in your test just rename it from:

Expand All @@ -26,4 +26,26 @@ to
<version>5.3.0<version>
<scope>test</scope>
</dependency>
```
```

## Metrics interface changes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would technically be an API break and would require a new major version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strinctly sepaking yes, but such minor API changes we do some times, see the migration document. As other frameworks sometimes. It is basically I think a better choice in terms of tradeoff, since because we don't really want to increase the major verion that often and we on the other hand we have quite an amount of APIs, that sometimes better to evolve this way IMO.

I also was trying to do backwards compatible, we still could. But at the end it looked like that it would be more confusing, that just having a table to be able to easily migrate from current impl. If that makes sense.


The [Metrics](http://github.com/operator-framework/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/monitoring/Metrics.java)
interface changed in non backwards compatible way, in order to make the API cleaner:

The following table shows the relevant method renames:

| v5.2 method | v5.3 method |
|------------------------------------|------------------------------|
| `reconcileCustomResource` | `reconciliationSubmitted` |
| `reconciliationExecutionStarted` | `reconciliationStarted` |
| `reconciliationExecutionFinished` | `reconciliationSucceeded` |
| `failedReconciliation` | `reconciliationFailed` |
| `finishedReconciliation` | `reconciliationFinished` |
| `cleanupDoneFor` | `cleanupDone` |
| `receivedEvent` | `eventReceived` |


Other changes:
- `reconciliationFinished(..)` method is extended with `RetryInfo`
- `monitorSizeOf(..)` method is removed.
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,10 @@

import static io.javaoperatorsdk.operator.api.reconciler.Constants.CONTROLLER_NAME;

/**
* @deprecated Use {@link MicrometerMetricsV2} instead
*/
@Deprecated(forRemoval = true)
public class MicrometerMetrics implements Metrics {

private static final String PREFIX = "operator.sdk.";
Expand Down Expand Up @@ -68,7 +72,6 @@ public class MicrometerMetrics implements Metrics {
private static final String EVENTS_RECEIVED = "events.received";
private static final String EVENTS_DELETE = "events.delete";
private static final String CLUSTER = "cluster";
private static final String SIZE_SUFFIX = ".size";
private static final String UNKNOWN_ACTION = "UNKNOWN";
private final boolean collectPerResourceMetrics;
private final MeterRegistry registry;
Expand Down Expand Up @@ -182,7 +185,7 @@ public <T> T timeControllerExecution(ControllerExecution<T> execution) {
}

@Override
public void receivedEvent(Event event, Map<String, Object> metadata) {
public void eventReceived(Event event, Map<String, Object> metadata) {
if (event instanceof ResourceEvent) {
incrementCounter(
event.getRelatedCustomResourceID(),
Expand All @@ -201,14 +204,14 @@ public void receivedEvent(Event event, Map<String, Object> metadata) {
}

@Override
public void cleanupDoneFor(ResourceID resourceID, Map<String, Object> metadata) {
public void cleanupDone(ResourceID resourceID, Map<String, Object> metadata) {
incrementCounter(resourceID, EVENTS_DELETE, metadata);

cleaner.removeMetersFor(resourceID);
}

@Override
public void reconcileCustomResource(
public void reconciliationSubmitted(
HasMetadata resource, RetryInfo retryInfoNullable, Map<String, Object> metadata) {
Optional<RetryInfo> retryInfo = Optional.ofNullable(retryInfoNullable);
incrementCounter(
Expand All @@ -228,19 +231,20 @@ public void reconcileCustomResource(
}

@Override
public void finishedReconciliation(HasMetadata resource, Map<String, Object> metadata) {
public void reconciliationSucceeded(HasMetadata resource, Map<String, Object> metadata) {
incrementCounter(ResourceID.fromResource(resource), RECONCILIATIONS_SUCCESS, metadata);
}

@Override
public void reconciliationExecutionStarted(HasMetadata resource, Map<String, Object> metadata) {
public void reconciliationStarted(HasMetadata resource, Map<String, Object> metadata) {
var reconcilerExecutions =
gauges.get(RECONCILIATIONS_EXECUTIONS + metadata.get(CONTROLLER_NAME));
reconcilerExecutions.incrementAndGet();
}

@Override
public void reconciliationExecutionFinished(HasMetadata resource, Map<String, Object> metadata) {
public void reconciliationFinished(
HasMetadata resource, RetryInfo retryInfo, Map<String, Object> metadata) {
var reconcilerExecutions =
gauges.get(RECONCILIATIONS_EXECUTIONS + metadata.get(CONTROLLER_NAME));
reconcilerExecutions.decrementAndGet();
Expand All @@ -251,8 +255,8 @@ public void reconciliationExecutionFinished(HasMetadata resource, Map<String, Ob
}

@Override
public void failedReconciliation(
HasMetadata resource, Exception exception, Map<String, Object> metadata) {
public void reconciliationFailed(
HasMetadata resource, RetryInfo retry, Exception exception, Map<String, Object> metadata) {
var cause = exception.getCause();
if (cause == null) {
cause = exception;
Expand All @@ -266,11 +270,6 @@ public void failedReconciliation(
Tag.of(EXCEPTION, cause.getClass().getSimpleName()));
}

@Override
public <T extends Map<?, ?>> T monitorSizeOf(T map, String name) {
return registry.gaugeMapSize(PREFIX + name + SIZE_SUFFIX, Collections.emptyList(), map);
}

private void addMetadataTags(
ResourceID resourceID, Map<String, Object> metadata, List<Tag> tags, boolean prefixed) {
if (collectPerResourceMetrics) {
Expand Down
Loading