Skip to content

Instrument both workercache and valkey store with metrics.#347

Open
Julian Gutierrez Oschmann (juli4n) wants to merge 1 commit into
agent-substrate:mainfrom
juli4n:workercache_metrics
Open

Instrument both workercache and valkey store with metrics.#347
Julian Gutierrez Oschmann (juli4n) wants to merge 1 commit into
agent-substrate:mainfrom
juli4n:workercache_metrics

Conversation

@juli4n

Copy link
Copy Markdown
Collaborator

Add metrics to track the caching layer. Emit metrics to count pub/sub PUBLISH events (and failures), number of cached workers, amount of time the cache is disconnected, total resyncs (caused by a watch disconnection) and total relists.

Add metrics to track the caching layer. Emit metrics to count pub/sub
`PUBLISH` events (and failures), number of cached workers, amount of time
the cache is disconnected, total resyncs (caused by a watch disconnection)
and total relists.
persistence := ateredis.NewPersistence(rdb)
persistence, err := ateredis.NewPersistence(rdb)
if err != nil {
mr.Close()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to do testing.Defer or something similar so we don't need to remember to do this every time?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good idea. We are already using this pattern throughout this file so I would probably tackle that as a separate PR?

}
if err := s.rdb.Publish(ctx, workerPubSubChannel, payload).Err(); err != nil {
slog.ErrorContext(ctx, "worker event publish failed", slog.Any("err", err))
s.metricWorkerPubsubMessages.Add(ctx, 1, metric.WithAttributes(attribute.String("error.type", "_OTHER")))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused on this. Is it a typical pattern to increment the same metric with an error attribute, or have a separate metric for failures?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an otel expert, but I think this is the recommended way of reporting errors (https://opentelemetry.io/docs/specs/semconv/general/recording-errors/#recording-errors-on-metrics).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants