In the microservices world, monitoring is crucial. Fortunately, a lot of libraries gather them for us, and usually, it requires a minimal amount of work to achieve our goal. Examples of the most popular, ready-to-use metrics in most libraries: are JVM memory usage, HTTP requests or Hikari pool metrics. Also, we can find ready-to-use Grafana dashboards. Putting everything together, we can set up the whole monitoring in an hour. Because of that, we only sometimes understand how it works, and it is hard to define custom technical or domain-specific metrics. What can we measure? For example, how fast can we process events, how often do we hit the edge case, or how many cars are on the road?
How to expose metrics?
Exposing Micrometer metrics in Prometheus format is pretty easy. All you need to do is define an endpoint to return data from the registry. Example code using Ktor:
How to count?
The counter is the most popular Micrometer metric.
Count it if you want to know how fast an application sends events.
Count it to know how many times something happened in a specific time.
Count it if you want to measure the memory usage at a specific time.
It is effortless to count. You can define a helper function to do it:
Then you can wrap any code fragment with a count function. Note that the process function has overridden this to ResourceSample. Thanks to that provided function have access to the ResourceSample methods (e.g., it is easy to define tags). As an example, we can stub sending an event to a topic:
Then you can query it:
With the result presented in Grafana:
Note that if you’re running multiple pods, it may be necessary to sum data by application tag (defined on the deployment level). With tags, it is easy to group values. In my example, measuring the events rate per label is possible.
How to gauge?
What about monitoring our application state? For example, what if we want to know how many cars are on the road? With counters, we can measure how often we start or end a trip, but it is hard to say how many trips are currently in progress. But we’d like to create alerts for exceptional business situations. For example, no trips are in progress. We can probably query the data in our database and expose the result with Micrometer. The gauge will be perfect for that case, but this metric type is more complicated to define. Let’s take a look at the code:
First of all, to avoid NaN values, we have to keep a reference to our result to protect it from the garbage collector. We can find the details in the official docs.
Another problem is that we must run this code periodically with a scheduler. We should not execute database queries in a metrics endpoint to avoid random failures and to keep a constant response time.
The last problem is that the gauge should be running only on one pod of your application. Otherwise, it will execute the query on all your pods. You can try to define a separate deployment responsible for running your application just for monitoring purposes (e.g., configurable profiles with properties) or create an individual service just for monitoring and keeping only one instance running.
The counter is easy to use and a potent tool. Gauge is more useful when storing the data in the database but requires more effort to configure it properly. Considering this, it’s good to look at alternative solutions, for example, Loki-based alerting.