promql/counter
This check will find rules with invalid use of counters. Counters track the number of events over time and so the value of a counter can only grow and never decrease. This means that the absolute value of a counter doesn’t matter, it will be a random number that depends on the number of events that happened since your application was started. To use the value of a counter in PromQL you most likely want to calculate the rate of events using the rate() function, or any other function that is safe to use with counters. Once you calculate the rate you can use that result in other functions or aggregations that are not counter safe, like sum().`
Here’s an example of invalid alerting rules that uses a counter metric called errors_total
. This metric will be incremented every time there’s an error.
A bad rule could look like this:
- alert: Too many errors
expr: errors_total > 10
The problem here is that a counter like errors_total
will only go up in value until:
- the value overflows the maximum value for a float
- your service restarts and resets the value of
errors_total
to zero - so it starts counting again
Once there are 11 errors observed since your application started Too many errors
alert will fire and will keep firing until your application restarts. This kind of alerts is usually unhelpful and what you really want to track is the health of your application right now. This alert should be triggered if, for example, in the last 1 hour there were more than N errors. If there’s a spike of errors but then errors stop, then the alert should stop firing.
Example of a better rule:
- alert: Too many errors
expr: rate(errors_total[1h]) > 10
Common problems
Metadata mismatch
Metric type checks are using metadata API. Metadata is aggregated from all scraped metrics.
This can cause a few potential problems:
- You might have the same metric reported with multiple different types and Prometheus or pint won’t know which time series is which type, because all we have to match a metric to a type is its name. Best solution here is to never export same name as multiple metrics with different types.
- If you change the typo of some exported metric then the old type will still show up in metadata, plus the new one, as long as there’s at least one target still exporting old metric type. If you accidentally exported some metric with wrong type, then fixed it, but pint is still complaining, then it’s very likely that you didn’t release your fix to all targets yet.
Configuration
This check doesn’t have any configuration options.
How to enable it
This check is enabled by default for all configured Prometheus servers.
Example:
prometheus "prod" {
uri = "https://prometheus-prod.example.com"
timeout = "60s"
include = [
"rules/prod/.*",
"rules/common/.*",
]
}
prometheus "dev" {
uri = "https://prometheus-dev.example.com"
timeout = "30s"
include = [
"rules/dev/.*",
"rules/common/.*",
]
}
How to disable it
You can disable this check globally by adding this config block:
checks {
disabled = ["promql/counter"]
}
You can also disable it for all rules inside given file by adding a comment anywhere in that file. Example:
# pint file/disable promql/counter
Or you can disable it per rule by adding a comment to it. Example:
# pint disable promql/counter
If you want to disable only individual instances of this check you can add a more specific comment.
# pint disable promql/counter($prometheus)
Where $prometheus
is the name of Prometheus server to disable.
Example:
# pint disable promql/counter(prod)
How to snooze it
You can disable this check until given time by adding a comment to it. Example:
# pint snooze $TIMESTAMP promql/counter
Where $TIMESTAMP
is either use RFC3339 formatted or YYYY-MM-DD
. Adding this comment will disable promql/counter
until $TIMESTAMP
, after that check will be re-enabled.