query/cost
This check is used to calculate cost of a query and optionally report an issue if that cost is too high. It will run expr
query from every rule against selected Prometheus servers and report results. This check can be used for both recording and alerting rules, but is mostly useful for recording rules.
Query evaluation duration
The total duration of a query comes from Prometheus query stats included in the API response when ?stats=1
is passed. When enabled pint can report if evalTotalTime
is higher than configured limit, which can be used either for informational purpose or to fail checks on queries that are too expensive (depending on configured severity
).
Query evaluation samples
Similar to evaluation duration this information comes from Prometheus query stats. There are two different stats that give us information about the number of samples used by given query:
totalQueryableSamples
- the total number of samples read during the query execution.peakSamples
- the max samples kept in memory during the query execution and shows how close the query was to reach the--query.max-samples
` limit.
In general higher totalQueryableSamples
means that a query either reads a lot of time series and/or queries a large time range, both translating into longer query execution times. Looking at peakSamples
on the other hand can be useful to find queries that are complex and perform some operation on a large number of time series, for example when you run max(...)
on a query that returns a huge number of results.
Series returned by the query
For recording rules anything returned by the query will be saved into Prometheus as new time series. Checking how many time series does a rule return allows us to estimate how much extra memory will be needed. pint
will try to estimate the number of bytes needed per single time series and use that to estimate the amount of memory needed to store all the time series returned by given query. The bytes per time series
number is calculated using this query:
avg(avg_over_time(go_memstats_alloc_bytes[2h]) / avg_over_time(prometheus_tsdb_head_series[2h]))
Since Go uses garbage collector total Prometheus process memory will be more than the sum of all memory allocations, depending on many factors like memory pressure, Go version, GOGC
settings etc. The estimate pint
gives you should be considered best case
scenario.
Configuration
Syntax:
cost {
comment = "..."
severity = "bug|warning|info"
maxSeries = 5000
maxPeakSamples = 10000
maxTotalSamples = 200000
maxEvaluationDuration = "1m"
}
comment
- set a custom comment that will be added to reported problems.severity
- set custom severity for reported issues, defaults to a warning. This is only used when query result series exceedmaxSeries
value (if set). IfmaxSeries
is not set or when results count is below it pint will still report it as information.maxSeries
- if set and number of results for given query exceeds this value it will be reported as a bug (or custom severity ifseverity
is set).maxPeakSamples
- setting this to a non-zero value will tell pint to report any query that has higherpeakSamples
values than the value configured here. Nothing will be reported if this option is not set.maxTotalSamples
- setting this to a non-zero value will tell pint to report any query that has highertotalQueryableSamples
values than the value configured here. Nothing will be reported if this option is not set.maxEvaluationDuration
- setting this to a non-zero value will tell pint to report any query that has higherevalTotalTime
values than the value configured here. Nothing will be reported if this option is not set.
How to enable it
This check is not enabled by default as it requires explicit configuration to work. To enable it add one or more prometheus {...}
blocks and a rule {...}
block with this checks config.
Examples:
All rules from files matching rules/dev/.+
pattern will be tested against dev
server. Results will be reported as information regardless of results.
prometheus "dev" {
uri = "https://prometheus-dev.example.com"
timeout = "30s"
include = ["rules/dev/.+"]
}
rule {
cost {}
}
Fail checks if any recording rule is using more than 300000 peak samples or if it’s taking more than 30 seconds to evaluate.
rule {
match {
kind = "recording"
}
cost {
maxPeakSamples = 300000
maxEvaluationDuration = "30s"
severity = "bug"
comment = "This query is too expensive to run"
}
}
How to disable it
You can disable this check globally by adding this config block:
checks {
disabled = ["query/cost"]
}
You can also disable it for all rules inside given file by adding a comment anywhere in that file. Example:
# pint file/disable query/cost
Or you can disable it per rule by adding a comment to it. Example:
# pint disable query/cost
If you want to disable only individual instances of this check you can add a more specific comment.
If maxSeries
is set
# pint disable query/cost($prometheus:$maxSeries)
Where $prometheus
is the name of Prometheus server to disable.
Example:
# pint disable query/cost(dev:5000)
If maxSeries
is NOT set
# pint disable query/cost($prometheus)
Where $prometheus
is the name of Prometheus server to disable.
Example:
# pint disable query/cost(dev)
How to snooze it
You can disable this check until given time by adding a comment to it. Example:
# pint snooze $TIMESTAMP query/cost
Where $TIMESTAMP
is either use RFC3339 formatted or YYYY-MM-DD
. Adding this comment will disable query/cost
until $TIMESTAMP
, after that check will be re-enabled.