promql/fragile
This check will try to find rules with queries that can be rewritten in a way which makes them more resilient to label changes.
Example:
Let’s assume we have these metrics:
errors{cluster="prod", instance="server1", job="node_exporter"} 5
requests{cluster="prod", instance="server1", job="node_exporter", rack="a"} 10
requests{cluster="prod", instance="server1", job="node_exporter", rack="b"} 30
requests{cluster="prod", instance="server1", job="node_exporter", rack="c"} 25
If we want to calculate the ratio of errors to requests we can use this query:
errors / sum(requests) without(rack)
sum(requests) without(rack)
will produce this result:
requests{cluster="prod", instance="server1", job="node_exporter"} 65
Both sides of the query will have exact same set of labels:
{cluster="prod", instance="server1", job="node_exporter"}`
which is needed to be able to use a binary expression here, and so this query will work correctly.
But the risk here is that if at any point we change labels on those metrics we might end up with left and right hand sides having different set of labels. Let’s see what happens if we add an extra label to requests
metric.
errors{cluster="prod", instance="server1", job="node_exporter"} 5
requests{cluster="prod", instance="server1", job="node_exporter", rack="a", path="/"} 3
requests{cluster="prod", instance="server1", job="node_exporter", rack="a", path="/users"} 7
requests{cluster="prod", instance="server1", job="node_exporter", rack="b", path="/"} 10
requests{cluster="prod", instance="server1", job="node_exporter", rack="b", path="/login"} 1
requests{cluster="prod", instance="server1", job="node_exporter", rack="b", path="/users"} 19
requests{cluster="prod", instance="server1", job="node_exporter", rack="c", path="/"} 25
Our left hand side (errors
metric) still has the same set of labels:
{cluster="prod", instance="server1", job="node_exporter"}
But sum(requests) without(rack)
will now return a different result:
requests{cluster="prod", instance="server1", job="node_exporter", path="/"} 38
requests{cluster="prod", instance="server1", job="node_exporter", path="/users"} 26
requests{cluster="prod", instance="server1", job="node_exporter", path="/login"} 1
We no longer get a single result because we only aggregate by removing rack
label. Newly added path
label is not being aggregated away so it splits our results into multiple series. Since our left hand side doesn’t have any path
label it won’t match any of the right hand side result and this query won’t produce anything.
One solution here is to add path
to without()
to remove this label when aggregating, but this requires updating all queries that use this metric every time labels change.
Another solution is to rewrite this query with by()
instead of without()
which will ensure that extra labels will be aggregated away automatically:
errors / sum(requests) by(cluster, instance, job)
The list of labels we aggregate by doesn’t have to match exactly with the list of labels on the left hand side, we can use on()
to instruct Prometheus which labels should be used to match both sides. For example if we would remove job
label during aggregation we would once again have different sets of labels on both side, but we can fix that by adding labels we use in by()
to on()
:
errors / on(cluster, instance) sum(requests) by(cluster, instance)
Configuration
This check doesn’t have any configuration options.
How to enable it
This check is enabled by default.
How to disable it
You can disable this check globally by adding this config block:
checks {
disabled = ["promql/fragile"]
}
You can also disable it for all rules inside given file by adding a comment anywhere in that file. Example:
# pint file/disable promql/fragile
Or you can disable it per rule by adding a comment to it. Example:
# pint disable promql/fragile
How to snooze it
You can disable this check until given time by adding a comment to it. Example:
# pint snooze $TIMESTAMP promql/fragile
Where $TIMESTAMP
is either use RFC3339 formatted or YYYY-MM-DD
. Adding this comment will disable promql/fragile
until $TIMESTAMP
, after that check will be re-enabled.