Why is Load Average shown on the Dashboard Metrics but isn’t included by Metric Drains?

Load average can be a useful diagnostic metric, but our assessment is that there are downsides of the implementation detail that outweigh the benefits of providing load average in Metric Drains.

How Load Average is Calculated

When measuring load average at the host level, the metric is computed by the Kernel, which keeps track of it whenever it updates the state of a process. In particular, this means the load average provided by the Kernel is not subject to sampling bias since it is kept up to date every time a task changes state.

Unfortunately, the same is not true of cgroup (container-level) load average, since the kernel does not track load average on a per-cgroup basis. Instead, we use the Control Groupstats netlink API, which provides us with the number of sleeping, running, etc. tasks, needed to compute load average. Aptible Deploy polls this API periodically in order to calculate and present a load average metric on the Dashboard that’s comparable to the expected load average measurement.

Sampling Bias

Since we released load average in the Dashboard, we have come to the realization that this approach is unfortunately very much subject to sampling bias. This means that if you happen to poll the API while a task in uninterruptible sleep, load average will jump even if the task stayed in that state for a microsecond. Unless Deploy were to poll very, very frequently (which would ultimately have a performance impact), load average ends up being a very noisy and fairly misleading metric. It’s worth noting that it’s only noisy in one direction: if your container is actually very busy, the load average will properly reflect this, but other times the load average will jump for no reason.

This sampling bias, unfortunately, makes load average fairly useless from an alerting perspective (if not actively misleading) and only mildly useful from an investigative perspective. We’ve seen many cases where load average is high in a container where everything was actually perfectly fine.

Security Concerns

Finally, the aforementioned netlink API would require running Metric Drains as privileged containers, which we avoid doing for security reasons.