← Back to blog

I Kept Getting Alerted for a Job That Was Fine

How CronDoctor's own alerts taught me about percentiles.

Early on, before product launch, I decided that CronDoctor should monitor itself. If you're building a cron job monitoring tool and you don't point it at your own infrastructure, what are you even doing. So I set up three internal monitors: one that monitored the hourly cleanup job, another monitoring the demo simulator, and a third that monitored the compliance check. The compliance check is the important one. It evaluates every monitor, every 60 seconds.

Those last 2 60-second jobs have wild variance. Sometimes they finish in 100 milliseconds. Sometimes they take 6 or 8 seconds. Occasionally 10. Nothing is wrong in any of these cases. The compliance check is doing real work. It queries every monitor in the database, compares schedules, calculates states. How long that takes depends on what's happening at that exact moment, how many monitors exist, how busy the database connection pool is. It's all normal. It's just not consistent.

I started out using simple averages and an average-based threshold. The average run time across the last bunch of pings was about 1.5 seconds, so I set the alert at something like 2 seconds over that. And then my own product started alerting me. Two, three, four times an hour. Every time one of those completely normal 6-second runs came in, it blew past the average-based line and fired an alert. The job was fine. It had always done this. But the math said it was slow, so I got an email.

I bumped the threshold. Still too many alerts. Bumped it again. Now it was too loose to catch actual problems. I was sitting there tuning a number on my own monitoring tool, which is the exact situation the monitoring tool was supposed to prevent for other people.

The average was the problem

The issue wasn't the threshold number. The issue was using an average at all. An average takes all your run times, includes the fast ones and the slow ones, and gives you a number in the middle that doesn't describe any actual run.

My compliance check runs were mostly 100-300ms. Fast. But a few times an hour, a run would take 6 to 10 seconds. Totally normal. Maybe a cold start, maybe the database was busy. Those slow-but-fine runs pulled the average up to about 1.5 seconds. So when the next slow-but-fine run came in at 7 seconds, it looked like an anomaly relative to that 1.5-second average, even though 7-second runs happened all the time. The average was telling me my job's “normal” was 1.5 seconds, which wasn't true. No individual run looked like that. Runs were either fast (sub-300ms) or occasionally slow (6-10s). The average was not reality and didn't match either group.

One outlier run, say a 15-second cold start, would drag the average even higher and make even more of the normal runs look abnormal by comparison. The average doesn't filter out noise. It eats it.

P95

I went looking for what other people do. In the SRE and infrastructure monitoring world (Prometheus, Datadog, the Google SRE book), percentiles are everywhere. P95 and P99 latency charts are standard for API monitoring. The concept has been around for years. But when I looked at cron job monitoring tools, nobody was doing it. Everything was still binary: did the job run, yes or no. Maybe a fixed timeout. The percentile approach that had become standard practice for APIs never crossed over to scheduled jobs.

P95 is simple. You take the last batch of runs (CronDoctor uses 50), sort them by duration, and find the value that 95% of them finish under. That's your baseline for what “normal” looks like, including the slow-but-fine runs. The top 5% get ignored. They're outliers.

For my compliance check, P95 landed around 8 or 9 seconds. That accounted for the fast runs and the slow-but-normal runs. Unlike the average (which was 1.5 seconds and made everything over 3 seconds look like a problem), P95 actually reflected the reality of how the job behaved. A run had to be unusual, not just one of the regular slow cycles, before it crossed the line.

Then you add a buffer on top. CronDoctor uses 20%. So a P95 of 9 seconds becomes an adaptive threshold of about 10.8 seconds. A run has to clear that before it counts as slow. This absorbs the normal jitter around the edges so you're not alerting on runs that are barely above the 95th percentile. When an alert does fire, it means something actually different happened.

I implemented it and the false alerting stopped. Those 6-second runs that had been alerting me three times an hour were now below the P95 threshold.

The thing I didn't expect

Another benefit of P95 is the self-correction. Jobs change over time. Your database gets bigger, your queries take longer, what used to be a 2-second job becomes a 5-second job because you added a table or your data volume doubled. With a static threshold you'd have to go manually adjust the number. With a P95 threshold over a rolling window, you don't.

As the slower runs fill up the 50-run window, P95 rises to match. The first few might trigger an alert, which is actually what you want. A heads-up that something shifted. But once the new pace is the norm, the threshold catches up and stops bugging you.

It works the other direction too. If you optimize a job and it gets faster, the threshold tightens. What used to be a normal 8-second run starts looking slow once your P95 drops to 3 seconds.

I didn't set out to build an adaptive threshold system. I just wanted my own product to stop sending me emails for runs that were fine. P95 with a 20% buffer was the first thing I tried that actually worked, and it's been stable enough that I haven't needed to revisit it. It handles CronDoctor's own monitors, and it handles every customer's monitors the same way. Feed it timing data, and CronDoctor starts learning what normal looks like for your jobs on its own with just 10 data points.

For most people that default is all they'll ever need. But some folks want to dial it in. On the Pro plan there are presets: Sensitive (P90 + 10%), Default (P95 + 20%), Tolerant (P99 + 30%), and a custom option if none of those fit. I added those because they could be useful for extreme cases, not because the default was wrong. P95 + 20% covers the vast majority of jobs. The presets are there for the power users who have a specific reason to tighten or loosen the timing.

I looked and I could not find any other cron monitoring tools out there that do the P95 approach. Some store percentile data for dashboards, but their actual alerting is still fixed thresholds or simple binary checks. You can get adaptive anomaly detection from a full observability platform like Datadog if you emit custom metrics and wire it up yourself, but that's a different thing than having it built in. The dedicated cron monitors, the ones people actually use for their backups and billing jobs, are mostly still asking “did it run” and leaving it at that.

CronDoctor learns what normal looks like for each of your jobs, so it only alerts you when something actually changes. Five jobs free.

Written by Brad Wiederholt, founder of Huladyne Labs and builder of CronDoctor.