Your Cron Job Didn't Hurt You. It Hurt Everyone Else.
The person who owns the job is usually the last one to notice it stopped running.
I have a spreadsheet. It's not a good spreadsheet. It has about twenty rows and two columns: one for the name of a scheduled job that somebody else owns, and one for the date I last confirmed its output actually showed up where it was supposed to.
I keep this spreadsheet because I am, professionally, a downstream consumer of other people's cron jobs. Data feeds that populate reports I need. Nightly syncs that keep dashboards current. Weekly exports that land in shared drives. I don't own these jobs. I don't have access to the servers they run on. I just need them to work.
When they don't, I'm usually the one who notices. Not because I'm vigilant. Because I'm the one staring at a blank report on a Monday morning, trying to figure out whether the data is late or gone. I file a ticket. I ping someone on Slack. The response, almost always, is some version of: “Huh, looks like it didn't run. It'll go again tonight.”
And they're not wrong. It will go again tonight. The job owner didn't lose anything. Their system didn't crash. Their pager didn't go off. They shrugged because, from where they sit, nothing happened. The failure was invisible to them.
It was not invisible to me.
Right now, someone downstream is your monitoring system. They're the person who opens a dashboard and sees yesterday's numbers are missing. The person who gets a support ticket from a customer asking why their invoice never arrived. The partner whose integration shows stale data. They're doing the job that a monitoring check should be doing. They just don't know it.
We solved monitoring for everything that fails loudly
The infrastructure monitoring industry is mature and, in certain areas, genuinely impressive. We have Prometheus scraping metrics every 15 seconds. We have Datadog dashboards tracking latency at the 99th percentile. We have PagerDuty waking people up at 3 AM when an API error rate crosses a threshold.
All of this was built for systems that fail loudly. A web server goes down and requests start returning 500s. An API gets slow and latency spikes. A database runs out of connections and everything backed up behind it starts throwing errors. These failures produce signals. They show up in graphs. They trigger alerts.
Cron jobs don't do any of that. A cron job that fails at 3 AM produces no signal at all. There's no request to time out, no error rate to spike, no graph to flatline. The job just doesn't run. And “didn't run” looks exactly like “hasn't run yet” until someone downstream needs the output and it isn't there.
Google's SRE handbook defines four golden signals for monitoring: latency, traffic, errors, and saturation. All four assume a request-driven system. Something is sending traffic, and you're measuring what happens to it. A nightly backup job has no traffic. A weekly report generator has no latency curve. The framework that has become standard practice for monitoring services simply does not apply to scheduled tasks.
Tyler Treat wrote an essay called “Pain-Driven Development” that explains the underlying dynamic. Engineering teams, he argues, operate as greedy algorithms. They optimize locally, solving whatever pain is closest to them. The unintended consequence is pain displacement: the pain doesn't disappear, it moves to someone else. The ops team that doesn't monitor a cron job isn't malicious. The job doesn't cause them pain. So they work on things that do. And the pain quietly radiates outward to the finance team, the support team, the customer, the partner—the people least equipped to diagnose or fix it.
The parable of the empty bucket
On January 31, 2017, a GitLab engineer accidentally deleted a production database directory. It happens. That's what backups are for.
Except there were no backups.
GitLab's postmortem revealed that their pg_dump backup cron job had been failing silently. The job was running PostgreSQL 9.2 binaries against a 9.6 database, and the version mismatch caused it to error and exit. The error notifications were configured to send by email. But DMARC wasn't enabled for the cronjob sender address, so the notification emails were being rejected by the receiver. The backups were failing. The failure notifications were failing. And nobody knew, because nobody was the downstream consumer of either process. The S3 backup bucket was completely empty. Five thousand projects, five thousand comments, and seven hundred user accounts were lost.
The engineer who deleted the directory felt the pain immediately. The team that owned the backup job didn't feel it until they reached for a backup that wasn't there. The GitLab users who lost their work never had a chance to feel anything at all until it was over.
GitLab's story is the famous one. But the pattern repeats constantly, less publicly.
A B2B SaaS platform's mysqldump backup job started failing on December 13, 2023 when the backup partition filled up. The script redirected stderr to /dev/null, so the errors vanished. Eighty-nine days later, a table corruption forced a restore. The most recent valid backup was from December. Fourteen thousand transactions across nearly three thousand customer records: gone.
A nightly Stripe sync job was killed by a permissions error after a deploy. Dead for eleven days. Discovery came when a customer reported missing data in their dashboard. The developer who wrote about it said the same thing had happened at three different companies he'd worked with.
The pattern is always the same. The job owner felt nothing. The downstream consumer discovered the failure during an emergency, when recovery was most expensive and least likely to succeed. Without monitoring, detection time is measured in days to months. With a simple heartbeat check, it's minutes.
What this actually costs
Think about your own scheduled jobs for a minute. Not in the abstract. Pick one.
Say it's a nightly report. It runs at 2 AM, pulls data from three sources, generates a PDF, drops it in a shared folder. Who reads that report? What happens to them on Tuesday morning when Monday's report isn't there? Do they wait? Do they email you? Do they make a decision based on last week's data and hope it's close enough?
That person emailed you about it last month, didn't they.
Now scale it. A backup job fails and the DBA doesn't find out until a disaster forces a restore. An email queue job stops and customers place orders, payments go through, but confirmation emails never send. Support tickets spike. Chargebacks follow. A data feed to a partner goes stale and the partner's dashboard shows last week's numbers for a week before someone on their side calls to ask what's wrong. An invoicing job skips a cycle and accounts receivable spends three days reconciling.
In every case, the engineer who owns the cron job sleeps fine. The job didn't page them. Their monitoring dashboard is green, because their monitoring was built for the API and the database and the load balancer—the things that fail loudly. The cron job that quietly stopped running at 2 AM on Saturday won't be noticed until Monday morning, and it won't be noticed by them.
Why it's worth watching
Jeff Sussna, who coined the term “operational empathy,” framed the right question: what will this look like for the person downstream on Day 100? Not Day 1, when you wrote the job and watched it run successfully and moved on to the next ticket. Day 100, when something has quietly changed—a partition filled up, a credential expired, a schema migrated—and nobody is watching.
Monitoring your cron jobs is not about protecting yourself. Your job failing probably won't hurt you at all. It's about the people who depend on the output of that job and have no idea it exists, no access to your server, and no way to check whether it ran.
They will find out it didn't run. They always do. It'll just take days instead of minutes, and it'll happen at the worst possible time.
The question they'll ask won't be “why did it fail?” Every job fails eventually. The question will be “why wasn't anyone watching?”
That's the question I got tired of answering from the downstream side. It's why I built CronDoctor.
One curl command at the end of your job. If the ping doesn't arrive on time, CronDoctor alerts you and tells you what went wrong. Five jobs free.