Back to notes
June 4, 20266 min read

The cron job that cost three days: real BullMQ + Upstash edge cases

A specific BullMQ + Upstash Redis bug that took three days to find and half an hour to fix. The story and the lesson on why idempotent retries aren't optional.

stack-decisionsreclamaai

There are bugs that feel like a crossword: one clue leads to another, pieces fit, and at the end there's an elegant solution. There are bugs that feel like a piñata in the dark: you swing at everything and hit nothing. This was the second kind. Here's the story and the lesson.

The symptom

In ReclamaAI, every night a job reviews documents generated in the last 30 days and emails their owners a reminder if they haven't downloaded them yet. It's simple: SELECT, loop, send email, mark as reminded. The function was written, tested locally, deployed with BullMQ running on Upstash Redis.

It worked the first night. The second night, two users reported receiving the same email five times. The third night, another user reported getting it twelve times. And this was increasing.

The first thing I tried (and was wrong)

My first instinct was: the cron is firing multiple times. I verified the cron — it ran once a day at 3am. I verified BullMQ — the job was being enqueued once per night. I verified worker logs — the worker processed the job once. But the emails went out multiple times.

My second instinct: the SELECT must be returning the same document multiple times (a bad join). No. The query was correct and returned 47 distinct rows, one per document.

The clue

I spent a day chasing the wrong queue. What finally gave me the clue was adding a log with the BullMQ job.id inside the worker. And there I saw it: the same job ID was being processed multiple times, minutes apart, by different worker instances.

It turns out that in serverless (Vercel functions), there's no single worker running always. Every time an invocation arrives, a new worker instance spins up, reads from the queue, processes what it can, and shuts down. If the job takes longer than the 60-second Hobby function max, the function gets killed, BullMQ marks the job as "stalled", and another instance picks it up. The internal job loop (which sent 47 emails) had already sent N of the 47 before dying, but since the function never confirmed the job, BullMQ retried it.

The fix

Three changes:

  1. The "remind 47 documents" job was decomposed into 47 individual jobs, one per document. Each runs in under 2 seconds. If one fails, BullMQ retries just that one, not all 47.
  2. Each individual job validates idempotency: before sending the email, it does UPDATE documents SET reminded_at = NOW() WHERE id = ? AND reminded_at IS NULL. If it affects 0 rows, it was already sent, abort. If it affects 1 row, send the email.
  3. We bumped the Vercel plan to Pro to have longer functions, and reduced BullMQ's per-instance batch size so no instance took more than it could process in its window.

The general lesson

Every serverless function can die mid-execution. Every queue can retry. Every network can split a message in two. Idempotency isn't an optimization, it's a requirement. If your action isn't safe to run twice, eventually it will run twice, and eventually you'll get an email from an angry user.

The rule I now apply in any worker: before doing anything with side effects (sending email, charging money, calling an external API), pass through a guard that dedupes in the database with a unique constraint. If you can't dedupe, log the attempt before doing it, not after. And if your logic can't be split into small units that are each idempotent, split it further until it can.

What I learned about BullMQ defaults

BullMQ has a removeOnComplete parameter that was set to false by default. That meant my Redis queue was growing indefinitely with completed jobs, filling Upstash's free plan without me noticing. I changed it to true, and added removeOnFail: { age: 86400 } to keep failed jobs for 24 hours (useful for debugging) and then clean them up. Three days to find the bug, thirty minutes to fix it, five minutes to configure it correctly from the start. Next time I'll know.