It Works Locally But Not on the Server — Fighting a Job-Scraping Bot Block

Intro: "wait, it works on my computer?"

JobRadar scrapes a job title, company, and description automatically when you paste in a job posting URL. Then one day, a card showed up in the list looking like this.

Scraping failed
No company name
Error: Error: Fetch failed: 403

403 Forbidden. "You're not allowed to access this." The URL in question was a job page from the German company trivago (careers.trivago.com). What was strange was that scraping the exact same thing from my laptop worked fine. This is the most headache-inducing category of bug when developing — the "works locally, fails on the server" kind.

This post traces down what that 403 actually was, and what I did about it in code. Short answer: the cause was bot blocking (Akamai), and along the way I also found a separate bug that was destroying real data.

Step 1. Narrowing down the actual cause

"It's a 403, so let's try changing headers" is too hasty an approach. First you need to see why it's returning a 403. I fired off the same request via curl, varying only the headers.

URL="https://careers.trivago.com/job/r8424124002/?gh_src=..."

# A) The same simple headers as the existing scraper
curl -s -o /dev/null -w "status=%{http_code}\n" \
  -H 'User-Agent: Mozilla/5.0 ... Chrome/120 ...' \
  -H 'Accept: text/html,...' "$URL"
# → status=200

# What's in the response headers?
curl -s -D - -o /dev/null "$URL" | grep -iE "server:|set-cookie"

This is where a decisive clue turned up. The response carried cookies like this.

set-cookie: ak_bmsc=...
set-cookie: bm_mi=...

ak_bmsc, bm_mi — these are cookies set by Akamai Bot Manager. So this site sits behind a bot-blocking solution called Akamai. And bot-blocking services like Akamai commonly operate like this.

A request from a residential IP (my home internet) → likely a human → allowed (200)
A request from a datacenter IP (Vercel, AWS, etc.) → likely a bot → blocked (403)

JobRadar is deployed on Vercel. So scraping requests come out of a Vercel datacenter IP, and Akamai flagged it with "hey, you're a bot" and threw a 403. My laptop (a residential IP) passed through fine. That was the answer to the "works locally, fails on the server" mystery.

Lesson: when you hit a 403, check the response cookies/headers first. If you spot prefixes like ak_ (Akamai) or __cf_ (Cloudflare), it's not a simple permission issue — it's bot blocking. The right response is completely different.

Step 2. A key realization — this is an "intermittent" problem

Digging further revealed something interesting. The trivago posting had already been scraped successfully once before. The DB had a perfectly intact description. Same URL, but 200 some days, 403 others.

This is a characteristic of bot blocking. It blocks probabilistically, weighing IP reputation, request timing, traffic patterns, and so on together. It doesn't always block — it blocks intermittently. Sure enough, while I was debugging, retrying a "failed" posting succeeded right away.

This is where the response strategy took shape.

Absorb intermittent 403s with retries. Even if blocked once, retrying a bit later has a good chance of passing.
Make the headers look more like a real browser. Lower the probability of getting blocked in the first place, at least a little.
(And, soon to be discovered) Make sure a failure doesn't destroy existing data.

Step 3. Building a retry + header-hardening helper

The existing scrapers each called fetch directly, used bare-bones headers, and threw immediately on failure.

// Existing: fails in one shot
const res = await fetch(url, {
  headers: {
    'User-Agent': 'Mozilla/5.0 ... Chrome/120 ...',
    'Accept': 'text/html,...',
  },
})
if (!res.ok) throw new Error(`Fetch failed: ${res.status}`)

I bundled this into a shared helper, fetchHtml, adding browser-like headers + retries.

const BROWSER_HEADERS: Record<string, string> = {
  'User-Agent': 'Mozilla/5.0 (Macintosh; ...) Chrome/126.0.0.0 Safari/537.36',
  'Accept': 'text/html,application/xhtml+xml,...,image/avif,image/webp,*/*;q=0.8',
  'Accept-Language': 'en-US,en;q=0.9',
  'Sec-Fetch-Dest': 'document',
  'Sec-Fetch-Mode': 'navigate',
  'Sec-Fetch-Site': 'none',
  'Sec-Fetch-User': '?1',
  'Upgrade-Insecure-Requests': '1',
}

// Status codes considered 'temporary' and eligible for retry
const RETRYABLE = new Set([403, 429, 500, 502, 503, 504])
const sleep = (ms: number) => new Promise(r => setTimeout(r, ms))

export async function fetchHtml(url: string, opts = {}): Promise<string> {
  const { label = 'Fetch', acceptLanguage, retries = 2 } = opts
  const headers = acceptLanguage
    ? { ...BROWSER_HEADERS, 'Accept-Language': acceptLanguage }
    : BROWSER_HEADERS

  let lastError = `${label} failed`

  for (let attempt = 0; attempt <= retries; attempt++) {
    if (attempt > 0) await sleep(400 * 2 ** (attempt - 1)) // 400ms → 800ms ...

    let res: Response
    try {
      res = await fetch(url, { headers })
    } catch (e) {
      lastError = `${label} fetch failed: ${String(e)}`
      continue // network error → retry
    }

    if (res.ok) return res.text()

    lastError = `Fetch failed: ${res.status}`
    // give up immediately on 'permanent' errors like 404, only retry temporary blocks
    if (!RETRYABLE.has(res.status)) break
  }

  throw new Error(lastError)
}

Three things I paid attention to while designing this.

Point	Reason
Exponential backoff (400ms → 800ms)	retrying immediately gets blocked the same way. A brief gap improves the odds of getting through
Distinguishing retryable codes	only retry 403/429/5xx. A 404 (page doesn't exist) won't change no matter how many times you try — give up immediately
*`Sec-Fetch-`, `Accept-Language`**	headers a real browser always sends. Missing these is a giveaway that you're a bot

Now each scraper is much simpler.

// generic-url.ts
const html = await fetchHtml(url)

// seek-url.ts (only the site-specific language/label differs)
const html = await fetchHtml(url, { label: 'Seek', acceptLanguage: 'en-AU,en;q=0.9' })

Step 4. A real bug found while chasing the cause

While chasing down the root cause, I found an even more dangerous bug lurking. The code handling a scraping failure looked like this.

// On failure: overwrite the title and description with the error
catch (e) {
  await supabaseAdmin.from('jobs').update({
    title: 'Scraping failed',
    description: String(e),
  }).eq('id', job.id)
}

See the problem? A posting that was already successfully scraped — if, for whatever reason, a re-scrape gets triggered and hits an intermittent 403 — has its perfectly good title and description overwritten with 'Scraping failed' and the error message. Combined with bot blocking's intermittent nature, this is the worst-case scenario. Data that was working fine gets destroyed in an instant.

So I fixed it to "only mark as failed if it never succeeded before."

catch (e) {
  const errMsg = String(e)
  // If this succeeded before (= has a real title), preserve the existing data
  const neverScraped =
    job.title === 'Waiting to scrape...' || job.title === 'Scraping failed'

  if (neverScraped) {
    await supabaseAdmin.from('jobs')
      .update({ title: 'Scraping failed', description: errMsg })
      .eq('id', job.id)
  }
  // If data already exists? → don't overwrite anything, leave it alone
  return NextResponse.json({ error: errMsg }, { status: 500 })
}

The key question was how to know "did this succeed before." I leveraged the fact that when a posting is created, its title starts as 'Waiting to scrape...', and becomes the real title on success. If the title is neither 'Waiting to scrape...' nor 'Scraping failed' = it means it succeeded at least once, so leave that data untouched.

This was actually the most important fix in this whole piece of work. A 403 itself can't be 100% prevented — it's the external site's call. But making sure a failure doesn't destroy good data is entirely my code's responsibility.

Troubleshooting: reading common blocking signals

When scraping gets blocked, spotting these signals in the response quickly narrows down the cause.

Signal	Meaning	Response
`set-cookie: ak_bmsc`	Akamai bot blocking	retries + header hardening, headless browser if that's still not enough
`set-cookie: __cf_bm`, `cf-ray` header	Cloudflare bot blocking	same as above
200 locally, 403 on the server	datacenter IP blocked	retries / consider a proxy
Always 404	a genuinely missing page	retrying is pointless, check the URL
429	too many requests (rate limit)	backoff and retry

Summary: separating what you can't block from what you can

Here's this work at a glance:

Root cause identification — confirmed Akamai bot blocking via response cookies (ak_bmsc), narrowed it down to "200 locally / 403 on server = datacenter IP blocking"
Recognizing intermittency — not always blocked, but probabilistic → decided on a retry strategy
Retry helper — browser-like headers + exponential backoff, distinguishing retryable codes (403/429/5xx) from give-up codes (404)
Fixed a data-preservation bug — made sure a failure doesn't overwrite a posting that was already successfully scraped

To be honest, this is not a perfect solution. If Akamai decides to aggressively block datacenter IPs, retries and headers alone have their limits. Truly getting through would need headless browser rendering (Playwright) or a proxy, which raises cost and complexity considerably. So this time, I stopped at "absorb most intermittent blocks with retries, and if it still fails, at least protect the existing data."

Engineering is often less about "blocking perfectly" and more about "separating what you can't block from what you can, and taking full responsibility for the part you can." A block from an external site is outside my control, but that failure destroying my data is entirely within it.