JobRadarPlaywrightSupabaseScraping

[JobRadar Part 2] Supabase Schema + Playwright Scraper — Debugging Notes

April 21, 20261 min read

While preparing for a job hunt in AU/NZ, I developed a daily routine.

  1. Open Indeed → search "React Native developer Sydney"
  2. Open Seek → same search
  3. Scroll past everything I already saw yesterday
  4. Can't remember if I've seen this one before, read it again
  5. Don't apply to anything and close the tabs

Doing this every day had efficiency approaching zero. So I thought:

"Why not automate this?"

The goal was simple: every morning, top 10 jobs matching my skills arrive in my inbox. And if Claude API could auto-generate cover letters too, even better.

This post covers the Supabase schema design, the Playwright scraper implementation, and all the pain points I ran into.


Tech Stack

AreaTech
FrameworkNext.js 14 App Router + TypeScript
DBSupabase (PostgreSQL)
AIClaude API (Anthropic)
ScrapingPlaywright + @sparticuz/chromium
EmailResend
DeployVercel + Vercel Cron

Step 1 — Supabase Schema Design

Table Structure

auth.users (managed by Supabase)
    ↓ 1:1
profiles (user profiles)
    ↓
matches (AI matching results) → jobs (full job listings pool)
    ↓
cover_letters (cover letters)

jobs is a shared pool of listings for all users. matches and cover_letters are separated per user.

profiles Table

CREATE TABLE profiles (
  id                UUID PRIMARY KEY REFERENCES auth.users(id),
  email             TEXT,
  name              TEXT,
  skills            TEXT[],
  desired_positions TEXT[],   -- ['React Native developer', 'Fullstack developer']
  desired_sources   TEXT[] DEFAULT ARRAY['indeed'],
  desired_locations TEXT[] DEFAULT ARRAY['Sydney NSW'],
  career_summary    TEXT,
  story             TEXT,     -- career story for cover letters
  resume_text       TEXT,
  preferences       JSONB
);

The key fields are desired_positions, desired_sources, and desired_locations. The scraper reads these to determine what to search for, where, and on which platforms.

Auto-create Profile on Signup

CREATE OR REPLACE FUNCTION handle_new_user()
RETURNS TRIGGER LANGUAGE plpgsql SECURITY DEFINER AS $$
BEGIN
  INSERT INTO profiles (id, email)
  VALUES (NEW.id, NEW.email);
  RETURN NEW;
END;
$$;

CREATE TRIGGER on_auth_user_created
  AFTER INSERT ON auth.users
  FOR EACH ROW EXECUTE FUNCTION handle_new_user();

RLS (Row Level Security)

CREATE POLICY "profiles: own records only" ON profiles
  FOR SELECT USING (auth.uid() = id);

CREATE POLICY "jobs: authenticated users can read all" ON jobs
  FOR SELECT TO authenticated USING (true);

RLS ensures user A can't see user B's matches or cover letters. This is baseline for any multi-user SaaS.

Scrapers Need the Service Role Key

With RLS enabled, unauthenticated requests are blocked. The scraper running on Vercel Cron has no user session, so it needs the service role key.

// src/lib/supabase-admin.ts
export const supabaseAdmin = createClient(
  process.env.NEXT_PUBLIC_SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY!,  // bypasses RLS
  { auth: { autoRefreshToken: false, persistSession: false } }
)

Step 2 — Building the Scraper

Original Plan vs Reality

The original plan was to scrape Indeed + Glassdoor.

Reality was different.

Attempt to access Glassdoor → "Just a moment..." → 10 min wait → CAPTCHA again
Apply playwright-extra stealth → still blocked
Try API endpoints directly → 403

Glassdoor was completely locked down by Cloudflare. I dropped it without hesitation and switched to Seek.com.au — Australia and New Zealand's #1 job board, with much more relaxed bot detection.

Indeed Scraper — Panel Click Approach

Indeed had one gotcha too.

// First attempt — blocked
await page.goto(`https://au.indeed.com/viewjob?jk=${jobId}`)
// → "We're sorry, this job is no longer available"

Directly navigating to a job URL via extraction gets blocked. The workaround is the panel click approach: click a job card from the listing page and extract the JD from the right-side panel that opens.

for (const card of jobCards) {
  await card.click()
  await page.waitForTimeout(1500)
  const description = await page.$eval(
    '#jobDescriptionText',
    el => el.textContent?.trim() ?? null
  ).catch(() => null)
}

Seek Scraper — URL Slug Conversion

Seek's URL format:

https://www.seek.com.au/react-native-developer-jobs/in-Sydney-NSW?daterange=7

Converting "React Native developer""react-native-developer" was needed.

function toSeekSlug(text: string): string {
  return text.toLowerCase().replace(/[^a-z0-9]+/g, '-').replace(/^-|-$/g, '')
}

Dynamic Targets from Profile

No hardcoding — scrape targets are read directly from user profiles.

async function collectScrapeTargets(): Promise<ScrapeTarget[]> {
  const { data: profiles } = await supabaseAdmin
    .from('profiles')
    .select('desired_positions, desired_locations, desired_sources')
    .not('desired_positions', 'is', null)
    .contains('desired_sources', ['seek'])

  const seen = new Set<string>()
  const targets: ScrapeTarget[] = []

  for (const profile of profiles ?? []) {
    for (const keyword of profile.desired_positions) {
      for (const location of profile.desired_locations) {
        const key = `${keyword}|${location}`
        if (!seen.has(key)) {
          seen.add(key)
          targets.push({ keyword, location })
        }
      }
    }
  }
  return targets
}

Step 3 — Vercel Cron Automation

{
  "crons": [
    {
      "path": "/api/scrape",
      "schedule": "0 22 * * *"
    }
  ]
}

22:00 UTC = 08:00 AEST (Sydney). Jobs pile up before the morning commute.

// src/app/api/scrape/route.ts
export const maxDuration = 300

export async function GET(request: Request) {
  const authHeader = request.headers.get('authorization')
  if (authHeader !== `Bearer ${process.env.CRON_SECRET}`) {
    return NextResponse.json({ error: 'Unauthorized' }, { status: 401 })
  }

  const { scrapeIndeed } = await import('@/lib/scrapers/indeed')
  const { scrapeSeek } = await import('@/lib/scrapers/seek')

  // sequential — concurrent execution causes /tmp/chromium ETXTBSY error
  const indeedResult = await scrapeIndeed().catch(e => ({ error: String(e) }))
  const seekResult = await scrapeSeek().catch(e => ({ error: String(e) }))

  return NextResponse.json({ ok: true, indeed: indeedResult, seek: seekResult })
}

Pain Points

1. Glassdoor Complete Block

Symptom: "Just a moment..." infinite loading even with playwright-extra + stealth
Cause: Cloudflare Enterprise-tier bot protection
Fix: Switched to Seek.com.au

2. Indeed viewjob URL Direct Access Blocked

Symptom: "Job no longer available" when accessing viewjob?jk=xxx
Cause: Direct access without going through the listing page is detected
Fix: Panel click approach from the listing page

3. Vercel Lambda — playwright-extra Module Missing

Symptom: Cannot find module 'is-plain-object'
Cause: Next.js File Tracer doesn't include dynamic require dependencies from playwright-extra
Fix: Removed playwright-extra entirely, using playwright-core directly

4. ETXTBSY — Chromium Binary Conflict

Symptom: spawn ETXTBSY
Cause: /tmp/chromium conflict when running Indeed and Seek concurrently
Fix: Changed to sequential execution

5. Vercel US Servers → AU Job Board Results: 0

Symptom: Scraper runs fine, targets: 12 but inserted: 0
Cause: Vercel Hobby plan runs from Washington DC. US IP can produce different results or trigger bot detection on AU job boards
Current status: Infrastructure is complete. Scraping quality improvement is a follow-up task


Troubleshooting Summary

ErrorCauseFix
Just a moment...Cloudflare blockSwitch job boards
viewjob blockedDirect URL detectionPanel click approach
Cannot find modulelazy-cache dynamic requireRemove playwright-extra
spawn ETXTBSYConcurrent Chromium binarySequential execution
inserted: 0US server locationImprovement pending

Summary — The Core Flow

Vercel Cron (daily 08:00 AEST)
    ↓
/api/scrape (Bearer auth)
    ↓
Read Supabase profiles (service role)
    → Generate desired_positions × desired_locations combinations
    ↓
Indeed scraper (panel click approach)
    ↓
Seek scraper
    ↓
jobs table upsert (deduplicated by URL)

The infrastructure is in place. Now it's time to build the real core: the AI matching engine. Continuing in Part 3.

PM

backtodev

A 40-something PM returns to code. Learning, failing, and growing.

[JobRadar Part 2] Supabase Schema + Playwright Scraper — Debugging Notes | backtodev