Job Board Scraper Pipeline — Implementation Plan (v3.2)

Context

Dmitri needs an automated system to find fractional executive job opportunities daily. The scraper will score listings with a hybrid approach (regex fast-pass + Claude deep analysis via claude CLI) and generate a daily markdown report with application briefs for top matches.

Forks battle-tested patterns from the fcto-pipeline (solanasis-scripts/fcto-pipeline/). All tools and boards have been independently validated through live testing and research — see Appendix A (Tool Evaluation) and Appendix B (Board Scrapability Audit) for full documentation.

Environment: Windows 11 with WSL2 available (Docker-capable). Claude Code subscription (no separate API key). Python 3.12.

Runtime: Pipeline runs from WSL2 to access Crawl4AI natively. Code lives on the Windows filesystem at solanasis-scripts/job-board-scraper/, accessed from WSL2 via /mnt/c/_my/_solanasis/....


Architecture

run_pipeline.py  (orchestrator)
  |
  +--> scrapers.py        Phase 1: Fetch listings (two-pass: index -> detail pages)
  |      trafilatura (static) + Crawl4AI (JS) + Jina (fallback)
  |
  +--> parsers.py         Phase 2: Normalize + dedup into SQLite
  |
  +--> score_regex.py     Phase 3: Fast regex scoring (all jobs, zero cost)
  |      Gate: score >= 15 passes to Claude
  |
  +--> analyze_claude.py  Phase 4: Claude CLI (subscription) deep analysis
  |      ~10-20 jobs/day, included in subscription
  |
  +--> generate_report.py Phase 5: Daily markdown report
         Output: solanasis-docs/daily-outreach/YYYY-MM-DD-jobs.md

Storage: SQLite (data/jobs.db) — dedup, querying, lifecycle tracking, atomic writes. Zero deps (stdlib).


Tool Stack (Validated)

Only tools that passed independent evaluation. Full rationale in Appendix A.

ToolRoleInstallStatus
httpxHTTP client for static requestsAlready installed (v0.28.1)Proven in fcto-pipeline
trafilaturaText extraction from static HTMLAlready installed (v2.0.0)Best benchmarked extractor (F1=0.945)
Crawl4AIJS-rendered page scraping (primary)pip install crawl4ai in WSL2Self-hosted, no rate limits, no external API. Runs natively in WSL2.
Jina ReaderJS fallback + quick page extractionNo install (API via httpx)Fallback for Crawl4AI failures. Sign up for free key → 200 RPM.
Claude CLIJob description analysisAlready installed (subscription)claude -p "prompt" --model opus via subprocess. $0 additional cost. Configurable model.
sqlite3Data storage + dedup + trackingPython stdlibZero install. WAL mode for crash recovery.
python-jobspyIndeed job aggregationpip install python-jobspy (optional)Indeed works; LinkedIn fragile; ZipRecruiter broken.

New installs needed:

  • WSL2: pip install crawl4ai + crawl4ai-setup (installs Playwright browser — works natively on Linux/WSL2)
  • WSL2 (optional): pip install python-jobspy
  • Windows: nothing new

Rejected tools (with rationale):

  • curl-cffi — Solves TLS fingerprinting anti-bot. Our niche boards have zero anti-bot. httpx with a User-Agent header is sufficient.
  • Anthropic Python SDK — Would require a separate API key and per-token billing. The claude CLI is already authenticated via subscription and costs nothing additional.
  • Playwright (standalone) — Crawl4AI wraps Playwright and adds content extraction + markdown output. Use Crawl4AI instead of raw Playwright.

Target Boards (Validated — Live Tested)

Every board below was fetched and audited on 2026-03-23. Full results in Appendix B.

Tier 1: Confirmed Scrapable (build these)

#BoardMethodInventoryKey Finding
1fractionaljobs.ioSitemap + trafilatura700+ jobsWebflow, no bot protection, sitemap has all URLs. Best source.
2findfractionaljobs.comWP REST API (JSON)~6 jobsGET /wp-json/wp/v2/job-listings/ returns structured JSON with salary data. Tiny but zero-effort.
3allfractionaljobs.comSitemap + trafilatura94 jobsJobboardly platform, open robots.txt, sitemap with all URLs. 9 free, 1 paywalled per page.
4useshiny.comHTML parsing~10+ jobsWordPress, server-rendered, “Show More” button suggests AJAX endpoint (jm-ajax).
5Indeed (via JobSpy)python-jobspyLargeIndeed scraper currently works with no rate limiting. Fragile long-term.

Tier 2: Conditional / Blocked (defer)

BoardIssueAction
gofractional.comReturns 403 on ALL automated requests. 251 curated listings behind aggressive bot protection.Defer. Would need Playwright + stealth + proxy. High effort for uncertain return.
LinkedIn (via JobSpy)Rate-limits after ~10 pages. Requires proxies.Include if Indeed works, skip if it doesn’t.
chiefjobs.comSSL certificate broken (chain missing intermediate cert). Browsers handle it; HTTP clients reject it.Recheck in 3-5 days. May resolve (Namecheap had SSL issues 2026-03-22).

Dropped (not job boards)

BoardReason
hirefractionaltalent.comNot a job board. It’s a consultant showcase / lead-gen site on HubSpot. Zero job listings.
gigx.comExecutive profile directory, not a job board. Companies browse exec profiles. Search is JS-rendered and robots.txt blocks /search.
ZipRecruiter (via JobSpy)Returns 403 errors. Broken in JobSpy since Sep 2025.

Configurability

All tunable settings live in config.py — no code changes needed to adjust behavior. The executing session should ensure these are all constants at the top of the file, not buried in function logic.

SettingDefaultPurpose
CLAUDE_MODEL"opus"Model for job analysis. Set to "sonnet" for faster/cheaper fallback.
CLAUDE_TIMEOUT60Seconds per Claude CLI call. Opus needs more time than Sonnet.
REGEX_GATE_THRESHOLD15Minimum regex score to pass to Claude. Lower = more Claude calls.
TIERS{"A": 75, "B": 50, "C": 25}Claude ai_fit_score thresholds for tiering.
SCRAPE_CONCURRENCY3Max concurrent board scrapes.
SCRAPE_DELAY1.5Seconds between requests to same board.
SCRAPE_TIMEOUT20HTTP request timeout in seconds.
JINA_RPM_LIMIT20Jina Reader rate limit (200 with free API key).
JOBSPY_SEARCH_TERMS[list]Search queries for Indeed via JobSpy.
JOBSPY_RESULTS_PER_TERM25Max results per search term from JobSpy.
BOARDS{dict}Board registry — enable/disable boards, change methods/priorities.
DMITRI_PROFILEstringProfile text fed to Claude for fit analysis. Update as services evolve.

All settings are also overridable via CLI flags where it makes sense (--model sonnet, --boards fractionaljobs, --limit 5, etc.).


Files to Create (9 files in solanasis-scripts/job-board-scraper/)

1. config.py (~200 lines)

Fork from: fcto-pipeline/config.py

  • Path setup: PIPELINE_DIR, DATA_DIR, DB_FILE, RAW_CACHE_DIR, DAILY_OUTREACH_DIR
  • BOARDS registry: 5 Tier-1 boards with url, method, priority, sitemap_url (where applicable)
    • fractionaljobs.io: sitemap_url: "https://www.fractionaljobs.io/sitemap.xml", method: "sitemap_trafilatura"
    • findfractionaljobs.com: method: "wp_rest_api", api_url: "https://findfractionaljobs.com/wp-json/wp/v2/job-listings/"
    • allfractionaljobs.com: sitemap_url, method: "sitemap_trafilatura"
    • useshiny.com: method: "html_parse"
    • indeed: method: "jobspy", site_name: "indeed"
  • Keyword regex lists (from continuation prompt lines 344-431)
  • SCORING dict, REGEX_GATE_THRESHOLD = 15
  • TIERS = {"A": 75, "B": 50, "C": 25} (Claude’s 0-100 scale)
  • Scraping constants: SCRAPE_CONCURRENCY=3, SCRAPE_DELAY=1.5, SCRAPE_TIMEOUT=20
  • JINA_READER_BASE = "https://r.jina.ai/", JINA_RPM_LIMIT = 20
  • JOBSPY_SEARCH_TERMS, JOBSPY_RESULTS_PER_TERM = 25
  • Compensation parsing regexes (hourly, monthly, annual shorthand)
  • DMITRI_PROFILE condensed string (~300 tokens for Claude prompts)
  • CLAUDE_MODEL = "opus" — default to Opus for best quality; configurable to “sonnet” as fallback
  • CLAUDE_TIMEOUT = 60 — configurable timeout per call (Opus may be slower than Sonnet)

2. db.py (~120 lines)

SQLite database layer (stdlib sqlite3)

  • init_db() — create tables + indexes, enable WAL mode + busy_timeout=5000
  • Tables: jobs (main), scraper_health (monitoring)
  • CRUD: upsert_job(), get_new_jobs(since_date) (where date_first_seen >= since_date), get_unanalyzed_jobs() (where regex_pass = 1 AND ai_fit_score IS NULL), mark_applied(job_id), mark_skipped(job_id), get_scraper_health(board, days=7), get_stats()

3. scrapers.py (~400 lines)

Fork from: fcto-pipeline/enrich_websites.py

  • Three scraping tiers:
    1. trafilatura — static HTML pages (fast, reliable)
    2. Crawl4AI — JS-rendered pages (self-hosted in WSL2, no rate limits)
    3. Jina Reader — fallback if Crawl4AI fails on a specific page
  • Cache: per-board-per-day JSON files in data/raw/ (reuse cache_key/load_cache/save_cache from fcto-pipeline)
  • Sitemap-based scrapers (fractionaljobs.io, allfractionaljobs.com):
    • Pass 1: Fetch sitemap XML via httpx, extract job URLs
    • Pass 2: Fetch each job detail page via trafilatura (with delay)
  • WP REST API scraper (findfractionaljobs.com):
    • Single httpx API call returns structured JSON — no HTML parsing needed
  • Crawl4AI scraper (useshiny.com for AJAX content):
    • Use AsyncWebCrawler to render JS, get clean markdown output
    • Falls back to Jina Reader (scrape_with_jina()) if Crawl4AI fails
    • (Tier 2 boards like gofractional.com can reuse this code path later if unblocked)
  • HTML parse scraper (useshiny.com static fallback):
    • Fetch listing page, extract job URLs from HTML
    • Fetch each detail page via trafilatura
    • (chiefjobs.com can reuse this pattern when SSL resolves)
  • JobSpy scraper (Indeed):
    • try-import python-jobspy, search with JOBSPY_SEARCH_TERMS
    • DataFrame → list[dict]
  • Async orchestrator: scrape_all_boards(force, boards_filter) with semaphore
  • Health tracking: record result counts per board in SQLite

4. parsers.py (~180 lines)

  • Unified schema: job_id, title, company, location, is_remote, url, source_board, description, compensation_raw, compensation_hourly, date_posted, date_scraped
  • Per-board normalizers (one function each)
  • make_job_id(url, title, company) — MD5 of canonical URL (primary) or normalized title+company (cross-board dedup). Not fuzzy — normalized exact match (lowercase, strip whitespace, remove Inc/LLC).
  • parse_compensation(text) — regex extraction, normalize to hourly rate

5. score_regex.py (~180 lines)

Fork from: fcto-pipeline/score_prospects.py

  • check_keywords(text, patterns) — regex matching with human-readable output
  • score_job(job)(score, breakdown_string) — all signals from continuation prompt
  • Gate: score >= 15regex_pass = True
  • score_all_jobs() — batch score, update SQLite, print tier distribution

6. analyze_claude.py (~250 lines)

Uses claude CLI via subprocess — authenticated through Claude subscription, no API key needed.

  • call_claude(prompt, model=None) — wrapper function:
    • Model defaults to config.CLAUDE_MODEL (Opus) but accepts override per call
    • Runs claude -p "<prompt>" --model {model} --output-format json via subprocess.run()
    • Sets encoding='utf-8' to avoid Windows cp1252 issues
    • Returns parsed JSON response
    • Timeout: configurable via config.CLAUDE_TIMEOUT (default 60s for Opus)
  • analyze_job(job) — single Claude call per job:
    • Prompt includes: DMITRI_PROFILE + job description + structured output instructions
    • Response: JSON with ai_fit_score (0-100), ai_tier, ai_reasoning, requirements_match, red_flags, referral_opportunity, estimated_seniority, engagement_type
  • generate_brief(job, analysis) — second call for A-tier only:
    • Why Dmitri is a fit, talking points, concerns to address, application approach
  • analyze_all_jobs() — reads unanalyzed from SQLite, calls Claude, updates DB
  • Results cached in SQLite (no re-analyzing same job on re-run)

Cost: $0 additional — covered by existing Claude subscription. No API key, no metered billing.

Fallback: If claude CLI is unavailable (e.g., scheduled task without Claude Code running), the pipeline works in --skip-claude mode using regex-only scoring.

7. generate_report.py (~280 lines)

Fork from: fcto-pipeline/generate_daily_outreach.py

  • Reads from SQLite: today’s analyzed jobs by tier
  • Markdown report:
    • Header: date, stats (new listings, A/B tier counts, total tracked)
    • A-Tier: full job cards with application brief, Claude reasoning, score breakdown
    • B-Tier: summary cards with reasoning
    • Referral opportunities (if any)
    • Board health warnings (if any scraper anomalously low)
    • Summary stats table (per-board)
    • Quick actions (CLI commands)
  • CLI: --mark-applied id, --mark-skipped id, --status, --tier A
  • Output: solanasis-docs/daily-outreach/YYYY-MM-DD-jobs.md

8. run_pipeline.py (~100 lines)

Orchestrator:

  1. Init DB + create dirs
  2. scrapers.scrape_all_boards()
  3. parsers.normalize_and_dedup()
  4. score_regex.score_all_jobs()
  5. analyze_claude.analyze_all_jobs()
  6. generate_report.generate_daily_report()

CLI flags:

  • --force — re-scrape even if cached
  • --boards <names> — comma-separated board names (default: all enabled)
  • --limit N — max jobs per board (for testing)
  • --skip-claude — regex-only mode, no Claude analysis
  • --model <name> — override CLAUDE_MODEL (e.g., --model sonnet)
  • --dry-run — show what would be scraped, no HTTP requests
  • --recheck-scores — re-score all jobs against current regex/config without re-scraping
  • Windows event loop policy for async code

9. requirements.txt

# Core
trafilatura>=2.0.0
httpx>=0.27.0
crawl4ai>=0.8.0       # JS-rendered page scraping (run in WSL2)

# Optional: Indeed scraping (fragile, not essential)
# python-jobspy>=1.1.0

Note: Claude analysis uses the claude CLI (subscription-authenticated, no SDK needed). SQLite is stdlib. Jina Reader is an API called via httpx (no package). Install all deps in WSL2’s Python environment.

Directory structure:

solanasis-scripts/job-board-scraper/
  config.py
  db.py
  scrapers.py
  parsers.py
  score_regex.py
  analyze_claude.py
  generate_report.py
  run_pipeline.py
  requirements.txt
  data/
    jobs.db        # SQLite (all data + tracker)
    raw/           # Per-board-per-day scrape cache (JSON)

SQLite Schema

-- Enable WAL mode and set busy timeout on connection
PRAGMA journal_mode=WAL;
PRAGMA busy_timeout=5000;
 
CREATE TABLE IF NOT EXISTS jobs (
    job_id TEXT PRIMARY KEY,
    title TEXT NOT NULL,
    company TEXT,
    location TEXT,
    url TEXT,
    source_board TEXT NOT NULL,
    description TEXT,
    compensation_raw TEXT,
    compensation_hourly REAL,
    date_posted TEXT,
    date_first_seen TEXT NOT NULL,
    date_last_seen TEXT NOT NULL,
    is_remote INTEGER DEFAULT 0,
 
    -- Regex scoring
    regex_score INTEGER,
    regex_breakdown TEXT,
    regex_pass INTEGER DEFAULT 0,
 
    -- Claude analysis (NULL until analyzed)
    ai_fit_score INTEGER,
    ai_tier TEXT,
    ai_reasoning TEXT,
    requirements_match TEXT,    -- JSON
    red_flags TEXT,             -- JSON
    referral_opportunity INTEGER DEFAULT 0,
    referral_notes TEXT,
    application_brief TEXT,
 
    -- Pipeline state
    pipeline_status TEXT DEFAULT 'new',
    applied_date TEXT,
    notes TEXT,
    created_at TEXT DEFAULT (datetime('now')),
    updated_at TEXT DEFAULT (datetime('now'))
);
 
CREATE TABLE IF NOT EXISTS scraper_health (
    board TEXT NOT NULL,
    date TEXT NOT NULL,
    jobs_found INTEGER DEFAULT 0,
    jobs_new INTEGER DEFAULT 0,
    error TEXT,
    duration_seconds REAL,
    PRIMARY KEY (board, date)
);
 
CREATE INDEX IF NOT EXISTS idx_jobs_status ON jobs(pipeline_status);
CREATE INDEX IF NOT EXISTS idx_jobs_tier ON jobs(ai_tier);
CREATE INDEX IF NOT EXISTS idx_jobs_source ON jobs(source_board);
CREATE INDEX IF NOT EXISTS idx_jobs_first_seen ON jobs(date_first_seen);

Build Order

Step 0: Environment Setup (WSL2 + Jina)

  1. Set up Python environment in WSL2 (if not already): wsl -- pip install trafilatura httpx crawl4ai
  2. Run wsl -- crawl4ai-setup (installs Playwright Chromium in WSL2 — works natively on Linux)
  3. Dmitri: Sign up for free Jina API key at https://jina.ai/ (30 seconds, no credit card). Provide key to save in .env.
  4. Save JINA_API_KEY=<key> to solanasis-scripts/.env
  5. Optionally install python-jobspy for Indeed: wsl -- pip install python-jobspy

Step 1: Foundation + fractionaljobs.io (largest source, 700+ jobs)

  1. Create directory structure
  2. Write requirements.txt
  3. Write config.py
  4. Write db.py
  5. Write scrapers.py — fractionaljobs.io only (sitemap → detail pages via trafilatura)
  6. Write parsers.py — unified schema, normalize_fractionaljobs, make_job_id, parse_compensation
  7. Write score_regex.py
  8. Write run_pipeline.py (--skip-claude mode)
  9. Test: wsl -- python /mnt/c/_my/_solanasis/solanasis-scripts/job-board-scraper/run_pipeline.py --boards fractionaljobs --skip-claude
  10. Spot-check 5 scraped listings against actual site

Step 2: Add remaining Tier-1 boards

  1. findfractionaljobs.com (WP REST API — easiest)
  2. allfractionaljobs.com (sitemap — same pattern as fractionaljobs.io)
  3. useshiny.com (Crawl4AI for AJAX + HTML fallback)
  4. Indeed via JobSpy (conditional, try-import)
  5. Test: wsl -- python /mnt/c/_my/_solanasis/solanasis-scripts/job-board-scraper/run_pipeline.py --skip-claude

Step 3: Claude analysis

  1. Write analyze_claude.py
  2. Test on 5-10 real listings — validate structured output
  3. Tune regex gate threshold based on real data
  4. Test: wsl -- python /mnt/c/_my/_solanasis/solanasis-scripts/job-board-scraper/run_pipeline.py --boards fractionaljobs

Step 4: Daily report + scheduling

  1. Write generate_report.py
  2. Test: Full end-to-end pipeline
  3. Verify solanasis-docs/daily-outreach/YYYY-MM-DD-jobs.md output
  4. Test: --mark-applied, --mark-skipped, --status
  5. Set up Windows Task Scheduler (daily 7:00 AM)

Verification

  1. python run_pipeline.py --boards fractionaljobs --skip-claude — smoke test (run via WSL2)
  2. Spot-check 5 scraped listings against actual website
  3. Review regex scoring: do pass/reject decisions make sense?
  4. python run_pipeline.py --boards fractionaljobs — with Claude
  5. Review Claude output: are ai_fit_scores reasonable? Red flags accurate?
  6. Full pipeline: python run_pipeline.py
  7. Open solanasis-docs/daily-outreach/YYYY-MM-DD-jobs.md — verify format
  8. python run_pipeline.py --recheck-scores — re-score without re-scraping
  9. python generate_report.py --mark-applied <id> — tracker test
  10. python generate_report.py --status — stats test
  11. Re-run next day — verify dedup, cache, only new jobs appear

Risks & Mitigations

RiskMitigation
trafilatura strips job listing tables on index pagesUse sitemap URLs to go direct to detail pages (bypass index). For useshiny.com, use raw HTML parsing for index.
Crawl4AI setup fails in WSL2Fall back to Jina Reader (free key, 200 RPM). Jina handles JS rendering server-side.
GoFractional 403 blocks even Crawl4AIBoard is deferred. May require residential proxy — evaluate if inventory (251 listings) justifies the effort.
JobSpy Indeed scraper breaksIndeed is supplemental. Primary value is niche boards. Disable with --boards flag.
Board HTML changesScraper health monitoring in SQLite. Auto-warn in daily report if a board returns <50% of 7-day average.
Claude structured output malformedValidate JSON, retry once, fall back to regex-only tier on failure.
Claude CLI unavailable (e.g., scheduled task)--skip-claude flag runs regex-only mode. Pipeline works without Claude.
Two-pass scraping is slowCache detail pages per-job-id. Only scrape new/unseen URLs. fractionaljobs.io sitemap lets us check for new URLs before fetching.

Key Files to Read Before Implementation

FileWhat to Fork
fcto-pipeline/config.pyPath setup (L9-16), keyword regex structure (L18-84), scoring dict (L87-109), tiers (L112-117)
fcto-pipeline/enrich_websites.pyCache helpers (L101-124), scrape_with_trafilatura (L159-172), scrape_with_jina (L175-188), check_keywords (L195-206), async orchestrator (L249-309), Windows policy (L535-537)
fcto-pipeline/score_prospects.pyscore_prospect (L36-122), assign_tier (L125-134)
fcto-pipeline/generate_daily_outreach.pyTracker load/save (L38-50), mark operations (L53-93), CLI (L660-693), daily brief (L436-579)

Appendix A: Tool Evaluation (2026-03-23)

Approved Tools

httpx (v0.28.1) — YES

  • Pure Python async HTTP client. Already installed and proven in fcto-pipeline.
  • Sufficient for niche job boards with no anti-bot protection.
  • No compiled dependencies, no platform issues.

trafilatura (v2.0.0) — YES

  • Best benchmarked content extractor (F1=0.945, beating readability-lxml, newspaper3k, goose3).
  • Already installed. Pure Python, Windows-tested.
  • Gotcha: Optimized for “article” content — may aggressively strip job listing index pages (thinks card layout is navigation). Mitigated by scraping detail pages directly via sitemap URLs.
  • Use include_tables=True to prevent table stripping.

Jina Reader (r.jina.ai) — YES (fallback for Crawl4AI)

  • Zero-install JS rendering: prepend https://r.jina.ai/ to any URL, get clean markdown via HTTP GET.
  • Free tier: 20 RPM without key, 200 RPM with free key (signup takes 30 seconds).
  • Role: Fallback when Crawl4AI fails or for quick one-off page fetches.
  • Tested: Returns clean content for most JS-rendered pages. Fails gracefully (error or empty, no hangs).
  • Dmitri to sign up for free key → store as JINA_API_KEY in .env.

Claude CLI (subscription) — YES

  • Uses claude -p "<prompt>" --model {model} --output-format json via subprocess.
  • Already installed and authenticated through Dmitri’s Claude subscription. Zero additional cost.
  • Default model: Opus (best quality). Configurable to Sonnet as fallback (faster, lower quality).
  • All model/timeout settings live in config.py — no code changes needed to switch models.
  • Windows encoding gotcha: set encoding='utf-8' in subprocess.run() to avoid cp1252 issues.
  • Node.js process spawn overhead (~1-2s per call) is acceptable for 10-20 calls/day.
  • Fallback: pipeline has --skip-claude mode for regex-only scoring when CLI unavailable.

SQLite (stdlib sqlite3) — YES

  • Zero install. Python stdlib. Rock-solid on Windows.
  • Enable WAL mode (PRAGMA journal_mode=WAL) for concurrent readers and crash recovery.
  • Set PRAGMA busy_timeout=5000 to handle accidental concurrent pipeline runs.
  • Single jobs.db file beats JSON sprawl for dedup, querying, and lifecycle tracking.

Crawl4AI (v0.8.0) — YES (via WSL2)

  • Windows-native installation is documented as broken: GitHub issues #38, #949, #1705.
  • Runs natively in WSL2 — Linux is a first-class platform. pip install crawl4ai && crawl4ai-setup works cleanly.
  • Self-hosted JS rendering: no rate limits, no external API dependency, no cost.
  • Outputs clean Markdown (ideal for feeding to Claude for analysis).
  • “Adaptive Intelligence” feature learns selectors over time — useful for daily scraping of the same boards.
  • Primary tool for JS-rendered pages (useshiny.com AJAX content, GoFractional if unblocked).

python-jobspy (v1.1.82) — CONDITIONAL

  • Include as optional (try-import). Use for Indeed only.
  • Indeed scraper currently works with no rate limiting (per maintainer and recent issues).
  • LinkedIn: rate-limits after ~10 pages, requires proxies. Not worth the friction.
  • ZipRecruiter: broken (403/429 since Sep 2025). Google Jobs: returns 0 results. Glassdoor: 403.
  • Install: pip install python-jobspy — pure Python, no compiled deps.
  • If Indeed breaks in the future, disable and rely on niche boards.

Rejected Tools

curl-cffi (v0.14.0) — NO

  • TLS fingerprint impersonation for anti-bot bypass. Niche job boards don’t have TLS fingerprinting.
  • Adds a compiled C dependency (cffi + libcurl) with platform-specific wheels.
  • Solving a problem we don’t have. httpx with a User-Agent header is sufficient.
  • If a specific board later requires it, swap in for that one scraper only.

Anthropic Python SDK — NO

  • Would require a separate API key and per-token billing.
  • The claude CLI is already authenticated via Dmitri’s subscription. Zero additional cost.
  • SDK stays on the bench unless we need programmatic access without the CLI.

Playwright (standalone) — NO (use Crawl4AI instead)

  • Crawl4AI wraps Playwright and adds content extraction + markdown output.
  • No reason to use raw Playwright when Crawl4AI provides a better developer experience.
  • Playwright’s Chromium is installed automatically by crawl4ai-setup in WSL2.

Appendix B: Board Scrapability Audit (2026-03-23)

All boards were fetched and tested live on 2026-03-23 via WebFetch.

fractionaljobs.io — EXCELLENT

  • Platform: Webflow (server-rendered HTML + JS enhancements)
  • Bot protection: None. No robots.txt restrictions (only sitemap reference).
  • Inventory: 700+ job URLs in sitemap. ~40 visible on homepage with “View 24 more” link.
  • Scraping strategy: Fetch /sitemap.xml → extract all /jobs/* URLs → fetch each detail page via trafilatura. Classes: .job-item, .jobs-collection-list, .job-item_link-to-job.
  • Login required: No.

findfractionaljobs.com — BEST (has API)

  • Platform: WordPress 6.9.4 with WP Job Manager + Workscout theme.
  • Bot protection: None. Open robots.txt (Disallow: empty).
  • Inventory: ~6 listings via API (small board).
  • Scraping strategy: GET /wp-json/wp/v2/job-listings/?per_page=100 returns structured JSON with title, content, excerpt, link, _company_name, _job_location, _remote_position, _salary_min, _salary_max, _rate_min, _rate_max, job-categories, job-types. Zero HTML parsing needed.
  • Login required: No for browsing.

allfractionaljobs.com — GOOD

  • Platform: Jobboardly (SaaS, appears Rails-based).
  • Bot protection: None. Open robots.txt (Allow: /).
  • Inventory: 94 job URLs in sitemap. 9 free per page, 1 behind $5/mo paywall.
  • Scraping strategy: Fetch /sitemap.xml → extract all /jobs/* URLs → fetch detail pages. Listings in <li> elements with company logo, title, type, location, compensation, hours.
  • Login required: No for free listings.

useshiny.com — MODERATE

  • Platform: WordPress + WooCommerce.
  • Bot protection: None. Open robots.txt (Yoast SEO).
  • Inventory: ~10 visible, “Show More Jobs” button suggests more via AJAX (jm-ajax endpoint pattern visible in source).
  • Scraping strategy: Fetch /job-postings page, parse <a> elements with <h4> titles and badges (type, location, compensation). May need to discover AJAX endpoint for full listing. No REST API exposed for jobs (unlike findfractionaljobs.com).
  • Login required: No for browsing.

gofractional.com — BLOCKED

  • Platform: Next.js (Vercel).
  • Bot protection: Aggressive. Returns 403 Forbidden on every automated request (homepage, /jobs, /job/*, sitemap).
  • Inventory: 251 jobs per Google index. Zero accessible via automated fetch.
  • robots.txt: Blocks /tag/, /skill/, /booking. Allows /jobs and /job/ — but the 403 blocks everything regardless.
  • Action: Deferred. Would require Playwright + stealth plugin + proxy rotation. High effort, ethically questionable given they’re clearly preventing automated access.

gigx.com — DROPPED

  • Platform: Drupal.
  • Issue: Executive profile directory, not a job board. Companies browse exec profiles. Zero job listings.
  • robots.txt: Blocks /search (the main way to find execs). TLS cert issues on bare domain (gigx.com fails; www.gigx.com works).
  • Search results: JS-rendered (spinner only without browser execution).
  • Action: Dropped. Wrong data model.

hirefractionaltalent.com — DROPPED

  • Platform: HubSpot CMS.
  • Issue: Not a job board. Consultant showcase / lead-gen site. Displays fractional exec profiles with “Free Consultation” CTA. Zero job listings.
  • Action: Dropped.

chiefjobs.com — TEMPORARILY UNAVAILABLE

  • Issue: SSL certificate chain broken (missing intermediate cert). All HTTP clients reject it. Browsers may work via cached intermediates.
  • Possible cause: Namecheap had SSL issuance delays on 2026-03-22.
  • Action: Recheck in 3-5 days. If SSL resolves, evaluate for C-suite job listings.

Indeed (via JobSpy) — WORKS

  • Indeed scraper in JobSpy v1.1.82 works with no rate limiting reported.
  • Anti-bot (DataDome) is handled by JobSpy internally.
  • Fragile long-term — depends on maintainer keeping up with Indeed’s changes.

LinkedIn (via JobSpy) — FRAGILE

  • Rate-limits after ~10 pages per IP. Requires proxies for meaningful volume.
  • Anti-bot escalation: canvas/WebGL/audio fingerprinting, ASN classification.
  • Not worth the friction for fractional job search volume.

ZipRecruiter (via JobSpy) — BROKEN

  • Returns 403/429 errors. Broken in JobSpy since September 2025 (issue #302, unresolved).
  • Dropped.

Open Items for Dmitri

  1. Jina API key (Step 0) — Sign up at https://jina.ai/ (free, 30 seconds, no credit card). Gets 200 RPM vs 20 RPM. I’ll save the key to .env once you provide it. (I can’t create web accounts on your behalf.)
  2. allfractionaljobs.com subscription — $5/month unlocks 86 paywalled listings (only 9 free). Worth evaluating after we see the quality of free listings.