Job Board Scraper Pipeline — Implementation Plan (v3.2)

Context

Dmitri needs an automated system to find fractional executive job opportunities daily. The scraper will score listings with a hybrid approach (regex fast-pass + Claude deep analysis via claude CLI) and generate a daily markdown report with application briefs for top matches.

Forks battle-tested patterns from the fcto-pipeline (solanasis-scripts/fcto-pipeline/). All tools and boards have been independently validated through live testing and research — see Appendix A (Tool Evaluation) and Appendix B (Board Scrapability Audit) for full documentation.

Environment: Windows 11 with WSL2 available (Docker-capable). Claude Code subscription (no separate API key). Python 3.12.

Runtime: Pipeline runs from WSL2 to access Crawl4AI natively. Code lives on the Windows filesystem at solanasis-scripts/job-board-scraper/, accessed from WSL2 via /mnt/c/_my/_solanasis/....

Architecture

run_pipeline.py  (orchestrator)
  |
  +--> scrapers.py        Phase 1: Fetch listings (two-pass: index -> detail pages)
  |      trafilatura (static) + Crawl4AI (JS) + Jina (fallback)
  |
  +--> parsers.py         Phase 2: Normalize + dedup into SQLite
  |
  +--> score_regex.py     Phase 3: Fast regex scoring (all jobs, zero cost)
  |      Gate: score >= 15 passes to Claude
  |
  +--> analyze_claude.py  Phase 4: Claude CLI (subscription) deep analysis
  |      ~10-20 jobs/day, included in subscription
  |
  +--> generate_report.py Phase 5: Daily markdown report
         Output: solanasis-docs/daily-outreach/YYYY-MM-DD-jobs.md

Storage: SQLite (data/jobs.db) — dedup, querying, lifecycle tracking, atomic writes. Zero deps (stdlib).

Tool Stack (Validated)

Only tools that passed independent evaluation. Full rationale in Appendix A.

Tool	Role	Install	Status
httpx	HTTP client for static requests	Already installed (v0.28.1)	Proven in fcto-pipeline
trafilatura	Text extraction from static HTML	Already installed (v2.0.0)	Best benchmarked extractor (F1=0.945)
Crawl4AI	JS-rendered page scraping (primary)	`pip install crawl4ai` in WSL2	Self-hosted, no rate limits, no external API. Runs natively in WSL2.
Jina Reader	JS fallback + quick page extraction	No install (API via httpx)	Fallback for Crawl4AI failures. Sign up for free key → 200 RPM.
Claude CLI	Job description analysis	Already installed (subscription)	`claude -p "prompt" --model opus` via subprocess. $0 additional cost. Configurable model.
sqlite3	Data storage + dedup + tracking	Python stdlib	Zero install. WAL mode for crash recovery.
python-jobspy	Indeed job aggregation	`pip install python-jobspy` (optional)	Indeed works; LinkedIn fragile; ZipRecruiter broken.

New installs needed:

WSL2: pip install crawl4ai + crawl4ai-setup (installs Playwright browser — works natively on Linux/WSL2)
WSL2 (optional): pip install python-jobspy
Windows: nothing new

Rejected tools (with rationale):

curl-cffi — Solves TLS fingerprinting anti-bot. Our niche boards have zero anti-bot. httpx with a User-Agent header is sufficient.
Anthropic Python SDK — Would require a separate API key and per-token billing. The claude CLI is already authenticated via subscription and costs nothing additional.
Playwright (standalone) — Crawl4AI wraps Playwright and adds content extraction + markdown output. Use Crawl4AI instead of raw Playwright.

Target Boards (Validated — Live Tested)

Every board below was fetched and audited on 2026-03-23. Full results in Appendix B.

Tier 1: Confirmed Scrapable (build these)

#	Board	Method	Inventory	Key Finding
1	fractionaljobs.io	Sitemap + trafilatura	700+ jobs	Webflow, no bot protection, sitemap has all URLs. Best source.
2	findfractionaljobs.com	WP REST API (JSON)	~6 jobs	`GET /wp-json/wp/v2/job-listings/` returns structured JSON with salary data. Tiny but zero-effort.
3	allfractionaljobs.com	Sitemap + trafilatura	94 jobs	Jobboardly platform, open robots.txt, sitemap with all URLs. 9 free, 1 paywalled per page.
4	useshiny.com	HTML parsing	~10+ jobs	WordPress, server-rendered, “Show More” button suggests AJAX endpoint (`jm-ajax`).
5	Indeed (via JobSpy)	python-jobspy	Large	Indeed scraper currently works with no rate limiting. Fragile long-term.

Tier 2: Conditional / Blocked (defer)

Board	Issue	Action
gofractional.com	Returns 403 on ALL automated requests. 251 curated listings behind aggressive bot protection.	Defer. Would need Playwright + stealth + proxy. High effort for uncertain return.
LinkedIn (via JobSpy)	Rate-limits after ~10 pages. Requires proxies.	Include if Indeed works, skip if it doesn’t.
chiefjobs.com	SSL certificate broken (chain missing intermediate cert). Browsers handle it; HTTP clients reject it.	Recheck in 3-5 days. May resolve (Namecheap had SSL issues 2026-03-22).

Dropped (not job boards)

Board	Reason
hirefractionaltalent.com	Not a job board. It’s a consultant showcase / lead-gen site on HubSpot. Zero job listings.
gigx.com	Executive profile directory, not a job board. Companies browse exec profiles. Search is JS-rendered and robots.txt blocks `/search`.
ZipRecruiter (via JobSpy)	Returns 403 errors. Broken in JobSpy since Sep 2025.

Configurability

All tunable settings live in config.py — no code changes needed to adjust behavior. The executing session should ensure these are all constants at the top of the file, not buried in function logic.

Setting	Default	Purpose
`CLAUDE_MODEL`	`"opus"`	Model for job analysis. Set to `"sonnet"` for faster/cheaper fallback.
`CLAUDE_TIMEOUT`	`60`	Seconds per Claude CLI call. Opus needs more time than Sonnet.
`REGEX_GATE_THRESHOLD`	`15`	Minimum regex score to pass to Claude. Lower = more Claude calls.
`TIERS`	`{"A": 75, "B": 50, "C": 25}`	Claude ai_fit_score thresholds for tiering.
`SCRAPE_CONCURRENCY`	`3`	Max concurrent board scrapes.
`SCRAPE_DELAY`	`1.5`	Seconds between requests to same board.
`SCRAPE_TIMEOUT`	`20`	HTTP request timeout in seconds.
`JINA_RPM_LIMIT`	`20`	Jina Reader rate limit (200 with free API key).
`JOBSPY_SEARCH_TERMS`	`[list]`	Search queries for Indeed via JobSpy.
`JOBSPY_RESULTS_PER_TERM`	`25`	Max results per search term from JobSpy.
`BOARDS`	`{dict}`	Board registry — enable/disable boards, change methods/priorities.
`DMITRI_PROFILE`	`string`	Profile text fed to Claude for fit analysis. Update as services evolve.

All settings are also overridable via CLI flags where it makes sense (--model sonnet, --boards fractionaljobs, --limit 5, etc.).

Files to Create (9 files in `solanasis-scripts/job-board-scraper/`)

1. `config.py` (~200 lines)

Fork from: fcto-pipeline/config.py

Path setup: PIPELINE_DIR, DATA_DIR, DB_FILE, RAW_CACHE_DIR, DAILY_OUTREACH_DIR
BOARDS registry: 5 Tier-1 boards with url, method, priority, sitemap_url (where applicable)
- fractionaljobs.io: sitemap_url: "https://www.fractionaljobs.io/sitemap.xml", method: "sitemap_trafilatura"
- findfractionaljobs.com: method: "wp_rest_api", api_url: "https://findfractionaljobs.com/wp-json/wp/v2/job-listings/"
- allfractionaljobs.com: sitemap_url, method: "sitemap_trafilatura"
- useshiny.com: method: "html_parse"
- indeed: method: "jobspy", site_name: "indeed"
Keyword regex lists (from continuation prompt lines 344-431)
SCORING dict, REGEX_GATE_THRESHOLD = 15
TIERS = {"A": 75, "B": 50, "C": 25} (Claude’s 0-100 scale)
Scraping constants: SCRAPE_CONCURRENCY=3, SCRAPE_DELAY=1.5, SCRAPE_TIMEOUT=20
JINA_READER_BASE = "https://r.jina.ai/", JINA_RPM_LIMIT = 20
JOBSPY_SEARCH_TERMS, JOBSPY_RESULTS_PER_TERM = 25
Compensation parsing regexes (hourly, monthly, annual shorthand)
DMITRI_PROFILE condensed string (~300 tokens for Claude prompts)
CLAUDE_MODEL = "opus" — default to Opus for best quality; configurable to “sonnet” as fallback
CLAUDE_TIMEOUT = 60 — configurable timeout per call (Opus may be slower than Sonnet)

2. `db.py` (~120 lines)

SQLite database layer (stdlib sqlite3)

init_db() — create tables + indexes, enable WAL mode + busy_timeout=5000
Tables: jobs (main), scraper_health (monitoring)
CRUD: upsert_job(), get_new_jobs(since_date) (where date_first_seen >= since_date), get_unanalyzed_jobs() (where regex_pass = 1 AND ai_fit_score IS NULL), mark_applied(job_id), mark_skipped(job_id), get_scraper_health(board, days=7), get_stats()

3. `scrapers.py` (~400 lines)

Fork from: fcto-pipeline/enrich_websites.py

Three scraping tiers:
1. trafilatura — static HTML pages (fast, reliable)
2. Crawl4AI — JS-rendered pages (self-hosted in WSL2, no rate limits)
3. Jina Reader — fallback if Crawl4AI fails on a specific page
Cache: per-board-per-day JSON files in data/raw/ (reuse cache_key/load_cache/save_cache from fcto-pipeline)
Sitemap-based scrapers (fractionaljobs.io, allfractionaljobs.com):
- Pass 1: Fetch sitemap XML via httpx, extract job URLs
- Pass 2: Fetch each job detail page via trafilatura (with delay)
WP REST API scraper (findfractionaljobs.com):
- Single httpx API call returns structured JSON — no HTML parsing needed
Crawl4AI scraper (useshiny.com for AJAX content):
- Use AsyncWebCrawler to render JS, get clean markdown output
- Falls back to Jina Reader (scrape_with_jina()) if Crawl4AI fails
- (Tier 2 boards like gofractional.com can reuse this code path later if unblocked)
HTML parse scraper (useshiny.com static fallback):
- Fetch listing page, extract job URLs from HTML
- Fetch each detail page via trafilatura
- (chiefjobs.com can reuse this pattern when SSL resolves)
JobSpy scraper (Indeed):
- try-import python-jobspy, search with JOBSPY_SEARCH_TERMS
- DataFrame → list[dict]
Async orchestrator: scrape_all_boards(force, boards_filter) with semaphore
Health tracking: record result counts per board in SQLite

4. `parsers.py` (~180 lines)

Unified schema: job_id, title, company, location, is_remote, url, source_board, description, compensation_raw, compensation_hourly, date_posted, date_scraped
Per-board normalizers (one function each)
make_job_id(url, title, company) — MD5 of canonical URL (primary) or normalized title+company (cross-board dedup). Not fuzzy — normalized exact match (lowercase, strip whitespace, remove Inc/LLC).
parse_compensation(text) — regex extraction, normalize to hourly rate

5. `score_regex.py` (~180 lines)

Fork from: fcto-pipeline/score_prospects.py

check_keywords(text, patterns) — regex matching with human-readable output
score_job(job) → (score, breakdown_string) — all signals from continuation prompt
Gate: score >= 15 → regex_pass = True
score_all_jobs() — batch score, update SQLite, print tier distribution

6. `analyze_claude.py` (~250 lines)

Uses claude CLI via subprocess — authenticated through Claude subscription, no API key needed.

call_claude(prompt, model=None) — wrapper function:
- Model defaults to config.CLAUDE_MODEL (Opus) but accepts override per call
- Runs claude -p "<prompt>" --model {model} --output-format json via subprocess.run()
- Sets encoding='utf-8' to avoid Windows cp1252 issues
- Returns parsed JSON response
- Timeout: configurable via config.CLAUDE_TIMEOUT (default 60s for Opus)
analyze_job(job) — single Claude call per job:
- Prompt includes: DMITRI_PROFILE + job description + structured output instructions
- Response: JSON with ai_fit_score (0-100), ai_tier, ai_reasoning, requirements_match, red_flags, referral_opportunity, estimated_seniority, engagement_type
generate_brief(job, analysis) — second call for A-tier only:
- Why Dmitri is a fit, talking points, concerns to address, application approach
analyze_all_jobs() — reads unanalyzed from SQLite, calls Claude, updates DB
Results cached in SQLite (no re-analyzing same job on re-run)

Cost: $0 additional — covered by existing Claude subscription. No API key, no metered billing.

Fallback: If claude CLI is unavailable (e.g., scheduled task without Claude Code running), the pipeline works in --skip-claude mode using regex-only scoring.

7. `generate_report.py` (~280 lines)

Fork from: fcto-pipeline/generate_daily_outreach.py

Reads from SQLite: today’s analyzed jobs by tier
Markdown report:
- Header: date, stats (new listings, A/B tier counts, total tracked)
- A-Tier: full job cards with application brief, Claude reasoning, score breakdown
- B-Tier: summary cards with reasoning
- Referral opportunities (if any)
- Board health warnings (if any scraper anomalously low)
- Summary stats table (per-board)
- Quick actions (CLI commands)
CLI: --mark-applied id, --mark-skipped id, --status, --tier A
Output: solanasis-docs/daily-outreach/YYYY-MM-DD-jobs.md

8. `run_pipeline.py` (~100 lines)

Orchestrator:

Init DB + create dirs
scrapers.scrape_all_boards()
parsers.normalize_and_dedup()
score_regex.score_all_jobs()
analyze_claude.analyze_all_jobs()
generate_report.generate_daily_report()

CLI flags:

--force — re-scrape even if cached
--boards <names> — comma-separated board names (default: all enabled)
--limit N — max jobs per board (for testing)
--skip-claude — regex-only mode, no Claude analysis
--model <name> — override CLAUDE_MODEL (e.g., --model sonnet)
--dry-run — show what would be scraped, no HTTP requests
--recheck-scores — re-score all jobs against current regex/config without re-scraping
Windows event loop policy for async code

9. `requirements.txt`

# Core
trafilatura>=2.0.0
httpx>=0.27.0
crawl4ai>=0.8.0       # JS-rendered page scraping (run in WSL2)

# Optional: Indeed scraping (fragile, not essential)
# python-jobspy>=1.1.0

Note: Claude analysis uses the claude CLI (subscription-authenticated, no SDK needed). SQLite is stdlib. Jina Reader is an API called via httpx (no package). Install all deps in WSL2’s Python environment.

Directory structure:

solanasis-scripts/job-board-scraper/
  config.py
  db.py
  scrapers.py
  parsers.py
  score_regex.py
  analyze_claude.py
  generate_report.py
  run_pipeline.py
  requirements.txt
  data/
    jobs.db        # SQLite (all data + tracker)
    raw/           # Per-board-per-day scrape cache (JSON)

SQLite Schema

-- Enable WAL mode and set busy timeout on connection
PRAGMA journal_mode=WAL;
PRAGMA busy_timeout=5000;
 
CREATE TABLE IF NOT EXISTS jobs (
    job_id TEXT PRIMARY KEY,
    title TEXT NOT NULL,
    company TEXT,
    location TEXT,
    url TEXT,
    source_board TEXT NOT NULL,
    description TEXT,
    compensation_raw TEXT,
    compensation_hourly REAL,
    date_posted TEXT,
    date_first_seen TEXT NOT NULL,
    date_last_seen TEXT NOT NULL,
    is_remote INTEGER DEFAULT 0,
 
    -- Regex scoring
    regex_score INTEGER,
    regex_breakdown TEXT,
    regex_pass INTEGER DEFAULT 0,
 
    -- Claude analysis (NULL until analyzed)
    ai_fit_score INTEGER,
    ai_tier TEXT,
    ai_reasoning TEXT,
    requirements_match TEXT,    -- JSON
    red_flags TEXT,             -- JSON
    referral_opportunity INTEGER DEFAULT 0,
    referral_notes TEXT,
    application_brief TEXT,
 
    -- Pipeline state
    pipeline_status TEXT DEFAULT 'new',
    applied_date TEXT,
    notes TEXT,
    created_at TEXT DEFAULT (datetime('now')),
    updated_at TEXT DEFAULT (datetime('now'))
);
 
CREATE TABLE IF NOT EXISTS scraper_health (
    board TEXT NOT NULL,
    date TEXT NOT NULL,
    jobs_found INTEGER DEFAULT 0,
    jobs_new INTEGER DEFAULT 0,
    error TEXT,
    duration_seconds REAL,
    PRIMARY KEY (board, date)
);
 
CREATE INDEX IF NOT EXISTS idx_jobs_status ON jobs(pipeline_status);
CREATE INDEX IF NOT EXISTS idx_jobs_tier ON jobs(ai_tier);
CREATE INDEX IF NOT EXISTS idx_jobs_source ON jobs(source_board);
CREATE INDEX IF NOT EXISTS idx_jobs_first_seen ON jobs(date_first_seen);

Build Order

Step 0: Environment Setup (WSL2 + Jina)

Set up Python environment in WSL2 (if not already): wsl -- pip install trafilatura httpx crawl4ai
Run wsl -- crawl4ai-setup (installs Playwright Chromium in WSL2 — works natively on Linux)
Dmitri: Sign up for free Jina API key at https://jina.ai/ (30 seconds, no credit card). Provide key to save in .env.
Save JINA_API_KEY=<key> to solanasis-scripts/.env
Optionally install python-jobspy for Indeed: wsl -- pip install python-jobspy

Step 1: Foundation + fractionaljobs.io (largest source, 700+ jobs)

Create directory structure
Write requirements.txt
Write config.py
Write db.py
Write scrapers.py — fractionaljobs.io only (sitemap → detail pages via trafilatura)
Write parsers.py — unified schema, normalize_fractionaljobs, make_job_id, parse_compensation
Write score_regex.py
Write run_pipeline.py (--skip-claude mode)
Test: wsl -- python /mnt/c/_my/_solanasis/solanasis-scripts/job-board-scraper/run_pipeline.py --boards fractionaljobs --skip-claude
Spot-check 5 scraped listings against actual site

Step 2: Add remaining Tier-1 boards

findfractionaljobs.com (WP REST API — easiest)
allfractionaljobs.com (sitemap — same pattern as fractionaljobs.io)
useshiny.com (Crawl4AI for AJAX + HTML fallback)
Indeed via JobSpy (conditional, try-import)
Test: wsl -- python /mnt/c/_my/_solanasis/solanasis-scripts/job-board-scraper/run_pipeline.py --skip-claude

Step 3: Claude analysis

Write analyze_claude.py
Test on 5-10 real listings — validate structured output
Tune regex gate threshold based on real data
Test: wsl -- python /mnt/c/_my/_solanasis/solanasis-scripts/job-board-scraper/run_pipeline.py --boards fractionaljobs

Step 4: Daily report + scheduling

Write generate_report.py
Test: Full end-to-end pipeline
Verify solanasis-docs/daily-outreach/YYYY-MM-DD-jobs.md output
Test: --mark-applied, --mark-skipped, --status
Set up Windows Task Scheduler (daily 7:00 AM)

Verification

python run_pipeline.py --boards fractionaljobs --skip-claude — smoke test (run via WSL2)
Spot-check 5 scraped listings against actual website
Review regex scoring: do pass/reject decisions make sense?
python run_pipeline.py --boards fractionaljobs — with Claude
Review Claude output: are ai_fit_scores reasonable? Red flags accurate?
Full pipeline: python run_pipeline.py
Open solanasis-docs/daily-outreach/YYYY-MM-DD-jobs.md — verify format
python run_pipeline.py --recheck-scores — re-score without re-scraping
python generate_report.py --mark-applied <id> — tracker test
python generate_report.py --status — stats test
Re-run next day — verify dedup, cache, only new jobs appear

Risks & Mitigations

Risk	Mitigation
trafilatura strips job listing tables on index pages	Use sitemap URLs to go direct to detail pages (bypass index). For useshiny.com, use raw HTML parsing for index.
Crawl4AI setup fails in WSL2	Fall back to Jina Reader (free key, 200 RPM). Jina handles JS rendering server-side.
GoFractional 403 blocks even Crawl4AI	Board is deferred. May require residential proxy — evaluate if inventory (251 listings) justifies the effort.
JobSpy Indeed scraper breaks	Indeed is supplemental. Primary value is niche boards. Disable with `--boards` flag.
Board HTML changes	Scraper health monitoring in SQLite. Auto-warn in daily report if a board returns <50% of 7-day average.
Claude structured output malformed	Validate JSON, retry once, fall back to regex-only tier on failure.
Claude CLI unavailable (e.g., scheduled task)	`--skip-claude` flag runs regex-only mode. Pipeline works without Claude.
Two-pass scraping is slow	Cache detail pages per-job-id. Only scrape new/unseen URLs. fractionaljobs.io sitemap lets us check for new URLs before fetching.

Key Files to Read Before Implementation

File	What to Fork
`fcto-pipeline/config.py`	Path setup (L9-16), keyword regex structure (L18-84), scoring dict (L87-109), tiers (L112-117)
`fcto-pipeline/enrich_websites.py`	Cache helpers (L101-124), scrape_with_trafilatura (L159-172), scrape_with_jina (L175-188), check_keywords (L195-206), async orchestrator (L249-309), Windows policy (L535-537)
`fcto-pipeline/score_prospects.py`	score_prospect (L36-122), assign_tier (L125-134)
`fcto-pipeline/generate_daily_outreach.py`	Tracker load/save (L38-50), mark operations (L53-93), CLI (L660-693), daily brief (L436-579)

Appendix A: Tool Evaluation (2026-03-23)

Approved Tools

httpx (v0.28.1) — YES

Pure Python async HTTP client. Already installed and proven in fcto-pipeline.
Sufficient for niche job boards with no anti-bot protection.
No compiled dependencies, no platform issues.

trafilatura (v2.0.0) — YES

Best benchmarked content extractor (F1=0.945, beating readability-lxml, newspaper3k, goose3).
Already installed. Pure Python, Windows-tested.
Gotcha: Optimized for “article” content — may aggressively strip job listing index pages (thinks card layout is navigation). Mitigated by scraping detail pages directly via sitemap URLs.
Use include_tables=True to prevent table stripping.

Jina Reader (r.jina.ai) — YES (fallback for Crawl4AI)

Zero-install JS rendering: prepend https://r.jina.ai/ to any URL, get clean markdown via HTTP GET.
Free tier: 20 RPM without key, 200 RPM with free key (signup takes 30 seconds).
Role: Fallback when Crawl4AI fails or for quick one-off page fetches.
Tested: Returns clean content for most JS-rendered pages. Fails gracefully (error or empty, no hangs).
Dmitri to sign up for free key → store as JINA_API_KEY in .env.

Claude CLI (subscription) — YES

Uses claude -p "<prompt>" --model {model} --output-format json via subprocess.
Already installed and authenticated through Dmitri’s Claude subscription. Zero additional cost.
Default model: Opus (best quality). Configurable to Sonnet as fallback (faster, lower quality).
All model/timeout settings live in config.py — no code changes needed to switch models.
Windows encoding gotcha: set encoding='utf-8' in subprocess.run() to avoid cp1252 issues.
Node.js process spawn overhead (~1-2s per call) is acceptable for 10-20 calls/day.
Fallback: pipeline has --skip-claude mode for regex-only scoring when CLI unavailable.

SQLite (stdlib sqlite3) — YES

Zero install. Python stdlib. Rock-solid on Windows.
Enable WAL mode (PRAGMA journal_mode=WAL) for concurrent readers and crash recovery.
Set PRAGMA busy_timeout=5000 to handle accidental concurrent pipeline runs.
Single jobs.db file beats JSON sprawl for dedup, querying, and lifecycle tracking.

Crawl4AI (v0.8.0) — YES (via WSL2)

Windows-native installation is documented as broken: GitHub issues #38, #949, #1705.
Runs natively in WSL2 — Linux is a first-class platform. pip install crawl4ai && crawl4ai-setup works cleanly.
Self-hosted JS rendering: no rate limits, no external API dependency, no cost.
Outputs clean Markdown (ideal for feeding to Claude for analysis).
“Adaptive Intelligence” feature learns selectors over time — useful for daily scraping of the same boards.
Primary tool for JS-rendered pages (useshiny.com AJAX content, GoFractional if unblocked).

python-jobspy (v1.1.82) — CONDITIONAL

Include as optional (try-import). Use for Indeed only.
Indeed scraper currently works with no rate limiting (per maintainer and recent issues).
LinkedIn: rate-limits after ~10 pages, requires proxies. Not worth the friction.
ZipRecruiter: broken (403/429 since Sep 2025). Google Jobs: returns 0 results. Glassdoor: 403.
Install: pip install python-jobspy — pure Python, no compiled deps.
If Indeed breaks in the future, disable and rely on niche boards.

Rejected Tools

curl-cffi (v0.14.0) — NO

TLS fingerprint impersonation for anti-bot bypass. Niche job boards don’t have TLS fingerprinting.
Adds a compiled C dependency (cffi + libcurl) with platform-specific wheels.
Solving a problem we don’t have. httpx with a User-Agent header is sufficient.
If a specific board later requires it, swap in for that one scraper only.

Anthropic Python SDK — NO

Would require a separate API key and per-token billing.
The claude CLI is already authenticated via Dmitri’s subscription. Zero additional cost.
SDK stays on the bench unless we need programmatic access without the CLI.

Playwright (standalone) — NO (use Crawl4AI instead)

Crawl4AI wraps Playwright and adds content extraction + markdown output.
No reason to use raw Playwright when Crawl4AI provides a better developer experience.
Playwright’s Chromium is installed automatically by crawl4ai-setup in WSL2.

Appendix B: Board Scrapability Audit (2026-03-23)

All boards were fetched and tested live on 2026-03-23 via WebFetch.

fractionaljobs.io — EXCELLENT

Platform: Webflow (server-rendered HTML + JS enhancements)
Bot protection: None. No robots.txt restrictions (only sitemap reference).
Inventory: 700+ job URLs in sitemap. ~40 visible on homepage with “View 24 more” link.
Scraping strategy: Fetch /sitemap.xml → extract all /jobs/* URLs → fetch each detail page via trafilatura. Classes: .job-item, .jobs-collection-list, .job-item_link-to-job.
Login required: No.

findfractionaljobs.com — BEST (has API)

Platform: WordPress 6.9.4 with WP Job Manager + Workscout theme.
Bot protection: None. Open robots.txt (Disallow: empty).
Inventory: ~6 listings via API (small board).
Scraping strategy: GET /wp-json/wp/v2/job-listings/?per_page=100 returns structured JSON with title, content, excerpt, link, _company_name, _job_location, _remote_position, _salary_min, _salary_max, _rate_min, _rate_max, job-categories, job-types. Zero HTML parsing needed.
Login required: No for browsing.

allfractionaljobs.com — GOOD

Platform: Jobboardly (SaaS, appears Rails-based).
Bot protection: None. Open robots.txt (Allow: /).
Inventory: 94 job URLs in sitemap. 9 free per page, 1 behind $5/mo paywall.
Scraping strategy: Fetch /sitemap.xml → extract all /jobs/* URLs → fetch detail pages. Listings in <li> elements with company logo, title, type, location, compensation, hours.
Login required: No for free listings.

useshiny.com — MODERATE

Platform: WordPress + WooCommerce.
Bot protection: None. Open robots.txt (Yoast SEO).
Inventory: ~10 visible, “Show More Jobs” button suggests more via AJAX (jm-ajax endpoint pattern visible in source).
Scraping strategy: Fetch /job-postings page, parse <a> elements with <h4> titles and badges (type, location, compensation). May need to discover AJAX endpoint for full listing. No REST API exposed for jobs (unlike findfractionaljobs.com).
Login required: No for browsing.

gofractional.com — BLOCKED

Platform: Next.js (Vercel).
Bot protection: Aggressive. Returns 403 Forbidden on every automated request (homepage, /jobs, /job/*, sitemap).
Inventory: 251 jobs per Google index. Zero accessible via automated fetch.
robots.txt: Blocks /tag/, /skill/, /booking. Allows /jobs and /job/ — but the 403 blocks everything regardless.
Action: Deferred. Would require Playwright + stealth plugin + proxy rotation. High effort, ethically questionable given they’re clearly preventing automated access.

gigx.com — DROPPED

Platform: Drupal.
Issue: Executive profile directory, not a job board. Companies browse exec profiles. Zero job listings.
robots.txt: Blocks /search (the main way to find execs). TLS cert issues on bare domain (gigx.com fails; www.gigx.com works).
Search results: JS-rendered (spinner only without browser execution).
Action: Dropped. Wrong data model.

hirefractionaltalent.com — DROPPED

Platform: HubSpot CMS.
Issue: Not a job board. Consultant showcase / lead-gen site. Displays fractional exec profiles with “Free Consultation” CTA. Zero job listings.
Action: Dropped.

chiefjobs.com — TEMPORARILY UNAVAILABLE

Issue: SSL certificate chain broken (missing intermediate cert). All HTTP clients reject it. Browsers may work via cached intermediates.
Possible cause: Namecheap had SSL issuance delays on 2026-03-22.
Action: Recheck in 3-5 days. If SSL resolves, evaluate for C-suite job listings.

Indeed (via JobSpy) — WORKS

Indeed scraper in JobSpy v1.1.82 works with no rate limiting reported.
Anti-bot (DataDome) is handled by JobSpy internally.
Fragile long-term — depends on maintainer keeping up with Indeed’s changes.

LinkedIn (via JobSpy) — FRAGILE

Rate-limits after ~10 pages per IP. Requires proxies for meaningful volume.
Anti-bot escalation: canvas/WebGL/audio fingerprinting, ASN classification.
Not worth the friction for fractional job search volume.

ZipRecruiter (via JobSpy) — BROKEN

Returns 403/429 errors. Broken in JobSpy since September 2025 (issue #302, unresolved).
Dropped.

Open Items for Dmitri

Jina API key (Step 0) — Sign up at https://jina.ai/ (free, 30 seconds, no credit card). Gets 200 RPM vs 20 RPM. I’ll save the key to .env once you provide it. (I can’t create web accounts on your behalf.)
allfractionaljobs.com subscription — $5/month unlocks 86 paywalled listings (only 9 free). Worth evaluating after we see the quality of free listings.

Solanasis Docs

Explorer

job-board-scraper-implementation-plan