Job Board Scraper Pipeline — Implementation Plan (v3.2)
Context
Dmitri needs an automated system to find fractional executive job opportunities daily. The scraper will score listings with a hybrid approach (regex fast-pass + Claude deep analysis via claude CLI) and generate a daily markdown report with application briefs for top matches.
Forks battle-tested patterns from the fcto-pipeline (solanasis-scripts/fcto-pipeline/). All tools and boards have been independently validated through live testing and research — see Appendix A (Tool Evaluation) and Appendix B (Board Scrapability Audit) for full documentation.
Environment: Windows 11 with WSL2 available (Docker-capable). Claude Code subscription (no separate API key). Python 3.12.
Runtime: Pipeline runs from WSL2 to access Crawl4AI natively. Code lives on the Windows filesystem at solanasis-scripts/job-board-scraper/, accessed from WSL2 via /mnt/c/_my/_solanasis/....
Architecture
run_pipeline.py (orchestrator)
|
+--> scrapers.py Phase 1: Fetch listings (two-pass: index -> detail pages)
| trafilatura (static) + Crawl4AI (JS) + Jina (fallback)
|
+--> parsers.py Phase 2: Normalize + dedup into SQLite
|
+--> score_regex.py Phase 3: Fast regex scoring (all jobs, zero cost)
| Gate: score >= 15 passes to Claude
|
+--> analyze_claude.py Phase 4: Claude CLI (subscription) deep analysis
| ~10-20 jobs/day, included in subscription
|
+--> generate_report.py Phase 5: Daily markdown report
Output: solanasis-docs/daily-outreach/YYYY-MM-DD-jobs.md
Storage: SQLite (data/jobs.db) — dedup, querying, lifecycle tracking, atomic writes. Zero deps (stdlib).
Tool Stack (Validated)
Only tools that passed independent evaluation. Full rationale in Appendix A.
| Tool | Role | Install | Status |
|---|---|---|---|
| httpx | HTTP client for static requests | Already installed (v0.28.1) | Proven in fcto-pipeline |
| trafilatura | Text extraction from static HTML | Already installed (v2.0.0) | Best benchmarked extractor (F1=0.945) |
| Crawl4AI | JS-rendered page scraping (primary) | pip install crawl4ai in WSL2 | Self-hosted, no rate limits, no external API. Runs natively in WSL2. |
| Jina Reader | JS fallback + quick page extraction | No install (API via httpx) | Fallback for Crawl4AI failures. Sign up for free key → 200 RPM. |
| Claude CLI | Job description analysis | Already installed (subscription) | claude -p "prompt" --model opus via subprocess. $0 additional cost. Configurable model. |
| sqlite3 | Data storage + dedup + tracking | Python stdlib | Zero install. WAL mode for crash recovery. |
| python-jobspy | Indeed job aggregation | pip install python-jobspy (optional) | Indeed works; LinkedIn fragile; ZipRecruiter broken. |
New installs needed:
- WSL2:
pip install crawl4ai+crawl4ai-setup(installs Playwright browser — works natively on Linux/WSL2) - WSL2 (optional):
pip install python-jobspy - Windows: nothing new
Rejected tools (with rationale):
- curl-cffi — Solves TLS fingerprinting anti-bot. Our niche boards have zero anti-bot. httpx with a User-Agent header is sufficient.
- Anthropic Python SDK — Would require a separate API key and per-token billing. The
claudeCLI is already authenticated via subscription and costs nothing additional. - Playwright (standalone) — Crawl4AI wraps Playwright and adds content extraction + markdown output. Use Crawl4AI instead of raw Playwright.
Target Boards (Validated — Live Tested)
Every board below was fetched and audited on 2026-03-23. Full results in Appendix B.
Tier 1: Confirmed Scrapable (build these)
| # | Board | Method | Inventory | Key Finding |
|---|---|---|---|---|
| 1 | fractionaljobs.io | Sitemap + trafilatura | 700+ jobs | Webflow, no bot protection, sitemap has all URLs. Best source. |
| 2 | findfractionaljobs.com | WP REST API (JSON) | ~6 jobs | GET /wp-json/wp/v2/job-listings/ returns structured JSON with salary data. Tiny but zero-effort. |
| 3 | allfractionaljobs.com | Sitemap + trafilatura | 94 jobs | Jobboardly platform, open robots.txt, sitemap with all URLs. 9 free, 1 paywalled per page. |
| 4 | useshiny.com | HTML parsing | ~10+ jobs | WordPress, server-rendered, “Show More” button suggests AJAX endpoint (jm-ajax). |
| 5 | Indeed (via JobSpy) | python-jobspy | Large | Indeed scraper currently works with no rate limiting. Fragile long-term. |
Tier 2: Conditional / Blocked (defer)
| Board | Issue | Action |
|---|---|---|
| gofractional.com | Returns 403 on ALL automated requests. 251 curated listings behind aggressive bot protection. | Defer. Would need Playwright + stealth + proxy. High effort for uncertain return. |
| LinkedIn (via JobSpy) | Rate-limits after ~10 pages. Requires proxies. | Include if Indeed works, skip if it doesn’t. |
| chiefjobs.com | SSL certificate broken (chain missing intermediate cert). Browsers handle it; HTTP clients reject it. | Recheck in 3-5 days. May resolve (Namecheap had SSL issues 2026-03-22). |
Dropped (not job boards)
| Board | Reason |
|---|---|
| hirefractionaltalent.com | Not a job board. It’s a consultant showcase / lead-gen site on HubSpot. Zero job listings. |
| gigx.com | Executive profile directory, not a job board. Companies browse exec profiles. Search is JS-rendered and robots.txt blocks /search. |
| ZipRecruiter (via JobSpy) | Returns 403 errors. Broken in JobSpy since Sep 2025. |
Configurability
All tunable settings live in config.py — no code changes needed to adjust behavior. The executing session should ensure these are all constants at the top of the file, not buried in function logic.
| Setting | Default | Purpose |
|---|---|---|
CLAUDE_MODEL | "opus" | Model for job analysis. Set to "sonnet" for faster/cheaper fallback. |
CLAUDE_TIMEOUT | 60 | Seconds per Claude CLI call. Opus needs more time than Sonnet. |
REGEX_GATE_THRESHOLD | 15 | Minimum regex score to pass to Claude. Lower = more Claude calls. |
TIERS | {"A": 75, "B": 50, "C": 25} | Claude ai_fit_score thresholds for tiering. |
SCRAPE_CONCURRENCY | 3 | Max concurrent board scrapes. |
SCRAPE_DELAY | 1.5 | Seconds between requests to same board. |
SCRAPE_TIMEOUT | 20 | HTTP request timeout in seconds. |
JINA_RPM_LIMIT | 20 | Jina Reader rate limit (200 with free API key). |
JOBSPY_SEARCH_TERMS | [list] | Search queries for Indeed via JobSpy. |
JOBSPY_RESULTS_PER_TERM | 25 | Max results per search term from JobSpy. |
BOARDS | {dict} | Board registry — enable/disable boards, change methods/priorities. |
DMITRI_PROFILE | string | Profile text fed to Claude for fit analysis. Update as services evolve. |
All settings are also overridable via CLI flags where it makes sense (--model sonnet, --boards fractionaljobs, --limit 5, etc.).
Files to Create (9 files in solanasis-scripts/job-board-scraper/)
1. config.py (~200 lines)
Fork from: fcto-pipeline/config.py
- Path setup:
PIPELINE_DIR,DATA_DIR,DB_FILE,RAW_CACHE_DIR,DAILY_OUTREACH_DIR BOARDSregistry: 5 Tier-1 boards with url, method, priority, sitemap_url (where applicable)- fractionaljobs.io:
sitemap_url: "https://www.fractionaljobs.io/sitemap.xml", method:"sitemap_trafilatura" - findfractionaljobs.com: method:
"wp_rest_api",api_url: "https://findfractionaljobs.com/wp-json/wp/v2/job-listings/" - allfractionaljobs.com:
sitemap_url, method:"sitemap_trafilatura" - useshiny.com: method:
"html_parse" - indeed: method:
"jobspy",site_name: "indeed"
- fractionaljobs.io:
- Keyword regex lists (from continuation prompt lines 344-431)
SCORINGdict,REGEX_GATE_THRESHOLD = 15TIERS = {"A": 75, "B": 50, "C": 25}(Claude’s 0-100 scale)- Scraping constants:
SCRAPE_CONCURRENCY=3,SCRAPE_DELAY=1.5,SCRAPE_TIMEOUT=20 JINA_READER_BASE = "https://r.jina.ai/",JINA_RPM_LIMIT = 20JOBSPY_SEARCH_TERMS,JOBSPY_RESULTS_PER_TERM = 25- Compensation parsing regexes (hourly, monthly, annual shorthand)
DMITRI_PROFILEcondensed string (~300 tokens for Claude prompts)CLAUDE_MODEL = "opus"— default to Opus for best quality; configurable to “sonnet” as fallbackCLAUDE_TIMEOUT = 60— configurable timeout per call (Opus may be slower than Sonnet)
2. db.py (~120 lines)
SQLite database layer (stdlib sqlite3)
init_db()— create tables + indexes, enable WAL mode + busy_timeout=5000- Tables:
jobs(main),scraper_health(monitoring) - CRUD:
upsert_job(),get_new_jobs(since_date)(wheredate_first_seen >= since_date),get_unanalyzed_jobs()(whereregex_pass = 1 AND ai_fit_score IS NULL),mark_applied(job_id),mark_skipped(job_id),get_scraper_health(board, days=7),get_stats()
3. scrapers.py (~400 lines)
Fork from: fcto-pipeline/enrich_websites.py
- Three scraping tiers:
- trafilatura — static HTML pages (fast, reliable)
- Crawl4AI — JS-rendered pages (self-hosted in WSL2, no rate limits)
- Jina Reader — fallback if Crawl4AI fails on a specific page
- Cache: per-board-per-day JSON files in
data/raw/(reusecache_key/load_cache/save_cachefrom fcto-pipeline) - Sitemap-based scrapers (fractionaljobs.io, allfractionaljobs.com):
- Pass 1: Fetch sitemap XML via httpx, extract job URLs
- Pass 2: Fetch each job detail page via trafilatura (with delay)
- WP REST API scraper (findfractionaljobs.com):
- Single httpx API call returns structured JSON — no HTML parsing needed
- Crawl4AI scraper (useshiny.com for AJAX content):
- Use
AsyncWebCrawlerto render JS, get clean markdown output - Falls back to Jina Reader (
scrape_with_jina()) if Crawl4AI fails - (Tier 2 boards like gofractional.com can reuse this code path later if unblocked)
- Use
- HTML parse scraper (useshiny.com static fallback):
- Fetch listing page, extract job URLs from HTML
- Fetch each detail page via trafilatura
- (chiefjobs.com can reuse this pattern when SSL resolves)
- JobSpy scraper (Indeed):
- try-import python-jobspy, search with
JOBSPY_SEARCH_TERMS - DataFrame → list[dict]
- try-import python-jobspy, search with
- Async orchestrator:
scrape_all_boards(force, boards_filter)with semaphore - Health tracking: record result counts per board in SQLite
4. parsers.py (~180 lines)
- Unified schema: job_id, title, company, location, is_remote, url, source_board, description, compensation_raw, compensation_hourly, date_posted, date_scraped
- Per-board normalizers (one function each)
make_job_id(url, title, company)— MD5 of canonical URL (primary) or normalized title+company (cross-board dedup). Not fuzzy — normalized exact match (lowercase, strip whitespace, remove Inc/LLC).parse_compensation(text)— regex extraction, normalize to hourly rate
5. score_regex.py (~180 lines)
Fork from: fcto-pipeline/score_prospects.py
check_keywords(text, patterns)— regex matching with human-readable outputscore_job(job)→(score, breakdown_string)— all signals from continuation prompt- Gate:
score >= 15→regex_pass = True score_all_jobs()— batch score, update SQLite, print tier distribution
6. analyze_claude.py (~250 lines)
Uses claude CLI via subprocess — authenticated through Claude subscription, no API key needed.
call_claude(prompt, model=None)— wrapper function:- Model defaults to
config.CLAUDE_MODEL(Opus) but accepts override per call - Runs
claude -p "<prompt>" --model {model} --output-format jsonviasubprocess.run() - Sets
encoding='utf-8'to avoid Windows cp1252 issues - Returns parsed JSON response
- Timeout: configurable via
config.CLAUDE_TIMEOUT(default 60s for Opus)
- Model defaults to
analyze_job(job)— single Claude call per job:- Prompt includes: DMITRI_PROFILE + job description + structured output instructions
- Response: JSON with ai_fit_score (0-100), ai_tier, ai_reasoning, requirements_match, red_flags, referral_opportunity, estimated_seniority, engagement_type
generate_brief(job, analysis)— second call for A-tier only:- Why Dmitri is a fit, talking points, concerns to address, application approach
analyze_all_jobs()— reads unanalyzed from SQLite, calls Claude, updates DB- Results cached in SQLite (no re-analyzing same job on re-run)
Cost: $0 additional — covered by existing Claude subscription. No API key, no metered billing.
Fallback: If claude CLI is unavailable (e.g., scheduled task without Claude Code running), the pipeline works in --skip-claude mode using regex-only scoring.
7. generate_report.py (~280 lines)
Fork from: fcto-pipeline/generate_daily_outreach.py
- Reads from SQLite: today’s analyzed jobs by tier
- Markdown report:
- Header: date, stats (new listings, A/B tier counts, total tracked)
- A-Tier: full job cards with application brief, Claude reasoning, score breakdown
- B-Tier: summary cards with reasoning
- Referral opportunities (if any)
- Board health warnings (if any scraper anomalously low)
- Summary stats table (per-board)
- Quick actions (CLI commands)
- CLI:
--mark-applied id,--mark-skipped id,--status,--tier A - Output:
solanasis-docs/daily-outreach/YYYY-MM-DD-jobs.md
8. run_pipeline.py (~100 lines)
Orchestrator:
- Init DB + create dirs
scrapers.scrape_all_boards()parsers.normalize_and_dedup()score_regex.score_all_jobs()analyze_claude.analyze_all_jobs()generate_report.generate_daily_report()
CLI flags:
--force— re-scrape even if cached--boards <names>— comma-separated board names (default: all enabled)--limit N— max jobs per board (for testing)--skip-claude— regex-only mode, no Claude analysis--model <name>— overrideCLAUDE_MODEL(e.g.,--model sonnet)--dry-run— show what would be scraped, no HTTP requests--recheck-scores— re-score all jobs against current regex/config without re-scraping- Windows event loop policy for async code
9. requirements.txt
# Core
trafilatura>=2.0.0
httpx>=0.27.0
crawl4ai>=0.8.0 # JS-rendered page scraping (run in WSL2)
# Optional: Indeed scraping (fragile, not essential)
# python-jobspy>=1.1.0
Note: Claude analysis uses the claude CLI (subscription-authenticated, no SDK needed). SQLite is stdlib. Jina Reader is an API called via httpx (no package). Install all deps in WSL2’s Python environment.
Directory structure:
solanasis-scripts/job-board-scraper/
config.py
db.py
scrapers.py
parsers.py
score_regex.py
analyze_claude.py
generate_report.py
run_pipeline.py
requirements.txt
data/
jobs.db # SQLite (all data + tracker)
raw/ # Per-board-per-day scrape cache (JSON)
SQLite Schema
-- Enable WAL mode and set busy timeout on connection
PRAGMA journal_mode=WAL;
PRAGMA busy_timeout=5000;
CREATE TABLE IF NOT EXISTS jobs (
job_id TEXT PRIMARY KEY,
title TEXT NOT NULL,
company TEXT,
location TEXT,
url TEXT,
source_board TEXT NOT NULL,
description TEXT,
compensation_raw TEXT,
compensation_hourly REAL,
date_posted TEXT,
date_first_seen TEXT NOT NULL,
date_last_seen TEXT NOT NULL,
is_remote INTEGER DEFAULT 0,
-- Regex scoring
regex_score INTEGER,
regex_breakdown TEXT,
regex_pass INTEGER DEFAULT 0,
-- Claude analysis (NULL until analyzed)
ai_fit_score INTEGER,
ai_tier TEXT,
ai_reasoning TEXT,
requirements_match TEXT, -- JSON
red_flags TEXT, -- JSON
referral_opportunity INTEGER DEFAULT 0,
referral_notes TEXT,
application_brief TEXT,
-- Pipeline state
pipeline_status TEXT DEFAULT 'new',
applied_date TEXT,
notes TEXT,
created_at TEXT DEFAULT (datetime('now')),
updated_at TEXT DEFAULT (datetime('now'))
);
CREATE TABLE IF NOT EXISTS scraper_health (
board TEXT NOT NULL,
date TEXT NOT NULL,
jobs_found INTEGER DEFAULT 0,
jobs_new INTEGER DEFAULT 0,
error TEXT,
duration_seconds REAL,
PRIMARY KEY (board, date)
);
CREATE INDEX IF NOT EXISTS idx_jobs_status ON jobs(pipeline_status);
CREATE INDEX IF NOT EXISTS idx_jobs_tier ON jobs(ai_tier);
CREATE INDEX IF NOT EXISTS idx_jobs_source ON jobs(source_board);
CREATE INDEX IF NOT EXISTS idx_jobs_first_seen ON jobs(date_first_seen);Build Order
Step 0: Environment Setup (WSL2 + Jina)
- Set up Python environment in WSL2 (if not already):
wsl -- pip install trafilatura httpx crawl4ai - Run
wsl -- crawl4ai-setup(installs Playwright Chromium in WSL2 — works natively on Linux) - Dmitri: Sign up for free Jina API key at https://jina.ai/ (30 seconds, no credit card). Provide key to save in
.env. - Save
JINA_API_KEY=<key>tosolanasis-scripts/.env - Optionally install
python-jobspyfor Indeed:wsl -- pip install python-jobspy
Step 1: Foundation + fractionaljobs.io (largest source, 700+ jobs)
- Create directory structure
- Write
requirements.txt - Write
config.py - Write
db.py - Write
scrapers.py— fractionaljobs.io only (sitemap → detail pages via trafilatura) - Write
parsers.py— unified schema,normalize_fractionaljobs,make_job_id,parse_compensation - Write
score_regex.py - Write
run_pipeline.py(--skip-claudemode) - Test:
wsl -- python /mnt/c/_my/_solanasis/solanasis-scripts/job-board-scraper/run_pipeline.py --boards fractionaljobs --skip-claude - Spot-check 5 scraped listings against actual site
Step 2: Add remaining Tier-1 boards
- findfractionaljobs.com (WP REST API — easiest)
- allfractionaljobs.com (sitemap — same pattern as fractionaljobs.io)
- useshiny.com (Crawl4AI for AJAX + HTML fallback)
- Indeed via JobSpy (conditional, try-import)
- Test:
wsl -- python /mnt/c/_my/_solanasis/solanasis-scripts/job-board-scraper/run_pipeline.py --skip-claude
Step 3: Claude analysis
- Write
analyze_claude.py - Test on 5-10 real listings — validate structured output
- Tune regex gate threshold based on real data
- Test:
wsl -- python /mnt/c/_my/_solanasis/solanasis-scripts/job-board-scraper/run_pipeline.py --boards fractionaljobs
Step 4: Daily report + scheduling
- Write
generate_report.py - Test: Full end-to-end pipeline
- Verify
solanasis-docs/daily-outreach/YYYY-MM-DD-jobs.mdoutput - Test:
--mark-applied,--mark-skipped,--status - Set up Windows Task Scheduler (daily 7:00 AM)
Verification
python run_pipeline.py --boards fractionaljobs --skip-claude— smoke test (run via WSL2)- Spot-check 5 scraped listings against actual website
- Review regex scoring: do pass/reject decisions make sense?
python run_pipeline.py --boards fractionaljobs— with Claude- Review Claude output: are ai_fit_scores reasonable? Red flags accurate?
- Full pipeline:
python run_pipeline.py - Open
solanasis-docs/daily-outreach/YYYY-MM-DD-jobs.md— verify format python run_pipeline.py --recheck-scores— re-score without re-scrapingpython generate_report.py --mark-applied <id>— tracker testpython generate_report.py --status— stats test- Re-run next day — verify dedup, cache, only new jobs appear
Risks & Mitigations
| Risk | Mitigation |
|---|---|
| trafilatura strips job listing tables on index pages | Use sitemap URLs to go direct to detail pages (bypass index). For useshiny.com, use raw HTML parsing for index. |
| Crawl4AI setup fails in WSL2 | Fall back to Jina Reader (free key, 200 RPM). Jina handles JS rendering server-side. |
| GoFractional 403 blocks even Crawl4AI | Board is deferred. May require residential proxy — evaluate if inventory (251 listings) justifies the effort. |
| JobSpy Indeed scraper breaks | Indeed is supplemental. Primary value is niche boards. Disable with --boards flag. |
| Board HTML changes | Scraper health monitoring in SQLite. Auto-warn in daily report if a board returns <50% of 7-day average. |
| Claude structured output malformed | Validate JSON, retry once, fall back to regex-only tier on failure. |
| Claude CLI unavailable (e.g., scheduled task) | --skip-claude flag runs regex-only mode. Pipeline works without Claude. |
| Two-pass scraping is slow | Cache detail pages per-job-id. Only scrape new/unseen URLs. fractionaljobs.io sitemap lets us check for new URLs before fetching. |
Key Files to Read Before Implementation
| File | What to Fork |
|---|---|
fcto-pipeline/config.py | Path setup (L9-16), keyword regex structure (L18-84), scoring dict (L87-109), tiers (L112-117) |
fcto-pipeline/enrich_websites.py | Cache helpers (L101-124), scrape_with_trafilatura (L159-172), scrape_with_jina (L175-188), check_keywords (L195-206), async orchestrator (L249-309), Windows policy (L535-537) |
fcto-pipeline/score_prospects.py | score_prospect (L36-122), assign_tier (L125-134) |
fcto-pipeline/generate_daily_outreach.py | Tracker load/save (L38-50), mark operations (L53-93), CLI (L660-693), daily brief (L436-579) |
Appendix A: Tool Evaluation (2026-03-23)
Approved Tools
httpx (v0.28.1) — YES
- Pure Python async HTTP client. Already installed and proven in fcto-pipeline.
- Sufficient for niche job boards with no anti-bot protection.
- No compiled dependencies, no platform issues.
trafilatura (v2.0.0) — YES
- Best benchmarked content extractor (F1=0.945, beating readability-lxml, newspaper3k, goose3).
- Already installed. Pure Python, Windows-tested.
- Gotcha: Optimized for “article” content — may aggressively strip job listing index pages (thinks card layout is navigation). Mitigated by scraping detail pages directly via sitemap URLs.
- Use
include_tables=Trueto prevent table stripping.
Jina Reader (r.jina.ai) — YES (fallback for Crawl4AI)
- Zero-install JS rendering: prepend
https://r.jina.ai/to any URL, get clean markdown via HTTP GET. - Free tier: 20 RPM without key, 200 RPM with free key (signup takes 30 seconds).
- Role: Fallback when Crawl4AI fails or for quick one-off page fetches.
- Tested: Returns clean content for most JS-rendered pages. Fails gracefully (error or empty, no hangs).
- Dmitri to sign up for free key → store as
JINA_API_KEYin.env.
Claude CLI (subscription) — YES
- Uses
claude -p "<prompt>" --model {model} --output-format jsonvia subprocess. - Already installed and authenticated through Dmitri’s Claude subscription. Zero additional cost.
- Default model: Opus (best quality). Configurable to Sonnet as fallback (faster, lower quality).
- All model/timeout settings live in
config.py— no code changes needed to switch models. - Windows encoding gotcha: set
encoding='utf-8'insubprocess.run()to avoid cp1252 issues. - Node.js process spawn overhead (~1-2s per call) is acceptable for 10-20 calls/day.
- Fallback: pipeline has
--skip-claudemode for regex-only scoring when CLI unavailable.
SQLite (stdlib sqlite3) — YES
- Zero install. Python stdlib. Rock-solid on Windows.
- Enable WAL mode (
PRAGMA journal_mode=WAL) for concurrent readers and crash recovery. - Set
PRAGMA busy_timeout=5000to handle accidental concurrent pipeline runs. - Single
jobs.dbfile beats JSON sprawl for dedup, querying, and lifecycle tracking.
Crawl4AI (v0.8.0) — YES (via WSL2)
- Windows-native installation is documented as broken: GitHub issues #38, #949, #1705.
- Runs natively in WSL2 — Linux is a first-class platform.
pip install crawl4ai && crawl4ai-setupworks cleanly. - Self-hosted JS rendering: no rate limits, no external API dependency, no cost.
- Outputs clean Markdown (ideal for feeding to Claude for analysis).
- “Adaptive Intelligence” feature learns selectors over time — useful for daily scraping of the same boards.
- Primary tool for JS-rendered pages (useshiny.com AJAX content, GoFractional if unblocked).
python-jobspy (v1.1.82) — CONDITIONAL
- Include as optional (
try-import). Use for Indeed only. - Indeed scraper currently works with no rate limiting (per maintainer and recent issues).
- LinkedIn: rate-limits after ~10 pages, requires proxies. Not worth the friction.
- ZipRecruiter: broken (403/429 since Sep 2025). Google Jobs: returns 0 results. Glassdoor: 403.
- Install:
pip install python-jobspy— pure Python, no compiled deps. - If Indeed breaks in the future, disable and rely on niche boards.
Rejected Tools
curl-cffi (v0.14.0) — NO
- TLS fingerprint impersonation for anti-bot bypass. Niche job boards don’t have TLS fingerprinting.
- Adds a compiled C dependency (cffi + libcurl) with platform-specific wheels.
- Solving a problem we don’t have. httpx with a User-Agent header is sufficient.
- If a specific board later requires it, swap in for that one scraper only.
Anthropic Python SDK — NO
- Would require a separate API key and per-token billing.
- The
claudeCLI is already authenticated via Dmitri’s subscription. Zero additional cost. - SDK stays on the bench unless we need programmatic access without the CLI.
Playwright (standalone) — NO (use Crawl4AI instead)
- Crawl4AI wraps Playwright and adds content extraction + markdown output.
- No reason to use raw Playwright when Crawl4AI provides a better developer experience.
- Playwright’s Chromium is installed automatically by
crawl4ai-setupin WSL2.
Appendix B: Board Scrapability Audit (2026-03-23)
All boards were fetched and tested live on 2026-03-23 via WebFetch.
fractionaljobs.io — EXCELLENT
- Platform: Webflow (server-rendered HTML + JS enhancements)
- Bot protection: None. No robots.txt restrictions (only sitemap reference).
- Inventory: 700+ job URLs in sitemap. ~40 visible on homepage with “View 24 more” link.
- Scraping strategy: Fetch
/sitemap.xml→ extract all/jobs/*URLs → fetch each detail page via trafilatura. Classes:.job-item,.jobs-collection-list,.job-item_link-to-job. - Login required: No.
findfractionaljobs.com — BEST (has API)
- Platform: WordPress 6.9.4 with WP Job Manager + Workscout theme.
- Bot protection: None. Open robots.txt (
Disallow:empty). - Inventory: ~6 listings via API (small board).
- Scraping strategy:
GET /wp-json/wp/v2/job-listings/?per_page=100returns structured JSON with title, content, excerpt, link,_company_name,_job_location,_remote_position,_salary_min,_salary_max,_rate_min,_rate_max, job-categories, job-types. Zero HTML parsing needed. - Login required: No for browsing.
allfractionaljobs.com — GOOD
- Platform: Jobboardly (SaaS, appears Rails-based).
- Bot protection: None. Open robots.txt (
Allow: /). - Inventory: 94 job URLs in sitemap. 9 free per page, 1 behind $5/mo paywall.
- Scraping strategy: Fetch
/sitemap.xml→ extract all/jobs/*URLs → fetch detail pages. Listings in<li>elements with company logo, title, type, location, compensation, hours. - Login required: No for free listings.
useshiny.com — MODERATE
- Platform: WordPress + WooCommerce.
- Bot protection: None. Open robots.txt (Yoast SEO).
- Inventory: ~10 visible, “Show More Jobs” button suggests more via AJAX (
jm-ajaxendpoint pattern visible in source). - Scraping strategy: Fetch
/job-postingspage, parse<a>elements with<h4>titles and badges (type, location, compensation). May need to discover AJAX endpoint for full listing. No REST API exposed for jobs (unlike findfractionaljobs.com). - Login required: No for browsing.
gofractional.com — BLOCKED
- Platform: Next.js (Vercel).
- Bot protection: Aggressive. Returns 403 Forbidden on every automated request (homepage, /jobs, /job/*, sitemap).
- Inventory: 251 jobs per Google index. Zero accessible via automated fetch.
- robots.txt: Blocks
/tag/,/skill/,/booking. Allows/jobsand/job/— but the 403 blocks everything regardless. - Action: Deferred. Would require Playwright + stealth plugin + proxy rotation. High effort, ethically questionable given they’re clearly preventing automated access.
gigx.com — DROPPED
- Platform: Drupal.
- Issue: Executive profile directory, not a job board. Companies browse exec profiles. Zero job listings.
- robots.txt: Blocks
/search(the main way to find execs). TLS cert issues on bare domain (gigx.comfails;www.gigx.comworks). - Search results: JS-rendered (spinner only without browser execution).
- Action: Dropped. Wrong data model.
hirefractionaltalent.com — DROPPED
- Platform: HubSpot CMS.
- Issue: Not a job board. Consultant showcase / lead-gen site. Displays fractional exec profiles with “Free Consultation” CTA. Zero job listings.
- Action: Dropped.
chiefjobs.com — TEMPORARILY UNAVAILABLE
- Issue: SSL certificate chain broken (missing intermediate cert). All HTTP clients reject it. Browsers may work via cached intermediates.
- Possible cause: Namecheap had SSL issuance delays on 2026-03-22.
- Action: Recheck in 3-5 days. If SSL resolves, evaluate for C-suite job listings.
Indeed (via JobSpy) — WORKS
- Indeed scraper in JobSpy v1.1.82 works with no rate limiting reported.
- Anti-bot (DataDome) is handled by JobSpy internally.
- Fragile long-term — depends on maintainer keeping up with Indeed’s changes.
LinkedIn (via JobSpy) — FRAGILE
- Rate-limits after ~10 pages per IP. Requires proxies for meaningful volume.
- Anti-bot escalation: canvas/WebGL/audio fingerprinting, ASN classification.
- Not worth the friction for fractional job search volume.
ZipRecruiter (via JobSpy) — BROKEN
- Returns 403/429 errors. Broken in JobSpy since September 2025 (issue #302, unresolved).
- Dropped.
Open Items for Dmitri
- Jina API key (Step 0) — Sign up at https://jina.ai/ (free, 30 seconds, no credit card). Gets 200 RPM vs 20 RPM. I’ll save the key to
.envonce you provide it. (I can’t create web accounts on your behalf.) - allfractionaljobs.com subscription — $5/month unlocks 86 paywalled listings (only 9 free). Worth evaluating after we see the quality of free listings.