LinkedIn Data Mining Pipeline — Implementation Prompt

Usage: Feed this entire file as context to a new Claude Code session on any machine that has access to the LinkedIn data export and the solanasis-scripts repo. This is fully self-contained; no external plan files or memory needed.

Created: 2026-03-22 Senior Reviewed: Yes (all findings addressed)


Task

Build a complete LinkedIn data mining pipeline: 14 Python scripts that parse a LinkedIn data export (37 CSV files) into clean CSV/JSON intermediaries, generate GTM intelligence and voice analysis reports, and produce CRM-ready import files.


Who You’re Working For

Dmitri Zasage (aka Dmitri Sunshine on LinkedIn), CEO of Solanasis LLC, a fractional CIO/CSIO/COO firm based in Boulder, Colorado. One-person operation with 1099 contractors, growth hacking mindset.

Solanasis offerings: Security Assessments, Disaster Recovery Verification, Data Migrations, CRM Setup, Systems Integration, Responsible AI Implementation

Target market: Nonprofits, private foundations (50M), growing organizations, wealth management firms

CRM systems: ERPNext at db.zasage.us (primary, custom), Baserow (temporary). Both accept standard CSV imports.


Three Goals

  1. CRM Population — Parse 2,500 connections + message history + invitations into scored, tiered contact records ready for ERPNext/Baserow import
  2. GTM Intelligence — Network composition analysis, ICP validation, attention audit, warm introduction mapping, referral partner identification
  3. Voice/Style Mining — Extract writing patterns from 225 comments, 258 shares, ~3,000 messages, 1,549 invitations to validate and extend Dmitri’s documented voice profile

The LinkedIn Data Export

Location: Update this path for your machine:

C:\Users\zasya\Documents\_zasage\data-export\Complete_LinkedInDataExport_03-13-2026.zip\

Already extracted. 37 CSV files. Dmitri’s LinkedIn profile URL: https://www.linkedin.com/in/dmitri-sunshine

File Inventory with Schemas

Connections.csv (2,500 rows)

  • IMPORTANT: 3-line preamble (notes about email privacy) before actual CSV columns at line 4. Skip first 3 lines.
  • Columns: First Name, Last Name, URL, Email Address, Company, Position, Connected On
  • Date format: “11 Mar 2026”
  • Emails: sparse (~10% have them; LinkedIn privacy opt-out)
  • Names may include credentials: e.g., "Beckmann, RFC®" (quoted, with comma inside)

messages.csv (~3,000 logical rows; 9,110 raw lines due to multiline quoted content)

  • Columns: CONVERSATION ID, CONVERSATION TITLE, FROM, SENDER PROFILE URL, TO, RECIPIENT PROFILE URLS, DATE, SUBJECT, CONTENT, FOLDER, ATTACHMENTS, IS MESSAGE DRAFT, IS CONVERSATION DRAFT
  • Date format: “2026-03-11 23:09:42 UTC”
  • Content field: multiline, properly quoted per RFC 4180
  • Date range: 2010-2026 (16 years of history)
  • Use csv.reader (NOT line counting) to get true logical row count

Invitations.csv (1,549 rows)

  • Columns: From, To, Sent At, Message, Direction, inviterProfileUrl, inviteeProfileUrl
  • Date format: “3/10/26, 1:02 PM”
  • Direction: OUTGOING or INCOMING
  • Dominant outgoing template: “Hey Brother, I came across your profile and would love to connect since I got the sense there may be a lot of alignment between us. I’d love to jump on a call with you sometime to see where our worlds overlap and how we might be able to help each other.”
  • Variations: “Hey Sister”, “Aloha [Name]”, “Hey [Name]”, “Hy [Name]” (typo variant)

Comments.csv (225 rows)

  • Columns: Date, Link, Message
  • Date format: “2026-03-11 02:14:50”
  • Date range: 2021-2026 (5 years)
  • This is the most authentic voice data. Examples:
    • “Wow! It must have been amazing to witness that kind of growth. What’s the main challenge to keep it growing at this rate?”
    • “Love it! Mindset is everything! I like to call this way of thinking ‘Pronoia’; the Universe is conspiring in your favor.”
    • “Claude bullshits me far more often so ‘are you sure? verify everything!’ has been one of my most used prompts!”

Shares.csv (258 rows)

  • Columns: Date, ShareLink, ShareCommentary, SharedUrl, MediaUrl, Visibility
  • Commentary field uses doubled double-quotes ("") for embedded quotes. Python’s csv.reader handles this natively per RFC 4180.
  • Topics: cybersecurity (password managers, zero-knowledge encryption), AI, community, career, philanthropy
  • Example commentary starts: “So it’s 2026 but I still see orgs and people who don’t use a password manager…”

Reactions.csv (602 rows)

  • Columns: Date, Type, Link
  • Distribution: LIKE 75%, EMPATHY 19%, PRAISE 4%, INTEREST 1.5%, ENTERTAINMENT 0.5%, APPRECIATION 0.2%
  • Date range: Apr 2022 - Mar 2026

Ad_Targeting.csv (1 data row, 34 columns, ~34 KB)

  • Each column cell contains semicolon-delimited lists. Parse by splitting on ; after CSV extraction.
  • Key columns and values:
    • Member Age: 25-34
    • Company Follower of: 100+ organizations including Rockefeller Foundation, Ford Foundation, Ewing Marion Kauffman Foundation, Y Combinator, OpenAI, The Nature Conservancy, Solanasis, Carson Wealth, TIFIN Wealth, Momentis Family Office, Altruist, Wealthbox, Nitrogen, DataKind
    • Member Groups: ~25 including Chronicle of Philanthropy, Emotional Intelligence Network, Philanthropy Network, Impact Investor networks, MAPS (psychedelics), Nonprofit fundraising
    • Member Interests: 500+ including Cloud Computing, Risk Management, Entrepreneurship, AI, Cybersecurity, Data Management, Venture Capital, Nonprofit sector, ESG, Impact Investing, Responsible AI, Social Enterprise
    • Member Traits: Active contributor/influencer
    • High Value Segments: Small Business Owners, Senior Tech Decision Makers, High Net Worth Individuals, Public Impact Leaders, Enterprise Builders, Millennials
    • Job Functions: Business Development
    • Company Revenue: Under $1M
    • Years of Experience: 12+

Inferences_about_you.csv (7 rows)

  • Columns: Category, Type of inference, Description, Inference
  • Inferences: HR professional=No, tech media interest=True, oil/energy interest=True, EV affinity=True, gender=Male, job seeking=True, active contributor/influencer=True

Company Follows.csv (299 rows)

  • Columns: Organization, Followed On
  • Strategically curated: wealth management firms, foundations, impact funds, philanthropy infrastructure, mental health orgs, community builders
  • Date range: Feb-Mar 2026 (recent bulk follows)

Member_Follows.csv (2,030 rows)

  • Columns: Date, Status, FullName
  • All “Active” status
  • C-suite coaches, impact investors, founders, nonprofit leaders, sustainability experts

Profile.csv (1 data row)

  • Columns: First Name, Last Name, Maiden Name, Address, Birth Date, Headline, Summary, Industry, Zip Code, Geo Location, Twitter Handles, Websites, Instant Messengers
  • Headline: “Less Luck. More Systems. | Founder @ Solanasis | Cybersecurity • Systems Integration • Responsible AI Implementation”
  • Industry: IT Services and IT Consulting
  • Location: Boulder, Colorado 80304

Positions.csv (7 positions)

  • Columns: Company Name, Title, Description, Location, Started On, Finished On
  • Career: InReach Solutions (Jul 2007-Nov 2021, 14yr, foster care/adoption) → Cohere COO (Nov 2021-Mar 2022) → Sage’d Founder (Mar 2022-Jan 2026) → Co-Nexus Community Co-op Founder (Nov 2023-present) → Wellnia Founder (Sep-Dec 2024) → Matchkeyz Founder (May 2025-Feb 2026) → Solanasis Founder & CEO (Feb 2026-present)

Skills.csv (11 skills): ERP Software, Cybersecurity, Healthcare IT, System Architecture, Project Management, Organizational Development, Operations Management, Software Architecture, Entrepreneurship, Nonprofit Organizations, Board of Directors

Other files:

  • Endorsement_Received_Info.csv (11 endorsements, mostly from Sarah Lena Hofer)
  • Events.csv (6: Networks Festival, Sociocracy Conference, gBETA Greeley, RegenWorld, Growing Cooperative Cultures, Self-Mastery Session)
  • Learning.csv (7 LinkedIn Learning courses: SOLIDWORKS, Inclusive Tech, Creating Maps with R, AI Agents for SMBs, Ageism, Visual Communication, Design Thinking/Social Innovation)
  • InstantReposts.csv (6 reposts)
  • Rich_Media.csv (6 photo uploads)
  • Ads Clicked.csv (5,852 ad clicks)
  • LAN Ads Engagement.csv (87 impressions on Roku, Pluto TV, etc.)
  • Logins.csv (22 login events)
  • Security Challenges.csv (117 challenges)
  • Registration.csv (account created Aug 17, 2007 — 18-year LinkedIn member)
  • Receipts_v2.csv (18 transactions, Sales Navigator Core @ $99.99/mo, Dec 2025-Mar 2026)
  • Email Addresses.csv (zasyatkin@gmail.com secondary, mr.sunshine@matchkeyz.com primary)
  • PhoneNumbers.csv (+1 503-544-2911)
  • SavedJobAlerts.csv (“founder” keyword, Denver, daily)
  • Jobs/Saved Jobs.csv (2 full-stack engineer roles at health tech companies)
  • Jobs/Job Seeker Preferences.csv (Portland, full-stack/software architect)

Research-Discovered Gold Nuggets

These are non-obvious insights from deep research that should inform the analysis scripts:

  1. ICP Validation via Company Follows: The 299 company follows heavily skew toward wealth management, foundations, impact funds, philanthropy infrastructure. This validates the declared ICP.
  2. Relationship Decay Detection: Cross-reference message dates with connection dates. Connections with past high engagement (5+ messages) that went silent (90+ days) are priority reactivation targets.
  3. Invitation Acceptance Rate: Cross-reference outgoing Invitations with Connections to calculate what % of invitations resulted in connections. Measures template effectiveness.
  4. Connection Growth Velocity: Plot connections by Connected On date to find networking surges (events, campaigns, job changes).
  5. Voice Evolution (5+ years): Comments span 2021-2026. Evolution: brief acknowledgments (2021) → curious questions (2023) → tech-forward/opinionated (2025) → consultative expert (2026).
  6. Ad Targeting Decode: LinkedIn classifies Dmitri as “Public Impact Leader,” “Small Business Owner,” “Senior Tech Decision Maker.” Member groups include Chronicle of Philanthropy, Impact Investor networks.
  7. Reaction Distribution: 75% Like, 19% Empathy — mostly observer posture with selective emotional engagement. Useful for voice mining (what triggers deeper reactions?).
  8. Job Seeker Signal: LinkedIn infers “active job seeker” (true) despite CEO role. Flag this in GTM report.

Existing Voice Profile (for validation)

Dmitri has a documented voice profile at solanasis-docs/solanasis-voice-profile.md. The voice analysis script should validate these patterns against actual LinkedIn data and discover NEW patterns:

Documented rules:

  • NO em dashes (—) — use semicolons, periods, parentheses, commas
  • NO “genuinely”, “seamless”, “frictionless”, “SMBs”, “audit”
  • NO “Furthermore”, “Moreover”, “Additionally” — use “So,” or jump-cut
  • NO “game-changer”, “next level”, “leverage” (verb), “synergy”
  • Semicolons as warm connectors (2+ per piece)
  • Parenthetical asides (1+ per piece)
  • Signature terms (2-3 per piece): “quiet failure”, “false comfort”, “drift”, “blind spot”, “risk debt”, “operational drag”, “hidden risk”, “brittle systems”, “messy handoffs”
  • Approved transitions: “So,”, “Which is why”, “Here’s the thing”, “Let’s face it”, “Imagine”
  • Max 3 sentences per paragraph

Voice learnings from outreach refinement (Appendix A of voice profile):

  • “Heya” is default casual greeting (not “Hey Brother” for all)
  • Extends warm openers with relatable cultural comments connected by semicolons
  • Soft hedges: “if you happen to know of” not “Do you know of”
  • Reciprocal closing: invites recipient to share expertise, not just do a favor
  • One ask per message
  • Autobiographical anchoring: “went back to my roots”
  • “By far” as authentic intensifier
  • No referral fees or “zone of genius” in warm outreach

Critical Technical Requirements

Encoding Gotcha

LinkedIn exports may use ANSI/Latin-1 encoding, NOT UTF-8. European characters, accented names, smart quotes will be garbled if read as UTF-8.

  • Use chardet library to auto-detect encoding on first read
  • Convert all outputs to plain UTF-8 (no BOM) for ERPNext/Baserow compatibility
  • For Excel-labeled CSVs, use utf-8-sig (BOM) separately

CSV Parsing

  • Use stdlib csv.reader for ALL Stage 1 parsers. Consistent, fewer dependencies. Always open with newline=''.
  • Pandas only for Stage 2 analysis scripts.
  • csv.reader handles doubled double-quotes ("") natively per RFC 4180.

Output CSV Robustness

  • All output CSVs: csv.writer with quoting=csv.QUOTE_ALL, plain UTF-8 encoding
  • Test that output opens cleanly in Excel AND imports to ERPNext without encoding errors

LinkedIn URL Normalization

This is the join key across ALL files. Normalize before any matching:

def normalize_linkedin_url(url: str) -> str:
    if not url:
        return ""
    url = url.strip().rstrip("/").lower()
    if not url.startswith("http"):
        url = "https://" + url
    url = url.split("?")[0].split("#")[0]
    return url

No External API Calls

Stages 1-2 are fully local/offline. No web requests, no API calls. All analysis uses the CSV files directly.


Existing Codebase Patterns

The project has 3 existing pipelines in solanasis-scripts/ with identical architecture. Use as templates:

  • foundation-pipeline/ (17 scripts, 62 tests) — most mature
  • msp-pipeline/ (13 scripts)
  • fcto-pipeline/ (9 scripts)

Pattern: config.py → parse → score → generate outreach → migrate to CRM

foundation-pipeline/config.py is the structural template. It defines:

  • Path constants (PIPELINE_DIR, DATA_DIR, RAW_DIR, INTERMEDIATE_DIR, OUTPUT_DIR)
  • Scoring model with positive/negative signals and point values
  • Tier thresholds (A: 30+, B: 20-29, C: 10-19, D: <10)
  • CRM output column definitions
  • Intermediate and output file path constants

All scripts use --plan (dry run, show what would happen) and --run (execute) modes where applicable.


Implementation Plan: 14 Scripts Across 3 Stages

Stage 1: Parse & Clean (CSV/JSON intermediaries)

All scripts go in solanasis-scripts/linkedin-pipeline/.

Script 1: config.py

  • LINKEDIN_EXPORT_DIR pointing to the Complete export
  • DMITRI_PROFILE_URL = "https://www.linkedin.com/in/dmitri-sunshine"
  • Path constants: DATA_DIR, INTERMEDIATE_DIR, OUTPUT_DIR, REPORT_DIR
  • Scoring weights for warmth and strategic value (see Script 8)
  • ICP keyword lists for segment tagging: foundation, nonprofit, wealth-mgmt, fractional-exec, tech, startup, coliving-community, boulder-local, connector, spiritual-wellness
  • Spam detection patterns (for flagging, not filtering)
  • Topic keyword lists for engagement tagging: ai/tech, community/coliving, spirituality/wellness, entrepreneurship, philanthropy/nonprofit, cybersecurity/ops
  • detect_encoding() helper using chardet

Script 2: privacy_filter.py

Utility module:

  • is_spam_message(content) — keyword heuristics (unsubscribe, free training, limited time, etc.). Flags but does NOT delete.
  • redact_pii_for_export(text) — optional redaction for any future external sharing (not used by default)
  • Full message text is preserved throughout the pipeline. Nothing is stripped or filtered out.

Script 3: parse_connections.py

  • Input: Connections.csv (skip first 3 lines, CSV header at line 4)
  • Output: output/connections_clean.csv
  • Logic: Normalize LinkedIn URLs, parse “11 Mar 2026” dates to ISO 8601, split name fields, handle credentials in names (e.g., "Beckmann, RFC®"), add connection_age_days computed column

Script 4: parse_messages.py

  • Input: messages.csv (use stdlib csv.reader, NOT pandas, NOT line counting)
  • Output: output/messages_parsed.json (grouped by conversation), output/message_stats_per_contact.csv
  • Logic:
    • Group by CONVERSATION ID, sort by date within each conversation
    • Identify Dmitri by SENDER PROFILE URL matching his profile URL
    • Per-contact stats: message count total/from-dmitri/to-dmitri, date range, direction ratio, response latency
    • Filter out drafts (IS MESSAGE DRAFT = “Yes”)
    • Flag spam messages (using privacy_filter.is_spam_message())
    • Extract greeting patterns from Dmitri’s sent messages (first line regex)

Script 5: parse_invitations.py

  • Input: Invitations.csv
  • Output: output/invitations_clean.csv
  • Logic:
    • Normalize URLs from both inviterProfileUrl and inviteeProfileUrl
    • Parse dates (“3/10/26, 1:02 PM” format) to ISO 8601
    • Classify direction (OUTGOING vs INCOMING)
    • Detect templates using difflib.SequenceMatcher (threshold 0.85)
    • The “Hey Brother” template is dominant; quantify usage count, variants, and personalization spots

Script 6: parse_engagement.py

  • Input: Comments.csv, Shares.csv, Reactions.csv, InstantReposts.csv
  • Output: output/engagement_activity.csv (unified timeline), output/engagement_by_topic.json
  • Logic:
    • Merge all engagement into single timeline with unified schema: date, activity_type, content_text, target_url, reaction_type, word_count, topic_tags
    • Extract text from comments and share commentary
    • Compute word count, sentence count for text content
    • Tag reaction types
    • Apply topic keyword lists to classify engagement by theme

Script 7: parse_profile_data.py

  • Input: Profile.csv, Positions.csv, Skills.csv, Endorsement_Received_Info.csv, Ad_Targeting.csv, Inferences_about_you.csv, Member_Follows.csv, Company Follows.csv, Events.csv, Learning.csv, Ads Clicked.csv, Registration.csv, Receipts_v2.csv
  • Output: output/profile_summary.json, output/ad_targeting_parsed.json, output/follows_and_interests.csv, output/career_timeline.json
  • Logic:
    • Parse Ad_Targeting’s 34 columns (semicolon-delimited lists within each cell)
    • Parse Inferences (7 rows)
    • Merge Member_Follows (2,030) + Company Follows (299) into single list with type column
    • Parse Positions into career timeline JSON
    • Parse Events and Learning for interest signals
    • Include Registration date for account maturity context

Script 8: build_unified_contacts.py (the critical merge)

  • Input: All Stage 1 outputs
  • Output: output/contacts_unified.csv, output/contacts_unified.json
  • Join key: normalized linkedin_url
  • Merge: Connections (base) + left-join Invitations + left-join Message stats + left-join Follows
  • Orphan handling: people in messages/invitations but NOT in connections → non_connection_interactions.csv

Warmth scoring (algorithmic):

  • +20 bidirectional messages exist
  • +15 if 5+ messages exchanged
  • +10 message exchange in last 30 days
  • +8 message exchange in last 90 days
  • +5 Dmitri sent invitation (proactive interest)
  • +5 they sent invitation (they sought Dmitri out)
  • +3 Dmitri follows them
  • -10 connected 1yr+ with zero messages
  • -5 only spam messages from this contact

Strategic value scoring:

  • +15 title matches ICP (CEO/COO/CTO/ED at nonprofits/foundations/growing orgs)
  • +10 company matches target vertical (foundation, nonprofit, wealth management, financial services)
  • +8 position is fractional C-suite (fCTO, fCOO, fCIO, fractional)
  • +5 company location includes Colorado/Boulder/Denver
  • +5 connector role (investor, advisor, community builder, board member)
  • +3 engaged recently (message or invitation in last 60 days)

Tiers: A (50+), B (30-49), C (15-29), D (<15)

Segment tags: nonprofit, foundation, wealth-mgmt, tech, startup, coliving-community, fractional-exec, boulder-local, connector, spiritual-wellness

Unified contact schema (one row per person, maximum data):

linkedin_url, first_name, last_name, full_name, email, company, position,
connected_on, connection_age_days,
invitation_direction, invitation_message, invitation_date, invitation_accepted,
message_count_total, message_count_from_dmitri, message_count_to_dmitri,
first_message_date, last_message_date, last_message_direction,
avg_response_time_hours, longest_conversation_length, conversation_count,
last_message_content_preview,
dmitri_follows_them, they_follow_dmitri,
warmth_score, strategic_value_score, relationship_tier,
relationship_decay_flag, days_since_last_contact,
segment_tags, icp_match_details, notes

Also produce per-contact conversation export (output/conversations/):

  • One JSON file per A/B-tier contact: full conversation threads with timestamps
  • For C/D-tier: summary JSON with metadata only + last 3 messages

Stage 2: Analysis Reports

Output to solanasis-docs/linkedin-analysis/ as markdown.

Script 9: analyze_network.py

Report: network-intelligence-report.md

  • Network composition: industry breakdown (keyword-based from company/position), role seniority distribution, geographic distribution
  • Top 25 most-engaged contacts (by warmth score) with commonalities
  • Network gaps vs. Solanasis target verticals: foundation execs, wealth management, MSPs, fractional CTOs. What % of network is in each vertical?
  • ICP validation: Cross-reference actual connection industries with declared ICP
  • Referral partner candidates: fractional execs/consultants/advisors/attorneys/accountants with high warmth + matching segments
  • Dormant gold: high strategic value but low warmth (connected, never messaged; target vertical people)
  • Relationship decay list: contacts with past high engagement (5+ messages) but no contact in 90+ days
  • Connection velocity: connections per week over time, identify networking surges
  • Invitation effectiveness: acceptance rate by template variant, by personalization level
  • Company Follows alignment: 299 followed companies mapped to Solanasis verticals

Script 10: analyze_gtm_alignment.py

Report: gtm-alignment-report.md

  • Attention audit: topic breakdown of comments/shares/reactions vs. Solanasis positioning needs
  • Ad Targeting decode: Parse all 34 columns; what LinkedIn thinks Dmitri is and what interests are inferred
  • Inferences analysis: Flag misalignment with Solanasis brand (e.g., job seeking=true)
  • Content engagement gap: quantify % of engagement on cybersecurity vs. community vs. AI vs. other
  • Warm introduction map: for each target vertical, list warmest contacts who could provide intros
  • Company Follows as ICP signal: 299 companies mapped to verticals
  • Member Groups inventory: list all ~25 groups, classify GTM relevance
  • Actionable recommendations: specific, numbered actions

Script 11: analyze_voice.py

Report: voice-analysis-report.md (Python-only, no LLM needed)

  • Comment voice profile (225 comments, 5 years):
    • Avg length, word count distribution, exclamation/question frequency
    • Top 30 words/phrases (TF-IDF), vocabulary richness (type-token ratio)
    • Intensifier usage: “by far”, “so”, “really”, “super”, “amazing”, “love it”
    • Softener usage: “happen to know”, “would love to”, “I think”, “I believe”
  • Share commentary voice (258 shares):
    • Similar metrics + topic framing analysis
    • Signature transitions: “Here’s the thing…”, “So…”, “Which is why…”
    • Pro-tip pattern detection
  • DM voice profile (~3,000 messages, Dmitri’s sent only):
    • Greeting distribution: “Hey Brother” X%, “Hey Sister” Y%, “Heya” Z%, “Aloha” W%
    • Sign-off patterns, meeting link inclusion rate
    • CTA patterns, response latency
  • Invitation template analysis (1,549 invitations):
    • Cluster using difflib.SequenceMatcher (0.85 threshold)
    • Each variant: verbatim text, usage count, date range, personalization spots
    • Acceptance rate per template (cross-ref Invitations → Connections)
    • Gender-specific greeting analysis
  • Voice validation against documented profile:
    • Em dash frequency (should be ~0), semicolon frequency, banned words scan
    • Compare against Appendix A findings
  • 5-year voice evolution (2021→2026):
    • 2021: Brief acknowledgments
    • 2023: Curious questions, learning posture
    • 2025: Tech-forward, opinionated
    • 2026: Consultative, expert-level, Solanasis-branded
    • Quantify: avg comment length, question ratio, vocabulary complexity by year
  • Sentiment analysis using VADER (nltk.sentiment.vader, purpose-built for short social text)

Script 12: generate_voice_enrichment_queue.py

  • Output: output/voice_llm_queue.json (prompt-ready samples for Claude analysis)
  • Selects: all 225 comments, all share commentaries, top 50 conversations (Dmitri’s messages only), all unique invitation templates
  • For future LLM-assisted analysis: tone classification, qualitative insights, voice profile extension

Stage 3: ERPNext Integration

Script 13: prepare_erpnext_import.py

  • Input: contacts_unified.csv
  • Output: output/erpnext_leads.csv, output/erpnext_contacts.csv
  • Mapping:
    • Connection with messages exchanged → Contact (linked to Lead)
    • Connection, no interaction → Lead (status: Open)
    • Connection in target segment → Lead (status: Qualified)
    • Non-connection → Lead (status: Open, lower priority)
  • --plan mode: print counts and samples, no file creation
  • --run mode: write ERPNext Data Import-ready CSVs (UTF-8, standard format per Frappe docs)
  • Full message text included for A/B-tier; metadata + last messages for C/D-tier

ERPNext import format (per Frappe docs):

  • Standard CSV upload via Data Import tool
  • Download template for Lead/Contact doctypes, fill in, upload
  • Must be UTF-8 encoded
  • ID column left blank for new records (auto-generated)
  • No hard row limit but recommend a few thousand at a time

Script 14: push_to_erpnext.py (stub only)

  • Placeholder with Frappe REST API patterns in comments
  • Auth env vars in .env.example: ERPNEXT_URL, ERPNEXT_API_KEY, ERPNEXT_API_SECRET
  • Will be fleshed out when ERPNext at db.zasage.us has API credentials confirmed

Directory Structure

solanasis-scripts/linkedin-pipeline/
    config.py
    privacy_filter.py
    parse_connections.py
    parse_messages.py
    parse_invitations.py
    parse_engagement.py
    parse_profile_data.py
    build_unified_contacts.py
    analyze_network.py
    analyze_gtm_alignment.py
    analyze_voice.py
    generate_voice_enrichment_queue.py
    prepare_erpnext_import.py
    push_to_erpnext.py
    requirements.txt
    .env.example
    data/
        intermediate/         # Working files (git-ignored)
        output/               # Final CSV/JSON (git-ignored)
    tests/
        test_parse_*.py
        test_privacy_filter.py
        test_scoring.py

solanasis-docs/linkedin-analysis/
    network-intelligence-report.md
    gtm-alignment-report.md
    voice-analysis-report.md

requirements.txt

chardet
python-dateutil
pandas>=2.0
nltk
httpx
python-dotenv

Implementation Order

Build in this order, testing each before proceeding:

  1. Phase A (foundation): config.py, privacy_filter.py, parse_connections.py
  2. Phase B (core parsers): parse_invitations.py, parse_engagement.py, parse_messages.py, parse_profile_data.py
  3. Phase C (the merge): build_unified_contacts.py — spot-check known contacts
  4. Phase D (analysis): analyze_network.py, analyze_gtm_alignment.py, analyze_voice.py, generate_voice_enrichment_queue.py
  5. Phase E (integration): prepare_erpnext_import.py, push_to_erpnext.py (stub)

Verification Checklist

  • parse_connections.py produces exactly 2,500 rows
  • parse_messages.py produces ~2,700-3,000 logical messages (NOT 9,110)
  • parse_invitations.py produces ~1,549 rows
  • parse_engagement.py produces ~1,085 rows (225 comments + 258 shares + 602 reactions merged)
  • build_unified_contacts.py spot-check: Ian Crafford, Tim Lockie, Kevin Roerty all present with correct merged data from connections + invitations + messages
  • Tier distribution is roughly Pareto (few A-tier, more B, most C/D)
  • Voice report shows “Hey Brother” as dominant invitation greeting
  • All output CSVs open cleanly in Excel and import to ERPNext without encoding errors
  • No PII (phone numbers, emails) leaked into git-tracked files
  • All intermediate/output directories are git-ignored