LinkedIn Data Mining Pipeline — Implementation Prompt

Usage: Feed this entire file as context to a new Claude Code session on any machine that has access to the LinkedIn data export and the solanasis-scripts repo. This is fully self-contained; no external plan files or memory needed.

Created: 2026-03-22 Senior Reviewed: Yes (all findings addressed)

Task

Build a complete LinkedIn data mining pipeline: 14 Python scripts that parse a LinkedIn data export (37 CSV files) into clean CSV/JSON intermediaries, generate GTM intelligence and voice analysis reports, and produce CRM-ready import files.

Who You’re Working For

Dmitri Zasage (aka Dmitri Sunshine on LinkedIn), CEO of Solanasis LLC, a fractional CIO/CSIO/COO firm based in Boulder, Colorado. One-person operation with 1099 contractors, growth hacking mindset.

Solanasis offerings: Security Assessments, Disaster Recovery Verification, Data Migrations, CRM Setup, Systems Integration, Responsible AI Implementation

Target market: Nonprofits, private foundations ( $5 M -$ 50M), growing organizations, wealth management firms

CRM systems: ERPNext at db.zasage.us (primary, custom), Baserow (temporary). Both accept standard CSV imports.

Three Goals

CRM Population — Parse 2,500 connections + message history + invitations into scored, tiered contact records ready for ERPNext/Baserow import
GTM Intelligence — Network composition analysis, ICP validation, attention audit, warm introduction mapping, referral partner identification
Voice/Style Mining — Extract writing patterns from 225 comments, 258 shares, ~3,000 messages, 1,549 invitations to validate and extend Dmitri’s documented voice profile

The LinkedIn Data Export

Location: Update this path for your machine:

C:\Users\zasya\Documents\_zasage\data-export\Complete_LinkedInDataExport_03-13-2026.zip\

Already extracted. 37 CSV files. Dmitri’s LinkedIn profile URL: https://www.linkedin.com/in/dmitri-sunshine

File Inventory with Schemas

Connections.csv (2,500 rows)

IMPORTANT: 3-line preamble (notes about email privacy) before actual CSV columns at line 4. Skip first 3 lines.
Columns: First Name, Last Name, URL, Email Address, Company, Position, Connected On
Date format: “11 Mar 2026”
Emails: sparse (~10% have them; LinkedIn privacy opt-out)
Names may include credentials: e.g., "Beckmann, RFC®" (quoted, with comma inside)

messages.csv (~3,000 logical rows; 9,110 raw lines due to multiline quoted content)

Columns: CONVERSATION ID, CONVERSATION TITLE, FROM, SENDER PROFILE URL, TO, RECIPIENT PROFILE URLS, DATE, SUBJECT, CONTENT, FOLDER, ATTACHMENTS, IS MESSAGE DRAFT, IS CONVERSATION DRAFT
Date format: “2026-03-11 23:09:42 UTC”
Content field: multiline, properly quoted per RFC 4180
Date range: 2010-2026 (16 years of history)
Use csv.reader (NOT line counting) to get true logical row count

Invitations.csv (1,549 rows)

Columns: From, To, Sent At, Message, Direction, inviterProfileUrl, inviteeProfileUrl
Date format: “3/10/26, 1:02 PM”
Direction: OUTGOING or INCOMING
Dominant outgoing template: “Hey Brother, I came across your profile and would love to connect since I got the sense there may be a lot of alignment between us. I’d love to jump on a call with you sometime to see where our worlds overlap and how we might be able to help each other.”
Variations: “Hey Sister”, “Aloha [Name]”, “Hey [Name]”, “Hy [Name]” (typo variant)

Comments.csv (225 rows)

Columns: Date, Link, Message
Date format: “2026-03-11 02:14:50”
Date range: 2021-2026 (5 years)
This is the most authentic voice data. Examples:
- “Wow! It must have been amazing to witness that kind of growth. What’s the main challenge to keep it growing at this rate?”
- “Love it! Mindset is everything! I like to call this way of thinking ‘Pronoia’; the Universe is conspiring in your favor.”
- “Claude bullshits me far more often so ‘are you sure? verify everything!’ has been one of my most used prompts!”

Shares.csv (258 rows)

Columns: Date, ShareLink, ShareCommentary, SharedUrl, MediaUrl, Visibility
Commentary field uses doubled double-quotes ("") for embedded quotes. Python’s csv.reader handles this natively per RFC 4180.
Topics: cybersecurity (password managers, zero-knowledge encryption), AI, community, career, philanthropy
Example commentary starts: “So it’s 2026 but I still see orgs and people who don’t use a password manager…”

Reactions.csv (602 rows)

Columns: Date, Type, Link
Distribution: LIKE 75%, EMPATHY 19%, PRAISE 4%, INTEREST 1.5%, ENTERTAINMENT 0.5%, APPRECIATION 0.2%
Date range: Apr 2022 - Mar 2026

Ad_Targeting.csv (1 data row, 34 columns, ~34 KB)

Each column cell contains semicolon-delimited lists. Parse by splitting on ; after CSV extraction.
Key columns and values:
- Member Age: 25-34
- Company Follower of: 100+ organizations including Rockefeller Foundation, Ford Foundation, Ewing Marion Kauffman Foundation, Y Combinator, OpenAI, The Nature Conservancy, Solanasis, Carson Wealth, TIFIN Wealth, Momentis Family Office, Altruist, Wealthbox, Nitrogen, DataKind
- Member Groups: ~25 including Chronicle of Philanthropy, Emotional Intelligence Network, Philanthropy Network, Impact Investor networks, MAPS (psychedelics), Nonprofit fundraising
- Member Interests: 500+ including Cloud Computing, Risk Management, Entrepreneurship, AI, Cybersecurity, Data Management, Venture Capital, Nonprofit sector, ESG, Impact Investing, Responsible AI, Social Enterprise
- Member Traits: Active contributor/influencer
- High Value Segments: Small Business Owners, Senior Tech Decision Makers, High Net Worth Individuals, Public Impact Leaders, Enterprise Builders, Millennials
- Job Functions: Business Development
- Company Revenue: Under $1M
- Years of Experience: 12+

Inferences_about_you.csv (7 rows)

Columns: Category, Type of inference, Description, Inference
Inferences: HR professional=No, tech media interest=True, oil/energy interest=True, EV affinity=True, gender=Male, job seeking=True, active contributor/influencer=True

Company Follows.csv (299 rows)

Columns: Organization, Followed On
Strategically curated: wealth management firms, foundations, impact funds, philanthropy infrastructure, mental health orgs, community builders
Date range: Feb-Mar 2026 (recent bulk follows)

Member_Follows.csv (2,030 rows)

Columns: Date, Status, FullName
All “Active” status
C-suite coaches, impact investors, founders, nonprofit leaders, sustainability experts

Profile.csv (1 data row)

Columns: First Name, Last Name, Maiden Name, Address, Birth Date, Headline, Summary, Industry, Zip Code, Geo Location, Twitter Handles, Websites, Instant Messengers
Headline: “Less Luck. More Systems. | Founder @ Solanasis | Cybersecurity • Systems Integration • Responsible AI Implementation”
Industry: IT Services and IT Consulting
Location: Boulder, Colorado 80304

Positions.csv (7 positions)

Columns: Company Name, Title, Description, Location, Started On, Finished On
Career: InReach Solutions (Jul 2007-Nov 2021, 14yr, foster care/adoption) → Cohere COO (Nov 2021-Mar 2022) → Sage’d Founder (Mar 2022-Jan 2026) → Co-Nexus Community Co-op Founder (Nov 2023-present) → Wellnia Founder (Sep-Dec 2024) → Matchkeyz Founder (May 2025-Feb 2026) → Solanasis Founder & CEO (Feb 2026-present)

Skills.csv (11 skills): ERP Software, Cybersecurity, Healthcare IT, System Architecture, Project Management, Organizational Development, Operations Management, Software Architecture, Entrepreneurship, Nonprofit Organizations, Board of Directors

Other files:

Endorsement_Received_Info.csv (11 endorsements, mostly from Sarah Lena Hofer)
Events.csv (6: Networks Festival, Sociocracy Conference, gBETA Greeley, RegenWorld, Growing Cooperative Cultures, Self-Mastery Session)
Learning.csv (7 LinkedIn Learning courses: SOLIDWORKS, Inclusive Tech, Creating Maps with R, AI Agents for SMBs, Ageism, Visual Communication, Design Thinking/Social Innovation)
InstantReposts.csv (6 reposts)
Rich_Media.csv (6 photo uploads)
Ads Clicked.csv (5,852 ad clicks)
LAN Ads Engagement.csv (87 impressions on Roku, Pluto TV, etc.)
Logins.csv (22 login events)
Security Challenges.csv (117 challenges)
Registration.csv (account created Aug 17, 2007 — 18-year LinkedIn member)
Receipts_v2.csv (18 transactions, Sales Navigator Core @ $99.99/mo, Dec 2025-Mar 2026)
Email Addresses.csv (zasyatkin@gmail.com secondary, mr.sunshine@matchkeyz.com primary)
PhoneNumbers.csv (+1 503-544-2911)
SavedJobAlerts.csv (“founder” keyword, Denver, daily)
Jobs/Saved Jobs.csv (2 full-stack engineer roles at health tech companies)
Jobs/Job Seeker Preferences.csv (Portland, full-stack/software architect)

Research-Discovered Gold Nuggets

These are non-obvious insights from deep research that should inform the analysis scripts:

ICP Validation via Company Follows: The 299 company follows heavily skew toward wealth management, foundations, impact funds, philanthropy infrastructure. This validates the declared ICP.
Relationship Decay Detection: Cross-reference message dates with connection dates. Connections with past high engagement (5+ messages) that went silent (90+ days) are priority reactivation targets.
Invitation Acceptance Rate: Cross-reference outgoing Invitations with Connections to calculate what % of invitations resulted in connections. Measures template effectiveness.
Connection Growth Velocity: Plot connections by Connected On date to find networking surges (events, campaigns, job changes).
Voice Evolution (5+ years): Comments span 2021-2026. Evolution: brief acknowledgments (2021) → curious questions (2023) → tech-forward/opinionated (2025) → consultative expert (2026).
Ad Targeting Decode: LinkedIn classifies Dmitri as “Public Impact Leader,” “Small Business Owner,” “Senior Tech Decision Maker.” Member groups include Chronicle of Philanthropy, Impact Investor networks.
Reaction Distribution: 75% Like, 19% Empathy — mostly observer posture with selective emotional engagement. Useful for voice mining (what triggers deeper reactions?).
Job Seeker Signal: LinkedIn infers “active job seeker” (true) despite CEO role. Flag this in GTM report.

Existing Voice Profile (for validation)

Dmitri has a documented voice profile at solanasis-docs/solanasis-voice-profile.md. The voice analysis script should validate these patterns against actual LinkedIn data and discover NEW patterns:

Documented rules:

NO em dashes (—) — use semicolons, periods, parentheses, commas
NO “genuinely”, “seamless”, “frictionless”, “SMBs”, “audit”
NO “Furthermore”, “Moreover”, “Additionally” — use “So,” or jump-cut
NO “game-changer”, “next level”, “leverage” (verb), “synergy”
Semicolons as warm connectors (2+ per piece)
Parenthetical asides (1+ per piece)
Signature terms (2-3 per piece): “quiet failure”, “false comfort”, “drift”, “blind spot”, “risk debt”, “operational drag”, “hidden risk”, “brittle systems”, “messy handoffs”
Approved transitions: “So,”, “Which is why”, “Here’s the thing”, “Let’s face it”, “Imagine”
Max 3 sentences per paragraph

Voice learnings from outreach refinement (Appendix A of voice profile):

“Heya” is default casual greeting (not “Hey Brother” for all)
Extends warm openers with relatable cultural comments connected by semicolons
Soft hedges: “if you happen to know of” not “Do you know of”
Reciprocal closing: invites recipient to share expertise, not just do a favor
One ask per message
Autobiographical anchoring: “went back to my roots”
“By far” as authentic intensifier
No referral fees or “zone of genius” in warm outreach

Critical Technical Requirements

Encoding Gotcha

LinkedIn exports may use ANSI/Latin-1 encoding, NOT UTF-8. European characters, accented names, smart quotes will be garbled if read as UTF-8.

Use chardet library to auto-detect encoding on first read
Convert all outputs to plain UTF-8 (no BOM) for ERPNext/Baserow compatibility
For Excel-labeled CSVs, use utf-8-sig (BOM) separately

CSV Parsing

Use stdlib csv.reader for ALL Stage 1 parsers. Consistent, fewer dependencies. Always open with newline=''.
Pandas only for Stage 2 analysis scripts.
csv.reader handles doubled double-quotes ("") natively per RFC 4180.

Output CSV Robustness

All output CSVs: csv.writer with quoting=csv.QUOTE_ALL, plain UTF-8 encoding
Test that output opens cleanly in Excel AND imports to ERPNext without encoding errors

LinkedIn URL Normalization

This is the join key across ALL files. Normalize before any matching:

def normalize_linkedin_url(url: str) -> str:
    if not url:
        return ""
    url = url.strip().rstrip("/").lower()
    if not url.startswith("http"):
        url = "https://" + url
    url = url.split("?")[0].split("#")[0]
    return url

No External API Calls

Stages 1-2 are fully local/offline. No web requests, no API calls. All analysis uses the CSV files directly.

Existing Codebase Patterns

The project has 3 existing pipelines in solanasis-scripts/ with identical architecture. Use as templates:

foundation-pipeline/ (17 scripts, 62 tests) — most mature
msp-pipeline/ (13 scripts)
fcto-pipeline/ (9 scripts)

Pattern: config.py → parse → score → generate outreach → migrate to CRM

foundation-pipeline/config.py is the structural template. It defines:

Path constants (PIPELINE_DIR, DATA_DIR, RAW_DIR, INTERMEDIATE_DIR, OUTPUT_DIR)
Scoring model with positive/negative signals and point values
Tier thresholds (A: 30+, B: 20-29, C: 10-19, D: <10)
CRM output column definitions
Intermediate and output file path constants

All scripts use --plan (dry run, show what would happen) and --run (execute) modes where applicable.

Implementation Plan: 14 Scripts Across 3 Stages

Stage 1: Parse & Clean (CSV/JSON intermediaries)

All scripts go in solanasis-scripts/linkedin-pipeline/.

Script 1: `config.py`

LINKEDIN_EXPORT_DIR pointing to the Complete export
DMITRI_PROFILE_URL = "https://www.linkedin.com/in/dmitri-sunshine"
Path constants: DATA_DIR, INTERMEDIATE_DIR, OUTPUT_DIR, REPORT_DIR
Scoring weights for warmth and strategic value (see Script 8)
ICP keyword lists for segment tagging: foundation, nonprofit, wealth-mgmt, fractional-exec, tech, startup, coliving-community, boulder-local, connector, spiritual-wellness
Spam detection patterns (for flagging, not filtering)
Topic keyword lists for engagement tagging: ai/tech, community/coliving, spirituality/wellness, entrepreneurship, philanthropy/nonprofit, cybersecurity/ops
detect_encoding() helper using chardet

Script 2: `privacy_filter.py`

Utility module:

is_spam_message(content) — keyword heuristics (unsubscribe, free training, limited time, etc.). Flags but does NOT delete.
redact_pii_for_export(text) — optional redaction for any future external sharing (not used by default)
Full message text is preserved throughout the pipeline. Nothing is stripped or filtered out.

Script 3: `parse_connections.py`

Input: Connections.csv (skip first 3 lines, CSV header at line 4)
Output: output/connections_clean.csv
Logic: Normalize LinkedIn URLs, parse “11 Mar 2026” dates to ISO 8601, split name fields, handle credentials in names (e.g., "Beckmann, RFC®"), add connection_age_days computed column

Script 4: `parse_messages.py`

Input: messages.csv (use stdlib csv.reader, NOT pandas, NOT line counting)
Output: output/messages_parsed.json (grouped by conversation), output/message_stats_per_contact.csv
Logic:
- Group by CONVERSATION ID, sort by date within each conversation
- Identify Dmitri by SENDER PROFILE URL matching his profile URL
- Per-contact stats: message count total/from-dmitri/to-dmitri, date range, direction ratio, response latency
- Filter out drafts (IS MESSAGE DRAFT = “Yes”)
- Flag spam messages (using privacy_filter.is_spam_message())
- Extract greeting patterns from Dmitri’s sent messages (first line regex)

Script 5: `parse_invitations.py`

Input: Invitations.csv
Output: output/invitations_clean.csv
Logic:
- Normalize URLs from both inviterProfileUrl and inviteeProfileUrl
- Parse dates (“3/10/26, 1:02 PM” format) to ISO 8601
- Classify direction (OUTGOING vs INCOMING)
- Detect templates using difflib.SequenceMatcher (threshold 0.85)
- The “Hey Brother” template is dominant; quantify usage count, variants, and personalization spots

Script 6: `parse_engagement.py`

Input: Comments.csv, Shares.csv, Reactions.csv, InstantReposts.csv
Output: output/engagement_activity.csv (unified timeline), output/engagement_by_topic.json
Logic:
- Merge all engagement into single timeline with unified schema: date, activity_type, content_text, target_url, reaction_type, word_count, topic_tags
- Extract text from comments and share commentary
- Compute word count, sentence count for text content
- Tag reaction types
- Apply topic keyword lists to classify engagement by theme

Script 7: `parse_profile_data.py`

Input: Profile.csv, Positions.csv, Skills.csv, Endorsement_Received_Info.csv, Ad_Targeting.csv, Inferences_about_you.csv, Member_Follows.csv, Company Follows.csv, Events.csv, Learning.csv, Ads Clicked.csv, Registration.csv, Receipts_v2.csv
Output: output/profile_summary.json, output/ad_targeting_parsed.json, output/follows_and_interests.csv, output/career_timeline.json
Logic:
- Parse Ad_Targeting’s 34 columns (semicolon-delimited lists within each cell)
- Parse Inferences (7 rows)
- Merge Member_Follows (2,030) + Company Follows (299) into single list with type column
- Parse Positions into career timeline JSON
- Parse Events and Learning for interest signals
- Include Registration date for account maturity context

Script 8: `build_unified_contacts.py` (the critical merge)

Input: All Stage 1 outputs
Output: output/contacts_unified.csv, output/contacts_unified.json
Join key: normalized linkedin_url
Merge: Connections (base) + left-join Invitations + left-join Message stats + left-join Follows
Orphan handling: people in messages/invitations but NOT in connections → non_connection_interactions.csv

Warmth scoring (algorithmic):

+20 bidirectional messages exist
+15 if 5+ messages exchanged
+10 message exchange in last 30 days
+8 message exchange in last 90 days
+5 Dmitri sent invitation (proactive interest)
+5 they sent invitation (they sought Dmitri out)
+3 Dmitri follows them
-10 connected 1yr+ with zero messages
-5 only spam messages from this contact

Strategic value scoring:

+15 title matches ICP (CEO/COO/CTO/ED at nonprofits/foundations/growing orgs)
+10 company matches target vertical (foundation, nonprofit, wealth management, financial services)
+8 position is fractional C-suite (fCTO, fCOO, fCIO, fractional)
+5 company location includes Colorado/Boulder/Denver
+5 connector role (investor, advisor, community builder, board member)
+3 engaged recently (message or invitation in last 60 days)

Tiers: A (50+), B (30-49), C (15-29), D (<15)

Segment tags: nonprofit, foundation, wealth-mgmt, tech, startup, coliving-community, fractional-exec, boulder-local, connector, spiritual-wellness

Unified contact schema (one row per person, maximum data):

linkedin_url, first_name, last_name, full_name, email, company, position,
connected_on, connection_age_days,
invitation_direction, invitation_message, invitation_date, invitation_accepted,
message_count_total, message_count_from_dmitri, message_count_to_dmitri,
first_message_date, last_message_date, last_message_direction,
avg_response_time_hours, longest_conversation_length, conversation_count,
last_message_content_preview,
dmitri_follows_them, they_follow_dmitri,
warmth_score, strategic_value_score, relationship_tier,
relationship_decay_flag, days_since_last_contact,
segment_tags, icp_match_details, notes

Also produce per-contact conversation export (output/conversations/):

One JSON file per A/B-tier contact: full conversation threads with timestamps
For C/D-tier: summary JSON with metadata only + last 3 messages

Stage 2: Analysis Reports

Output to solanasis-docs/linkedin-analysis/ as markdown.

Script 9: `analyze_network.py`

Report: network-intelligence-report.md

Network composition: industry breakdown (keyword-based from company/position), role seniority distribution, geographic distribution
Top 25 most-engaged contacts (by warmth score) with commonalities
Network gaps vs. Solanasis target verticals: foundation execs, wealth management, MSPs, fractional CTOs. What % of network is in each vertical?
ICP validation: Cross-reference actual connection industries with declared ICP
Referral partner candidates: fractional execs/consultants/advisors/attorneys/accountants with high warmth + matching segments
Dormant gold: high strategic value but low warmth (connected, never messaged; target vertical people)
Relationship decay list: contacts with past high engagement (5+ messages) but no contact in 90+ days
Connection velocity: connections per week over time, identify networking surges
Invitation effectiveness: acceptance rate by template variant, by personalization level
Company Follows alignment: 299 followed companies mapped to Solanasis verticals

Script 10: `analyze_gtm_alignment.py`

Report: gtm-alignment-report.md

Attention audit: topic breakdown of comments/shares/reactions vs. Solanasis positioning needs
Ad Targeting decode: Parse all 34 columns; what LinkedIn thinks Dmitri is and what interests are inferred
Inferences analysis: Flag misalignment with Solanasis brand (e.g., job seeking=true)
Content engagement gap: quantify % of engagement on cybersecurity vs. community vs. AI vs. other
Warm introduction map: for each target vertical, list warmest contacts who could provide intros
Company Follows as ICP signal: 299 companies mapped to verticals
Member Groups inventory: list all ~25 groups, classify GTM relevance
Actionable recommendations: specific, numbered actions

Script 11: `analyze_voice.py`

Report: voice-analysis-report.md (Python-only, no LLM needed)

Comment voice profile (225 comments, 5 years):
- Avg length, word count distribution, exclamation/question frequency
- Top 30 words/phrases (TF-IDF), vocabulary richness (type-token ratio)
- Intensifier usage: “by far”, “so”, “really”, “super”, “amazing”, “love it”
- Softener usage: “happen to know”, “would love to”, “I think”, “I believe”
Share commentary voice (258 shares):
- Similar metrics + topic framing analysis
- Signature transitions: “Here’s the thing…”, “So…”, “Which is why…”
- Pro-tip pattern detection
DM voice profile (~3,000 messages, Dmitri’s sent only):
- Greeting distribution: “Hey Brother” X%, “Hey Sister” Y%, “Heya” Z%, “Aloha” W%
- Sign-off patterns, meeting link inclusion rate
- CTA patterns, response latency
Invitation template analysis (1,549 invitations):
- Cluster using difflib.SequenceMatcher (0.85 threshold)
- Each variant: verbatim text, usage count, date range, personalization spots
- Acceptance rate per template (cross-ref Invitations → Connections)
- Gender-specific greeting analysis
Voice validation against documented profile:
- Em dash frequency (should be ~0), semicolon frequency, banned words scan
- Compare against Appendix A findings
5-year voice evolution (2021→2026):
- 2021: Brief acknowledgments
- 2023: Curious questions, learning posture
- 2025: Tech-forward, opinionated
- 2026: Consultative, expert-level, Solanasis-branded
- Quantify: avg comment length, question ratio, vocabulary complexity by year
Sentiment analysis using VADER (nltk.sentiment.vader, purpose-built for short social text)

Script 12: `generate_voice_enrichment_queue.py`

Output: output/voice_llm_queue.json (prompt-ready samples for Claude analysis)
Selects: all 225 comments, all share commentaries, top 50 conversations (Dmitri’s messages only), all unique invitation templates
For future LLM-assisted analysis: tone classification, qualitative insights, voice profile extension

Stage 3: ERPNext Integration

Script 13: `prepare_erpnext_import.py`

Input: contacts_unified.csv
Output: output/erpnext_leads.csv, output/erpnext_contacts.csv
Mapping:
- Connection with messages exchanged → Contact (linked to Lead)
- Connection, no interaction → Lead (status: Open)
- Connection in target segment → Lead (status: Qualified)
- Non-connection → Lead (status: Open, lower priority)
--plan mode: print counts and samples, no file creation
--run mode: write ERPNext Data Import-ready CSVs (UTF-8, standard format per Frappe docs)
Full message text included for A/B-tier; metadata + last messages for C/D-tier

ERPNext import format (per Frappe docs):

Standard CSV upload via Data Import tool
Download template for Lead/Contact doctypes, fill in, upload
Must be UTF-8 encoded
ID column left blank for new records (auto-generated)
No hard row limit but recommend a few thousand at a time

Script 14: `push_to_erpnext.py` (stub only)

Placeholder with Frappe REST API patterns in comments
Auth env vars in .env.example: ERPNEXT_URL, ERPNEXT_API_KEY, ERPNEXT_API_SECRET
Will be fleshed out when ERPNext at db.zasage.us has API credentials confirmed

Directory Structure

solanasis-scripts/linkedin-pipeline/
    config.py
    privacy_filter.py
    parse_connections.py
    parse_messages.py
    parse_invitations.py
    parse_engagement.py
    parse_profile_data.py
    build_unified_contacts.py
    analyze_network.py
    analyze_gtm_alignment.py
    analyze_voice.py
    generate_voice_enrichment_queue.py
    prepare_erpnext_import.py
    push_to_erpnext.py
    requirements.txt
    .env.example
    data/
        intermediate/         # Working files (git-ignored)
        output/               # Final CSV/JSON (git-ignored)
    tests/
        test_parse_*.py
        test_privacy_filter.py
        test_scoring.py

solanasis-docs/linkedin-analysis/
    network-intelligence-report.md
    gtm-alignment-report.md
    voice-analysis-report.md

requirements.txt

chardet
python-dateutil
pandas>=2.0
nltk
httpx
python-dotenv

Implementation Order

Build in this order, testing each before proceeding:

Phase A (foundation): config.py, privacy_filter.py, parse_connections.py
Phase B (core parsers): parse_invitations.py, parse_engagement.py, parse_messages.py, parse_profile_data.py
Phase C (the merge): build_unified_contacts.py — spot-check known contacts
Phase D (analysis): analyze_network.py, analyze_gtm_alignment.py, analyze_voice.py, generate_voice_enrichment_queue.py
Phase E (integration): prepare_erpnext_import.py, push_to_erpnext.py (stub)

Solanasis Docs

Explorer

linkedin-pipeline-continuation-prompt

LinkedIn Data Mining Pipeline — Implementation Prompt

Task

Who You’re Working For

Three Goals

The LinkedIn Data Export

File Inventory with Schemas

Research-Discovered Gold Nuggets

Existing Voice Profile (for validation)

Critical Technical Requirements

Encoding Gotcha

CSV Parsing

Output CSV Robustness

LinkedIn URL Normalization

No External API Calls

Existing Codebase Patterns

Implementation Plan: 14 Scripts Across 3 Stages

Stage 1: Parse & Clean (CSV/JSON intermediaries)

Script 1: config.py

Script 2: privacy_filter.py

Script 3: parse_connections.py

Script 4: parse_messages.py

Script 5: parse_invitations.py

Script 6: parse_engagement.py

Script 7: parse_profile_data.py

Script 8: build_unified_contacts.py (the critical merge)

Stage 2: Analysis Reports

Script 9: analyze_network.py

Script 10: analyze_gtm_alignment.py

Script 11: analyze_voice.py

Script 12: generate_voice_enrichment_queue.py