LinkedIn Data Mining Pipeline

Context

Dmitri has a complete LinkedIn data export (37 files, ~22K rows total) that contains untapped gold for three purposes: populating the CRM with 2,500 contacts and relationship history, extracting GTM intelligence about network composition and engagement patterns, and mining voice/style data from 225 comments, 258 shares, ~3,000 messages, and 1,549 invitation messages to refine the existing voice profile.

The approach is staged: CSV/JSON intermediaries first (system-agnostic), analysis reports second, ERPNext push third. This lets us validate data quality at each step and keeps outputs portable across systems.


What We’re Working With

FileRowsGTM ValueCRM ValueVoice Value
Connections.csv2,500Network compositionPrimary contact source-
messages.csv~3,000Relationship intelInteraction historyDM style, greetings, CTAs
Invitations.csv1,549Outreach patternsConnection contextTemplates, personalization
Comments.csv225Topic interests-Short-form authentic voice
Shares.csv258Content strategy-Opinions, framing, commentary
Reactions.csv602Attention mapping-Interest signals
Ad_Targeting.csv1 row, 34 cols (~34 KB)Psychographic profile--
Member_Follows.csv2,030Interest signals--
Profile/Positions/Skills~25Self-assessmentProfile data-
Inferences_about_you.csv8LinkedIn’s perception--

Source: C:\Users\zasya\Documents\_zasage\data-export\Complete_LinkedInDataExport_03-13-2026.zip\

Additional High-Value Files (discovered in deep exploration)

FileRowsValue
Company Follows.csv299ICP validation: Rockefeller Foundation, Ford Foundation, Kauffman Foundation, Carson Wealth, TIFIN Wealth, Momentis Family Office, Altruist, Wealthbox, Nitrogen, DataKind + 290 more
Ad_Targeting.csv1 (34 cols)500+ interests, 100+ company associations, 25+ member groups (Chronicle of Philanthropy, Impact Investor networks), high-value segments (Small Business Owners, Senior Tech Decision Makers, Public Impact Leaders)
Ads Clicked.csv5,852Ad interest signals; mirrors prospect ad behavior
Positions.csv7Career narrative: InReach (14yr) → Cohere COO → Wellnia → Matchkeyz → Sage’d → Co-Nexus → Solanasis
Events.csv6Networks Festival, Sociocracy Conference, gBETA Greeley, RegenWorld
Learning.csv7AI Agents, Design Thinking, Visual Communication (sampling, not completing)
Logins/Security140Device/IP/location patterns
Receipts_v2.csv18Sales Navigator @ $99.99/mo (active Dec 2025-Mar 2026)
Registration.csv1Account created Aug 17, 2007 (18-year member)

Research-Discovered Gold Nuggets

  1. ICP Validation via Company Follows: 299 strategically curated company follows heavily skew toward wealth management, foundations, impact funds, philanthropy infrastructure. This IS the ICP proof.
  2. Relationship Decay Detection: Cross-reference message dates with connection dates. Connections with past high engagement that went silent are priority reactivation targets.
  3. Invitation Acceptance Rate: Cross-reference Invitations (outgoing) with Connections to calculate acceptance rate. Tells us how effective the “Hey Brother” template actually is.
  4. Connection Growth Velocity: Plot connections by Connected On date to find networking surges (events, job changes, campaigns).
  5. Voice Evolution (5 years): Comments span 2021-2026. Early: brief acknowledgments (“Superman”). Mid: curious questions. Recent: consultative expert commentary. Shows clear voice maturation.
  6. Ad Targeting Decode: LinkedIn classifies Dmitri as “Public Impact Leader,” “Small Business Owner,” “Senior Tech Decision Maker.” Company follows include major foundations. Member groups include Chronicle of Philanthropy, Impact Investor networks.
  7. Reaction Distribution: 75% Like, 19% Empathy, 4% Praise, 1.5% Interest. Mostly observer posture; selective emotional engagement.
  8. Message Date Range: 2010 - Mar 2026 (16 years of history). Voice evolution analysis from messages can go much further back than comments (which start 2021).
  9. Job Seeker Signal: LinkedIn infers “active job seeker” (true) despite CEO role. Suggests exploratory mindset or stale preferences. Consider turning this off.

CSV Encoding Gotcha (Critical)

Research found LinkedIn exports may use ANSI/Latin-1 encoding, NOT UTF-8. European characters, accented names, smart quotes will be garbled if read as UTF-8. All parsers must:

  • Use chardet to auto-detect encoding on first read
  • Convert to UTF-8 for all outputs
  • ERPNext requires UTF-8 for import

Existing Tools Assessment

No existing “LinkedIn export parser” library exists on PyPI or GitHub. Existing tools are all scrapers (Selenium/BeautifulSoup), not export parsers. We build our own, which is fine since the CSV schemas are straightforward.

ERPNext Data Import Format

Per Frappe docs:

  • Standard CSV upload via Data Import tool
  • Download template for Lead/Contact doctypes, fill in, upload
  • Must be UTF-8 encoded (especially for non-ASCII names)
  • No hard row limit but recommend “a few thousand at a time”
  • ID column left blank for new records (auto-generated)
  • Can also produce Baserow-compatible CSVs (simpler flat schema)

Stage 1: Parse & Clean (CSV/JSON intermediaries)

All scripts go in solanasis-scripts/linkedin-pipeline/. Follow existing pipeline conventions (config.py, —plan/—run modes).

Script 1: config.py

Pattern from foundation-pipeline/config.py (line-for-line reference: solanasis-scripts/foundation-pipeline/config.py):

  • LINKEDIN_EXPORT_DIR pointing to the Complete export
  • DMITRI_PROFILE_URL = "https://www.linkedin.com/in/dmitri-sunshine"
  • Path constants: DATA_DIR, INTERMEDIATE_DIR, OUTPUT_DIR, REPORT_DIR
  • Scoring weights for warmth and strategic value (see Script 8)
  • ICP keyword lists for segment tagging (foundation, nonprofit, wealth-mgmt, fractional-exec, etc.)
  • Spam detection patterns (for flagging, not filtering)
  • Topic keyword lists for engagement tagging (ai/tech, community, spirituality, entrepreneurship, philanthropy, cybersecurity)
  • detect_encoding() helper using chardet (LinkedIn CSVs may be ANSI/Latin-1)

Script 2: privacy_filter.py

Utility module:

  • is_spam_message() - keyword heuristics (unsubscribe, free training, limited time, etc.) - flags but does NOT delete
  • redact_pii_for_export() - optional redaction for any future external sharing (not used by default)
  • Full message text is preserved throughout the pipeline; nothing is stripped or filtered out

Script 3: parse_connections.py

  • Input: Connections.csv (note: 3-line header before actual CSV columns at line 4)
  • Output: output/connections_clean.csv
  • Logic: Normalize LinkedIn URLs (lowercase, strip trailing /), parse dates to ISO, split name fields, handle credentials in names (e.g., “Beckmann, RFC(R)”), add connection_age_days

Script 4: parse_messages.py

  • Input: messages.csv (multiline content in quoted fields; use stdlib csv.reader, NOT pandas)
  • Output: output/messages_parsed.json (grouped by conversation), output/message_stats_per_contact.csv
  • Logic: Group by CONVERSATION ID, identify Dmitri as anchor by profile URL, compute per-contact stats (message count, direction ratio, date range, response latency), filter drafts, flag spam, extract greeting patterns from Dmitri’s sent messages

Script 5: parse_invitations.py

  • Input: Invitations.csv
  • Output: output/invitations_clean.csv
  • Logic: Normalize URLs, parse dates (“3/10/26, 1:02 PM” format), classify direction, detect templates using difflib.SequenceMatcher (threshold 0.85) - the “Hey Brother” template is dominant and should be quantified

Script 6: parse_engagement.py

  • Input: Comments.csv, Shares.csv (tricky: doubled double-quotes in commentary), Reactions.csv, InstantReposts.csv
  • Output: output/engagement_activity.csv (unified timeline), output/engagement_by_topic.json
  • Logic: Merge all into single timeline, extract text, compute word/sentence counts, tag reaction types, apply topic keyword lists (ai/tech, community/coliving, spirituality, entrepreneurship, philanthropy/nonprofit, cybersecurity)

Script 7: parse_profile_data.py

  • Input: Profile.csv, Positions.csv, Skills.csv, Endorsement_Received_Info.csv, Ad_Targeting.csv, Inferences_about_you.csv, Member_Follows.csv, Company Follows.csv, Events.csv, Learning.csv, Ads Clicked.csv, Registration.csv, Receipts_v2.csv
  • Output: output/profile_summary.json, output/ad_targeting_parsed.json, output/follows_and_interests.csv, output/career_timeline.json
  • Logic:
    • Parse Ad_Targeting’s 34 columns (semicolon-delimited lists within each column cell). Extract: Member Age, Buyer Groups, Company Names, Company Follows (100+), Industries, Interests (500+), Member Traits, High Value Segments, Job Functions, Skills, Standard Audience Segments
    • Parse Inferences (7 rows: HR professional=No, tech media=True, oil/energy=True, EV affinity=True, gender=Male, job seeking=True, active contributor=True)
    • Merge Member_Follows (2,030) + Company Follows (299) into single list with type column
    • Parse Positions into career timeline JSON with date ranges and descriptions
    • Parse Events for interest signals (sociocracy, regenerative systems, community building)
    • Parse Learning for curiosity signals (AI agents, design thinking, visual communication)
    • Include Registration date (Aug 2007, 18-year member) for account maturity context

Script 8: build_unified_contacts.py (the critical merge)

  • Input: All Stage 1 outputs
  • Output: output/contacts_unified.csv, output/contacts_unified.json
  • Join key: normalized linkedin_url
  • Merge: Connections (base) + left-join Invitations + left-join Message stats + left-join Follows
  • Orphan handling: people in messages/invitations but NOT in connections go to non_connection_interactions.csv
  • Warmth scoring (algorithmic):
    • +20 bidirectional messages, +15 if 5+ messages, +10 message in last 30d, +8 in last 90d
    • +5 Dmitri sent invitation (proactive), +5 they sent invitation, +3 Dmitri follows them
    • -10 connected 1yr+ with zero messages, -5 only spam messages
  • Strategic value scoring:
    • +15 title matches ICP (CEO/COO/CTO/ED at nonprofits/foundations/growing orgs)
    • +10 company matches target vertical, +8 fractional C-suite, +5 Colorado/Boulder/Denver
    • +5 connector role (investor, advisor, board member), +3 engaged recently
  • Tier: A (50+), B (30-49), C (15-29), D (<15)
  • Segment tags: nonprofit, foundation, wealth-mgmt, tech, startup, coliving-community, fractional-exec, boulder-local, connector, spiritual-wellness

Unified contact schema (one row per person, maximum data):

linkedin_url, first_name, last_name, full_name, email, company, position,
connected_on, connection_age_days,
invitation_direction, invitation_message, invitation_date, invitation_accepted,
message_count_total, message_count_from_dmitri, message_count_to_dmitri,
first_message_date, last_message_date, last_message_direction,
avg_response_time_hours, longest_conversation_length, conversation_count,
last_message_content_preview,
dmitri_follows_them, they_follow_dmitri,
warmth_score, strategic_value_score, relationship_tier,
relationship_decay_flag, days_since_last_contact,
segment_tags, icp_match_details, notes

Also produce per-contact conversation export (output/conversations/):

  • One JSON file per A/B-tier contact: full conversation threads with timestamps
  • For C/D-tier: summary JSON with metadata only + last 3 messages

Stage 2: Analysis Reports

Output to solanasis-docs/linkedin-analysis/ as markdown.

Script 9: analyze_network.py

Report: network-intelligence-report.md

  • Network composition: industry breakdown (keyword-based from company/position), role seniority distribution, geographic distribution
  • Top 25 most-engaged contacts (by warmth score) with commonalities
  • Network gaps vs. Solanasis target verticals: foundation execs, wealth management, MSPs, fractional CTOs. What % of network is in each vertical? Where is it thin?
  • ICP validation: Cross-reference actual connection industries with declared ICP. If <5% are nonprofit/foundation people, the network doesn’t support the GTM
  • Referral partner candidates: fractional execs/consultants/advisors/attorneys/accountants with high warmth + matching segments
  • Dormant gold: high strategic value but low warmth (connected, never messaged; target vertical people)
  • Relationship decay list: contacts with past high engagement (5+ messages) but no contact in 90+ days
  • Connection velocity: connections per week over time, identify networking surges (events, campaigns)
  • Invitation effectiveness: acceptance rate by template variant, by personalization level
  • Company Follows alignment: 299 followed companies mapped to Solanasis verticals

Script 10: analyze_gtm_alignment.py

Report: gtm-alignment-report.md

  • Attention audit: topic breakdown of comments/shares/reactions vs. Solanasis positioning needs. Currently engaging on: AI, community, spirituality, entrepreneurship. Needs more: cybersecurity, compliance, disaster recovery, operational resilience
  • Ad Targeting decode: Parse all 34 columns. What LinkedIn thinks Dmitri is: Public Impact Leader, Small Business Owner, Senior Tech Decision Maker. What interests are inferred. What member groups he’s in (Chronicle of Philanthropy, Impact Investor networks)
  • Inferences analysis: LinkedIn says: active contributor, tech media interest, job seeking (true), EV affinity. Flag misalignment with Solanasis brand
  • Content engagement gap: quantify % of engagement on cybersecurity vs. community vs. AI vs. other. Compare with desired positioning
  • Warm introduction map: for each target vertical (foundations, wealth mgmt, MSPs, fCTOs), list the warmest contacts who could provide intros
  • Company Follows as ICP signal: 299 companies mapped to verticals. Which followed companies align with each pipeline (foundation, MSP, fCTO)?
  • Member Groups inventory: list all ~25 groups from Ad_Targeting, classify relevance to Solanasis GTM
  • Actionable recommendations: specific, numbered actions (e.g., “Join 3 cybersecurity-focused LinkedIn groups,” “Increase security-related comments from X% to 25%“)

Script 11: analyze_voice.py

Report: voice-analysis-report.md (Python-only, no LLM needed)

  • Comment voice profile (225 comments, 5 years of data):
    • Avg length, word count distribution, exclamation frequency, question frequency
    • Top 30 words/phrases (TF-IDF), vocabulary richness (type-token ratio)
    • Intensifier usage: “by far”, “so”, “really”, “super”, “amazing”, “love it”
    • Softener usage: “happen to know”, “would love to”, “I think”, “I believe”
  • Share commentary voice (258 shares):
    • Similar metrics + topic framing analysis
    • Signature transitions: “Here’s the thing…”, “So…”, “Which is why…”
    • Pro-tip pattern detection (educational framing)
  • DM voice profile (~3,000 messages, Dmitri’s sent messages only):
    • Greeting distribution: “Hey Brother” X%, “Hey Sister” Y%, “Heya” Z%, “Aloha” W%, “Hey [Name]” V%
    • Sign-off patterns, meeting link inclusion rate (go.solanasis.com/meet)
    • CTA patterns: question-ending, call-booking, group-asking
    • Response latency: how fast does Dmitri reply?
  • Invitation template analysis (1,549 invitations):
    • Cluster templates using difflib.SequenceMatcher (0.85 threshold)
    • Show each template variant with: verbatim text, usage count, date range, personalization spots
    • Calculate acceptance rate per template (cross-ref Invitations → Connections)
    • Gender-specific greeting analysis (Brother vs Sister vs name-only)
  • Voice validation against documented profile (solanasis-voice-profile.md):
    • Em dash frequency (should be ~0), semicolon frequency, banned words scan
    • Comparison against Appendix A findings (Heya vs Hey Brother, soft hedges, reciprocal closing)
  • 5-year voice evolution (2021→2026):
    • 2021: Brief acknowledgments (“Superman”)
    • 2023: Curious questions, learning posture
    • 2025: Tech-forward, opinionated (Claude AI commentary)
    • 2026: Consultative, expert-level, Solanasis-branded
    • Quantify: avg comment length by year, question ratio by year, vocabulary complexity by year
  • Sentiment analysis using VADER (nltk.sentiment.vader, purpose-built for short social media text): polarity and subjectivity trends over time

Script 12: generate_voice_enrichment_queue.py

  • Output: output/voice_llm_queue.json (prompt-ready samples for Claude analysis)
  • Selects: all 225 comments, all share commentaries, top 50 conversations (Dmitri’s messages only), all unique invitation templates
  • For future LLM-assisted analysis: tone classification, qualitative insights, voice profile extension

Stage 3: ERPNext Integration

Script 13: prepare_erpnext_import.py

  • Input: contacts_unified.csv
  • Output: output/erpnext_leads.csv, output/erpnext_contacts.csv
  • Mapping:
    • Connection with messages exchanged → Contact (linked to Lead)
    • Connection, no interaction → Lead (status: Open)
    • Connection in target segment → Lead (status: Qualified)
    • Non-connection → Lead (status: Open, lower priority)
  • --plan mode: print counts and samples, no file creation
  • --run mode: write ERPNext Data Import-ready CSVs
  • Full message text included in conversation exports for A/B-tier contacts; metadata + last messages for C/D-tier

Script 14: push_to_erpnext.py (stub only - implemented when API access confirmed)

  • Placeholder with Frappe REST API patterns documented in comments
  • Auth env vars defined in .env.example: ERPNEXT_URL, ERPNEXT_API_KEY, ERPNEXT_API_SECRET
  • Will be fleshed out once ERPNext at db.zasage.us has API credentials and schema confirmed

Directory Structure

solanasis-scripts/linkedin-pipeline/
    config.py
    privacy_filter.py
    parse_connections.py
    parse_messages.py
    parse_invitations.py
    parse_engagement.py
    parse_profile_data.py
    build_unified_contacts.py
    analyze_network.py
    analyze_gtm_alignment.py
    analyze_voice.py
    generate_voice_enrichment_queue.py
    prepare_erpnext_import.py
    push_to_erpnext.py
    requirements.txt          # python-dateutil, pandas (Stage 2 only), httpx, python-dotenv
    .env.example
    data/
        intermediate/         # Working files (git-ignored)
        output/               # Final CSV/JSON (git-ignored)
    tests/
        test_parse_*.py
        test_privacy_filter.py
        test_scoring.py

solanasis-docs/linkedin-analysis/
    network-intelligence-report.md
    gtm-alignment-report.md
    voice-analysis-report.md

Implementation Order

Phase A (foundation): config.py, privacy_filter.py, parse_connections.py Phase B (core parsers): parse_invitations.py, parse_engagement.py, parse_messages.py, parse_profile_data.py Phase C (the merge): build_unified_contacts.py - spot-check known contacts (Ian Crafford, Tim Lockie, Kevin Roerty) Phase D (analysis): analyze_network.py, analyze_gtm_alignment.py, analyze_voice.py, generate_voice_enrichment_queue.py Phase E (integration): prepare_erpnext_import.py, push_to_erpnext.py


Key Technical Notes

  • Encoding: LinkedIn CSVs may be ANSI/Latin-1, not UTF-8. Use chardet.detect() on first read, convert all outputs to UTF-8. Add chardet to requirements.
  • CSV parsing: Use stdlib csv.reader for ALL Stage 1 parsers (consistent, fewer dependencies). Pandas only for Stage 2 analysis scripts. Always open with newline='' per Python csv module docs.
  • Connections.csv header: 3-line preamble (notes about email privacy) before actual column headers at line 4. Skip first 3 lines manually with csv.reader.
  • Shares.csv quoting: Commentary contains doubled double-quotes ("") for embedded quotes. Python’s csv.reader handles doubled double-quotes natively per RFC 4180 (no special dialect needed).
  • Ad_Targeting.csv: 34 columns, 1 data row, ~34K of data. Each cell contains semicolon-delimited lists. Parse by splitting on ; after CSV extraction.
  • LinkedIn URL normalization: lowercase, strip trailing /, strip query params/fragments, ensure https:// prefix. This is the join key across ALL files.
  • Output CSV robustness: All output CSVs must use csv.writer with quoting=csv.QUOTE_ALL and plain UTF-8 encoding (no BOM) for ERPNext/Baserow import. For any CSVs explicitly labeled for Excel use, add BOM (utf-8-sig) separately.
  • Raw LinkedIn export stays in _zasage/data-export/: never copied into pipeline dir or committed to git
  • All intermediate/output files git-ignored
  • requirements.txt: chardet, python-dateutil, pandas>=2.0, nltk, httpx, python-dotenv
  • No external API calls in Stage 1-2: All analysis is local, offline, using the CSV files directly

Verification

  1. Row count validation: Each parser should log input rows vs. output rows (expect 2,500 connections, ~2,700-3,000 logical messages (9,110 raw CSV lines including multiline content), ~1,549 invitations, ~1,085 engagement items)
  2. Spot-check known contacts: Ian Crafford, Tim Lockie, Kevin Roerty should appear in unified contacts with correct merge of connection + invitation + message data
  3. Tier distribution: Should be roughly Pareto (few A-tier, more B, most C/D)
  4. Voice report validation: Cross-check greeting distribution against what we see in raw data (e.g., “Hey Brother” is clearly dominant in invitations)
  5. Analysis reports: Review markdown reports for actionable insights before ERPNext push

Decisions (confirmed with Dmitri)

  1. ERPNext: CSV-only for now. No API push script; just import-ready CSVs. push_to_erpnext.py becomes a stub/placeholder.
  2. Data scope: Full messages, full data, as much as possible. No filtering to business-only. Full message text preserved in intermediate outputs and available for analysis.
  3. Enrichment: LinkedIn URL + name + company is sufficient. No Apollo enrichment step needed; can always enrich later.
  4. Scope: Build all 14 scripts in one pass.
  5. Execution: Run on a separate server (not local dev machine) due to processing power concerns.

Senior Reviewer Findings (all addressed)

Status: APPROVED WITH NOTES (all fixed)

  1. SHOULD-FIX (FIXED): Message count corrected from 9,109 to ~3,000 logical rows (multiline content inflates raw line count)
  2. SHOULD-FIX (FIXED): Message date range corrected to 2010-2026 (16 years, not 14 months)
  3. SHOULD-FIX (FIXED): Connections row count corrected to 2,500
  4. SHOULD-FIX (FIXED): CSV quoting note corrected (RFC 4180 native handling, not QUOTE_ALL)
  5. NICE-TO-HAVE (FIXED): Switched from TextBlob to VADER for sentiment (purpose-built for short social text)
  6. NICE-TO-HAVE (FIXED): Standardized csv.reader for all Stage 1 parsers
  7. NICE-TO-HAVE (FIXED): Plain UTF-8 for CRM imports, BOM only for Excel-labeled files
  8. NICE-TO-HAVE (FIXED): Ad_Targeting row count notation clarified


CONTINUATION PROMPT (for executing on another server)

Copy everything below this line into a new Claude Code session on the target server. This prompt is self-contained and references the plan above.


Task: Build LinkedIn Data Mining Pipeline

You are implementing a LinkedIn data mining pipeline for Solanasis LLC. The complete plan is in this file above the “CONTINUATION PROMPT” heading. Read the entire plan first.

Who You’re Working For

Dmitri Zasage (aka Dmitri Sunshine), CEO of Solanasis LLC, a fractional CIO/CSIO/COO firm based in Boulder, Colorado targeting nonprofits, foundations, and growing organizations. Solanasis offers: Security Assessments, Disaster Recovery Verification, Data Migrations, CRM Setup, Systems Integration, Responsible AI Implementation.

What You’re Building

14 Python scripts that parse Dmitri’s LinkedIn data export (37 CSV files) into:

  1. Clean CSV/JSON intermediaries (system-agnostic, portable)
  2. Three analysis reports (GTM intelligence, network analysis, voice/style mining)
  3. ERPNext-ready import CSVs for CRM population

Data Location

The LinkedIn data export is at:

[UPDATE THIS PATH for your server]

It is the Complete export (37 files) downloaded on 2026-03-13. The export folder name ends in Complete_LinkedInDataExport_03-13-2026.zip but is already extracted.

Key Files in the Export (with schemas)

Connections.csv (2,500 rows)

  • 3-line preamble (skip first 3 lines), then CSV header at line 4
  • Columns: First Name, Last Name, URL, Email Address, Company, Position, Connected On
  • Dates: “11 Mar 2026” format
  • Emails: sparse (~10% have them; LinkedIn privacy opt-out)
  • Names: may include credentials (e.g., “Beckmann, RFC(R)”)

messages.csv (~3,000 logical rows, 9,110 raw lines due to multiline content)

  • Columns: CONVERSATION ID, CONVERSATION TITLE, FROM, SENDER PROFILE URL, TO, RECIPIENT PROFILE URLS, DATE, SUBJECT, CONTENT, FOLDER, ATTACHMENTS, IS MESSAGE DRAFT, IS CONVERSATION DRAFT
  • Dates: “2026-03-11 23:09:42 UTC” format
  • Content: multiline, quoted, may contain newlines
  • Dmitri’s profile URL: https://www.linkedin.com/in/dmitri-sunshine
  • Date range: 2010-2026 (16 years)

Invitations.csv (1,549 rows)

  • Columns: From, To, Sent At, Message, Direction, inviterProfileUrl, inviteeProfileUrl
  • Dates: “3/10/26, 1:02 PM” format
  • Direction: OUTGOING or INCOMING
  • Dominant template: “Hey Brother, I came across your profile and would love to connect since I got the sense there may be a lot of alignment between us…”

Comments.csv (225 rows)

  • Columns: Date, Link, Message
  • Dates: “2026-03-11 02:14:50” format
  • Date range: 2021-2026
  • Authentic voice data: questions, exclamations, mentions of others

Shares.csv (258 rows)

  • Columns: Date, ShareLink, ShareCommentary, SharedUrl, MediaUrl, Visibility
  • Commentary field: doubled double-quotes for embedded quotes (RFC 4180)
  • Topics: cybersecurity, AI, community, career, philanthropy

Reactions.csv (602 rows)

  • Columns: Date, Type, Link
  • Types: LIKE (75%), EMPATHY (19%), PRAISE (4%), INTEREST (1.5%), ENTERTAINMENT, APPRECIATION

Ad_Targeting.csv (1 data row, 34 columns, ~34 KB)

  • Each column cell contains semicolon-delimited lists
  • Key columns: Member Age (25-34), Company Names, Company Follower of (100+ orgs including Rockefeller Foundation, Ford Foundation), Member Groups (~25 including Chronicle of Philanthropy), Member Interests (500+), Member Traits, High Value Audience Segments (Small Business Owners, Public Impact Leaders, Senior Tech Decision Makers), Job Titles, Skills

Company Follows.csv (299 rows)

  • Columns: Organization, Followed On
  • Strategic curation: wealth management firms, foundations, impact funds, philanthropy infrastructure

Member_Follows.csv (2,030 rows)

  • Columns: Date, Status, FullName
  • C-suite coaches, impact investors, founders, nonprofit leaders

Inferences_about_you.csv (7 rows)

  • Columns: Category, Type of inference, Description, Inference
  • Key: tech media=True, gender=Male, job seeking=True, active contributor=True

Other files: Profile.csv, Positions.csv (7 jobs), Skills.csv (11 skills), Education.csv, Events.csv (6 events), Learning.csv (7 courses), Endorsement_Received_Info.csv, InstantReposts.csv, Ads Clicked.csv (5,852), Logins.csv, Security Challenges.csv, Registration.csv (Aug 2007), Receipts_v2.csv (Sales Navigator $99.99/mo), Email Addresses.csv, PhoneNumbers.csv

Existing Codebase Patterns to Follow

The project has 3 existing pipelines with identical architecture:

  • solanasis-scripts/foundation-pipeline/ (17 scripts, 62 tests)
  • solanasis-scripts/msp-pipeline/ (13 scripts)
  • solanasis-scripts/fcto-pipeline/ (9 scripts)

All follow: config.py → parse → score → outreach → migrate pattern. Use foundation-pipeline/config.py as the structural template.

Voice Profile Context

Dmitri has an existing documented voice profile at solanasis-docs/solanasis-voice-profile.md:

  • NO em dashes; uses semicolons as warm connectors
  • Signature terms: “quiet failure”, “false comfort”, “drift”, “blind spot”, “risk debt”
  • Greetings: “Heya” (default casual), “Hey Brother” (close contacts), “Aloha” (occasional)
  • Soft hedges: “if you happen to know of”, “would love to”
  • One ask per message; autobiographical anchoring; reciprocal closing
  • Banned: “genuinely”, “seamless”, “frictionless”, “SMBs”, “audit”, em dashes

The voice analysis script should validate these documented patterns against actual LinkedIn data and discover NEW patterns not yet documented.

Critical Technical Requirements

  1. Encoding: Use chardet to detect encoding on first read. LinkedIn CSVs may be Latin-1, not UTF-8. All outputs must be UTF-8.
  2. CSV parsing: Use stdlib csv.reader for all Stage 1 parsers. Open with newline=''. Pandas only for Stage 2 analysis.
  3. Output CSVs: csv.writer with quoting=csv.QUOTE_ALL, plain UTF-8 (no BOM) for ERPNext/Baserow import.
  4. LinkedIn URL normalization: lowercase, strip trailing /, strip query params/fragments, ensure https:// prefix. This is the join key.
  5. messages.csv: ~3,000 logical rows (9,110 raw lines include multiline content). Use csv.reader, NOT line counting, to get true row count.
  6. Full data preservation: Keep all message text, all fields. Nothing is filtered or stripped. Spam is flagged but preserved.

Implementation Order

Build in this order, testing each before proceeding:

  1. config.py + privacy_filter.py (foundation)
  2. parse_connections.py (simplest parser, validates the pattern)
  3. parse_invitations.py + parse_engagement.py (medium complexity)
  4. parse_messages.py (most complex; multiline content, conversation grouping)
  5. parse_profile_data.py (Ad_Targeting parsing is the challenge)
  6. build_unified_contacts.py (the critical merge + scoring)
  7. analyze_network.py + analyze_gtm_alignment.py + analyze_voice.py + generate_voice_enrichment_queue.py
  8. prepare_erpnext_import.py + push_to_erpnext.py (stub)

Verification Checklist

  • parse_connections.py produces exactly 2,500 rows
  • parse_messages.py produces ~2,700-3,000 logical messages (NOT 9,110)
  • parse_invitations.py produces ~1,549 rows
  • parse_engagement.py produces ~1,085 rows (225+258+602 merged)
  • build_unified_contacts.py spot-check: Ian Crafford, Tim Lockie, Kevin Roerty all present with correct merged data
  • Tier distribution is roughly Pareto (few A, more B, most C/D)
  • Voice report shows “Hey Brother” as dominant invitation greeting
  • All output CSVs open cleanly in Excel and import to ERPNext without encoding errors
  • No PII (phone numbers, emails) leaked into git-tracked files