LinkedIn Data Mining Pipeline

Context

Dmitri has a complete LinkedIn data export (37 files, ~22K rows total) that contains untapped gold for three purposes: populating the CRM with 2,500 contacts and relationship history, extracting GTM intelligence about network composition and engagement patterns, and mining voice/style data from 225 comments, 258 shares, ~3,000 messages, and 1,549 invitation messages to refine the existing voice profile.

The approach is staged: CSV/JSON intermediaries first (system-agnostic), analysis reports second, ERPNext push third. This lets us validate data quality at each step and keeps outputs portable across systems.

What We’re Working With

File	Rows	GTM Value	CRM Value	Voice Value
Connections.csv	2,500	Network composition	Primary contact source	-
messages.csv	~3,000	Relationship intel	Interaction history	DM style, greetings, CTAs
Invitations.csv	1,549	Outreach patterns	Connection context	Templates, personalization
Comments.csv	225	Topic interests	-	Short-form authentic voice
Shares.csv	258	Content strategy	-	Opinions, framing, commentary
Reactions.csv	602	Attention mapping	-	Interest signals
Ad_Targeting.csv	1 row, 34 cols (~34 KB)	Psychographic profile	-	-
Member_Follows.csv	2,030	Interest signals	-	-
Profile/Positions/Skills	~25	Self-assessment	Profile data	-
Inferences_about_you.csv	8	LinkedIn’s perception	-	-

Source: C:\Users\zasya\Documents\_zasage\data-export\Complete_LinkedInDataExport_03-13-2026.zip\

Additional High-Value Files (discovered in deep exploration)

File	Rows	Value
Company Follows.csv	299	ICP validation: Rockefeller Foundation, Ford Foundation, Kauffman Foundation, Carson Wealth, TIFIN Wealth, Momentis Family Office, Altruist, Wealthbox, Nitrogen, DataKind + 290 more
Ad_Targeting.csv	1 (34 cols)	500+ interests, 100+ company associations, 25+ member groups (Chronicle of Philanthropy, Impact Investor networks), high-value segments (Small Business Owners, Senior Tech Decision Makers, Public Impact Leaders)
Ads Clicked.csv	5,852	Ad interest signals; mirrors prospect ad behavior
Positions.csv	7	Career narrative: InReach (14yr) → Cohere COO → Wellnia → Matchkeyz → Sage’d → Co-Nexus → Solanasis
Events.csv	6	Networks Festival, Sociocracy Conference, gBETA Greeley, RegenWorld
Learning.csv	7	AI Agents, Design Thinking, Visual Communication (sampling, not completing)
Logins/Security	140	Device/IP/location patterns
Receipts_v2.csv	18	Sales Navigator @ $99.99/mo (active Dec 2025-Mar 2026)
Registration.csv	1	Account created Aug 17, 2007 (18-year member)

Research-Discovered Gold Nuggets

ICP Validation via Company Follows: 299 strategically curated company follows heavily skew toward wealth management, foundations, impact funds, philanthropy infrastructure. This IS the ICP proof.
Relationship Decay Detection: Cross-reference message dates with connection dates. Connections with past high engagement that went silent are priority reactivation targets.
Invitation Acceptance Rate: Cross-reference Invitations (outgoing) with Connections to calculate acceptance rate. Tells us how effective the “Hey Brother” template actually is.
Connection Growth Velocity: Plot connections by Connected On date to find networking surges (events, job changes, campaigns).
Voice Evolution (5 years): Comments span 2021-2026. Early: brief acknowledgments (“Superman”). Mid: curious questions. Recent: consultative expert commentary. Shows clear voice maturation.
Ad Targeting Decode: LinkedIn classifies Dmitri as “Public Impact Leader,” “Small Business Owner,” “Senior Tech Decision Maker.” Company follows include major foundations. Member groups include Chronicle of Philanthropy, Impact Investor networks.
Reaction Distribution: 75% Like, 19% Empathy, 4% Praise, 1.5% Interest. Mostly observer posture; selective emotional engagement.
Message Date Range: 2010 - Mar 2026 (16 years of history). Voice evolution analysis from messages can go much further back than comments (which start 2021).
Job Seeker Signal: LinkedIn infers “active job seeker” (true) despite CEO role. Suggests exploratory mindset or stale preferences. Consider turning this off.

CSV Encoding Gotcha (Critical)

Research found LinkedIn exports may use ANSI/Latin-1 encoding, NOT UTF-8. European characters, accented names, smart quotes will be garbled if read as UTF-8. All parsers must:

Use chardet to auto-detect encoding on first read
Convert to UTF-8 for all outputs
ERPNext requires UTF-8 for import

Existing Tools Assessment

No existing “LinkedIn export parser” library exists on PyPI or GitHub. Existing tools are all scrapers (Selenium/BeautifulSoup), not export parsers. We build our own, which is fine since the CSV schemas are straightforward.

ERPNext Data Import Format

Per Frappe docs:

Standard CSV upload via Data Import tool
Download template for Lead/Contact doctypes, fill in, upload
Must be UTF-8 encoded (especially for non-ASCII names)
No hard row limit but recommend “a few thousand at a time”
ID column left blank for new records (auto-generated)
Can also produce Baserow-compatible CSVs (simpler flat schema)

Stage 1: Parse & Clean (CSV/JSON intermediaries)

All scripts go in solanasis-scripts/linkedin-pipeline/. Follow existing pipeline conventions (config.py, —plan/—run modes).

Script 1: `config.py`

Pattern from foundation-pipeline/config.py (line-for-line reference: solanasis-scripts/foundation-pipeline/config.py):

LINKEDIN_EXPORT_DIR pointing to the Complete export
DMITRI_PROFILE_URL = "https://www.linkedin.com/in/dmitri-sunshine"
Path constants: DATA_DIR, INTERMEDIATE_DIR, OUTPUT_DIR, REPORT_DIR
Scoring weights for warmth and strategic value (see Script 8)
ICP keyword lists for segment tagging (foundation, nonprofit, wealth-mgmt, fractional-exec, etc.)
Spam detection patterns (for flagging, not filtering)
Topic keyword lists for engagement tagging (ai/tech, community, spirituality, entrepreneurship, philanthropy, cybersecurity)
detect_encoding() helper using chardet (LinkedIn CSVs may be ANSI/Latin-1)

Script 2: `privacy_filter.py`

Utility module:

is_spam_message() - keyword heuristics (unsubscribe, free training, limited time, etc.) - flags but does NOT delete
redact_pii_for_export() - optional redaction for any future external sharing (not used by default)
Full message text is preserved throughout the pipeline; nothing is stripped or filtered out

Script 3: `parse_connections.py`

Input: Connections.csv (note: 3-line header before actual CSV columns at line 4)
Output: output/connections_clean.csv
Logic: Normalize LinkedIn URLs (lowercase, strip trailing /), parse dates to ISO, split name fields, handle credentials in names (e.g., “Beckmann, RFC(R)”), add connection_age_days

Script 4: `parse_messages.py`

Input: messages.csv (multiline content in quoted fields; use stdlib csv.reader, NOT pandas)
Output: output/messages_parsed.json (grouped by conversation), output/message_stats_per_contact.csv
Logic: Group by CONVERSATION ID, identify Dmitri as anchor by profile URL, compute per-contact stats (message count, direction ratio, date range, response latency), filter drafts, flag spam, extract greeting patterns from Dmitri’s sent messages

Script 5: `parse_invitations.py`

Input: Invitations.csv
Output: output/invitations_clean.csv
Logic: Normalize URLs, parse dates (“3/10/26, 1:02 PM” format), classify direction, detect templates using difflib.SequenceMatcher (threshold 0.85) - the “Hey Brother” template is dominant and should be quantified

Script 6: `parse_engagement.py`

Input: Comments.csv, Shares.csv (tricky: doubled double-quotes in commentary), Reactions.csv, InstantReposts.csv
Output: output/engagement_activity.csv (unified timeline), output/engagement_by_topic.json
Logic: Merge all into single timeline, extract text, compute word/sentence counts, tag reaction types, apply topic keyword lists (ai/tech, community/coliving, spirituality, entrepreneurship, philanthropy/nonprofit, cybersecurity)

Script 7: `parse_profile_data.py`

Input: Profile.csv, Positions.csv, Skills.csv, Endorsement_Received_Info.csv, Ad_Targeting.csv, Inferences_about_you.csv, Member_Follows.csv, Company Follows.csv, Events.csv, Learning.csv, Ads Clicked.csv, Registration.csv, Receipts_v2.csv
Output: output/profile_summary.json, output/ad_targeting_parsed.json, output/follows_and_interests.csv, output/career_timeline.json
Logic:
- Parse Ad_Targeting’s 34 columns (semicolon-delimited lists within each column cell). Extract: Member Age, Buyer Groups, Company Names, Company Follows (100+), Industries, Interests (500+), Member Traits, High Value Segments, Job Functions, Skills, Standard Audience Segments
- Parse Inferences (7 rows: HR professional=No, tech media=True, oil/energy=True, EV affinity=True, gender=Male, job seeking=True, active contributor=True)
- Merge Member_Follows (2,030) + Company Follows (299) into single list with type column
- Parse Positions into career timeline JSON with date ranges and descriptions
- Parse Events for interest signals (sociocracy, regenerative systems, community building)
- Parse Learning for curiosity signals (AI agents, design thinking, visual communication)
- Include Registration date (Aug 2007, 18-year member) for account maturity context

Script 8: `build_unified_contacts.py` (the critical merge)

Input: All Stage 1 outputs
Output: output/contacts_unified.csv, output/contacts_unified.json
Join key: normalized linkedin_url
Merge: Connections (base) + left-join Invitations + left-join Message stats + left-join Follows
Orphan handling: people in messages/invitations but NOT in connections go to non_connection_interactions.csv
Warmth scoring (algorithmic):
- +20 bidirectional messages, +15 if 5+ messages, +10 message in last 30d, +8 in last 90d
- +5 Dmitri sent invitation (proactive), +5 they sent invitation, +3 Dmitri follows them
- -10 connected 1yr+ with zero messages, -5 only spam messages
Strategic value scoring:
- +15 title matches ICP (CEO/COO/CTO/ED at nonprofits/foundations/growing orgs)
- +10 company matches target vertical, +8 fractional C-suite, +5 Colorado/Boulder/Denver
- +5 connector role (investor, advisor, board member), +3 engaged recently
Tier: A (50+), B (30-49), C (15-29), D (<15)
Segment tags: nonprofit, foundation, wealth-mgmt, tech, startup, coliving-community, fractional-exec, boulder-local, connector, spiritual-wellness

Unified contact schema (one row per person, maximum data):

linkedin_url, first_name, last_name, full_name, email, company, position,
connected_on, connection_age_days,
invitation_direction, invitation_message, invitation_date, invitation_accepted,
message_count_total, message_count_from_dmitri, message_count_to_dmitri,
first_message_date, last_message_date, last_message_direction,
avg_response_time_hours, longest_conversation_length, conversation_count,
last_message_content_preview,
dmitri_follows_them, they_follow_dmitri,
warmth_score, strategic_value_score, relationship_tier,
relationship_decay_flag, days_since_last_contact,
segment_tags, icp_match_details, notes

Also produce per-contact conversation export (output/conversations/):

One JSON file per A/B-tier contact: full conversation threads with timestamps
For C/D-tier: summary JSON with metadata only + last 3 messages

Stage 2: Analysis Reports

Output to solanasis-docs/linkedin-analysis/ as markdown.

Script 9: `analyze_network.py`

Report: network-intelligence-report.md

Network composition: industry breakdown (keyword-based from company/position), role seniority distribution, geographic distribution
Top 25 most-engaged contacts (by warmth score) with commonalities
Network gaps vs. Solanasis target verticals: foundation execs, wealth management, MSPs, fractional CTOs. What % of network is in each vertical? Where is it thin?
ICP validation: Cross-reference actual connection industries with declared ICP. If <5% are nonprofit/foundation people, the network doesn’t support the GTM
Referral partner candidates: fractional execs/consultants/advisors/attorneys/accountants with high warmth + matching segments
Dormant gold: high strategic value but low warmth (connected, never messaged; target vertical people)
Relationship decay list: contacts with past high engagement (5+ messages) but no contact in 90+ days
Connection velocity: connections per week over time, identify networking surges (events, campaigns)
Invitation effectiveness: acceptance rate by template variant, by personalization level
Company Follows alignment: 299 followed companies mapped to Solanasis verticals

Script 10: `analyze_gtm_alignment.py`

Report: gtm-alignment-report.md

Attention audit: topic breakdown of comments/shares/reactions vs. Solanasis positioning needs. Currently engaging on: AI, community, spirituality, entrepreneurship. Needs more: cybersecurity, compliance, disaster recovery, operational resilience
Ad Targeting decode: Parse all 34 columns. What LinkedIn thinks Dmitri is: Public Impact Leader, Small Business Owner, Senior Tech Decision Maker. What interests are inferred. What member groups he’s in (Chronicle of Philanthropy, Impact Investor networks)
Inferences analysis: LinkedIn says: active contributor, tech media interest, job seeking (true), EV affinity. Flag misalignment with Solanasis brand
Content engagement gap: quantify % of engagement on cybersecurity vs. community vs. AI vs. other. Compare with desired positioning
Warm introduction map: for each target vertical (foundations, wealth mgmt, MSPs, fCTOs), list the warmest contacts who could provide intros
Company Follows as ICP signal: 299 companies mapped to verticals. Which followed companies align with each pipeline (foundation, MSP, fCTO)?
Member Groups inventory: list all ~25 groups from Ad_Targeting, classify relevance to Solanasis GTM
Actionable recommendations: specific, numbered actions (e.g., “Join 3 cybersecurity-focused LinkedIn groups,” “Increase security-related comments from X% to 25%“)

Script 11: `analyze_voice.py`

Report: voice-analysis-report.md (Python-only, no LLM needed)

Comment voice profile (225 comments, 5 years of data):
- Avg length, word count distribution, exclamation frequency, question frequency
- Top 30 words/phrases (TF-IDF), vocabulary richness (type-token ratio)
- Intensifier usage: “by far”, “so”, “really”, “super”, “amazing”, “love it”
- Softener usage: “happen to know”, “would love to”, “I think”, “I believe”
Share commentary voice (258 shares):
- Similar metrics + topic framing analysis
- Signature transitions: “Here’s the thing…”, “So…”, “Which is why…”
- Pro-tip pattern detection (educational framing)
DM voice profile (~3,000 messages, Dmitri’s sent messages only):
- Greeting distribution: “Hey Brother” X%, “Hey Sister” Y%, “Heya” Z%, “Aloha” W%, “Hey [Name]” V%
- Sign-off patterns, meeting link inclusion rate (go.solanasis.com/meet)
- CTA patterns: question-ending, call-booking, group-asking
- Response latency: how fast does Dmitri reply?
Invitation template analysis (1,549 invitations):
- Cluster templates using difflib.SequenceMatcher (0.85 threshold)
- Show each template variant with: verbatim text, usage count, date range, personalization spots
- Calculate acceptance rate per template (cross-ref Invitations → Connections)
- Gender-specific greeting analysis (Brother vs Sister vs name-only)
Voice validation against documented profile (solanasis-voice-profile.md):
- Em dash frequency (should be ~0), semicolon frequency, banned words scan
- Comparison against Appendix A findings (Heya vs Hey Brother, soft hedges, reciprocal closing)
5-year voice evolution (2021→2026):
- 2021: Brief acknowledgments (“Superman”)
- 2023: Curious questions, learning posture
- 2025: Tech-forward, opinionated (Claude AI commentary)
- 2026: Consultative, expert-level, Solanasis-branded
- Quantify: avg comment length by year, question ratio by year, vocabulary complexity by year
Sentiment analysis using VADER (nltk.sentiment.vader, purpose-built for short social media text): polarity and subjectivity trends over time

Script 12: `generate_voice_enrichment_queue.py`

Output: output/voice_llm_queue.json (prompt-ready samples for Claude analysis)
Selects: all 225 comments, all share commentaries, top 50 conversations (Dmitri’s messages only), all unique invitation templates
For future LLM-assisted analysis: tone classification, qualitative insights, voice profile extension

Stage 3: ERPNext Integration

Script 13: `prepare_erpnext_import.py`

Input: contacts_unified.csv
Output: output/erpnext_leads.csv, output/erpnext_contacts.csv
Mapping:
- Connection with messages exchanged → Contact (linked to Lead)
- Connection, no interaction → Lead (status: Open)
- Connection in target segment → Lead (status: Qualified)
- Non-connection → Lead (status: Open, lower priority)
--plan mode: print counts and samples, no file creation
--run mode: write ERPNext Data Import-ready CSVs
Full message text included in conversation exports for A/B-tier contacts; metadata + last messages for C/D-tier

Script 14: `push_to_erpnext.py` (stub only - implemented when API access confirmed)

Placeholder with Frappe REST API patterns documented in comments
Auth env vars defined in .env.example: ERPNEXT_URL, ERPNEXT_API_KEY, ERPNEXT_API_SECRET
Will be fleshed out once ERPNext at db.zasage.us has API credentials and schema confirmed

Directory Structure

solanasis-scripts/linkedin-pipeline/
    config.py
    privacy_filter.py
    parse_connections.py
    parse_messages.py
    parse_invitations.py
    parse_engagement.py
    parse_profile_data.py
    build_unified_contacts.py
    analyze_network.py
    analyze_gtm_alignment.py
    analyze_voice.py
    generate_voice_enrichment_queue.py
    prepare_erpnext_import.py
    push_to_erpnext.py
    requirements.txt          # python-dateutil, pandas (Stage 2 only), httpx, python-dotenv
    .env.example
    data/
        intermediate/         # Working files (git-ignored)
        output/               # Final CSV/JSON (git-ignored)
    tests/
        test_parse_*.py
        test_privacy_filter.py
        test_scoring.py

solanasis-docs/linkedin-analysis/
    network-intelligence-report.md
    gtm-alignment-report.md
    voice-analysis-report.md

Implementation Order

Phase A (foundation): config.py, privacy_filter.py, parse_connections.py Phase B (core parsers): parse_invitations.py, parse_engagement.py, parse_messages.py, parse_profile_data.py Phase C (the merge): build_unified_contacts.py - spot-check known contacts (Ian Crafford, Tim Lockie, Kevin Roerty) Phase D (analysis): analyze_network.py, analyze_gtm_alignment.py, analyze_voice.py, generate_voice_enrichment_queue.py Phase E (integration): prepare_erpnext_import.py, push_to_erpnext.py

Key Technical Notes

Encoding: LinkedIn CSVs may be ANSI/Latin-1, not UTF-8. Use chardet.detect() on first read, convert all outputs to UTF-8. Add chardet to requirements.
CSV parsing: Use stdlib csv.reader for ALL Stage 1 parsers (consistent, fewer dependencies). Pandas only for Stage 2 analysis scripts. Always open with newline='' per Python csv module docs.
Connections.csv header: 3-line preamble (notes about email privacy) before actual column headers at line 4. Skip first 3 lines manually with csv.reader.
Shares.csv quoting: Commentary contains doubled double-quotes ("") for embedded quotes. Python’s csv.reader handles doubled double-quotes natively per RFC 4180 (no special dialect needed).
Ad_Targeting.csv: 34 columns, 1 data row, ~34K of data. Each cell contains semicolon-delimited lists. Parse by splitting on ; after CSV extraction.
LinkedIn URL normalization: lowercase, strip trailing /, strip query params/fragments, ensure https:// prefix. This is the join key across ALL files.
Output CSV robustness: All output CSVs must use csv.writer with quoting=csv.QUOTE_ALL and plain UTF-8 encoding (no BOM) for ERPNext/Baserow import. For any CSVs explicitly labeled for Excel use, add BOM (utf-8-sig) separately.
Raw LinkedIn export stays in _zasage/data-export/: never copied into pipeline dir or committed to git
All intermediate/output files git-ignored
requirements.txt: chardet, python-dateutil, pandas>=2.0, nltk, httpx, python-dotenv
No external API calls in Stage 1-2: All analysis is local, offline, using the CSV files directly

Verification

Row count validation: Each parser should log input rows vs. output rows (expect 2,500 connections, ~2,700-3,000 logical messages (9,110 raw CSV lines including multiline content), ~1,549 invitations, ~1,085 engagement items)
Spot-check known contacts: Ian Crafford, Tim Lockie, Kevin Roerty should appear in unified contacts with correct merge of connection + invitation + message data
Tier distribution: Should be roughly Pareto (few A-tier, more B, most C/D)
Voice report validation: Cross-check greeting distribution against what we see in raw data (e.g., “Hey Brother” is clearly dominant in invitations)
Analysis reports: Review markdown reports for actionable insights before ERPNext push

Decisions (confirmed with Dmitri)

ERPNext: CSV-only for now. No API push script; just import-ready CSVs. push_to_erpnext.py becomes a stub/placeholder.
Data scope: Full messages, full data, as much as possible. No filtering to business-only. Full message text preserved in intermediate outputs and available for analysis.
Enrichment: LinkedIn URL + name + company is sufficient. No Apollo enrichment step needed; can always enrich later.
Scope: Build all 14 scripts in one pass.
Execution: Run on a separate server (not local dev machine) due to processing power concerns.

Senior Reviewer Findings (all addressed)

Status: APPROVED WITH NOTES (all fixed)

SHOULD-FIX (FIXED): Message count corrected from 9,109 to ~3,000 logical rows (multiline content inflates raw line count)
SHOULD-FIX (FIXED): Message date range corrected to 2010-2026 (16 years, not 14 months)
SHOULD-FIX (FIXED): Connections row count corrected to 2,500
SHOULD-FIX (FIXED): CSV quoting note corrected (RFC 4180 native handling, not QUOTE_ALL)
NICE-TO-HAVE (FIXED): Switched from TextBlob to VADER for sentiment (purpose-built for short social text)
NICE-TO-HAVE (FIXED): Standardized csv.reader for all Stage 1 parsers
NICE-TO-HAVE (FIXED): Plain UTF-8 for CRM imports, BOM only for Excel-labeled files
NICE-TO-HAVE (FIXED): Ad_Targeting row count notation clarified

CONTINUATION PROMPT (for executing on another server)

Copy everything below this line into a new Claude Code session on the target server. This prompt is self-contained and references the plan above.

Task: Build LinkedIn Data Mining Pipeline

You are implementing a LinkedIn data mining pipeline for Solanasis LLC. The complete plan is in this file above the “CONTINUATION PROMPT” heading. Read the entire plan first.

Who You’re Working For

Dmitri Zasage (aka Dmitri Sunshine), CEO of Solanasis LLC, a fractional CIO/CSIO/COO firm based in Boulder, Colorado targeting nonprofits, foundations, and growing organizations. Solanasis offers: Security Assessments, Disaster Recovery Verification, Data Migrations, CRM Setup, Systems Integration, Responsible AI Implementation.

What You’re Building

14 Python scripts that parse Dmitri’s LinkedIn data export (37 CSV files) into:

Clean CSV/JSON intermediaries (system-agnostic, portable)
Three analysis reports (GTM intelligence, network analysis, voice/style mining)
ERPNext-ready import CSVs for CRM population

Data Location

The LinkedIn data export is at:

[UPDATE THIS PATH for your server]

It is the Complete export (37 files) downloaded on 2026-03-13. The export folder name ends in Complete_LinkedInDataExport_03-13-2026.zip but is already extracted.

Key Files in the Export (with schemas)

Connections.csv (2,500 rows)

3-line preamble (skip first 3 lines), then CSV header at line 4
Columns: First Name, Last Name, URL, Email Address, Company, Position, Connected On
Dates: “11 Mar 2026” format
Emails: sparse (~10% have them; LinkedIn privacy opt-out)
Names: may include credentials (e.g., “Beckmann, RFC(R)”)

messages.csv (~3,000 logical rows, 9,110 raw lines due to multiline content)

Columns: CONVERSATION ID, CONVERSATION TITLE, FROM, SENDER PROFILE URL, TO, RECIPIENT PROFILE URLS, DATE, SUBJECT, CONTENT, FOLDER, ATTACHMENTS, IS MESSAGE DRAFT, IS CONVERSATION DRAFT
Dates: “2026-03-11 23:09:42 UTC” format
Content: multiline, quoted, may contain newlines
Dmitri’s profile URL: https://www.linkedin.com/in/dmitri-sunshine
Date range: 2010-2026 (16 years)

Invitations.csv (1,549 rows)

Columns: From, To, Sent At, Message, Direction, inviterProfileUrl, inviteeProfileUrl
Dates: “3/10/26, 1:02 PM” format
Direction: OUTGOING or INCOMING
Dominant template: “Hey Brother, I came across your profile and would love to connect since I got the sense there may be a lot of alignment between us…”

Comments.csv (225 rows)

Columns: Date, Link, Message
Dates: “2026-03-11 02:14:50” format
Date range: 2021-2026
Authentic voice data: questions, exclamations, mentions of others

Shares.csv (258 rows)

Columns: Date, ShareLink, ShareCommentary, SharedUrl, MediaUrl, Visibility
Commentary field: doubled double-quotes for embedded quotes (RFC 4180)
Topics: cybersecurity, AI, community, career, philanthropy

Reactions.csv (602 rows)

Columns: Date, Type, Link
Types: LIKE (75%), EMPATHY (19%), PRAISE (4%), INTEREST (1.5%), ENTERTAINMENT, APPRECIATION

Ad_Targeting.csv (1 data row, 34 columns, ~34 KB)

Each column cell contains semicolon-delimited lists
Key columns: Member Age (25-34), Company Names, Company Follower of (100+ orgs including Rockefeller Foundation, Ford Foundation), Member Groups (~25 including Chronicle of Philanthropy), Member Interests (500+), Member Traits, High Value Audience Segments (Small Business Owners, Public Impact Leaders, Senior Tech Decision Makers), Job Titles, Skills

Company Follows.csv (299 rows)

Columns: Organization, Followed On
Strategic curation: wealth management firms, foundations, impact funds, philanthropy infrastructure

Member_Follows.csv (2,030 rows)

Columns: Date, Status, FullName
C-suite coaches, impact investors, founders, nonprofit leaders

Inferences_about_you.csv (7 rows)

Columns: Category, Type of inference, Description, Inference
Key: tech media=True, gender=Male, job seeking=True, active contributor=True

Other files: Profile.csv, Positions.csv (7 jobs), Skills.csv (11 skills), Education.csv, Events.csv (6 events), Learning.csv (7 courses), Endorsement_Received_Info.csv, InstantReposts.csv, Ads Clicked.csv (5,852), Logins.csv, Security Challenges.csv, Registration.csv (Aug 2007), Receipts_v2.csv (Sales Navigator $99.99/mo), Email Addresses.csv, PhoneNumbers.csv

Existing Codebase Patterns to Follow

The project has 3 existing pipelines with identical architecture:

solanasis-scripts/foundation-pipeline/ (17 scripts, 62 tests)
solanasis-scripts/msp-pipeline/ (13 scripts)
solanasis-scripts/fcto-pipeline/ (9 scripts)

All follow: config.py → parse → score → outreach → migrate pattern. Use foundation-pipeline/config.py as the structural template.

Voice Profile Context

Dmitri has an existing documented voice profile at solanasis-docs/solanasis-voice-profile.md:

NO em dashes; uses semicolons as warm connectors
Signature terms: “quiet failure”, “false comfort”, “drift”, “blind spot”, “risk debt”
Greetings: “Heya” (default casual), “Hey Brother” (close contacts), “Aloha” (occasional)
Soft hedges: “if you happen to know of”, “would love to”
One ask per message; autobiographical anchoring; reciprocal closing
Banned: “genuinely”, “seamless”, “frictionless”, “SMBs”, “audit”, em dashes

The voice analysis script should validate these documented patterns against actual LinkedIn data and discover NEW patterns not yet documented.

Critical Technical Requirements

Encoding: Use chardet to detect encoding on first read. LinkedIn CSVs may be Latin-1, not UTF-8. All outputs must be UTF-8.
CSV parsing: Use stdlib csv.reader for all Stage 1 parsers. Open with newline=''. Pandas only for Stage 2 analysis.
Output CSVs: csv.writer with quoting=csv.QUOTE_ALL, plain UTF-8 (no BOM) for ERPNext/Baserow import.
LinkedIn URL normalization: lowercase, strip trailing /, strip query params/fragments, ensure https:// prefix. This is the join key.
messages.csv: ~3,000 logical rows (9,110 raw lines include multiline content). Use csv.reader, NOT line counting, to get true row count.
Full data preservation: Keep all message text, all fields. Nothing is filtered or stripped. Spam is flagged but preserved.

Implementation Order

Build in this order, testing each before proceeding:

config.py + privacy_filter.py (foundation)
parse_connections.py (simplest parser, validates the pattern)
parse_invitations.py + parse_engagement.py (medium complexity)
parse_messages.py (most complex; multiline content, conversation grouping)
parse_profile_data.py (Ad_Targeting parsing is the challenge)
build_unified_contacts.py (the critical merge + scoring)
analyze_network.py + analyze_gtm_alignment.py + analyze_voice.py + generate_voice_enrichment_queue.py
prepare_erpnext_import.py + push_to_erpnext.py (stub)

Verification Checklist

parse_connections.py produces exactly 2,500 rows
parse_messages.py produces ~2,700-3,000 logical messages (NOT 9,110)
parse_invitations.py produces ~1,549 rows
parse_engagement.py produces ~1,085 rows (225+258+602 merged)
build_unified_contacts.py spot-check: Ian Crafford, Tim Lockie, Kevin Roerty all present with correct merged data
Tier distribution is roughly Pareto (few A, more B, most C/D)
Voice report shows “Hey Brother” as dominant invitation greeting
All output CSVs open cleanly in Excel and import to ERPNext without encoding errors
No PII (phone numbers, emails) leaked into git-tracked files

Solanasis Docs

Explorer

linkedin-pipeline-plan

LinkedIn Data Mining Pipeline

Context

What We’re Working With

Additional High-Value Files (discovered in deep exploration)

Research-Discovered Gold Nuggets

CSV Encoding Gotcha (Critical)

Existing Tools Assessment

ERPNext Data Import Format

Stage 1: Parse & Clean (CSV/JSON intermediaries)

Script 1: config.py

Script 2: privacy_filter.py

Script 3: parse_connections.py

Script 4: parse_messages.py

Script 5: parse_invitations.py

Script 6: parse_engagement.py

Script 7: parse_profile_data.py

Script 8: build_unified_contacts.py (the critical merge)

Stage 2: Analysis Reports

Script 9: analyze_network.py

Script 10: analyze_gtm_alignment.py

Script 11: analyze_voice.py

Script 12: generate_voice_enrichment_queue.py

Stage 3: ERPNext Integration

Script 13: prepare_erpnext_import.py

Script 14: push_to_erpnext.py (stub only - implemented when API access confirmed)

Directory Structure

Implementation Order

Key Technical Notes

Verification

Decisions (confirmed with Dmitri)

Senior Reviewer Findings (all addressed)

CONTINUATION PROMPT (for executing on another server)

Task: Build LinkedIn Data Mining Pipeline

Who You’re Working For

What You’re Building

Data Location

Key Files in the Export (with schemas)

Existing Codebase Patterns to Follow

Voice Profile Context

Critical Technical Requirements

Implementation Order

Verification Checklist

Graph View

Table of Contents

Script 1: `config.py`

Script 2: `privacy_filter.py`

Script 3: `parse_connections.py`

Script 4: `parse_messages.py`

Script 5: `parse_invitations.py`

Script 6: `parse_engagement.py`

Script 7: `parse_profile_data.py`

Script 8: `build_unified_contacts.py` (the critical merge)

Script 9: `analyze_network.py`

Script 10: `analyze_gtm_alignment.py`

Script 11: `analyze_voice.py`

Script 12: `generate_voice_enrichment_queue.py`

Script 13: `prepare_erpnext_import.py`

Script 14: `push_to_erpnext.py` (stub only - implemented when API access confirmed)