LinkedIn Data Mining Pipeline — Implementation Prompt
Usage: Feed this entire file as context to a new Claude Code session on any machine that has access to the LinkedIn data export and the solanasis-scripts repo. This is fully self-contained; no external plan files or memory needed.
Created: 2026-03-22 Senior Reviewed: Yes (all findings addressed)
Task
Build a complete LinkedIn data mining pipeline: 14 Python scripts that parse a LinkedIn data export (37 CSV files) into clean CSV/JSON intermediaries, generate GTM intelligence and voice analysis reports, and produce CRM-ready import files.
Who You’re Working For
Dmitri Zasage (aka Dmitri Sunshine on LinkedIn), CEO of Solanasis LLC, a fractional CIO/CSIO/COO firm based in Boulder, Colorado. One-person operation with 1099 contractors, growth hacking mindset.
Solanasis offerings: Security Assessments, Disaster Recovery Verification, Data Migrations, CRM Setup, Systems Integration, Responsible AI Implementation
Target market: Nonprofits, private foundations (50M), growing organizations, wealth management firms
CRM systems: ERPNext at db.zasage.us (primary, custom), Baserow (temporary). Both accept standard CSV imports.
Three Goals
- CRM Population — Parse 2,500 connections + message history + invitations into scored, tiered contact records ready for ERPNext/Baserow import
- GTM Intelligence — Network composition analysis, ICP validation, attention audit, warm introduction mapping, referral partner identification
- Voice/Style Mining — Extract writing patterns from 225 comments, 258 shares, ~3,000 messages, 1,549 invitations to validate and extend Dmitri’s documented voice profile
The LinkedIn Data Export
Location: Update this path for your machine:
C:\Users\zasya\Documents\_zasage\data-export\Complete_LinkedInDataExport_03-13-2026.zip\
Already extracted. 37 CSV files. Dmitri’s LinkedIn profile URL: https://www.linkedin.com/in/dmitri-sunshine
File Inventory with Schemas
Connections.csv (2,500 rows)
- IMPORTANT: 3-line preamble (notes about email privacy) before actual CSV columns at line 4. Skip first 3 lines.
- Columns:
First Name, Last Name, URL, Email Address, Company, Position, Connected On - Date format: “11 Mar 2026”
- Emails: sparse (~10% have them; LinkedIn privacy opt-out)
- Names may include credentials: e.g.,
"Beckmann, RFC®"(quoted, with comma inside)
messages.csv (~3,000 logical rows; 9,110 raw lines due to multiline quoted content)
- Columns:
CONVERSATION ID, CONVERSATION TITLE, FROM, SENDER PROFILE URL, TO, RECIPIENT PROFILE URLS, DATE, SUBJECT, CONTENT, FOLDER, ATTACHMENTS, IS MESSAGE DRAFT, IS CONVERSATION DRAFT - Date format: “2026-03-11 23:09:42 UTC”
- Content field: multiline, properly quoted per RFC 4180
- Date range: 2010-2026 (16 years of history)
- Use
csv.reader(NOT line counting) to get true logical row count
Invitations.csv (1,549 rows)
- Columns:
From, To, Sent At, Message, Direction, inviterProfileUrl, inviteeProfileUrl - Date format: “3/10/26, 1:02 PM”
- Direction: OUTGOING or INCOMING
- Dominant outgoing template: “Hey Brother, I came across your profile and would love to connect since I got the sense there may be a lot of alignment between us. I’d love to jump on a call with you sometime to see where our worlds overlap and how we might be able to help each other.”
- Variations: “Hey Sister”, “Aloha [Name]”, “Hey [Name]”, “Hy [Name]” (typo variant)
Comments.csv (225 rows)
- Columns:
Date, Link, Message - Date format: “2026-03-11 02:14:50”
- Date range: 2021-2026 (5 years)
- This is the most authentic voice data. Examples:
- “Wow! It must have been amazing to witness that kind of growth. What’s the main challenge to keep it growing at this rate?”
- “Love it! Mindset is everything! I like to call this way of thinking ‘Pronoia’; the Universe is conspiring in your favor.”
- “Claude bullshits me far more often so ‘are you sure? verify everything!’ has been one of my most used prompts!”
Shares.csv (258 rows)
- Columns:
Date, ShareLink, ShareCommentary, SharedUrl, MediaUrl, Visibility - Commentary field uses doubled double-quotes (
"") for embedded quotes. Python’scsv.readerhandles this natively per RFC 4180. - Topics: cybersecurity (password managers, zero-knowledge encryption), AI, community, career, philanthropy
- Example commentary starts: “So it’s 2026 but I still see orgs and people who don’t use a password manager…”
Reactions.csv (602 rows)
- Columns:
Date, Type, Link - Distribution: LIKE 75%, EMPATHY 19%, PRAISE 4%, INTEREST 1.5%, ENTERTAINMENT 0.5%, APPRECIATION 0.2%
- Date range: Apr 2022 - Mar 2026
Ad_Targeting.csv (1 data row, 34 columns, ~34 KB)
- Each column cell contains semicolon-delimited lists. Parse by splitting on
;after CSV extraction. - Key columns and values:
- Member Age: 25-34
- Company Follower of: 100+ organizations including Rockefeller Foundation, Ford Foundation, Ewing Marion Kauffman Foundation, Y Combinator, OpenAI, The Nature Conservancy, Solanasis, Carson Wealth, TIFIN Wealth, Momentis Family Office, Altruist, Wealthbox, Nitrogen, DataKind
- Member Groups: ~25 including Chronicle of Philanthropy, Emotional Intelligence Network, Philanthropy Network, Impact Investor networks, MAPS (psychedelics), Nonprofit fundraising
- Member Interests: 500+ including Cloud Computing, Risk Management, Entrepreneurship, AI, Cybersecurity, Data Management, Venture Capital, Nonprofit sector, ESG, Impact Investing, Responsible AI, Social Enterprise
- Member Traits: Active contributor/influencer
- High Value Segments: Small Business Owners, Senior Tech Decision Makers, High Net Worth Individuals, Public Impact Leaders, Enterprise Builders, Millennials
- Job Functions: Business Development
- Company Revenue: Under $1M
- Years of Experience: 12+
Inferences_about_you.csv (7 rows)
- Columns:
Category, Type of inference, Description, Inference - Inferences: HR professional=No, tech media interest=True, oil/energy interest=True, EV affinity=True, gender=Male, job seeking=True, active contributor/influencer=True
Company Follows.csv (299 rows)
- Columns:
Organization, Followed On - Strategically curated: wealth management firms, foundations, impact funds, philanthropy infrastructure, mental health orgs, community builders
- Date range: Feb-Mar 2026 (recent bulk follows)
Member_Follows.csv (2,030 rows)
- Columns:
Date, Status, FullName - All “Active” status
- C-suite coaches, impact investors, founders, nonprofit leaders, sustainability experts
Profile.csv (1 data row)
- Columns:
First Name, Last Name, Maiden Name, Address, Birth Date, Headline, Summary, Industry, Zip Code, Geo Location, Twitter Handles, Websites, Instant Messengers - Headline: “Less Luck. More Systems. | Founder @ Solanasis | Cybersecurity • Systems Integration • Responsible AI Implementation”
- Industry: IT Services and IT Consulting
- Location: Boulder, Colorado 80304
Positions.csv (7 positions)
- Columns:
Company Name, Title, Description, Location, Started On, Finished On - Career: InReach Solutions (Jul 2007-Nov 2021, 14yr, foster care/adoption) → Cohere COO (Nov 2021-Mar 2022) → Sage’d Founder (Mar 2022-Jan 2026) → Co-Nexus Community Co-op Founder (Nov 2023-present) → Wellnia Founder (Sep-Dec 2024) → Matchkeyz Founder (May 2025-Feb 2026) → Solanasis Founder & CEO (Feb 2026-present)
Skills.csv (11 skills): ERP Software, Cybersecurity, Healthcare IT, System Architecture, Project Management, Organizational Development, Operations Management, Software Architecture, Entrepreneurship, Nonprofit Organizations, Board of Directors
Other files:
- Endorsement_Received_Info.csv (11 endorsements, mostly from Sarah Lena Hofer)
- Events.csv (6: Networks Festival, Sociocracy Conference, gBETA Greeley, RegenWorld, Growing Cooperative Cultures, Self-Mastery Session)
- Learning.csv (7 LinkedIn Learning courses: SOLIDWORKS, Inclusive Tech, Creating Maps with R, AI Agents for SMBs, Ageism, Visual Communication, Design Thinking/Social Innovation)
- InstantReposts.csv (6 reposts)
- Rich_Media.csv (6 photo uploads)
- Ads Clicked.csv (5,852 ad clicks)
- LAN Ads Engagement.csv (87 impressions on Roku, Pluto TV, etc.)
- Logins.csv (22 login events)
- Security Challenges.csv (117 challenges)
- Registration.csv (account created Aug 17, 2007 — 18-year LinkedIn member)
- Receipts_v2.csv (18 transactions, Sales Navigator Core @ $99.99/mo, Dec 2025-Mar 2026)
- Email Addresses.csv (zasyatkin@gmail.com secondary, mr.sunshine@matchkeyz.com primary)
- PhoneNumbers.csv (+1 503-544-2911)
- SavedJobAlerts.csv (“founder” keyword, Denver, daily)
- Jobs/Saved Jobs.csv (2 full-stack engineer roles at health tech companies)
- Jobs/Job Seeker Preferences.csv (Portland, full-stack/software architect)
Research-Discovered Gold Nuggets
These are non-obvious insights from deep research that should inform the analysis scripts:
- ICP Validation via Company Follows: The 299 company follows heavily skew toward wealth management, foundations, impact funds, philanthropy infrastructure. This validates the declared ICP.
- Relationship Decay Detection: Cross-reference message dates with connection dates. Connections with past high engagement (5+ messages) that went silent (90+ days) are priority reactivation targets.
- Invitation Acceptance Rate: Cross-reference outgoing Invitations with Connections to calculate what % of invitations resulted in connections. Measures template effectiveness.
- Connection Growth Velocity: Plot connections by
Connected Ondate to find networking surges (events, campaigns, job changes). - Voice Evolution (5+ years): Comments span 2021-2026. Evolution: brief acknowledgments (2021) → curious questions (2023) → tech-forward/opinionated (2025) → consultative expert (2026).
- Ad Targeting Decode: LinkedIn classifies Dmitri as “Public Impact Leader,” “Small Business Owner,” “Senior Tech Decision Maker.” Member groups include Chronicle of Philanthropy, Impact Investor networks.
- Reaction Distribution: 75% Like, 19% Empathy — mostly observer posture with selective emotional engagement. Useful for voice mining (what triggers deeper reactions?).
- Job Seeker Signal: LinkedIn infers “active job seeker” (true) despite CEO role. Flag this in GTM report.
Existing Voice Profile (for validation)
Dmitri has a documented voice profile at solanasis-docs/solanasis-voice-profile.md. The voice analysis script should validate these patterns against actual LinkedIn data and discover NEW patterns:
Documented rules:
- NO em dashes (—) — use semicolons, periods, parentheses, commas
- NO “genuinely”, “seamless”, “frictionless”, “SMBs”, “audit”
- NO “Furthermore”, “Moreover”, “Additionally” — use “So,” or jump-cut
- NO “game-changer”, “next level”, “leverage” (verb), “synergy”
- Semicolons as warm connectors (2+ per piece)
- Parenthetical asides (1+ per piece)
- Signature terms (2-3 per piece): “quiet failure”, “false comfort”, “drift”, “blind spot”, “risk debt”, “operational drag”, “hidden risk”, “brittle systems”, “messy handoffs”
- Approved transitions: “So,”, “Which is why”, “Here’s the thing”, “Let’s face it”, “Imagine”
- Max 3 sentences per paragraph
Voice learnings from outreach refinement (Appendix A of voice profile):
- “Heya” is default casual greeting (not “Hey Brother” for all)
- Extends warm openers with relatable cultural comments connected by semicolons
- Soft hedges: “if you happen to know of” not “Do you know of”
- Reciprocal closing: invites recipient to share expertise, not just do a favor
- One ask per message
- Autobiographical anchoring: “went back to my roots”
- “By far” as authentic intensifier
- No referral fees or “zone of genius” in warm outreach
Critical Technical Requirements
Encoding Gotcha
LinkedIn exports may use ANSI/Latin-1 encoding, NOT UTF-8. European characters, accented names, smart quotes will be garbled if read as UTF-8.
- Use
chardetlibrary to auto-detect encoding on first read - Convert all outputs to plain UTF-8 (no BOM) for ERPNext/Baserow compatibility
- For Excel-labeled CSVs, use
utf-8-sig(BOM) separately
CSV Parsing
- Use stdlib
csv.readerfor ALL Stage 1 parsers. Consistent, fewer dependencies. Always open withnewline=''. - Pandas only for Stage 2 analysis scripts.
csv.readerhandles doubled double-quotes ("") natively per RFC 4180.
Output CSV Robustness
- All output CSVs:
csv.writerwithquoting=csv.QUOTE_ALL, plain UTF-8 encoding - Test that output opens cleanly in Excel AND imports to ERPNext without encoding errors
LinkedIn URL Normalization
This is the join key across ALL files. Normalize before any matching:
def normalize_linkedin_url(url: str) -> str:
if not url:
return ""
url = url.strip().rstrip("/").lower()
if not url.startswith("http"):
url = "https://" + url
url = url.split("?")[0].split("#")[0]
return urlNo External API Calls
Stages 1-2 are fully local/offline. No web requests, no API calls. All analysis uses the CSV files directly.
Existing Codebase Patterns
The project has 3 existing pipelines in solanasis-scripts/ with identical architecture. Use as templates:
foundation-pipeline/(17 scripts, 62 tests) — most maturemsp-pipeline/(13 scripts)fcto-pipeline/(9 scripts)
Pattern: config.py → parse → score → generate outreach → migrate to CRM
foundation-pipeline/config.py is the structural template. It defines:
- Path constants (
PIPELINE_DIR,DATA_DIR,RAW_DIR,INTERMEDIATE_DIR,OUTPUT_DIR) - Scoring model with positive/negative signals and point values
- Tier thresholds (A: 30+, B: 20-29, C: 10-19, D: <10)
- CRM output column definitions
- Intermediate and output file path constants
All scripts use --plan (dry run, show what would happen) and --run (execute) modes where applicable.
Implementation Plan: 14 Scripts Across 3 Stages
Stage 1: Parse & Clean (CSV/JSON intermediaries)
All scripts go in solanasis-scripts/linkedin-pipeline/.
Script 1: config.py
LINKEDIN_EXPORT_DIRpointing to the Complete exportDMITRI_PROFILE_URL = "https://www.linkedin.com/in/dmitri-sunshine"- Path constants:
DATA_DIR,INTERMEDIATE_DIR,OUTPUT_DIR,REPORT_DIR - Scoring weights for warmth and strategic value (see Script 8)
- ICP keyword lists for segment tagging: foundation, nonprofit, wealth-mgmt, fractional-exec, tech, startup, coliving-community, boulder-local, connector, spiritual-wellness
- Spam detection patterns (for flagging, not filtering)
- Topic keyword lists for engagement tagging: ai/tech, community/coliving, spirituality/wellness, entrepreneurship, philanthropy/nonprofit, cybersecurity/ops
detect_encoding()helper usingchardet
Script 2: privacy_filter.py
Utility module:
is_spam_message(content)— keyword heuristics (unsubscribe, free training, limited time, etc.). Flags but does NOT delete.redact_pii_for_export(text)— optional redaction for any future external sharing (not used by default)- Full message text is preserved throughout the pipeline. Nothing is stripped or filtered out.
Script 3: parse_connections.py
- Input:
Connections.csv(skip first 3 lines, CSV header at line 4) - Output:
output/connections_clean.csv - Logic: Normalize LinkedIn URLs, parse “11 Mar 2026” dates to ISO 8601, split name fields, handle credentials in names (e.g.,
"Beckmann, RFC®"), addconnection_age_dayscomputed column
Script 4: parse_messages.py
- Input:
messages.csv(use stdlibcsv.reader, NOT pandas, NOT line counting) - Output:
output/messages_parsed.json(grouped by conversation),output/message_stats_per_contact.csv - Logic:
- Group by
CONVERSATION ID, sort by date within each conversation - Identify Dmitri by
SENDER PROFILE URLmatching his profile URL - Per-contact stats: message count total/from-dmitri/to-dmitri, date range, direction ratio, response latency
- Filter out drafts (
IS MESSAGE DRAFT= “Yes”) - Flag spam messages (using
privacy_filter.is_spam_message()) - Extract greeting patterns from Dmitri’s sent messages (first line regex)
- Group by
Script 5: parse_invitations.py
- Input:
Invitations.csv - Output:
output/invitations_clean.csv - Logic:
- Normalize URLs from both
inviterProfileUrlandinviteeProfileUrl - Parse dates (“3/10/26, 1:02 PM” format) to ISO 8601
- Classify direction (OUTGOING vs INCOMING)
- Detect templates using
difflib.SequenceMatcher(threshold 0.85) - The “Hey Brother” template is dominant; quantify usage count, variants, and personalization spots
- Normalize URLs from both
Script 6: parse_engagement.py
- Input:
Comments.csv,Shares.csv,Reactions.csv,InstantReposts.csv - Output:
output/engagement_activity.csv(unified timeline),output/engagement_by_topic.json - Logic:
- Merge all engagement into single timeline with unified schema: date, activity_type, content_text, target_url, reaction_type, word_count, topic_tags
- Extract text from comments and share commentary
- Compute word count, sentence count for text content
- Tag reaction types
- Apply topic keyword lists to classify engagement by theme
Script 7: parse_profile_data.py
- Input:
Profile.csv,Positions.csv,Skills.csv,Endorsement_Received_Info.csv,Ad_Targeting.csv,Inferences_about_you.csv,Member_Follows.csv,Company Follows.csv,Events.csv,Learning.csv,Ads Clicked.csv,Registration.csv,Receipts_v2.csv - Output:
output/profile_summary.json,output/ad_targeting_parsed.json,output/follows_and_interests.csv,output/career_timeline.json - Logic:
- Parse Ad_Targeting’s 34 columns (semicolon-delimited lists within each cell)
- Parse Inferences (7 rows)
- Merge Member_Follows (2,030) + Company Follows (299) into single list with type column
- Parse Positions into career timeline JSON
- Parse Events and Learning for interest signals
- Include Registration date for account maturity context
Script 8: build_unified_contacts.py (the critical merge)
- Input: All Stage 1 outputs
- Output:
output/contacts_unified.csv,output/contacts_unified.json - Join key: normalized
linkedin_url - Merge: Connections (base) + left-join Invitations + left-join Message stats + left-join Follows
- Orphan handling: people in messages/invitations but NOT in connections →
non_connection_interactions.csv
Warmth scoring (algorithmic):
- +20 bidirectional messages exist
- +15 if 5+ messages exchanged
- +10 message exchange in last 30 days
- +8 message exchange in last 90 days
- +5 Dmitri sent invitation (proactive interest)
- +5 they sent invitation (they sought Dmitri out)
- +3 Dmitri follows them
- -10 connected 1yr+ with zero messages
- -5 only spam messages from this contact
Strategic value scoring:
- +15 title matches ICP (CEO/COO/CTO/ED at nonprofits/foundations/growing orgs)
- +10 company matches target vertical (foundation, nonprofit, wealth management, financial services)
- +8 position is fractional C-suite (fCTO, fCOO, fCIO, fractional)
- +5 company location includes Colorado/Boulder/Denver
- +5 connector role (investor, advisor, community builder, board member)
- +3 engaged recently (message or invitation in last 60 days)
Tiers: A (50+), B (30-49), C (15-29), D (<15)
Segment tags: nonprofit, foundation, wealth-mgmt, tech, startup, coliving-community, fractional-exec, boulder-local, connector, spiritual-wellness
Unified contact schema (one row per person, maximum data):
linkedin_url, first_name, last_name, full_name, email, company, position,
connected_on, connection_age_days,
invitation_direction, invitation_message, invitation_date, invitation_accepted,
message_count_total, message_count_from_dmitri, message_count_to_dmitri,
first_message_date, last_message_date, last_message_direction,
avg_response_time_hours, longest_conversation_length, conversation_count,
last_message_content_preview,
dmitri_follows_them, they_follow_dmitri,
warmth_score, strategic_value_score, relationship_tier,
relationship_decay_flag, days_since_last_contact,
segment_tags, icp_match_details, notes
Also produce per-contact conversation export (output/conversations/):
- One JSON file per A/B-tier contact: full conversation threads with timestamps
- For C/D-tier: summary JSON with metadata only + last 3 messages
Stage 2: Analysis Reports
Output to solanasis-docs/linkedin-analysis/ as markdown.
Script 9: analyze_network.py
Report: network-intelligence-report.md
- Network composition: industry breakdown (keyword-based from company/position), role seniority distribution, geographic distribution
- Top 25 most-engaged contacts (by warmth score) with commonalities
- Network gaps vs. Solanasis target verticals: foundation execs, wealth management, MSPs, fractional CTOs. What % of network is in each vertical?
- ICP validation: Cross-reference actual connection industries with declared ICP
- Referral partner candidates: fractional execs/consultants/advisors/attorneys/accountants with high warmth + matching segments
- Dormant gold: high strategic value but low warmth (connected, never messaged; target vertical people)
- Relationship decay list: contacts with past high engagement (5+ messages) but no contact in 90+ days
- Connection velocity: connections per week over time, identify networking surges
- Invitation effectiveness: acceptance rate by template variant, by personalization level
- Company Follows alignment: 299 followed companies mapped to Solanasis verticals
Script 10: analyze_gtm_alignment.py
Report: gtm-alignment-report.md
- Attention audit: topic breakdown of comments/shares/reactions vs. Solanasis positioning needs
- Ad Targeting decode: Parse all 34 columns; what LinkedIn thinks Dmitri is and what interests are inferred
- Inferences analysis: Flag misalignment with Solanasis brand (e.g., job seeking=true)
- Content engagement gap: quantify % of engagement on cybersecurity vs. community vs. AI vs. other
- Warm introduction map: for each target vertical, list warmest contacts who could provide intros
- Company Follows as ICP signal: 299 companies mapped to verticals
- Member Groups inventory: list all ~25 groups, classify GTM relevance
- Actionable recommendations: specific, numbered actions
Script 11: analyze_voice.py
Report: voice-analysis-report.md (Python-only, no LLM needed)
- Comment voice profile (225 comments, 5 years):
- Avg length, word count distribution, exclamation/question frequency
- Top 30 words/phrases (TF-IDF), vocabulary richness (type-token ratio)
- Intensifier usage: “by far”, “so”, “really”, “super”, “amazing”, “love it”
- Softener usage: “happen to know”, “would love to”, “I think”, “I believe”
- Share commentary voice (258 shares):
- Similar metrics + topic framing analysis
- Signature transitions: “Here’s the thing…”, “So…”, “Which is why…”
- Pro-tip pattern detection
- DM voice profile (~3,000 messages, Dmitri’s sent only):
- Greeting distribution: “Hey Brother” X%, “Hey Sister” Y%, “Heya” Z%, “Aloha” W%
- Sign-off patterns, meeting link inclusion rate
- CTA patterns, response latency
- Invitation template analysis (1,549 invitations):
- Cluster using
difflib.SequenceMatcher(0.85 threshold) - Each variant: verbatim text, usage count, date range, personalization spots
- Acceptance rate per template (cross-ref Invitations → Connections)
- Gender-specific greeting analysis
- Cluster using
- Voice validation against documented profile:
- Em dash frequency (should be ~0), semicolon frequency, banned words scan
- Compare against Appendix A findings
- 5-year voice evolution (2021→2026):
- 2021: Brief acknowledgments
- 2023: Curious questions, learning posture
- 2025: Tech-forward, opinionated
- 2026: Consultative, expert-level, Solanasis-branded
- Quantify: avg comment length, question ratio, vocabulary complexity by year
- Sentiment analysis using VADER (
nltk.sentiment.vader, purpose-built for short social text)
Script 12: generate_voice_enrichment_queue.py
- Output:
output/voice_llm_queue.json(prompt-ready samples for Claude analysis) - Selects: all 225 comments, all share commentaries, top 50 conversations (Dmitri’s messages only), all unique invitation templates
- For future LLM-assisted analysis: tone classification, qualitative insights, voice profile extension
Stage 3: ERPNext Integration
Script 13: prepare_erpnext_import.py
- Input:
contacts_unified.csv - Output:
output/erpnext_leads.csv,output/erpnext_contacts.csv - Mapping:
- Connection with messages exchanged → Contact (linked to Lead)
- Connection, no interaction → Lead (status: Open)
- Connection in target segment → Lead (status: Qualified)
- Non-connection → Lead (status: Open, lower priority)
--planmode: print counts and samples, no file creation--runmode: write ERPNext Data Import-ready CSVs (UTF-8, standard format per Frappe docs)- Full message text included for A/B-tier; metadata + last messages for C/D-tier
ERPNext import format (per Frappe docs):
- Standard CSV upload via Data Import tool
- Download template for Lead/Contact doctypes, fill in, upload
- Must be UTF-8 encoded
- ID column left blank for new records (auto-generated)
- No hard row limit but recommend a few thousand at a time
Script 14: push_to_erpnext.py (stub only)
- Placeholder with Frappe REST API patterns in comments
- Auth env vars in
.env.example:ERPNEXT_URL,ERPNEXT_API_KEY,ERPNEXT_API_SECRET - Will be fleshed out when ERPNext at db.zasage.us has API credentials confirmed
Directory Structure
solanasis-scripts/linkedin-pipeline/
config.py
privacy_filter.py
parse_connections.py
parse_messages.py
parse_invitations.py
parse_engagement.py
parse_profile_data.py
build_unified_contacts.py
analyze_network.py
analyze_gtm_alignment.py
analyze_voice.py
generate_voice_enrichment_queue.py
prepare_erpnext_import.py
push_to_erpnext.py
requirements.txt
.env.example
data/
intermediate/ # Working files (git-ignored)
output/ # Final CSV/JSON (git-ignored)
tests/
test_parse_*.py
test_privacy_filter.py
test_scoring.py
solanasis-docs/linkedin-analysis/
network-intelligence-report.md
gtm-alignment-report.md
voice-analysis-report.md
requirements.txt
chardet
python-dateutil
pandas>=2.0
nltk
httpx
python-dotenv
Implementation Order
Build in this order, testing each before proceeding:
- Phase A (foundation):
config.py,privacy_filter.py,parse_connections.py - Phase B (core parsers):
parse_invitations.py,parse_engagement.py,parse_messages.py,parse_profile_data.py - Phase C (the merge):
build_unified_contacts.py— spot-check known contacts - Phase D (analysis):
analyze_network.py,analyze_gtm_alignment.py,analyze_voice.py,generate_voice_enrichment_queue.py - Phase E (integration):
prepare_erpnext_import.py,push_to_erpnext.py(stub)
Verification Checklist
-
parse_connections.pyproduces exactly 2,500 rows -
parse_messages.pyproduces ~2,700-3,000 logical messages (NOT 9,110) -
parse_invitations.pyproduces ~1,549 rows -
parse_engagement.pyproduces ~1,085 rows (225 comments + 258 shares + 602 reactions merged) -
build_unified_contacts.pyspot-check: Ian Crafford, Tim Lockie, Kevin Roerty all present with correct merged data from connections + invitations + messages - Tier distribution is roughly Pareto (few A-tier, more B, most C/D)
- Voice report shows “Hey Brother” as dominant invitation greeting
- All output CSVs open cleanly in Excel and import to ERPNext without encoding errors
- No PII (phone numbers, emails) leaked into git-tracked files
- All intermediate/output directories are git-ignored