I am sure you have had this problem before, specially when your company decides to implement a CRM like Salesforce using existing data bases. Finding the correct name of the client or unifying similar names could be a nightmare, This article explains how AI can help you.
top of page
bottom of page
Jorge, thanks for sharing this interesting article. The article describes a real-world case study in which Tobias Zwingmann and his team tackled a seemingly hopeless data-cleaning task—standardizing over 50,000 customer records riddled with inconsistencies in company names. The data came from registration forms for professional events, and participants had entered company names manually, creating a chaotic mix like “SIEMENS”, “Siemens AG”, “Siemens Deutschland”, and many others.
The Challenge
Nature of the data: The dataset originated from event registrations where attendees typed their company names freely.
Problems:
Many entries had missing or inconsistent company names.
Numerous variations of the same company appeared.
Human-entered data was full of typos, alternative spellings, and structural inconsistencies.
Initial estimate: Manual cleanup would take a month and cost around $10,000 in labor.
Risk: The insights from the analysis were uncertain—whether useful results would emerge was unknown until the data was clean.
The AI-Powered Cleaning Strategy
Rather than abandon the project or burn weeks on manual labor, the team took a staged approach that combined external data, traditional algorithms, and lightweight AI tools. The process was broken into three stages, each reducing the complexity of the next:
1. External Reference Matching
Goal: Match company names to reliable external sources using business domains and websites.
Data providers: Services like Apollo were used to pull accurate company names based on domain names.
Outcome: Automatically resolved 40% of the dataset—entries with clearly identifiable business domains.
2. Similarity Detection (Non-AI)
Technique: Used algorithms like Levenshtein distance to identify near-duplicate names.
Use case: Detected variations like “Siemens AG” and “SIEMENS” as likely referring to the same entity.
Purpose: Clustered similar entries together without yet using AI, relying on smart pattern recognition.
3. AI Harmonization
Tool: A small LLM (o3-mini) was employed to make nuanced judgment calls.
Inputs: Company name, industry, country, and other contextual info.
Function: Determined whether pairs of entries referred to the same company or not.
Processing speed: Over 10,000 records were evaluated in under an hour.
Final rule: Retain the external reference name when available, or the most complete version of the name otherwise.
Each stage made the next more efficient. By progressively narrowing the problem scope, AI could focus on the hardest, most judgment-based decisions.
The Results
Time saved: Entire process completed in 6.5 hours, compared to an estimated month.
Cost saved: Avoided $10,000 in manual labor costs.
Insights unlocked:
Identified key companies that sent the most attendees.
Tracked returning companies and revealed industry-specific attendance patterns.
Discovered hidden gems—small but influential companies that had gone unnoticed.
Business impact: The cleaned data enabled better event planning, more targeted marketing, and valuable follow-ups with high-profile clients.
Lessons and Takeaways
AI’s true role: AI didn’t do all the work—it handled the “last mile” that humans or basic algorithms couldn’t easily solve.
Start with messy data: The article challenges the belief that clean data must come before AI. Instead, it argues that AI can produce clean data.
AI for AI: Large AI labs already use similar techniques—applying AI to refine their own training datasets.
Strategic advice: The ugliest, most painful datasets often hold the most value. Rather than delay analysis for the sake of cleaning, organizations should use AI to extract insights from the mess.
Don’t wait for perfect data to embrace AI. Start with your messiest, most frustrating dataset—the one that makes your team groan. That’s where the clearest, most immediate wins with AI lie.