<i>An anonymized case study: we helped a Maitland HVAC company clean up 80,000 customer records using AI embeddings, fuzzy matching, and smart human review—cutting duplicate emails and saving 12 hours of manual work per week.</i>
(Client details are anonymized and some specifics composited at the client’s request.)
I got a call from a friend who runs a mid-sized HVAC company in Maitland. They had three locations, a dozen service trucks, and a customer database that’d been collecting dust—and duplicates—for over a decade. “Every time we send an email promotion,” he said, “half the customers get it twice. And the other half get it to the wrong address. It’s embarrassing.”
He wasn’t exaggerating. Their CRM held just over 80,000 records, but after a quick scan, I estimated at least 15–20% were duplicates or near-duplicates. Some had typos in names, others had swapped street and city fields, and many had multiple entries for the same person with slightly different phone numbers. The result: double-sent emails, confused customers, and a marketing team that spent hours each week manually reconciling lists.
This is the story of how we cleaned it up—using AI embeddings, careful fuzzy matching, and a deliberate human-in-the-loop process. It’s not glamorous. But it saved them about 12 hours of manual work per week and stopped the double-sends almost overnight.
The Situation: A Messy Database That Grew Organically
Like many small businesses, this HVAC company had never had a dedicated data person. Records came from multiple sources: their dispatch software, a legacy Excel sheet, paper invoices that were manually entered, and an old email marketing tool. Over time, the same customer might be entered as “John Smith,” “Jon Smith,” “John Smithe,” or “John Smith (Maitland office).” Addresses were all over the place: “123 Main St” vs. “123 Main Street” vs. “123 Main St, Maitland, FL.” Phone numbers had dashes, no dashes, or area codes missing.
Their marketing team was spending about 15 hours a week trying to merge duplicates by hand. They’d export lists, sort by name, and manually decide if two records were the same person. It was tedious, error-prone, and they still missed a lot. The worst part? They had no confidence in their email lists, so they often sent campaigns to the entire database, knowing full well some people would get two copies.
What They’d Tried Before: Simple Dedup Tools
Before calling me, they’d tried a few off-the-shelf dedup tools built into their CRM. Those tools used exact matching on email or phone—if the email was exactly the same, it flagged a duplicate. But most of their duplicates had different emails (work vs. personal) or different phone numbers (cell vs. home). Exact matching caught maybe 5% of the duplicates. They also tried a simple fuzzy match in Excel using the “similar” function, but it produced so many false positives they gave up.
They needed something smarter. Something that could understand that “John Smith at 123 Main St” and “Jon Smithe at 123 Main Street” are probably the same person, even if the email and phone are different. That’s where AI embeddings came in.
The AI Work: Embeddings, Fuzzy Matching, and a Smart Pipeline
We built a dedup pipeline using a combination of techniques. Here’s the plain-English version of what we did:
1. Cleaning and Normalization
First, we standardized the data. We stripped extra spaces, removed punctuation from phone numbers, expanded common abbreviations (“St” → “Street”, “Ave” → “Avenue”), and converted everything to lowercase. We also parsed addresses into components (street, city, state, zip) using a simple address parser. This step alone eliminated about 10% of obvious duplicates.
2. Generating Embeddings for Each Record
We used a pre-trained sentence transformer model (all-MiniLM-L6-v2) to convert each customer record into a vector—a mathematical representation that captures the “meaning” of the text. For each record, we created a combined string of name, address, phone, and email, then generated a 384-dimensional embedding. Similar records would have similar vectors.
3. Finding Candidate Pairs with Approximate Nearest Neighbor
We couldn’t compare every record to every other record (that’d be 3.2 billion comparisons for 80,000 records). Instead, we used an approximate nearest neighbor library (FAISS) to find the top 20 most similar records for each record. This brought the candidate pairs down to about 1.6 million—still a lot, but manageable.
4. Scoring with Fuzzy Matching Rules
For each candidate pair, we computed a similarity score using a weighted combination of: cosine similarity of the embeddings (40% weight), Jaro-Winkler distance on names (30%), token-set ratio on addresses (20%), and exact match on phone/email (10%). We tuned these weights by manually labeling a sample of 500 pairs. Pairs with a score above 0.85 were automatically merged. Pairs between 0.70 and 0.85 were flagged for human review. Below 0.70, we ignored them.
5. Human Review of the Gray Area
About 8,000 pairs fell into the 0.70–0.85 range. We built a simple web interface where their office manager could review each pair side by side: names, addresses, phones, emails, service history. She could click “Merge,” “Skip,” or “Maybe.” The “Maybe” ones we reviewed together. This took about 6 hours total over two weeks—much less than the 15 hours per week they were spending before.
6. Merging and Deduplication
For automatically merged pairs, we kept the most complete record (the one with the most filled fields) and merged service history from both. We also flagged any conflicting fields (e.g., two different phone numbers) and kept both in a notes field. The final result: about 68,000 unique customers, down from 80,000.
Where We Kept a Human in the Loop (And Why)
We deliberately did not automate the entire process. Honestly, the gray area—pairs where the AI was uncertain—needed human judgment. A father and son with the same name and address but different phones could be two different people. The AI couldn’t know that. The office manager, who knew many customers by name, could spot those cases immediately. We also kept a human review step for any merge that involved a customer with a service history longer than 5 visits—to avoid accidentally combining two different accounts.
This human-in-the-loop approach added about 6 hours of work, but it prevented maybe 50–100 incorrect merges. In a business where customer relationships matter, that’s worth it.
The Measured Results
After the dedup, we ran a test email campaign to the cleaned list. The bounce rate dropped from 8% to 2%. The number of “unsubscribe” complaints fell by half. Their marketing team reported that they no longer saw the same customer appear twice in their export. They estimated they saved about 12 hours per week on manual list cleaning—time they now spend on actual marketing strategy.
More importantly, the office manager said, “I can finally trust the data.” That trust is hard to quantify, but it’s the foundation for everything else—from targeted promotions to accurate service reminders.
“We used to dread email blasts because we knew we’d get angry calls from customers who got two copies. Now, we don’t even think about it.” – Office manager, Maitland HVAC company
What We’d Do Differently (Honest Caveats)
This project wasn’t perfect. A few things we’d change if we did it again:
- Start with a data audit. We spent alot of time cleaning data that was never going to be useful (e.g., records from 2005 with no phone or email). A quick audit upfront could’ve saved us a few hours.
- Use a better address parser. Some addresses were so mangled that even a human couldn’t decipher them. A more sophisticated address parsing library (like libpostal) would’ve helped considerably.
- Involve the marketing team earlier. We built the review interface without much input from the people who’d use it. A few tweaks (like showing service history more prominently) would’ve made their job easier.
Also, embeddings aren’t magic. They work well for text similarity, but they can be fooled by very short records (e.g., just a name and city). For those, we relied more on fuzzy string matching. The combination of both techniques was key.
The Bigger Lesson: Data Cleaning Is a Prerequisite for AI
This project wasn’t about implementing a flashy AI chatbot or a voice agent. It was about getting the fundamentals right. If you want to use AI for anything—personalized marketing, predictive maintenance, customer segmentation—you need clean, deduplicated data. Garbage in, garbage out, as they say.
For this HVAC company, the dedup project opened the door to more advanced AI use cases. They’re now looking at using the same embeddings to power a customer search tool for their dispatchers, and they’re considering an AI voice agent to handle after-hours calls. But none of that would work if the database still had 12,000 duplicates.
If you’re a small business owner in Central Florida with a messy customer database, you’re not alone. Most companies that’ve been around for a few years have this problem. The good news is that it’s solvable—and you don’t need a data science team to do it. A focused project with the right tools and a bit of human oversight can clean things up in a few weeks. Once it’s clean, you can start using your data for real business growth.
Want to see if your own data is ready for AI? Start with a free AI readiness assessment. We’ll look at your data quality, your workflows, and where a little automation could save you time and money. No buzzwords, just honest advice.
We used to dread email blasts because we knew we'd get angry calls from customers who got two copies. Now, we don't even think about it.
Frequently asked questions
What are embeddings and how do they help with deduplication?
Embeddings are mathematical representations of text that capture meaning. For dedup, we convert each customer record into a vector. Similar records have similar vectors, so we can find potential duplicates by comparing vector distances. This catches fuzzy matches that exact matching misses.
How long did the dedup project take?
The entire project took about 3 weeks from start to finish: 1 week for data cleaning and pipeline setup, 1 week for running the matching and human review, and 1 week for final merges and validation. The human review part took about 6 hours total.
Why not just use a commercial dedup tool?
Commercial tools often rely on exact matching or simple fuzzy logic. They work well for clean data, but our client's data had many inconsistencies (typos, swapped fields, multiple email addresses). Custom embeddings gave us better accuracy for their messy dataset.
What percentage of records were duplicates?
We found about 15% of the 80,000 records were duplicates or near-duplicates. After merging, the database went from 80,000 to 68,000 unique customers. That's 12,000 duplicates removed.
How much did this cost?
Costs vary, but for a project of this size, expect to invest a few thousand dollars in consulting time and tools. The client recouped that in saved labor within a few months, not to mention improved email deliverability.
Do I need clean data before using AI for my business?
Yes, absolutely. AI models are only as good as the data you feed them. If your customer database has duplicates, errors, or missing fields, any AI application—whether it's a chatbot, recommendation engine, or predictive model—will produce unreliable results. Cleaning data first is a smart investment.
Ready to talk it through?
Send a one-line description of what you are trying to do. I will reply within one business day with a plain-English next step. Email or use the form →