*We helped a DeLand general contractor stop digging through filing cabinets by building a document-extraction pipeline that reads scanned permits and inspection reports, outputs structured data, and saves 15 hours a week. Here's exactly how we did it.*
(Client details are anonymized and some specifics composited at the client’s request.)
I walked into a small office off Highway 17 in DeLand, and the first thing I saw was a wall of filing cabinets. Four of them, two-drawer, beige, labeled with faded Sharpie. The owner—let’s call him the GC—gestured at them and said, “That’s my permit history. Every job, every inspection, every sign-off. If I need to find something, I pull a drawer.”
He wasn’t complaining. He was proud of his system. But he also knew it was breaking. His crew’d grown from 8 to 25 people in three years. He was bidding on bigger commercial jobs in Volusia County, and each bid required pulling past permit data to prove compliance. The filing cabinet wasn’t scaling.
This is the story of how we built a document-extraction pipeline that reads scanned permits and inspection reports, turns them into structured data, and retired the filing cabinet for good.
The Situation: What Was Breaking
The GC’s company did ground-up construction and large remodels—mostly in DeLand, Orange City, and Deltona. Every job generated a paper trail: building permits, inspection reports, certificates of occupancy, lien waivers. The county sends permits as PDFs, but inspectors hand-write notes on carbon-copy forms. The GC’s office manager scanned everything into a shared folder with filenames like scan_2024_03_12.pdf.
When the GC needed to find a specific permit number or inspection date, someone had to open dozens of PDFs and skim through them. He told me, “I’d estimate we waste 15 hours a week just hunting for documents. And sometimes we miss a deadline because we can’t find the sign-off.”
The breaking point came when a city inspector flagged a missing inspection report for a foundation pour. The GC knew the report existed—he’d been there—but it took three people two days to locate it in the digital pile. He called me the next week.
What They’d Tried Before
They’d tried a generic document scanner app that claimed to extract text. It worked fine on clean typed PDFs, but the hand-written inspection forms came back as gibberish. They’d also tried hiring a part-time data entry person, but the backlog was too deep—hundreds of documents per month—and the work was mind-numbing. Turnover was brutal.
The GC’d also looked at expensive enterprise document-management systems. The quotes started at $30,000 for setup, plus annual fees. “We’re not a Fortune 500 company,” he said. “I need something that works, but I’m not spending $40K on software.”
That’s where we came in.
The Actual AI Work: OCR, LLM, and a Confidence Threshold
We built a pipeline that combines optical character recognition (OCR) with a large language model (LLM) to extract structured fields from scanned documents. The stack is straightforward: Tesseract for OCR (with some preprocessing), Python for orchestration, and GPT-4o-mini for the LLM extraction step. We deployed it on a small cloud server for about $120/month.
Here’s how it works:
- Scan ingestion. The office manager drops scanned PDFs into a watched folder. A Python script detects new files and moves them to a processing queue.
- Image preprocessing. For hand-written forms, we apply OpenCV to deskew, binarize, and remove noise. This step alone improved OCR accuracy from 60% to roughly 85%.
- OCR pass. Tesseract extracts raw text. We keep the bounding boxes so we know where each word appeared on the page.
- LLM extraction. We send the raw OCR text (plus bounding-box hints) to GPT-4o-mini with a structured prompt that asks for fields like: permit number, issue date, inspection type, inspector name, pass/fail status, and expiration date. The model returns JSON.
- Confidence threshold. The LLM also returns a confidence score (0–1) for each field. If any field falls below 0.8, the document gets flagged for human review. That queue lives in a simple web dashboard.
- Human-in-the-loop. A staff member opens flagged documents, corrects the low-confidence fields, and approves. The corrected data feeds back into the model for fine-tuning (we’re working on that).
We deliberately kept a human in the loop because permit data has legal implications. Get an inspection date wrong and the GC could miss a deadline and face fines. The confidence threshold gives us a safety net. The office manager now spends about 30 minutes per day reviewing flagged documents, down from 15 hours of manual searching.
One thing that was harder than expected: the hand-written forms. Inspector handwriting varies wildly. Some use cursive, some print, some use checkmarks that look like scribbles. We experimented with fine-tuning a smaller OCR model on a dataset of 200 hand-labeled inspection forms. That boosted accuracy on hand-written fields by another 10%, but it took two weeks of labeling. For a smaller contractor, that upfront cost might not be worth it. I’d recommend starting with the generic pipeline and only fine-tuning if the flag rate gets too high.
Where We Kept a Human in the Loop
Look, I want to be honest about where AI still falls short. The pipeline handles about 85% of documents without any human touch. The remaining 15% get flagged. Common issues:
- Faded carbon copies where the text is nearly invisible.
- Forms with complex tables where the LLM misaligns data (e.g., reading a date from the wrong column).
- Checkboxes—the model can’t reliably tell if a box is checked or unchecked.
For those, the human reviewer opens the original PDF, looks at the flagged field, and corrects it. The dashboard highlights the specific field and shows the raw OCR snippet. It takes about 10 seconds per correction.
We also kept a human in the loop for any document that contains a financial figure (like permit fees). The GC insisted on that, and I agreed with him. A misread dollar amount could cause accounting headaches down the road. The LLM is good, but it’s not perfect, and the cost of a mistake is higher than the cost of a quick human glance.
The Measured Results
After three months of running the pipeline, here’s what we measured:
- Time saved: The GC’s office manager reported saving 15 hours per week that used to be spent searching for documents. That’s nearly two full workdays.
- Documents processed: The pipeline handled about 400 documents per month. Of those, 340 (85%) were fully automated. The remaining 60 required human correction.
- Backlog cleared: We did a one-time batch process of their existing scanned archives—about 1,200 documents—over a weekend. Backlog gone entirely.
- Missed deadlines: In the three months since, the GC hasn’t missed a single permit deadline. He attributes that to being able to pull up any inspection date in seconds.
The GC told me, “I didn’t realize how much mental energy I was spending on paper. Now I just open a spreadsheet and see every permit, every date. It’s like the filing cabinet never existed.”
We also estimated the ROI. The pipeline costs $120/month for the cloud server plus about $50/month in API calls to the LLM. That’s $170/month. The GC values the office manager’s time at roughly $25/hour. At 15 hours/week saved, that’s $375/week, or $1,500/month. Net savings: about $1,330/month. Plus the intangible benefit of not missing deadlines.
What We’d Do Differently (Honest Caveats)
No project is perfect. Here are the things I’d change if we did it again:
- Start with a smaller scope. We initially tried to extract 20 fields from every document. That overwhelmed the LLM and led to low confidence on half the fields. We trimmed to the 8 most important fields (permit number, date, type, etc.) and saw a big improvement. Start small, then expand.
- Invest in better scanning hardware. The GC used a $200 flatbed scanner. Some of the carbon copies were barely legible. A $600 document scanner with automatic feed and better resolution would’ve reduced the flag rate by maybe 5%. For a contractor processing 400 docs/month, that’s worth it.
- Plan for the human workflow. We built the dashboard in a weekend, and it was ugly. The office manager initially disliked it. We spent another week polishing the UI and adding keyboard shortcuts. Don’t underestimate the importance of a good interface for the human reviewer.
- Consider a managed service. If you don’t have in-house technical skills, this pipeline requires someone to maintain the server and update the LLM prompts. For a small business, a service like ours (or a tool like Document AI from Google) might be easier. But the trade-off is cost—managed services can run $500/month or more.
What This Means for Other Contractors
Honestly, if you’re a general contractor in Central Florida drowning in paper permits, you don’t need to spend $40,000 on a document management system. A simple OCR-plus-LLM pipeline can handle most of the work for under $200/month. The key is knowing where to draw the line between automation and human review.
We’ve since built similar pipelines for a roofing company in Sanford and a plumbing contractor in Lake Mary. The core pattern stays the same: scan, OCR, LLM extract, confidence check, human review. The field names change, but the workflow doesn’t.
If you’re curious whether this approach’d work for your business, we offer an AI readiness assessment that looks at your document volume, types, and current pain points. No sales pitch—just an honest evaluation of whether automation makes sense.
And if you decide to go ahead, we can help with the implementation—though that page is about voice agents, the same principles of custom pipeline building apply. Or just contact us directly. We’ll tell you if we think it’s a good fit.
The filing cabinet in DeLand is now a storage unit for old holiday decorations. The GC says he might keep it as a reminder of how things used to be. But he doesn’t open it anymore.
"I didn't realize how much mental energy I was spending on paper. Now I just open a spreadsheet and see every permit, every date."
Frequently asked questions
What types of documents can this pipeline handle?
It works best with typed or hand-printed forms like building permits, inspection reports, certificates of occupancy, and lien waivers. Highly cursive handwriting or very faded carbon copies may require human review.
How much does the pipeline cost to run?
For a contractor processing about 400 documents per month, the cloud server and API calls run roughly $170/month. That includes the LLM extraction and the human-review dashboard.
Do I need technical staff to maintain it?
Some basic technical skills are needed to update prompts and handle server maintenance. If you don't have that in-house, we offer managed services or can recommend a low-code alternative.
How accurate is the extraction?
About 85% of documents are processed without human intervention. For the remaining 15%, a staff member corrects low-confidence fields in about 10 seconds per document.
Can this integrate with my accounting software?
Yes, the pipeline outputs JSON that can be fed into most accounting or project management tools via API. We've integrated with QuickBooks and Buildertrend for other clients.
What if I have a backlog of old documents?
We can do a one-time batch process to digitize your archive. For the DeLand contractor, we processed 1,200 documents over a weekend.
Ready to talk it through?
Send a one-line description of what you are trying to do. I will reply within one business day with a plain-English next step. Email or use the form →