AI Glossary
Quantization is the process of compressing an AI model by using smaller, less precise numbers for its calculations — making it cheaper and faster to run, usually with only a minor dip in quality.
What it really means
Think of quantization like converting a high-resolution photo to a smaller JPEG file. The original photo might have millions of colors and fine detail. The JPEG version has fewer colors and might look a little softer, but it loads faster and takes up less space on your phone. You still get the picture — just not the full original quality.
AI models work the same way. They store their “knowledge” as a huge collection of numbers. Normally those numbers are stored at 32-bit or 16-bit precision — think of it like using a ruler with very tiny marks. Quantization shrinks those numbers down to 8-bit or even 4-bit precision — like switching to a ruler with only inch marks. The model still works, but it’s less precise in its calculations.
I’ve seen this technique become essential for small businesses. Without quantization, running a capable AI model requires expensive server hardware or cloud subscriptions. With quantization, you can run a surprisingly smart model on a decent laptop or even a phone.
Where it shows up
Quantization is happening behind the scenes in many tools you might already use. If you’ve ever used an AI writing assistant that runs locally on your computer, or a voice assistant that responds quickly without an internet connection, quantization is likely involved.
In the AI world, you’ll hear terms like “int8 quantization” (8-bit integer) or “4-bit quantization.” These refer to how many bits are used to store each number. Lower bits mean more compression. A model quantized to 4 bits might be one-eighth the size of its original 32-bit version.
Some popular open-source models — like Llama, Mistral, or Phi — are commonly distributed in quantized versions. You’ll see filenames ending in “-Q4” or “-Q8” to indicate the quantization level.
Common SMB use cases
- Running AI on existing hardware. A Maitland HVAC company I worked with wanted to use an AI assistant for their dispatch team but didn’t want to pay for cloud API calls. I helped them quantize a small model that runs on a refurbished office PC. No monthly fees, no internet required.
- On-device document analysis. A Winter Park dental practice needed to scan patient intake forms for key information. A quantized model runs locally on their office computer, keeping patient data private and avoiding cloud costs.
- Faster customer service chatbots. A Lake Nona restaurant chain wanted a menu chatbot for their website. Quantization let them host the model on a cheap $20/month server instead of a $200/month cloud AI service.
- Mobile apps. A pool service in Clermont built a simple app for their technicians to log service notes. A quantized model helps auto-complete common entries without needing an internet connection in the field.
Pitfalls (what gets oversold)
Quantization isn’t magic. Here’s what I’ve seen go wrong:
- Quality loss is real. A heavily quantized model (like 4-bit) can make more mistakes, especially on complex tasks. For a law firm in downtown Orlando doing contract review, I’d recommend a less aggressive quantization or no quantization at all. The savings aren’t worth missing a key clause.
- Not all models quantize equally. Some models handle compression better than others. I’ve tested quantized versions of the same model that varied wildly in quality. You need to test on your actual data, not just trust the file name.
- It’s not free. Setting up quantization requires some technical work. You need to know which tool to use (like llama.cpp or AutoGPTQ) and how to calibrate it. If that sounds like Greek, you’ll want someone to help.
- Speed gains are situational. Quantization usually makes models run faster, but on very small models the difference can be negligible. And on some hardware, the overhead of converting between precision levels can actually slow things down.
Related terms
- Pruning: Another compression technique that removes less important connections in the model. Often used alongside quantization.
- Inference: The process of running a trained model to make predictions. Quantization mainly affects inference speed and cost.
- FP16 / BF16 / INT8: Different number formats used in AI. FP16 uses 16-bit floating point, INT8 uses 8-bit integers. Quantization typically moves from FP16 or FP32 down to INT8 or lower.
- Local AI / Edge AI: Running models on local devices rather than in the cloud. Quantization is often the key that makes local AI practical.
- Model size: The storage space a model takes up. Quantization directly reduces this, often by 50-75% or more.
Want help with this in your business?
If you’re curious whether quantization could help your business run AI more affordably — or if you’ve heard the buzz and want a straight answer — send me an email or use the contact form. I’ll tell you if it fits your situation.