I didn't set out to compare fine-tuning and zero-shot. It just kind of happened.
I was building a computer vision system to detect electrical components in field photographs — meters, fuse boxes, wiring — the kind of images engineers take on-site at UK properties. Early on I wanted to quickly validate whether the problem was even solvable with a model, so I grabbed 5 images, labelled them by hand, and trained a YOLOv8 model. Later, when the scope grew and I needed to process hundreds of images across multiple object classes without spending weeks annotating, I spun up a SAM3 server and used text prompts instead.
Same domain. Roughly the same objects. Two completely different bets on how to solve it.
SAM3 is Meta's Segment Anything Model — you tell it what to look for in plain English and it finds it. No training data, no labels. I ran three prompts across 300 images: "electric meters", "fuse", and "wires".
Wires: found in 71.3% of images. Meters: 57.3%. Fuses: 6.7%.
The wire and meter numbers felt reasonable. Wires are visually prominent, and "electric meters" is a specific enough phrase that SAM3 probably had plenty of reference during pretraining. But 6.7% on fuses genuinely surprised me.
The problem is that "fuse" means different things. A ceramic fuse, a fuse carrier, a cut-out — they all look different, and SAM3 has no idea which one I'm after. It's guessing based on whatever "fuse" looked like in its training data, which is not always UK domestic electrical installations. You're at the mercy of what the model thinks your word means.
Still — three object classes, no annotation, a working system in a couple of days.
So I tried something. I fine-tuned SAM3 on a small set of fuse cutout images — not hundreds, just enough to show the model what I actually meant by "fuse" in this context. The detection rate jumped significantly. Same model, same architecture, just a handful of domain-specific examples and it suddenly understood what it was looking for.
That's the part that stuck with me — it didn't need much. It needed context. The base model already knew how to segment objects, it just had no idea that a UK fuse cutout was the thing I cared about. A few labelled examples closed that gap almost immediately.
Put that next to the earlier SAM zero-shot 6.7% fuse detection rate and it's a stark difference. A model trained on 5–10 images found more fuse boxes than a 0.9B parameter zero-shot model prompted with the word "fuse". The precision issues go away with more data. Getting recall from 6.7% to something useful is a harder problem.
There's a tendency right now to treat large foundation models as a ceiling — like the question is just "which one do I call?" and fine-tuning is some legacy approach for when you don't have access to something better. But what I found was almost the opposite.
These billion-parameter models are sitting on an enormous amount of general visual understanding. What they don't have is your domain.
Although these experiments take time to set up, the results feel almost too easy once they work. It doesn't feel like you did much. But that's the point.