1. Annotation Is a Grind
Training an object detection model takes labeled data. Lots of it. And drawing bounding boxes by hand is, frankly, miserable work.
Open an image. Scan for the object. Carefully drag a rectangle around it. Next image. And the next. Hundreds, thousands of times. It's tedious, slow, draining, and you inevitably miss things. Nobody wants to do this.

Open-vocabulary models — Grounding DINO, SAM3, and their ilk — might save us. Type "cat" and they find cats. Type "traffic light" and they nail it. So can they label your objects automatically too?
If the model already knows the concept, yes. For objects within an open-vocabulary model's training vocabulary, text prompts alone are enough to auto-generate usable annotations.
But what happens when the model has never seen your object?
For this experiment, I happened to pick shrimp in underwater footage. Feed "shrimp" to Grounding DINO, and here's what you get:
"shrimp" → Recall = 0.000
"fish" → Recall = 0.761
"crab" → Recall = 0.843Zero. Not low confidence — zero detections. The word "shrimp" simply isn't in the model's vocabulary. I tried 11 vision models across both text and visual prompting modes. Not a single one could reliably detect shrimp from reference images alone.
What This Article Covers
I built a pipeline that auto-labels unknown objects using just a handful of reference crop images. No object names. No fine-tuning. No iterative training.
The goal isn't full automation. It's a fundamental change in what the work looks like:
- Before: Open each image, hunt for the object, draw a bounding box from scratch
- After: Look at pre-generated candidate crops, hit Accept or Reject
No more searching. No more drawing. Just making calls. That alone is a dramatic improvement in throughput.
A note up front: The results and techniques shown here are specific to underwater shrimp. Different objects will have different sweet spots. The pipeline is designed to be customizable — you run multiple strategies in parallel and iteratively converge on whatever works best for your particular object.
2. Experimental Setup
Dataset
AAU Brackish Underwater Dataset (CC BY 4.0) — underwater footage containing six classes: crab, fish, jellyfish, shrimp, small_fish, starfish. All six fall outside COCO-80, making them genuinely unknown to standard open-vocabulary detectors.
Target: shrimp — 76 instances across 1,467 validation images. Small, translucent, and easily lost in the background.
Hardware & Models
- GPU: NVIDIA RTX 5090 (32GB VRAM)
- SAM3: Segment Anything 3 (via Ultralytics,
SAM3SemanticPredictor) - Grounding DINO v1: Text-guided detection (HuggingFace
transformers) - Qwen3-VL-8B: Vision-Language Model (for environment detection)
Design Philosophy: Don't Try to Nail It in One Shot
Annotation generation isn't a real-time process. You create the labels once, then train your model as many times as you want. There's no need to optimize for speed.
Instead, run multiple approaches in parallel, compare the outputs, and pick whatever configuration works best for your object. Try SAM3 at several confidence levels. Generate a range of prompt templates. Mix and match verification methods. Look at everything, then decide.
There's no single approach that works universally. Every object has its own sweet spot. This "parallel exploration → interactive selection" mindset is the foundation of the pipeline design.
3. Baseline: 11 Models, Zero-Shot
Before building anything, I benchmarked what's already out there:
Model Mode shrimp F1 Verdict Grounding DINO v1 Text: "shrimp" 0.000 Not in vocabulary Grounding DINO v1 Text: all 6 classes 0.694 (overall) Best overall SAM3 Text 0.392 Moderate DINO-X API Text 0.119 Paid + poor OWL-ViT v2 Visual 0.004 Failed YOLOE26 Text / Visual 0.018 / 0.002 Failed DINOv2 Prototype 0.000 Failed Rex-Omni Visual 0.000 Failed Qwen3-VL-8B VLM grounding 0.019 Failed
Two findings:
- Text prompts are the only viable path. All five visual prompting approaches failed outright. Reference images couldn't get the job done for unknown objects.
- For known objects, auto-labeling already works. The challenge is exclusively with unknowns.
4. Descriptive Prompts: Detecting What You Can't Name
This is the insight that made everything else possible.
GDino doesn't know "shrimp." But it understands descriptions:
"shrimp" → R = 0.000 ❌
"underwater creature" → R = 0.553 ★
"small translucent creature" → R = 0.368 ★The model can't look up "shrimp" in its vocabulary, but it can match the visual concept of "underwater creature" against what it sees in the image. GDino's text encoder learned to associate descriptive phrases with visual features during pre-training — it just never encountered the specific word "shrimp."
Automating Prompt Generation
Users shouldn't have to guess which descriptions work. I use a VLM (Qwen3-VL-8B) to analyze reference crops and detect the environment automatically:
Reference crop + full image → VLM → "underwater"
→ Templates: "underwater creature", "underwater animal", "underwater organism"Why rigid templates instead of free-form VLM descriptions? I'll cover that in Section 6, but the short version: adjectives destroy GDino's discriminative power.
Discovery Is Where It All Starts
Descriptive prompts can now find shrimp. But they also find everything else. GDino returns over 9,000 detections across the dataset, of which roughly 70 are real shrimp.
Here's what matters: you can't annotate what you can't find. No filter, however precise, can rescue a ground-truth object that never made it into the candidate pool. That makes Discovery — the fraction of GT objects captured at the candidate stage — the top priority.
False positives can always be dealt with later. A missed object is gone for good.
5. The Key Shift: Segment First, Then Classify
I tried filtering GDino's full-image detections in every way I could think of — CLIP similarity, confidence deltas, area filters, VLM verification, SVMs. The best any of them managed was getting half the labels right.
The breakthrough came from reframing the problem entirely.

The Old Way: Detect → Filter (Didn't Work)
Full image → GDino "underwater creature" → 9,145 detections (TP=69, FP=9,076)
→ Filter → Best result: half the labels are wrongGDino's full-image detections are noisy: overlapping boxes, multiple objects per box, pure background. Trying to sort the good from the bad in that mess is extremely hard.
The New Way: Segment → Classify (Worked)
Full image → SAM3 segmentation → ~10K clean segments (1 crop = 1 object)
→ Each crop → GDino "underwater creature" → classification score
→ Threshold → High-quality labeling ★★Why this works: SAM3 produces non-overlapping masks with clean object boundaries. Each crop contains exactly one thing. At that point, GDino isn't doing noisy detection anymore — it's doing binary classification: "Is this crop an underwater creature?" Separating TPs from FPs gets dramatically easier.
SAM3 Configuration Matters
Two settings that made a significant difference:
1. Domain-specific category words are essential.
SAM3 with ["object"] only: Recall = 53.9% ❌
SAM3 with ["animal", "creature", "object"]: Recall = 78.9% ✅SAM3 is a vision-language model. Broad category words like "animal" activate attention toward biological entities in a way that "object" alone doesn't. This is a biology-specific setting — other domains would need different category words.
2. Lowering confidence captures more TPs.
SAM3 conf = 0.15: Recall = 60.5%, TP = 124
SAM3 conf = 0.10: Recall = 80.3%, TP = 215 (+73%)More false positives come along for the ride, but the downstream GDino classification step handles those. The optimal threshold depends on the object — which is why you run multiple levels and compare.
from ultralytics.models.sam.predict import SAM3SemanticPredictor
predictor = SAM3SemanticPredictor(overrides={
"model": "sam3.pt", "conf": 0.10, "imgsz": 1008
})
predictor.prompts = {"text": ["animal", "creature", "object"]}
results = predictor(image_path)(Note: SAM3's auto-segmentation mode isn't supported in the Ultralytics API. You need SAM3SemanticPredictor with text prompts.)
6. Full Automation: What VLMs Should (and Shouldn't) Do
The segment→classify pipeline works — but I was hand-picking prompts. To be genuinely reusable, everything needs to run on its own.
The Adjective Trap
First attempt: have the VLM generate rich descriptive prompts like "smooth undersea item" or "pale round organism." Sounds reasonable. But tested against 9,318 SAM3 crops:
Prompt GDino Discrimination (TP vs FP mean score gap)
"underwater animal" 0.144 ★★★ 2-word template
"underwater creature" 0.136 ★★★ 2-word template
"pale underwater animal" 0.138 +1 adjective
"small underwater creature" 0.122 +1 adjective
"underwater smooth round organism" 0.034 ❌ +3 adjectives (4x worse!)The more adjectives you add, the worse GDino gets at telling TPs from FPs. Why? GDino's text-image matching is a coarse category judgment. Words like "smooth" and "round" match target crops and background crops equally, collapsing the gap.
The sweet spot for the VLM: just detect the environment.
VLM → ENV = "underwater"
→ Templates: "underwater creature", "underwater animal", "underwater organism"This is a finding from shrimp specifically — adjectives might be useful for other objects. That's exactly why you generate multiple prompt candidates, let GDino's discriminative screening rank them automatically, and go with whatever the data says works best.
The SAM3 Prompt Contamination Problem
A subtler issue: VLM-generated prompts like "smooth undersea item" were being passed to SAM3 alongside the template prompts. They didn't just fail at classification — they actively misdirected SAM3's segmentation:
SAM3 with base + VLM prompts: TP = 173
SAM3 with base + templates only: TP = 215 (+24%)The fix: only feed template prompts to SAM3. Reserve everything else for the GDino classification step.
Fully Automated Results
Metric Manual Automated Human intervention 3 steps 0 Label quality Good Comparable
The automated pipeline matches the manual version's performance with zero human intervention. The entire flow — from reference crops to labeling data — runs unattended.
7. Discovery: Capturing 96% of Ground Truth
With the pipeline automated, the next question is: how many GT objects actually make it into the candidate pool? That's the Discovery rate, and it sets the ceiling for everything downstream.
What doesn't get found can't get annotated. Discovery is the hard limit.
SAM3 + GDino Complement
SAM3 at conf=0.10 catches 61/76 GT objects. But running GDino directly on full images picks up some of what SAM3 misses — the two models fail in different ways:
SAM3 alone: 61/76 GT (80.3%)
+ GDino complement: 73/76 GT (96.1%) ★ +16 pointsThat puts 96% of GT objects in the candidate pool. That's the ceiling for what we can annotate.
The trade-off: GDino complement adds 4,241 extra candidates, only 18 of which (0.4%) are true positives. More candidates, but almost all noise. How you handle those depends on your verification strategy.
Annotation Quality
Discovery is meaningless if the bounding boxes are junk. They aren't:
- Mean IoU vs. GT: 0.799
- GT objects with IoU ≥ 0.7: 86%
I visually inspected all 73 overlay images. The bboxes closely follow the shrimp outlines. Pixel-perfect precision isn't the goal — what matters is whether a human reviewer would accept them as-is. They would.




The GDino Complement Trade-off
Whether to use GDino complement depends on the object:
- If SAM3 alone already gives you sufficient Discovery (say, 90%+), skip the complement. It just adds FPs.
- If SAM3 falls short (80% → 96% in this case), the complement is a major win.
Again, don't lock in a choice upfront — run it both ways and compare.
8. Verification: Separating Signal from Noise
Discovery puts 96% of GT into the candidate pool. Now comes the sorting.
Automated Verification Attempts
I wanted to filter down to TPs automatically. Several approaches:
Method Result Issue CLIP cosine similarity Near-zero discrimination Underwater crops look too similar VLM "Is this a creature?" Too many TPs rejected VLM can't judge small, blurry crops GDino threshold (with supplements) Supplement FPs overwhelm Only 0.4% of supplement candidates are TPs
For shrimp, fully automatic verification hit a wall. But this is a shrimp-specific outcome. For more visually distinctive objects, or domains where CLIP has better coverage, automatic verification may work much better.
A Better Use for GDino Scores
GDino scores aren't reliable enough for automatic filtering, but they're excellent for something else: prioritizing human review.
Higher-scoring candidates are more likely to be TPs. Lower-scoring ones are mostly FPs. Sort by score, review from the top, and stop when you have enough labels.
Strategy GT Coverage Review Time Auto-accept high scores only 30% 0h Auto + review top tier 54% ~1h Auto + review mid tier 83% ~2h Full review 96% ~4h
Where you draw the line depends on how many labels your project needs and how much time you can spare.
9. Not Replacing Humans — Making Them Faster
This pipeline doesn't fully automate annotation.
But that's the wrong question. It's not "Can it replace humans?" — it's "How much faster can it make them?"
Before vs. After
Without the pipeline (traditional annotation):
- Open 1,467 underwater images in a viewer
- Scan each one for shrimp (small, translucent, easy to miss)
- Draw bounding boxes from scratch for each one found
- Repeat. Miss some. Lose focus. Make mistakes.
With the pipeline:
- Run the pipeline
- Review pre-generated candidate crops with bboxes already drawn
- Accept or Reject each candidate
- Export accepted labels as labeling data
The cognitive load is entirely different. No searching for objects — the pipeline already found 96% of them. No drawing boxes — bboxes at IoU = 0.80 are ready to go. You're just making judgment calls.
This isn't a failure of automation. It's the right division of labor between human and AI. The pipeline handles the heavy lifting — detection and candidate generation. The human handles the easy part — deciding yes or no.
10. Pipeline Overview: Customization Axes
The pipeline has three phases, each with a knob you should tune per object.

PHASE 1: Prompt Generation
┌──────────────────────────────────────────────────┐
│ VLM → Environment detection → Template prompts │
│ │
│ Customization axis: CATEGORY NOUNS │
│ Biology: creature, animal, organism │
│ Industrial: part, component, piece (untested) │
│ Medical: lesion, tissue, mass (untested) │
└──────────────────────────────────────────────────┘
PHASE 2: Discovery
┌──────────────────────────────────────────────────┐
│ SAM3 primary segmentation │
│ + GDino complement detection (optional) │
│ │
│ Customization axis: COMPLEMENT ON/OFF │
│ SAM3 sufficient → complement OFF (fewer FPs) │
│ SAM3 insufficient → complement ON (Discovery ↑) │
└──────────────────────────────────────────────────┘
PHASE 3: Verification
┌──────────────────────────────────────────────────┐
│ GDino classification scoring → review / auto-accept│
│ │
│ Customization axis: VERIFICATION STRATEGY │
│ High GDino gap → lean into auto-accept │
│ Low GDino gap → lean into human review │
└──────────────────────────────────────────────────┘
OUTPUT
┌──────────────────────────────────────────────────┐
│ Labeling data (bboxes + class name) │
└──────────────────────────────────────────────────┘Parallel Exploration → Interactive Selection
You can't know the optimal configuration in advance. Instead:
- Run SAM3 at multiple confidence levels (0.10, 0.12, 0.15) — compare the TP/FP trade-off
- Generate multiple prompt templates — let GDino's discriminative screening rank them automatically
- Try both with and without GDino complement — weigh Discovery against FP volume
- Look at the results, then pick a strategy — the GDino gap tells you which verification approach to use
All of this runs offline. A single pipeline run gives you everything you need to make an informed call. From there, you tune interactively based on what the data shows for your particular object.
A different object — say, defect inspection on an assembly line — will have a different optimal setup. That's expected. The pipeline gives you the tools to find it.
What the Pipeline Requires
Input: 1–3 reference crop images + 1 full-scene image + class name
What it does NOT require:
- An object name the model already knows
- Model fine-tuning
- Iterative training loops
- Real-time inference
- A large labeled dataset
11. Takeaways
1. Unknown objects need a different approach. Open-vocabulary detection fails for objects outside the model's training vocabulary. Descriptive prompts ("shrimp" → "underwater creature") bridge the gap.
2. Segment first, then classify. Running GDino on full images makes filtering nearly impossible. Segment with SAM3 first, then classify individual crops with GDino, and discrimination improves dramatically.
3. Let the VLM detect the environment — nothing more. Having VLMs generate adjective-heavy descriptions degrades GDino's discriminative power. The best prompt is just "{environment} {category noun}." That said, this is a shrimp-specific finding and may not hold for every object.
4. Discovery sets the ceiling. You can't annotate what you can't find. SAM3 + GDino complement captured 96% of GT objects, with bbox quality at IoU = 0.80 — fully usable for annotation.
5. Don't try to get it right in one shot. Run multiple confidence levels, prompt variants, and complement settings in parallel. Review the results interactively and converge on the best configuration for your object. Annotation is an offline process — spend the compute.
6. Don't replace humans — make them faster. Drawing bboxes from scratch → Accept/Reject on pre-generated candidates. That shift alone slashes annotation cost and cognitive load.
7. Results will differ for every object. Everything in this article is tuned for shrimp. A different object will call for different settings. The customization axes — category nouns, Discovery strategy, verification strategy — exist for exactly this reason.
Appendix
A. Detailed Comparison: 27 Methods Tested
(For readers who want the full picture.)
Verification Methods
Method Verdict Notes
───────────────────────────────────────────────────────────
v4.4 fixed (auto) ★★★ best fully automated
SAM3(+VLM,c=0.10)→GDino ★★ best with manual prompts
SAM3(+VLM,c=0.15)→GDino ★ highest precision
v4.3 template (auto) ★ first auto success
SAM3→GDino ★ first segment→classify
SAM3→CLIP LR ★ learning-based
CLIP SVM ★ high precision, low recall
───────────────────────────────────────────────────────────
Quad filter ❌ best detect→filter result
v5.1 GDino+VLM ❌ VLM verification limited
CLIP similarity ❌ ineffective underwater
Multi-prompt intersection ❌ detects same regionDiscovery Methods
Method GT Coverage Notes
──────────────────────────────────────────────────────
SAM3(c=0.05)+GDino 98.7% FP explosion
SAM3(c=0.10)+GDino 96.1% ← practical best
GDino v1 t=0.15 90.8% standalone
SAM3 c=0.10 80.3% primary method
SAM3 c=0.15 60.5% conservativeB. Key Negative Results
Dead ends that might save you time:
- GDino scores don't predict correctness. A high score doesn't mean it's a TP. You can't threshold your way out.
- More reference images can make things worse. Averaging dilutes the discriminative signal.
- CLIP couldn't separate underwater crops. Domain-dependent limitation.
- Multi-prompt intersection doesn't filter. GDino detects the same spatial region regardless of prompt.
- Letting the VLM pick SAM3 base nouns fails. It chooses the wrong words and recall drops.
C. References
Models & Libraries
- SAM3 (Segment Anything 3) — Meta (SAM License)
- Grounding DINO v1 — IDEA Research (Apache 2.0)
- Qwen3-VL-8B — Alibaba Qwen (Apache 2.0)
- CLIP ViT-L/14 — OpenAI (MIT)
- Ultralytics — Ultralytics (AGPL-3.0)
- HuggingFace Transformers — HuggingFace (Apache 2.0)
Dataset
- AAU Brackish Underwater Dataset — Aalborg University (CC BY 4.0)
Related
All experiments ran on a single RTX 5090 (32GB VRAM).