"Why am I getting thousands of numbers instead of bounding boxes?"
This confusion usually comes from a common misconception YOLO does not output final detections. What it actually produces is a large tensor of raw predictions, containing thousands of candidate boxes, each with geometry, objectness confidence, and class probabilities. At this stage, YOLO is essentially saying:
"Here are many possible boxes. You decide which ones are real."
Understanding this output tensor is the single most important step in correctly deploying YOLO models in production especially when working with C++ and ONNX Runtime, where nothing is abstracted away for you.
If you've ever struggled with YOLO post-processing, incorrect indexing, or confidence calculations that "almost work but not quite", this post is meant to remove that confusion for good.
Let's start by understanding what YOLO actually outputs and why bounding boxes don't exist yet.
Before we look at tensor shapes or indexing, it's important to understand where exactly we are in the YOLO inference flow.
A typical YOLO inference pipeline looks like this:
Image
↓
Preprocessing (resize, letterbox, normalize)
↓
ONNX Runtime inference
↓
session.Run()
↓
Ort::Value (YOLO output tensor) ← WE ARE HERE
↓
Post-processing (decode + filter + NMS)
↓
Final bounding boxesAt this point:
- The model has already run successfully
- ONNX Runtime has returned an
Ort::Value - No bounding boxes exist yet
What we have is a raw output tensor, and our job is to interpret and decode it correctly.
What session.Run() Actually Returns
After calling ONNX Runtime inference:
std::vector<Ort::Value> output_tensors = session.Run(
Ort::RunOptions{nullptr},
input_names.data(),
&input_tensor,
1,
output_names.data(),
output_names.size()
);You receive one or more Ort::Value objects.
For YOLO models, the first output is typically:
- A 3D tensor
- Containing predictions for thousands of candidate boxes
- With no filtering or decoding applied
This tensor is not human-readable, and it is not final output.
First: what YOLO actually outputs
YOLO does NOT output final boxes. It outputs raw predictions for many candidate boxes.
Think of YOLO as a "proposal generator"
YOLO effectively says:
"Here are thousands of possible boxes. You decide which ones are real."
What is a "candidate box"?
Each candidate corresponds to:
- A grid cell
- An anchor (depending on the YOLO version)
YOLO predicts many such candidates per image.
What each candidate contains
For each candidate, YOLO outputs a vector with three parts.
1. Box geometry
Usually 4 values:
tx, ty, tw, thThese are not final pixel coordinates.
They represent:
- Offsets relative to a grid cell
- Width and height in model space
2. Objectness confidence
One value:
objectnessThis answers:
"Is there any object here?"
Value range:
0 → background
1 → object3. Class scores
One value per class:
c1, c2, c3, ..., cNThese answer:
"If there is an object, what is it?"
These are conditional probabilities. One candidate vector (example)
For COCO (80 classes):
[ tx, ty, tw, th, objectness, c1, c2, ..., c80 ]Total values per candidate:
4 + 1 + 80 = 85How many candidates are there?
For example, YOLOv5 at 640×640 uses:
- Multiple detection heads
- Multiple grid sizes
- Multiple anchors
Total candidates ≈ 25,000+. Now that we understand what YOLO predicts, the next step is to see how these predictions are laid out in memory.
Typical YOLOv5 / YOLOv8 ONNX Output Shape
Once inference completes, ONNX Runtime returns the YOLO output as a 3D tensor.
The most common shape looks like this:
[1, N, C]Where:
1means batch size, N- number of predicted boxes,C- values per box.
Example: YOLOv8 trained on COCO
A typical output shape is:
[1, 8400, 85]This means:
- One image was processed
- The model predicted 8400 candidate boxes
- Each candidate box is represented by 85 values
What determines C?
The value of C is fixed by the model configuration:
C = 4 + 1 + num_classesFor COCO (80 classes):
C = 4 + 1 + 80 = 85So for every candidate box, YOLO outputs:
- 4 values for box geometry
- 1 value for objectness
- 80 values for class scores
Important clarification
This tensor does not represent final detections.
Each of the N rows corresponds to a candidate box, not a confirmed object.
At this stage:
- All candidates are present
- No filtering has happened
- No decoding has happened
Knowing the tensor shape is not enough we also need to understand what each value inside C actually represents.
Great, continuing with the same constraints. Here is the next section, refined and transitioned, with your content kept intact.
What Is Inside C (VERY IMPORTANT)
The last dimension of the YOLO output tensor contains all the information for one candidate box.
Its size is defined as:
C = 4 + 1 + num_classesPer-box layout
Each row in the [N, C] tensor follows this layout:
[x, y, w, h, obj, cls0, cls1, ..., clsN]Where:

Important notes
(x, y, w, h)are not corner coordinates- They are expressed in model input space
- Depending on the export, values may be normalized or already scaled
These values still require decoding before they can be drawn on the image.
At this point, you must still:
- Convert
(x, y, w, h)→(x1, y1, x2, y2) - Combine objectness with class probability
- Filter low-confidence predictions
- Apply Non-Maximum Suppression (NMS)
Now that we understand what each candidate box contains, the next step is to see how to access this tensor in C++ using ONNX Runtime.
Accessing the Output Tensor in C++
After running inference with ONNX Runtime, the YOLO output is available as an Ort::Value.
A typical inference call looks like this:
std::vector<Ort::Value> output_tensors = session.Run(
Ort::RunOptions{nullptr},
input_names.data(),
&input_tensor,
1,
output_names.data(),
output_names.size()
);For most YOLO models, the first output tensor contains all predictions.
Getting the tensor shape
To interpret the output, the first step is to inspect its shape:
auto& output = output_tensors[0];
auto type_info = output.GetTensorTypeAndShapeInfo();
std::vector<int64_t> shape = type_info.GetShape();Example result:
shape = {1, 8400, 85}This directly matches the [1, N, C] layout discussed earlier.
Accessing the raw data pointer
Once the shape is known, you can access the underlying data:
float* data = output.GetTensorMutableData<float>();At this point:
datais a flat 1D array- All tensor values are stored contiguously in memory
- The original 3D shape is no longer explicit
Understanding how to index into this array is critical.
How Indexing Works (THIS IS KEY)
At this stage, ONNX Runtime has given us:
float* data;This is a flat 1D array, even though the original tensor shape was:
[1, num_boxes, elements_per_box]Understanding how this flattened memory maps back to the tensor is critical.
What the YOLO output really is in memory
Conceptually, the output tensor looks like:
[batch][boxes][elements]With batch size = 1:
[1][num_boxes][elements_per_box]In memory, ONNX Runtime stores this tensor in row-major order.
What "row-major" means here
Row-major means:
The last dimension changes fastest.
So the memory layout is:
Box 0: [x y w h obj cls0 cls1 ...]
Box 1: [x y w h obj cls0 cls1 ...]
Box 2: [x y w h obj cls0 cls1 ...]
...All values for one box are stored contiguously, followed by the next box.
Flattened memory view (example)
Imagine a small tensor:
[1][3][5] // batch = 1, boxes = 3, elements = 5Flattened in memory, it looks like:
Index: 0 1 2 3 4 | 5 6 7 8 9 | 10 11 12 13 14
Data : [b0e0 b0e1 b0e2 b0e3 b0e4 | b1e0 b1e1 b1e2 b1e3 b1e4 | b2e0 ...]Each box occupies a fixed-size block in memory.
Why this indexing formula works
To access:
- box
i - element
jinside that box
You use:
index = i * elements_per_box + jBecause:
- Each box uses
elements_per_boxvalues - To reach box
i, you skipifull blocks - Then offset by
jwithin that block
Applying this to the YOLO output
From the tensor shape:
int num_boxes = shape[1];
int elements_per_box = shape[2];This means:
shape[1]→ number of candidate boxesshape[2]→ values per box
Extracting box values
Using the indexing rule:
for (int i = 0; i < num_boxes; i++) {
float x = data[i * elements_per_box + 0];
float y = data[i * elements_per_box + 1];
float w = data[i * elements_per_box + 2];
float h = data[i * elements_per_box + 3];
float obj = data[i * elements_per_box + 4];
}Each iteration processes one candidate box.
Why batch is ignored
The batch dimension exists, but here:
batch = 1So its offset is zero.
If batch size were greater than 1, indexing would become:
index = b * (num_boxes * elements_per_box)
+ i * elements_per_box
+ jBut for YOLO inference, batch size is typically 1.
We can use the same logic to extract class scores and determine which class each box predicts.
Class Score Extraction
So far, we've extracted:
- Box geometry
- Objectness score
The remaining values in each candidate box correspond to class probabilities.
Where class scores start
From the per-box layout:
[x, y, w, h, obj, cls0, cls1, ..., clsN]Class scores begin at index:
int class_start = 5;Determining the number of classes
Rather than hardcoding the number of classes, we compute it from the tensor shape:
int num_classes = elements_per_box - class_start;For example, if:
elements_per_box = 85Then:
num_classes = 80This makes the code model-agnostic.
Finding the best class for a candidate
For each candidate box, we scan all class scores and select the highest one:
float max_class_score = 0.0f;
int class_id = -1;
for (int c = 0; c < num_classes; c++) {
float score = data[i * elements_per_box + class_start + c];
if (score > max_class_score) {
max_class_score = score;
class_id = c;
}
}At the end of this loop:
class_idis the predicted classmax_class_scoreis the highest class probability
Important clarification
This is not the final confidence score.
At this stage:
- We have the best class prediction
- But objectness has not been applied yet
Next, we'll combine objectness and class probability to compute the final confidence score used for filtering detections.
Final Confidence Score
Once we have:
- Objectness score
- Highest class probability
We combine them to compute the final confidence for a candidate box.
float confidence = obj * max_class_score;Why this formula is correct
YOLO predicts two probabilities:
- Objectness
P(object exists)- Class score
P(class | object exists)
From probability theory:
P(class AND object) = P(object) × P(class | object)This is exactly what the confidence score represents.
Both signals must be high.
Examples:
obj = 0.9, class = 0.9 → confidence = 0.81
obj = 0.1, class = 0.9 → confidence = 0.09
obj = 0.9, class = 0.1 → confidence = 0.09A high class score alone is not sufficient.
What this confidence is used for
This value is the one you should:
- Threshold
- Use for ranking detections
- Pass into NMS
Not:
- Objectness alone
- Class score alone
Typical thresholding
A common threshold looks like:
if (confidence < conf_threshold)
continue;Typical values:
0.2 – 0.3→ more detections0.4 – 0.5→ fewer, cleaner detections
After computing confidence, the next step is to filter low-quality candidates before running NMS. This ordering is important for both correctness and performance.
Thresholding and NMS Order
Once the final confidence score is computed, the next step is filtering.
This step is applied before Non-Maximum Suppression (NMS).
Why thresholding comes first
NMS is an expensive operation.
Its complexity grows with the number of candidate boxes:
O(n²)Thresholding early:
- Removes low-quality predictions
- Reduces the number of boxes passed to NMS
- Improves performance significantly
Correct order of operations
The correct post-processing flow is:
YOLO raw output
↓
Decode box + class
↓
confidence = objectness × class_score
↓
Threshold (filter)
↓
NMS (remove duplicates)
↓
Final detectionsCommon mistake
Running NMS before thresholding.
This:
- Slows down inference
- Produces unstable results
- Wastes computation on background predictions
Before wrapping up, it's important to mention model variations that can affect how this output tensor should be interpreted.
Important Variation: YOLOv8 Output Format
Not all YOLO ONNX exports follow the same output layout.
Some YOLOv8 models remove the explicit objectness score and output:
[x, y, w, h, cls0, cls1, ...]In this case:
- There is no separate objectness value
- Class scores already represent final confidence
Why this matters
Assuming the wrong layout will lead to:
- Incorrect confidence computation
- Invalid filtering
- Broken detections
This is why you should never assume the output format.
What to always check
Before writing decoding logic:
- Print the output tensor shape
- Check the value of
C - Inspect the first row of the output tensor
These steps immediately reveal:
- Whether objectness is present
- Where class scores start
- How many values exist per box
Debug tip
A simple inspection loop is often enough:
for (int j = 0; j < elements_per_box; j++)
std::cout << data[j] << " ";This helps confirm:
- Value ranges
- Layout correctness
- Normalization behavior
With the tensor decoded and confidence computed, we can summarize the core ideas that make YOLO post-processing work correctly.
By now, you should have a clear picture of what happens after session.Run() in YOLO inference. The model has done its work, producing a large tensor of raw predictions — it's just up to us to interpret it.
Decoding these predictions in C++ is straightforward once you understand the tensor shape, memory layout, and how objectness interacts with class probabilities. Filtering low-confidence boxes and applying NMS turns this raw output into clean, usable detections.
Getting this step right removes most of the common headaches: misplaced boxes, weird confidence scores, and inconsistent results. Once the decoding logic is in place, everything downstream from drawing boxes on an image to integrating YOLO into a larger application becomes predictable and reliable.
If you'd like to connect, you can find me on LinkedIn: https://www.linkedin.com/in/isvidhi/