"Why am I getting thousands of numbers instead of bounding boxes?"

This confusion usually comes from a common misconception YOLO does not output final detections. What it actually produces is a large tensor of raw predictions, containing thousands of candidate boxes, each with geometry, objectness confidence, and class probabilities. At this stage, YOLO is essentially saying:

"Here are many possible boxes. You decide which ones are real."

Understanding this output tensor is the single most important step in correctly deploying YOLO models in production especially when working with C++ and ONNX Runtime, where nothing is abstracted away for you.

If you've ever struggled with YOLO post-processing, incorrect indexing, or confidence calculations that "almost work but not quite", this post is meant to remove that confusion for good.

Let's start by understanding what YOLO actually outputs and why bounding boxes don't exist yet.

Before we look at tensor shapes or indexing, it's important to understand where exactly we are in the YOLO inference flow.

A typical YOLO inference pipeline looks like this:

Image
  ↓
Preprocessing (resize, letterbox, normalize)
  ↓
ONNX Runtime inference
  ↓
session.Run()
  ↓
Ort::Value (YOLO output tensor)   ← WE ARE HERE
  ↓
Post-processing (decode + filter + NMS)
  ↓
Final bounding boxes

At this point:

  • The model has already run successfully
  • ONNX Runtime has returned an Ort::Value
  • No bounding boxes exist yet

What we have is a raw output tensor, and our job is to interpret and decode it correctly.

What session.Run() Actually Returns

After calling ONNX Runtime inference:

std::vector<Ort::Value> output_tensors = session.Run(
    Ort::RunOptions{nullptr},
    input_names.data(),
    &input_tensor,
    1,
    output_names.data(),
    output_names.size()
);

You receive one or more Ort::Value objects.

For YOLO models, the first output is typically:

  • A 3D tensor
  • Containing predictions for thousands of candidate boxes
  • With no filtering or decoding applied

This tensor is not human-readable, and it is not final output.

First: what YOLO actually outputs

YOLO does NOT output final boxes. It outputs raw predictions for many candidate boxes.

Think of YOLO as a "proposal generator"

YOLO effectively says:

"Here are thousands of possible boxes. You decide which ones are real."

What is a "candidate box"?

Each candidate corresponds to:

  • A grid cell
  • An anchor (depending on the YOLO version)

YOLO predicts many such candidates per image.

What each candidate contains

For each candidate, YOLO outputs a vector with three parts.

1. Box geometry

Usually 4 values:

tx, ty, tw, th

These are not final pixel coordinates.

They represent:

  • Offsets relative to a grid cell
  • Width and height in model space

2. Objectness confidence

One value:

objectness

This answers:

"Is there any object here?"

Value range:

0 → background
1 → object

3. Class scores

One value per class:

c1, c2, c3, ..., cN

These answer:

"If there is an object, what is it?"

These are conditional probabilities. One candidate vector (example)

For COCO (80 classes):

[ tx, ty, tw, th, objectness, c1, c2, ..., c80 ]

Total values per candidate:

4 + 1 + 80 = 85

How many candidates are there?

For example, YOLOv5 at 640×640 uses:

  • Multiple detection heads
  • Multiple grid sizes
  • Multiple anchors

Total candidates ≈ 25,000+. Now that we understand what YOLO predicts, the next step is to see how these predictions are laid out in memory.

Typical YOLOv5 / YOLOv8 ONNX Output Shape

Once inference completes, ONNX Runtime returns the YOLO output as a 3D tensor.

The most common shape looks like this:

[1, N, C]

Where:

1means batch size, N- number of predicted boxes,C- values per box.

Example: YOLOv8 trained on COCO

A typical output shape is:

[1, 8400, 85]

This means:

  • One image was processed
  • The model predicted 8400 candidate boxes
  • Each candidate box is represented by 85 values

What determines C?

The value of C is fixed by the model configuration:

C = 4 + 1 + num_classes

For COCO (80 classes):

C = 4 + 1 + 80 = 85

So for every candidate box, YOLO outputs:

  • 4 values for box geometry
  • 1 value for objectness
  • 80 values for class scores

Important clarification

This tensor does not represent final detections.

Each of the N rows corresponds to a candidate box, not a confirmed object.

At this stage:

  • All candidates are present
  • No filtering has happened
  • No decoding has happened

Knowing the tensor shape is not enough we also need to understand what each value inside C actually represents.

Great, continuing with the same constraints. Here is the next section, refined and transitioned, with your content kept intact.

What Is Inside C (VERY IMPORTANT)

The last dimension of the YOLO output tensor contains all the information for one candidate box.

Its size is defined as:

C = 4 + 1 + num_classes

Per-box layout

Each row in the [N, C] tensor follows this layout:

[x, y, w, h, obj, cls0, cls1, ..., clsN]

Where:

None

Important notes

  • (x, y, w, h) are not corner coordinates
  • They are expressed in model input space
  • Depending on the export, values may be normalized or already scaled

These values still require decoding before they can be drawn on the image.

At this point, you must still:

  1. Convert (x, y, w, h)(x1, y1, x2, y2)
  2. Combine objectness with class probability
  3. Filter low-confidence predictions
  4. Apply Non-Maximum Suppression (NMS)

Now that we understand what each candidate box contains, the next step is to see how to access this tensor in C++ using ONNX Runtime.

Accessing the Output Tensor in C++

After running inference with ONNX Runtime, the YOLO output is available as an Ort::Value.

A typical inference call looks like this:

std::vector<Ort::Value> output_tensors = session.Run(
    Ort::RunOptions{nullptr},
    input_names.data(),
    &input_tensor,
    1,
    output_names.data(),
    output_names.size()
);

For most YOLO models, the first output tensor contains all predictions.

Getting the tensor shape

To interpret the output, the first step is to inspect its shape:

auto& output = output_tensors[0];
auto type_info = output.GetTensorTypeAndShapeInfo();
std::vector<int64_t> shape = type_info.GetShape();

Example result:

shape = {1, 8400, 85}

This directly matches the [1, N, C] layout discussed earlier.

Accessing the raw data pointer

Once the shape is known, you can access the underlying data:

float* data = output.GetTensorMutableData<float>();

At this point:

  • data is a flat 1D array
  • All tensor values are stored contiguously in memory
  • The original 3D shape is no longer explicit

Understanding how to index into this array is critical.

How Indexing Works (THIS IS KEY)

At this stage, ONNX Runtime has given us:

float* data;

This is a flat 1D array, even though the original tensor shape was:

[1, num_boxes, elements_per_box]

Understanding how this flattened memory maps back to the tensor is critical.

What the YOLO output really is in memory

Conceptually, the output tensor looks like:

[batch][boxes][elements]

With batch size = 1:

[1][num_boxes][elements_per_box]

In memory, ONNX Runtime stores this tensor in row-major order.

What "row-major" means here

Row-major means:

The last dimension changes fastest.

So the memory layout is:

Box 0: [x y w h obj cls0 cls1 ...]
Box 1: [x y w h obj cls0 cls1 ...]
Box 2: [x y w h obj cls0 cls1 ...]
...

All values for one box are stored contiguously, followed by the next box.

Flattened memory view (example)

Imagine a small tensor:

[1][3][5]   // batch = 1, boxes = 3, elements = 5

Flattened in memory, it looks like:

Index:  0   1   2   3   4 | 5   6   7   8   9 | 10  11  12  13  14
Data : [b0e0 b0e1 b0e2 b0e3 b0e4 | b1e0 b1e1 b1e2 b1e3 b1e4 | b2e0 ...]

Each box occupies a fixed-size block in memory.

Why this indexing formula works

To access:

  • box i
  • element j inside that box

You use:

index = i * elements_per_box + j

Because:

  • Each box uses elements_per_box values
  • To reach box i, you skip i full blocks
  • Then offset by j within that block

Applying this to the YOLO output

From the tensor shape:

int num_boxes = shape[1];
int elements_per_box = shape[2];

This means:

  • shape[1] → number of candidate boxes
  • shape[2] → values per box

Extracting box values

Using the indexing rule:

for (int i = 0; i < num_boxes; i++) {
    float x   = data[i * elements_per_box + 0];
    float y   = data[i * elements_per_box + 1];
    float w   = data[i * elements_per_box + 2];
    float h   = data[i * elements_per_box + 3];
    float obj = data[i * elements_per_box + 4];
}

Each iteration processes one candidate box.

Why batch is ignored

The batch dimension exists, but here:

batch = 1

So its offset is zero.

If batch size were greater than 1, indexing would become:

index = b * (num_boxes * elements_per_box)
      + i * elements_per_box
      + j

But for YOLO inference, batch size is typically 1.

We can use the same logic to extract class scores and determine which class each box predicts.

Class Score Extraction

So far, we've extracted:

  • Box geometry
  • Objectness score

The remaining values in each candidate box correspond to class probabilities.

Where class scores start

From the per-box layout:

[x, y, w, h, obj, cls0, cls1, ..., clsN]

Class scores begin at index:

int class_start = 5;

Determining the number of classes

Rather than hardcoding the number of classes, we compute it from the tensor shape:

int num_classes = elements_per_box - class_start;

For example, if:

elements_per_box = 85

Then:

num_classes = 80

This makes the code model-agnostic.

Finding the best class for a candidate

For each candidate box, we scan all class scores and select the highest one:

float max_class_score = 0.0f;
int class_id = -1;
for (int c = 0; c < num_classes; c++) {
    float score = data[i * elements_per_box + class_start + c];
    if (score > max_class_score) {
        max_class_score = score;
        class_id = c;
    }
}

At the end of this loop:

  • class_id is the predicted class
  • max_class_score is the highest class probability

Important clarification

This is not the final confidence score.

At this stage:

  • We have the best class prediction
  • But objectness has not been applied yet

Next, we'll combine objectness and class probability to compute the final confidence score used for filtering detections.

Final Confidence Score

Once we have:

  • Objectness score
  • Highest class probability

We combine them to compute the final confidence for a candidate box.

float confidence = obj * max_class_score;

Why this formula is correct

YOLO predicts two probabilities:

  • Objectness
  • P(object exists)
  • Class score
  • P(class | object exists)

From probability theory:

P(class AND object) = P(object) × P(class | object)

This is exactly what the confidence score represents.

Both signals must be high.

Examples:

obj = 0.9, class = 0.9 → confidence = 0.81
obj = 0.1, class = 0.9 → confidence = 0.09
obj = 0.9, class = 0.1 → confidence = 0.09

A high class score alone is not sufficient.

What this confidence is used for

This value is the one you should:

  • Threshold
  • Use for ranking detections
  • Pass into NMS

Not:

  • Objectness alone
  • Class score alone

Typical thresholding

A common threshold looks like:

if (confidence < conf_threshold)
    continue;

Typical values:

  • 0.2 – 0.3 → more detections
  • 0.4 – 0.5 → fewer, cleaner detections

After computing confidence, the next step is to filter low-quality candidates before running NMS. This ordering is important for both correctness and performance.

Thresholding and NMS Order

Once the final confidence score is computed, the next step is filtering.

This step is applied before Non-Maximum Suppression (NMS).

Why thresholding comes first

NMS is an expensive operation.

Its complexity grows with the number of candidate boxes:

O(n²)

Thresholding early:

  • Removes low-quality predictions
  • Reduces the number of boxes passed to NMS
  • Improves performance significantly

Correct order of operations

The correct post-processing flow is:

YOLO raw output
   ↓
Decode box + class
   ↓
confidence = objectness × class_score
   ↓
Threshold (filter)
   ↓
NMS (remove duplicates)
   ↓
Final detections

Common mistake

Running NMS before thresholding.

This:

  • Slows down inference
  • Produces unstable results
  • Wastes computation on background predictions

Before wrapping up, it's important to mention model variations that can affect how this output tensor should be interpreted.

Important Variation: YOLOv8 Output Format

Not all YOLO ONNX exports follow the same output layout.

Some YOLOv8 models remove the explicit objectness score and output:

[x, y, w, h, cls0, cls1, ...]

In this case:

  • There is no separate objectness value
  • Class scores already represent final confidence

Why this matters

Assuming the wrong layout will lead to:

  • Incorrect confidence computation
  • Invalid filtering
  • Broken detections

This is why you should never assume the output format.

What to always check

Before writing decoding logic:

  1. Print the output tensor shape
  2. Check the value of C
  3. Inspect the first row of the output tensor

These steps immediately reveal:

  • Whether objectness is present
  • Where class scores start
  • How many values exist per box

Debug tip

A simple inspection loop is often enough:

for (int j = 0; j < elements_per_box; j++)
    std::cout << data[j] << " ";

This helps confirm:

  • Value ranges
  • Layout correctness
  • Normalization behavior

With the tensor decoded and confidence computed, we can summarize the core ideas that make YOLO post-processing work correctly.

By now, you should have a clear picture of what happens after session.Run() in YOLO inference. The model has done its work, producing a large tensor of raw predictions — it's just up to us to interpret it.

Decoding these predictions in C++ is straightforward once you understand the tensor shape, memory layout, and how objectness interacts with class probabilities. Filtering low-confidence boxes and applying NMS turns this raw output into clean, usable detections.

Getting this step right removes most of the common headaches: misplaced boxes, weird confidence scores, and inconsistent results. Once the decoding logic is in place, everything downstream from drawing boxes on an image to integrating YOLO into a larger application becomes predictable and reliable.

If you'd like to connect, you can find me on LinkedIn: https://www.linkedin.com/in/isvidhi/